Image for post
Image for post

Famous in-memory data format

Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly — from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.

I actually don’t know any other in-memory format with complex data, dynamic schemas, performance, and platform support.

Apache Arrow itself is not a storage or execution engine. It is designed to serve as a foundation for the following types of systems:

  • SQL execution engines (Drill, Impala etc)
  • Data analysis systems (Pandas, Spark etc)
  • Streaming and queueing systems (Kafka, Storm etc)
  • Storage systems (Parquet, Kudu, Cassandra etc)
  • Machine Learning libraries(TensorFlow, Petastorm, Rapids etc)

Please do not think that this is part of Parquet format or part of PySpark. This is a separate self-contained format which I think is a bit undervalued and should be taught with all other big data formats.

Thank you for reading!

Any questions? Leave your comment below to start fantastic discussions!

Check out my blog or come to say hi 👋 on Twitter or subscribe to my telegram channel.
Plan your best!

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store