Member-only story

Famous in-memory data format

Kirill Bobrov
Nov 4, 2020

Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly — from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.

I actually don’t know any other in-memory format with complex data, dynamic schemas, performance, and platform support.

Apache Arrow itself is not a storage or execution engine. It is designed to serve as a foundation for the following types of systems:

  • SQL execution engines (Drill, Impala etc)
  • Data analysis systems (Pandas, Spark etc)
  • Streaming and queueing systems (Kafka, Storm etc)
  • Storage systems (Parquet, Kudu, Cassandra etc)
  • Machine Learning libraries(TensorFlow, Petastorm, Rapids etc)

Please do not think that this is part of Parquet format or part of PySpark. This is a separate self-contained format which I think is a bit undervalued and should be taught with all other big data formats.

Thank you for reading!

Any questions? Leave your comment below to start fantastic discussions!

Check out my blog or come to say hi 👋 on Twitter or subscribe to my telegram channel.
Plan your best!

--

--

Kirill Bobrov
Kirill Bobrov

Written by Kirill Bobrov

helping robots conquer the earth and trying not to increase entropy using Python, Data Engineering, ML. Check out my blog—luminousmen.com

No responses yet