Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly — from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.
I actually don’t know any other in-memory format with complex data, dynamic schemas, performance, and platform support.
Apache Arrow itself is not a storage or execution engine. It is designed to serve as a foundation for the following types of systems:
- SQL execution engines (Drill, Impala etc)
- Data analysis systems (Pandas, Spark etc)
- Streaming and queueing systems (Kafka, Storm etc)
- Storage systems (Parquet, Kudu, Cassandra etc)
- Machine Learning libraries(TensorFlow, Petastorm, Rapids etc)
Please do not think that this is part of Parquet format or part of PySpark. This is a separate self-contained format which I think is a bit undervalued and should be taught with all other big data formats.
Thank you for reading!
Any questions? Leave your comment below to start fantastic discussions!