I am very annoyed that all sorts of big data engineers confuse S3 and HDFS systems, assuming that S3 is the same as HDFS.

That’s not true.

HDFS is a distributed file system designed to store big data. It runs on physical machines that can run something else. S3 is the storage of AWS objects, it has nothing to do with storing files, all data in S3 is stored as Object Entities to which the key (document name), value (object content) and VersionID are associated. There is nothing else you can do in S3 because it is not a file system. S3 has “ presumably” unlimited storage in the cloud, but HDFS does not. S3 performs deletion or modification of the records in a eventually consistent way.

There are many other criteria like cost, SLA, durability, elasticity (you can create a custom lifecycle and version control over objects). But let’s not think about it, S3 wins there anyway.

Hadoop and HDFS have made it cheap to store and distribute large amounts of data. But now that everyone is moving to cloud architectures, the benefits of HDFS are minimal and not worth the complexity that it brings. That’s why now and in the future organizations will use S3 as a backend for their data storage solutions.

