Image for post
Image for post

So I wrote a book.

I’ve never had anyone I know write a book. I decided to do it myself as an experiment. And in this post, I want to tell you a little bit about it. Maybe one day you can learn from my example.

Writing a book is fucking hard, it is hard work, especially when you are not Stephen King. It is even harder when the publisher has a hard deadline. Fortunately, I did not have such conditions — I published the book myself and did the whole process from beginning to end. But I spent some…


Image for post
Image for post

Apache Spark is considered as a powerful complement to Hadoop, big data’s original technology. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. It has become mainstream and the most in-demand big data framework across all major industries. Spark has become part of the Hadoop since 2.0. And is one of the most useful technologies for Python Big Data Engineers.

This series of posts is a single-stop resource that gives spark architecture overview and it’s good for people looking to learn Spark.

Whole series:


Image for post
Image for post

So I wrote a book.

I’ve never had anyone I know write a book. I decided to do it myself as an experiment, and I want to tell you a little bit about it. Maybe one day you can learn from my example.

Writing a book is fucking hard, it is hard work, especially when you are not Stephen King. It is even harder when the publisher has a hard deadline. Fortunately, I did not have such conditions — I published the book myself and did the whole process from beginning to end. But I spent some time digging through…


Image for post
Image for post

Apache Spark is considered as a powerful complement to Hadoop, big data’s original technology. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. It has become mainstream and the most in-demand big data framework across all major industries. Spark has become part of the Hadoop since 2.0. And is one of the most useful technologies for Python Big Data Engineers.

This series of posts is a single-stop resource that gives spark architecture overview and it’s good for people looking to learn spark.

Whole series:


Image for post
Image for post

In this post, I will talk about my experience with AWS certification for Solution Architect Associate and how I prepared for it.

AWS certification allows the developer to confirm his qualification and skills in working with AWS services. And the preparation process itself provides additional experience in working with AWS services. But besides all of this, you also get knowledge about architectural patterns that can be applied anywhere else, how solutions are built in the cloud, their limitations and problems.

Documentation and videos on the AWS services pages are certainly not a bad start, but to prepare for the exam…


Image for post
Image for post

Since childhood, we know that when we come from the street, we have to wash our hands. However, we do not really think about what to do after surfing online.

Everyone should decide for himself which level of security is acceptable for him personally. You have to understand how much the protected information costs in case you lose it. If you have important information and you are afraid of losing it, and the mere thought that this information might reach your enemies scares you, then you should think about information security.

Who reads me or knows me personally understands that…


Image for post
Image for post

In python, it is common practice to write all the application dependencies that are installed via pip into a separate text file called requirements.txt.

It’s good practice to fully specify package versions in your requirements file. And in our case, everything will be there — both direct dependencies of our application and dependency dependencies, etc.

But sometimes, especially on a long-lived project, it’s hard to understand what dependencies were original. It is necessary to update them on time, not depend on packages that are outdated or no longer needed for some reason.

For example, which of the following dependencies are…


Image for post
Image for post
Photo by Jerry Attrick from FreeImages

In the current zoo of possible data storage and analysis systems within the same organization it is sometimes necessary to collect reference data. This data is reliable, synchronized from different subsystems, normalized, deduplicated and cleaned. This problem arises in organizations where departments have been working independently for a long time, collecting data from their systems in their own format.

The technological solution to such problems is the introduction of the MDM system. MDM system or Master Data Management system is essentially a bunch of processes, standards and rules for working and storing data in a uniform way across the organization…


Image for post
Image for post

The activities of web applications are uncertain, sometimes they serve a huge number of workloads, but sometimes they idle without a large number of requests. The hosting of applications on virtual machines in the cloud forces us to pay for idle times too. To solve such a problem we must look at load balancing, DNS lookup, and automatic scaling. It is difficult to manage all of this and on pet projects it makes zero sense.

Serverless technologies are several years old and its popularity is increasing every year. For highly loaded systems it is a simple way of infinite scaling…


Image for post
Image for post

Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly — from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.

I actually don’t know any other in-memory format with complex data, dynamic schemas, performance, and platform support.

Apache Arrow itself is not a storage…

luminousmen

helping robots conquer the earth and trying not to increase entropy using Python, Big Data, Machine Learning. Check out my blog — luminousmen.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store