Apache Spark is considered as a powerful complement to Hadoop, big data’s original technology. Spark is a more accessible, powerful and capable big data tool for tackling various big data challenges. It has become mainstream and the most in-demand big data framework across all major industries. Spark has become part of the Hadoop since 2.0. And is one of the most useful technologies for Python Big Data Engineers.
This series of posts is a single-stop resource that gives spark architecture overview and it’s good for people looking to learn spark.
In this post, I will talk about my experience with AWS certification for Solution Architect Associate and how I prepared for it.
AWS certification allows the developer to confirm his qualification and skills in working with AWS services. And the preparation process itself provides additional experience in working with AWS services. But besides all of this, you also get knowledge about architectural patterns that can be applied anywhere else, how solutions are built in the cloud, their limitations and problems.
Documentation and videos on the AWS services pages are certainly not a bad start, but to prepare for the exam it would be very good to have experience in the cloud and system knowledge. …
Since childhood, we know that when we come from the street, we have to wash our hands. However, we do not really think about what to do after surfing online.
Everyone should decide for himself which level of security is acceptable for him personally. You have to understand how much the protected information costs in case you lose it. If you have important information and you are afraid of losing it, and the mere thought that this information might reach your enemies scares you, then you should think about information security.
Who reads me or knows me personally understands that I am very much worried about personal security and I want to share how I deal with it. …
In python, it is common practice to write all the application dependencies that are installed via pip into a separate text file called requirements.txt.
It’s good practice to fully specify package versions in your requirements file. And in our case, everything will be there — both direct dependencies of our application and dependency dependencies, etc.
But sometimes, especially on a long-lived project, it’s hard to understand what dependencies were original. It is necessary to update them on time, not depend on packages that are outdated or no longer needed for some reason.
For example, which of the following dependencies are the original ones? …
In the current zoo of possible data storage and analysis systems within the same organization it is sometimes necessary to collect reference data. This data is reliable, synchronized from different subsystems, normalized, deduplicated and cleaned. This problem arises in organizations where departments have been working independently for a long time, collecting data from their systems in their own format.
The technological solution to such problems is the introduction of the MDM system. MDM system or Master Data Management system is essentially a bunch of processes, standards and rules for working and storing data in a uniform way across the organization. As a result it creates so-called gold records which represent entities (can be anything, depends on business) and their relations. …
The activities of web applications are uncertain, sometimes they serve a huge number of workloads, but sometimes they idle without a large number of requests. The hosting of applications on virtual machines in the cloud forces us to pay for idle times too. To solve such a problem we must look at load balancing, DNS lookup, and automatic scaling. It is difficult to manage all of this and on pet projects it makes zero sense.
Serverless technologies are several years old and its popularity is increasing every year. For highly loaded systems it is a simple way of infinite scaling, and for simple pet projects it is a great opportunity for free hosting. …
Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly — from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.
I actually don’t know any other in-memory format with complex data, dynamic schemas, performance, and platform support.
Apache Arrow itself is not a storage or execution engine. It is designed to serve as a foundation for the following types of…
There are a ton of free comment widgets available — I tried Disqus, Facebook, Livefyre for example. They all have huge disadvantages — privacy, page loading time, number of requests, limit functionality and useless logins to 3rd party systems. But there is a new(at least for me) idea to store comments in Github issues.
The benefits of this approach you get immediately:
During the work developers frequently need to update their services and deploy them on the servers. When the amount of projects is small it’s not an issue, there are no problems because releases and deployments are rare. Tests are running manually. But when the time comes, a number of services and tasks increases and execution of the same task takes more time.
Let’s look at the typical process of feature implementation for the majority of projects:
We should understand that ML models are not static — as soon as the data changes, so do the models and their predictions, and it is necessary to constantly monitor ML pipelines, retraining, optimization and so on. All these are “time series” problems, which should be solved by engineers and data scientists, which are not trivial from many points of view. And solutions may have huge time horizons, but the worst part is that they need to be maintained afterwards. Eww. As engineers, we love to create things, but we don’t want to maintain them. To somehow automate data preprocessing, feature engineering, model selection and configuration, and the evaluation of results, the AutoML process was invented. …