Image for post
Image for post

In python, it is common practice to write all the application dependencies that are installed via pip into a separate text file called requirements.txt.

It’s good practice to fully specify package versions in your requirements file. And in our case, everything will be there — both direct dependencies of our application and dependency dependencies, etc.

But sometimes, especially on a long-lived project, it’s hard to understand what dependencies were original. It is necessary to update them on time, not depend on packages that are outdated or no longer needed for some reason.

For example, which of the following dependencies are the original ones? …


Image for post
Image for post
Photo by Jerry Attrick from FreeImages

In the current zoo of possible data storage and analysis systems within the same organization it is sometimes necessary to collect reference data. This data is reliable, synchronized from different subsystems, normalized, deduplicated and cleaned. This problem arises in organizations where departments have been working independently for a long time, collecting data from their systems in their own format.

The technological solution to such problems is the introduction of the MDM system. MDM system or Master Data Management system is essentially a bunch of processes, standards and rules for working and storing data in a uniform way across the organization. …


Image for post
Image for post

The activities of web applications are uncertain, sometimes they serve a huge number of workloads, but sometimes they idle without a large number of requests. The hosting of applications on virtual machines in the cloud forces us to pay for idle times too. To solve such a problem we must look at load balancing, DNS lookup, and automatic scaling. It is difficult to manage all of this and on pet projects it makes zero sense.

Serverless technologies are several years old and its popularity is increasing every year. For highly loaded systems it is a simple way of infinite scaling, and for simple pet projects it is a great opportunity for free hosting. …


Image for post
Image for post

Apache Arrow is a sacred grail of analytics that was invented not so long ago. It is a special format for column data storage in memory. It allows you to copy objects from one process to another very quickly — from pandas to PyTorch, from pandas to TensorFlow, from Cuda to PyTorch, from one node to another node, etc.. This makes it the horse of a large number of frameworks for both analytics and big data.

I actually don’t know any other in-memory format with complex data, dynamic schemas, performance, and platform support.

Apache Arrow itself is not a storage or execution engine. It is designed to serve as a foundation for the following types of…


Image for post
Image for post
Photo by Andrew Neel on Unsplash

There are a ton of free comment widgets available — I tried Disqus, Facebook, Livefyre for example. They all have huge disadvantages — privacy, page loading time, number of requests, limit functionality and useless logins to 3rd party systems. But there is a new(at least for me) idea to store comments in Github issues.

The benefits of this approach you get immediately:


Image for post
Image for post

Why do we need it?

During the work developers frequently need to update their services and deploy them on the servers. When the amount of projects is small it’s not an issue, there are no problems because releases and deployments are rare. Tests are running manually. But when the time comes, a number of services and tasks increases and execution of the same task takes more time.

Let’s look at the typical process of feature implementation for the majority of projects:


Image for post
Image for post
Photo by Pietro Jeng on Unsplash

We should understand that ML models are not static — as soon as the data changes, so do the models and their predictions, and it is necessary to constantly monitor ML pipelines, retraining, optimization and so on. All these are “time series” problems, which should be solved by engineers and data scientists, which are not trivial from many points of view. …


Image for post
Image for post
Photo by fabio on Unsplash

I am very annoyed that all sorts of big data engineers confuse S3 and HDFS systems, assuming that S3 is the same as HDFS.

That’s not true.

HDFS is a distributed file system designed to store big data. It runs on physical machines that can run something else. S3 is the storage of AWS objects, it has nothing to do with storing files, all data in S3 is stored as Object Entities to which the key (document name), value (object content) and VersionID are associated. …


Image for post
Image for post

“In the new world, it is not the big fish that eats the small fish, it’s the fast fish that eats the slow fish.” — Klaus Schwab.

Unfortunately, microservices are today the development standard for any large product. Why is that? Because the market has become overloaded with competitors and not only big companies compete with each other but everyone participates in the race. Everyone wants to deliver new features in their products quickly, frequently, and reliably. …


Image for post
Image for post
Image by xresch from Pixabay

Our ML algorithms are fine, but good results do require a significant team of data specialists, data engineers, field experts, and more support staff. And while the number and cost of expert staff is not constraining enough, our understanding of how to optimize for nodes, layers, and hyperparameters is still primitive. …

About

luminousmen

helping robots conquer the earth and trying not to increase entropy using Python, Big Data, Machine Learning. Check out my blog — luminousmen.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store