1. Introduction
Kafka is not really a new product on the software market. It has been some 10 years since LinkedIn released it. Although by that time there had already been available products with similar functionality on the market, the open source code and broad support from the expert community, particularly through Apache Incubator, allowed it to quickly get on its feet, and later to seriously compete with alternative solutions.2. Data transformation
As mentioned above, the Kafka ecosystem is currently on its way to a full-fledged ETL tool. In other words, you can successfully build and develop comprehensive ETL solutions based on the Kafka ecosystem alone. Yet, we have not yet mentioned the core part of the ETL process - data transformation. Since business requirements are always unique, there are no ready-made transformation solutions. In each case, it is a separate task that calls for an individual approach. In this article we would like to focus on one of these tools, which has been booming in recent years, namely ksqlDB.2.2 Data transformation in SQL queries
ksqlDB offers a fairly advanced SQL syntax, which enables you to convert, filter and aggregate data in streams. Now let us consider the following stream query:2.3 Saving the obtained results in a separate topic
We have shown previously that streams in ksqlDB can be created based on Kafka topics. But can we transfer already converted data to Kafka? We wrote a complex SELECT Query where we executed all the transformations we needed, and now we want to save the resulting query into a separate topic. So, we are going to use the following lines to do this:2.4 Transforming data formats in ksqlDB
In chapter 2.2, we basically executed conversions that extract data from the JSON format and created something resembling XML. The very availability of this feature is a big advantage, which offers us sql for data conversion. However, data transformation is not exhausted by this possibility alone.2.5 Lambda and Kappa architectures from ksqlDb perspective
Despite all the advantages of lambda architecture, the major drawback of this approach to designing Big Data systems is its complexity, because of database logic duplication in "cold" and "hot" paths. Let's examine what a lambda architecture could look like when it comes to writing queries to Kafka. Here we have no separation into two paths, but the duplication of query logic persists. The fast layer no longer needs a separate storage because fast data is in Kafka already. The issue of strict data partitioning remains to be solved, otherwise we run a risk of accounting for part of the data twice. Suppose that Kafka topics are configured to store them for more than a day. Then the data stored in the target base (see Figure 1) will be considered cold. Thus, all our database queries should take into account the relevant condition (until "today's" data).3. Conclusion
Transformation process is one of the ETL cornerstones. The universe of big data has a wealth of applications of all kinds that have been more or less successful in solving this problem. Each has its own strengths and vulnerabilities, and not all can provide high reliability and processability of big data streams in a short time. Thus, Big Data transformation tools often work in conjunction with Kafka. Kafka is quite good for such tasks, but usually as a simple message broker.