Tech Talk: Data integration in the world of microservices with Fabian Wollert & Valentine Gogichashvili

Table of Contents

This week, we’re featuring presentations from Microservices Day London. Props to nearForm for assembling an all-star lineup of speakers! In case you don’t have time to watch the videos and prefer to read through the talks, we’ve selected some standouts to offer in transcript form! We continue this series with Fabian Wollert, Business Intelligence Big Data Manager at Zalando and Valentine Gogichashvili, Head of Data Engineering at Zalando.

We hope you enjoy the presentation as much as we did!

Data Integration in the World of Microservices

Presented at Microservices Day London




As we were preparing the presentation, originally we thought it would be 45 minutes. For 25 minutes, we had to really think about what is important. We wanted to give as much context as possible so that you understand the problems we are trying to solve. Let’s see how we manage to do that.

What Zalando is …I suppose in Great Britain, you know what Zalando is. How many people actually know what Zalando is? Okay, wonderful. How many people’s partners, girlfriends, boyfriends are buying stuff at Zalando? So you are the only ones who are buying with Zalando. Very good. Very good. Oh shoot, the technical problems, they solve the technical problems nearly.

[1:30] As I started at Zalando, the company was very small. The technology department as I started at Zalando, more than five years ago, was only about 40 people in all. Now the technology department is 1100 persons. We are hiring up to 50 developers per month. We sucked in every talent in Berlin. We are sucking in all the talent from all cities around and around, I mean Moscow, Budapest, Bucharest, India, it’s not a city I know. The focus should be on the screen, I think. Or I should enable it now … Wonderful. We are a little bit squashed here, but whatever. Excuse us for this technical problem.

I am Valentine, and Fabian who is actually doing things, I’m just talking about them. What is Zalando? Zalando is a fashion store, I already asked you. We operate in 15 countries. We have fulfillment centers and so on. Revenues are quite high. We recently have been doing very successful IPO and so. The number of offices that we have is also quite high, one of the biggest offices is in Dublin and Adrian was a little bit angry with us about this. Some specialist should come from somewhere. We are rapidly growing. I want to give you, as I said, a little bit of the context.

How did it start when I started at Zalando? It was a classical tiny online shop with PHP and mySQL. We were, at the moment, the biggest users of Magento system in the world and at some moment, we understood it will not work. So actually it was not me who understood, the guys who were operating with this thing, and then they called me and I was moving this thing from mySQL to Postgres. We did the reboot six weeks without weekends and we had wonderful system. We had microservices using RPC, Postgres and so on. This is how it looked like. Microservices as we call it. We would call this system a vintage infrastructure. We have for now, still, 150 microservices running.

The typical process of operating of this kind of business logic and these services is that every business, every service has business logic that accesses database and then the ETL process and business intelligence guys and database engineers know how to extract data from the databases. The ETL process sometimes is not visible to the developers because DBAs know the databases, they can extract data from there. We have data warehouse system where our analysts can actually do their analytical stuff. From one side, it’s use case specific, but from the other hand, you have what you need, so you know usually what you need. With this architecture, we were quite happy, but we were growing further and further and organizational structures were moving and changing and cracking and we were fixing and changing the organizational structures. At some moment, actually it was all the time, people want to have access to cool technologies, but this is the typical answer from the management, “yeah, we’re afraid.”

Building Autonomy

This is what started to happen, people started to leave. As you understand, we are growing fast, we cannot afford talent to leave. What we did last year was radical agility was a huge change for the whole Zalando, especially Zalando technology. Autonomy, purpose, mastery were proclaimed the main ideas. Teams got autonomy, teams can decide what they do. Teams own the technology stack and they are responsible for that. We also kind of proclaimed that AWS and cloud will be something we are moving to. We wanted to abandon our data centers by the end of this year if we manage that. We are working on that.

Also we proclaimed that the services will be built in the form of microservices and we will be talking to each other using REST APIs. As you can see, if you are building a standalone service, it should actually own its data. It shouldn’t ride to one single big database that can be accessed by business analysts though they want it.

Nakadi event bus

Applications are hidden behind the walls VPCs and AWS so that one person, one team in its own VPC account, in its own AWS account cannot destroy other teams infrastructure. They are responsible only for infrastructure in the AWS account. This is how the guys were running around at the moment. The ETL processes, we saw that they could break completely. We started building Nakadi system, Nakadi event bus into the system, it’s a very small broker on top of Kafka. Unfortunately we don’t have time to go into detail for Nakadi… Just an event bus, with some minor tweaks and additions that help developers work with event bus, like Kafka, for example, security, for example, minimal scale validation, and so on. The work is in progress and it’s completely almost open source. So if you want, you can have a look.

The idea is very simple. Every microservice pushes into the event bus, from event bus, you are pushing things into the business intelligence, but this small arrow is, that is not actually small, is a question mark. How, actually, do we do this. What we build, what we’re trying to do and this is more or less … Now I want to give the board to Fabian who is building this part and who will be able to give you more idea about the infrastructure and I will be available for the questions if there will be questions.

Fabian: Thank you, Valentine. Let me introduce you a little bit to what that question mark actually is. It’s going to be a little more technical than all the previous, by the way, really great presentations. I hope it’s going to be fine for you guys. We basically have built our own data integration platform, which we call Saiki. It’s Javanese for a queue, and that’s also predicting what we actually used here as well. No one is speaking Javanese with us, don’t worry, it’s Google Translate. We built Saiki.

Saiki platform

What is Saiki consisting of? Let’s dive into this. At first, we’re using Kafka here as well. You could ask now, “Why are we having another Kafka here? There is already a Kafka in Nakadi, why not use this one?” What Valentine quickly jumped through was that every team has to use Rest APIs only and communicate via JSON and be as flexible as possible. We actually wanted to decouple from Nakadi as much as possible, so at first we decoupled from the accounts. There are two separate AWS accounts and our own rules of play say that we can not access the Kafka natively in the Nakadi account, so we have to basically set up our own to use tools which are using Kafka APIs natively.

[11:40] I hope all of you know Kafka, it was talked about already a little bit here. It’s a unified log by LinkedIn. It’s basically a distributed linux service. I sometimes also refer to it as kind of a buffer. The events come into Nakadi and we can get them as fast as we want, and then also process them as fast as we want. What also comes with our own Kafka is, obviously, our own configuration, possibilities, and we can enable lock depiction on it or define another retention time because we want to keep the messages longer, something like that.

That means we have now all the events coming from all the applications in our account. But we don’t want to have them in Kafka, we want to have them in our data warehouse, so everyone in the BI department can access them. So we actually need to have another component here, which we call Tukang. This is basically a small service which takes all the messages out of Kafka and puts them into a S3, and exposes via REST API the metadata information: Where is this stored, how long it’s going to be stored there, and some other metadata, which the warehouse needs, essentially, to download this data into the data warehouse. There’s also some cleansing in the data. For example, the teams can send the data in out of order. Tukang is taking care, in the first step, already of that. There might be duplicates in there which it’s taking care of.

Then, afterwards, it’s materializing the data in an S3 and provides the data so the data warehouse can actually download it locally. This is it, basically. This is the way of how we’re doing, right now, the whole data integration thing. But, as you’ve seen, I’ve left a little bit of space because, actually, this is giving us more possibilities than just the classic ETL-loading process. I will come to that in a second.

Delta loads to event stream

[14:02] Let’s compare really quick the old process and the new process. So, the old process was relying on those delta loads. We basically opened up a JBC connection to all the Postgres databases, and we are putting that into our data warehouse. The new stuff is completely relying on event stream. We only use RESTful HTTPS connections now.

One big issue we are currently facing is actually … Before, because it was all in the data center, we could just access the Postgres databases and check the real source data and compare that to the data warehouse data. Since every team is autonomous and the access to their persistence layer is restricted, we cannot do that anymore. This means we have to trust the delivery teams even more for the correctness of their data, which is fine for us as long as they do their job. That should be fine, right? Obviously the old process was kind of Postgres-dependent. Now everyone can do whatever they want, as long as they send an event to Nakadi, we are fine with it.

Another thing I want to quickly point out is that in the old world, we had a so-to-say end-to-one data stream, so we had several Postgres databases, but one data warehouse which was getting all of this data. Now we have an end-to-end stream because we can also split this box up here and say, “Okay, we are actually getting the data into the data warehouse.” But we can also upload it to some other DB for data science reasons, or more data science focused reasons.

[15:48] What else can we do with this platform? This is actually allowing us way more than just the normal data integration and into the data warehouse stream. I said we are using Kafka and we chose it because of the strong support from the community. For example, stream processing. We decided to focus a lot on stream processing with the framework called Apache Flink. Who of you have heard of Apache Flink, actually? Well, word gets spread around I would say. Maybe some of you know of spark streaming.

Essentially Apache Flink is kind of spark streaming, but even better because if you have spark streaming you process … if they say it’s microbatches and everything, and that’s essentially true. So, spark waits off a batch of 10, let’s say, and processes that batch. Flink is processing each event independently on one node. Kind of like Storm a little bit. It’s written in SCALA. It’s written in Berlin, actually it was founded at the technical university in Berlin. That means for us, also, we can basically get the guys in thirty minutes into our office, which is really cool. I guess in London it’s kind of the same, it’s a few hours flight. It’s highly scalable, which is very important for us. If you want to read more about why we chose Flink over Spark, feel free to check out the blog post we wrote about it.

As I said, very good connectors to Kafka and to Elasticsearch, which we are actually adding then on top of the stream processing, or not on top, but we’re writing the data into Elasticsearch which enables us real-time monitoring. Real-time monitoring in the sense of … I think someone was touching on the topic shortly, business process monitoring, so technically check if the platform works. For example, check if all the velocities are right. “Do we get enough orders currently for that time frame, or is there a drop in there?” Same for delivery velocities or checking SLAs of correlated events. For example, is a shipment being sent out after 48 hours from the order?

Data lake

We can analyze the data on the fly and visualize that for the tech controlling team. We started with Kibana, but we are kind of shifting right now to a custom-written Python Flask application. Another possibility to extend the platform is a concept we call the Data lake. Kind of a buzzword nowadays, but we see the Data lake under the motto of “Free the data from the silos”. We basically want to bring high-quality data together. Currently there are those data entity boundaries. Every team knows this and that, and you probably can not prohibit that really, but we want to try to bridge those boundaries as good as possible with the Data lake. Store the data in a centralized, secure, cost-efficient storage, which will be, in our case, S3 because it’s currently the most convenient to use. We are investigating also to use HTFS on top of maybe EC2 or something, but currently our favorable solution is S3. We want to store, potentially, all data which is useful for analytical activities in the enterprise.

So that’s basically it. That’s the Saiki platform which we built. It’s already a technical deep dive, but there’s actually way more to it. For example, the still-big-but-small arrow to the Kafka cluster is actually two components, and there’s obviously a little bit more to it. But generally speaking this is the stream of events. It goes through the Kafka, into the Rest API, which the data warehouse calls and downloads the data from AWS.

Open source projects

I’m coming now to my last slide, and I’ve got five minutes so I can actually elaborate a little bit more on each of these links, and that’s cool. We are, as many of you probably already also are, we are big fans of open source. We actually have a dedicated GitHub open source site there where you can check out all of our projects. For example, some of these, if you’re planning on getting into microservices in the future, we wrote some frameworks for Python and for Scala/Play to … You basically define your RESTful API first with the [inaudible] file and then just connect all the methods to it from the backend Python code, and you basically can have up and running a Python code in fifteen minutes. It’s pretty cool.

We also open-sourced all our guidelines, which we think is cool, on RESTful APIs and what the teams actually should follow on to communicate with each other. Everything I just said to Saiki we want to publish as much as possible to our wiki on GitHub. I’m just starting to create it, so don’t blame me if there’s not that much yet, but there’s more coming in the next weeks for sure. We write blog posts all the time, so feel free to check that out. And we also open-sourced our deployment organization tools for AWS which creates the immutable stacks and is also, since we’re from Germany we’re having a lot of audit and compliance issues which we have to follow, and this is the STUPS infrastructure framework for us on AWS. And that’s about it, thank you.

About the speakers

Fabian Wollert Profile Photo

Fabian Wollert is Business Intelligence Big Data Manager at Zalando, an e-commerce company based in Berlin. At Zalando, Fabian is working on new concepts to address business intelligence challenges in an agile and scalable infrastructure to ensure reporting and analytic capabilities for the future. Whether it’s data lakes or stream processing, Fabian is enthusiastic about embracing new technologies.

Valentine Gogichashvili Profile Photo

As Head of Data Engineering at Zalando, Valentine’s main goal is to create a data integration platform that enables data science and business intelligence teams to access and process data easily and efficiently. One of his primary activities in this job was to help the company’s growing team of engineers to migrate from MySQL to PostgreSQL. He served as head of Zalando’s database team for several years.

Like what you read?

Signup for a free FlockerHub account.
Sign Up

Get all the ClusterHQ News

Stay up to date with our newsletter. No spam. Ever.