Reading "State Management in Apache Flink"

September 9, 2018 Tech 本文总阅读量次

Updated on 2018-09-09

I’m playing with Flink (1.6) and Structured Streaming (2.3.1) recently. I’m no expert for either framework, so my opinion is based on my very short experience with each of them.

In Flink, stream processing is the first class application. There are a lot of nice features to make it easy and flexible to support various streaming scenarios. You can do complex event processing (CEP) and co-processing with multiple input sources. One great feature is that you can attach user defined state to amost any operator. From version 1.6, they provide the so called “broadcast state“ which is very useful for streaming applications with dynamic configurations. With support of transactional write by Kafka, now Flink is able to achieve the end-to-end exactly-once semantic. Here is the blog explaining how they use two-phase commit to implement this.

Spark is originally designed for batch processing. Even they now support streaming by introducing Spark Streaming and Structured Streaming, their advantages remains still for batch oriented applications. The data model is tightly coupled with DataSet which makes it more friendly for ETL like data manipulations, but not so nice for operations like event processing. I’m not saying that Spark has no advantage compared with Flink. For example, Spark can scale dynamically to the traffic load. Flink is currently missing this feature due to its more complicated state management (the answer here says that it is a coming feature). Also, I note that in the newly introduced “contineous model”, dynamic scaling is not supported due to similar reason as in Flink. If I want to pick one for ETL only where latency is not the major concern, Spark is actually a good candidate. However, if my main application is event processing, Flink is definitely the better choice. Actually, Flink provides a very nice UI to monitor the application status. In contrast, the Spark UI is way too complicated for stream processing applications.

In conclusion, Flink is generally more suitable in stream processing applications. But now, if your application is with highly dynamic traffic load, and latency is not your major concern, pick Spark.

Updated on 2018-02-05

I recently read an excellent blog about exactly-once streaming processing. It details typical solutions for exactly-once processing used by various open source projects. No matter if the solution is based on streaming or mini-batch, exactly-once processing incurs a inevitably latency. For example in Flink, the state at each operation can only be read at each checkpoint, in order not to read something that might be rollbacked during a crash.

===

I recently read the VLDB’17 paper “State Management in Apache Flink”. In one sentence,

The Apache Flink system is an open-source project that provides a full software stack for programming, compiling and running distributed continuous data processing pipelines.

For me, Flink sounds yet another computation framework alternative to Spark and Mapreduce with a workflow management tool. However,

In contrast to batch-centric job management which prioritizes reconfiguration and coordination, Flink employs a schedule-once, long-running allocation of tasks.

How exactly does a streaming-centric framework differ from a batch-centric framework? Conceptually, there is no fundamental difference between the two. Any batch processing framework can work “like” a streaming processing framework by reducing the size of each batch to 1. However, in practice, they are indeed different. A batch-centric framework usually involve a working procedure such as

batch 1 start
do some job
batch 1 end
update some state

batch 2 start
do some job
batch 2 end
update some state

...

Note that the job is started and ended within each batch. In contrast, for a streaming-centric framework,

start a job

receiving a new data
process the data
update some state
pass the data to the next job

receiving a new data
process the data
update some state
pass the data to the next job

...

end the job

This comparison is clear. A job in the streaming-centric framework usually work continuously without being started/stopped multiple times as in a batch-centric framework. Starting and stopping a job usually incur some cost. Therefore, a batch-centric framework usually performs less efficiently compared to a streaming-centric one. Additionally, if the application is mission critical (e.g., malicious event detection), processing data in batch usually means high latency. However, if the task is batch-by-batch in nature, a batch-centric framework usually performs as efficiently as a streaming-centric one.

Another problem is about snapshotting. Snapshotting is a key capability for a processing pipeline. A snapshot is consist of both the state and data. The global state of the pipeline is composed of the sub-state of each operator. Each state is either a Keyed state or a Operator state. The former represents all type of states indexed by the key from data (e.g., count by key); the latter is more an operator-aware state (e.g., the offset of data). Snapshotting the data is tricky where Flink assumes that

Input data streams are durably logged and indexed externally allowing dataflow sources to re-consume their input, upon recovery, from a specific logical time (offset) by restoring their state. This functionality is typically provided by file systems and message queues such as Apache Kafka

Each operator snapshots the current state once processing a mark in the dataflow. With the marks and the snapshotted states of each operator, we can always restore the system state from the last snapshot. One should note that the keyed state is associated with an operator, and therefore, the data with the same key should be physically processed at the same node. Otherwise, there should be a scalability issue. Consequently, there should be a shuffle before such operators, or the data is already prepared to ensure data with the same key is processed at a single node.

In conclusion, Flink is great as streaming-centric frameworks have some fundamental advantages over batch-centric frameworks. However, since batch-centric frameworks such as Mapreduce and Spark are already widely applied, there should be really strong motivations to migrate existing systems to this new framework. Moreover, the implementation quality and contributor community are two very important facts for the adoption of a new born framework, while Spark has been a really popular project. Maybe, a higher level project such as the Apache Beam is a good direction. Beam hides the low-level execution engine by unifying the interface. Any application written in Beam is then compiled to run on an execution engine such as Spark or Flink.