首页 > 其他分享 >Flink and Kafka Streams: a Comparison and Guideline for Users

Flink and Kafka Streams: a Comparison and Guideline for Users

时间:2023-08-16 23:23:16浏览次数:44  
标签:Comparison processing Users stream Flink Kafka application Streams

This blog post is written jointly by Stephan Ewen, CTO of data Artisans, and Neha Narkhede, CTO of Confluent. Stephan Ewen is PMC member of Apache Flink and co-founder and CTO of data Artisans. Before founding data Artisans, Stephan was leading the development that led to the creation of Apache Flink. Stephan holds a PhD. in Computer Science from TU Berlin. You can also find this post on the data Artisans blog.

The open source stream processing space is exploding, with more streaming platforms available than ever. presenting users with many alternatives. In the Apache Software Foundation alone, there are now over 10 stream processing projects, some in incubation and others graduated to top-level project status. While the availability of alternatives benefits users and the industry as a whole by enabling competition and thus, encouraging innovation, it can also be quite confusing: with all these options, which is the best stream processing system for me now, and in the future? Stream processors can be evaluated on several dimensions, including performance (throughput and latency), integration with other systems, ease of use, fault tolerance guarantees, etc, but making such a comparison is not the topic of its post (and we are certainly biased).

For some time now, the Apache Kafka project has served as a common denominator in most open source stream processors as the the de-facto storage layer for storing and moving potentially large volumes of streaming data with low latency. Recently, the Kafka community introduced Kafka Streams, a stream processing library that ships as part of Apache Kafka. With the addition of Kafka Streams and Kafka Connect, Kafka has now added significant stream processing capabilities.

In this post, we focus on discussing how Flink and Kafka Streams compare with each other on stream processing, and we attempt to provide clarity on that question in this post. Flink and Kafka Streams were created with different use cases in mind. While they have some overlap in their applicability, they are designed to solve orthogonal problems and have very different sweet spots and placement in the data infrastructure stack.

First, let’s look into a quick introduction to Flink and Kafka Streams.

What is Apache Flink?

Apache Flink’s roots are in high-performance cluster computing, and data processing frameworks. Flink runs self-contained streaming computations that can be deployed on resources provided by a resource manager like YARN, Mesos, or Kubernetes. Flink jobs consume streams and produce data into streams, databases, or the stream processor itself. Flink is commonly used with Kafka as the underlying storage layer, but is independent of it.

Before Flink, users of stream processing frameworks had to make hard choices and trade off either latency, throughput, or result accuracy. Flink was the first open source framework (and still the only one), that has been demonstrated to deliver (1) throughput in the order of tens of millions of events per second in moderate clusters, (2) sub-second latency that can be as low as few 10s of milliseconds, (3) guaranteed exactly once semantics for application state, as well as exactly once end-to-end delivery with supported sources and sinks (e.g., pipelines from Kafka to Flink to HDFS or Cassandra), and (4) accurate results in the presence of out of order data arrival through its support for event time. Flink is based on a cluster architecture with master and worker nodes. Flink clusters are highly available, and can be deployed standalone or with resource managers such as YARN and Mesos. This architecture is what allows Flink to use a lightweight checkpointing mechanism to guarantee exactly-once results in the case of failures, as well allow easy and correct re-processing via savepoints without sacrificing latency or throughput. Finally, Flink is also a full-fledged batch processing framework, and, in addition to its DataStream and DataSet APIs (for stream and batch processing respectively), offers a variety of higher-level APIs and libraries, such as CEP (for Complex Event Processing), SQL and Table (for structured streams and tables), FlinkML (for Machine Learning), and Gelly (for graph processing). Flink has been proven to run very robustly in production at very large scale by several companies, powering applications that are used every day by end customers.

What is the Kafka Streams API?

In contrast, the Streams API is a powerful, embeddable stream processing engine for building standard Java applications for stream processing in a simple manner. Such Java applications are particularly well-suited, for example, to build reactive and stateful applications, microservices, and event-driven systems. As a native component of Apache Kafka since version 0.10, the Streams API is an out-of-the-box stream processing solution that builds on top of the battle-tested foundation of Kafka to make these stream processing applications highly scalable, elastic, fault-tolerant, distributed, and simple to build. The gap the Streams API fills is less the analytics-focused domain and more building core applications and microservices that process data streams.

The goal of the Streams API is to simplify stream processing enough to make it accessible as a mainstream application programming model. To aid in that goal, there are a few deliberate design decisions made in the Streams API — 1) It is an embeddable library with no cluster, just Kafka and your application. With the Streams API you can focus on building applications that drive your business rather than on building clusters. This makes it significantly more approachable to application developers looking to do stream processing, as it seamlessly integrates with a company’s existing packaging, deployment, monitoring and operations tooling 2) It is fully integrated with core abstractions in Kafka, so all the strengths of Kafka — failover, elasticity, fault-tolerance, scalability and security — are available and built-in to the Streams API; Kafka is battle-tested and is deployed at scale in thousands of companies worldwide, allowing the Streams API to build on that strong foundation 3) It introduces new concepts and functionality to allow for stream processing, such as fully integrating the abstractions of streams and of tables, which you can use interchangeably within your application to achieve, for example, highly performant join operations and continuous queries.

Flink vs Kafka Streams API: Major Differences

The table below lists the most important differences between Kafka and Flink:

  Apache Flink Kafka Streams API
Deployment Flink is a cluster framework, which means that the framework takes care of deploying the application, either in standalone Flink clusters, or using YARN, Mesos, or containers (Docker, Kubernetes) The Streams API is a library that any  standard Java application can embed and hence does not attempt to dictate a deployment method; you can thus deploy applications with essentially any deployment technology — including but not limited to: containers (Docker, Kubernetes), resource managers (Mesos, YARN), deployment automation (Puppet, Chef, Ansible), and custom in-house tools.
Life cycle User’s stream processing code is deployed and run as a job in the Flink cluster User’s stream processing code runs inside their application
Typically owned by Data infrastructure or BI team Line of business team that manages the respective application
Coordination Flink Master (JobManager),
part of the streaming program
Leverages the Kafka cluster for coordination, load balancing, and  fault-tolerance.
Source of continuous data Kafka, File Systems, other message queues Strictly Kafka with the Connect API in Kafka serving to address the data into, data out of Kafka problem
Sink for results Kafka, other MQs, file system, analytical database, key/value stores, stream processor state, and other external systems Kafka, application state, operational database or any external system
Bounded and unbounded data streams Unbounded and Bounded Unbounded
Semantical Guarantees Exactly once for internal Flink state; end-to-end exactly once with selected sources and sinks (e.g., Kafka to Flink to HDFS); at least once when Kafka is used as a sink, is likely to be exactly-once end-to-end with Kafka in the future Exactly-once end-to-end with Kafka

 

The fundamental differences between a Flink and a Streams API program lie in the way these are deployed and managed (which often has implications to who owns these applications from an organizational perspective) and how the parallel processing (including fault tolerance) is coordinated. These are core differences – they are ingrained in the architecture of these two systems.

Deployment and Organizational Management

A Flink streaming program is modeled as an independent stream processing computation and is typically known as a job. The entire lifecycle of a Flink job is the responsibility of the Flink framework; be it deployment, fault-tolerance or upgrades. The resources used by a Flink job come from resource managers like YARN, Mesos, pools of deployed Docker containers in existing clusters (e.g., a Hadoop cluster in case of YARN), or from standalone Flink installations. Flink jobs can start and stop themselves, which is important for finite streaming jobs or batch jobs. From an ownership perspective, a Flink job is often the responsibility of the team that owns the cluster that the framework runs, often the data infrastructure, BI or ETL team.

The Streams API in Kafka is a library that can be embedded inside any standard Java application. As such, the lifecycle of a Kafka Streams API application is the responsibility of the application developer or operator. The Streams API does not dictate how the application should be configured, monitored or deployed and seamlessly integrates with a company’s existing packaging, deployment, monitoring and operations tooling. From an ownership perspective, a Streams application is often the responsibility of the respective product teams.

Besides affecting the deployment model, running the stream processing computation embedded inside your application vs. as an independent process in a cluster touches issues like resource isolation or separation vs. unification of concerns. For instance, running a stream processing computation inside your application means that it uses the packaging and deployment model of the application itself. And running a stream processing computation on a central cluster means that you can allow it to be managed centrally and use the packaging and deployment model already offered by the cluster. Likewise, running a stream processing computation on a central cluster provides separation of concerns as the stream processing part of the application’s business logic lives separately from the rest of the application and the message transport layer (for example, this means that resources dedicated to stream processes are isolated from resources dedicated to Kafka). On the other hand, running a stream processing computation inside your application is convenient if you want to manage your entire application, along with the stream processing part, using a uniform set of operational tooling. Depending on the requirements of a specific application, one or the other approach may be more suitable.

Call out: Stream processing is used in a variety of places in an organization — from user-facing applications to running analytics on streaming data. The Streams API in Kafka and Flink are used in both capacities. The main distinction lies in where these applications live — as jobs in a central cluster (Flink), or inside microservices (Streams API).

Distributed Coordination and Fault Tolerance

The biggest difference between the two systems with respect to distributed coordination is that Flink has a dedicated master node for coordination, while the Streams API relies on the Kafka broker for distributed coordination and fault tolerance, via the Kafka’s consumer group protocol. While this sounds like a subtle difference at first, the implications are quite significant.

In Apache Flink, fault tolerance, scaling, and even distribution of state are globally coordinated by the dedicated master node. Flink’s master node implements its own high availability mechanism based on ZooKeeper. A failure of one node (or one operator) frequently triggers  recovery actions in other operators as well (such as rolling back changes). This approach helps Flink to get its high throughput with exactly once guarantees, it enables Flink’s savepoint feature (for application snapshots and program and framework upgrades), and it powers Flink’s exactly-once sinks (e.g., HDFS and Cassandra, but not Kafka). Even for nondeterministic programs, Flink can that way guarantee results that are equivalent to a valid failure-free execution. It is worth pointing out that since Kafka does not provide an exactly-once producer yet, Flink when used with Kafka as a sink does not provide end to end exactly-once guarantees as a result.

The Streams API in Kafka provides fault-tolerance, guarantees continuous processing and high availability by leveraging core primitives in Kafka. Each shard or instance of the user’s application or microservice acts independently. All coordination is done by the Kafka brokers; the individual application instances simply receive callbacks to either pick up additional partitions (scale up) or to relinquish partitions (scale down). Fault tolerance is built-in to the Kafka protocol; if an application instance dies or a new one is started, it automatically receives a new set of partitions from the brokers to manage and process. The application that embeds the Streams API program does not have to integrate with any special fault tolerance APIs or even be aware of the fault tolerance model. This allows for a very lightweight integration; any standard Java application can use the Streams API.

To summarize, while the global coordination model is powerful for streaming jobs in Flink, it works less well for standalone applications and microservices that need to do stream processing: the application would have to participate in Flink’s checkpointing (implement some APIs) and would need to participate in the recovery of other failed shards by rolling back certain state changes to maintain consistency. That is clearly not as lightweight as the Streams API approach. Again, both approaches show their strength in different scenarios.

Conclusion

In summary, while there certainly is an overlap between the Streams API in Kafka and Flink, they live in different parts of a company, largely due to differences in their architecture and thus we see them as complementary systems. The Streams API makes stream processing accessible as an application programming model, that applications built as microservices can avail from, and benefits from Kafka’s core competency —performance, scalability, security, reliability and soon, end-to-end exactly-once — due to its tight integration with core abstractions in Kafka. Flink, on the other hand, is a great fit for applications that are deployed in existing clusters and benefit from throughput, latency, event time semantics, savepoints and operational features, exactly-once guarantees for application state, end-to-end exactly-once guarantees (except when used with Kafka as a sink today), and batch processing.

Finally, Flink and core Kafka (the message transport layer) are of course complementary, and together are a great fit for a streaming architecture. The data Artisans and Confluent teams remain committed to guaranteeing that Flink and Kafka work great together in all subsequent releases of the frameworks.

Try out Kafka’s Streams API

If you have enjoyed this article, you might want to continue with the following resources to learn more about Apache Kafka’s Streams API:

 

标签:Comparison,processing,Users,stream,Flink,Kafka,application,Streams
From: https://www.cnblogs.com/WLCYSYS/p/17636465.html

相关文章

  • 开源XL-LightHouse与Flink、ClickHouse之类技术相比有什么优势
    Flink是一款非常优秀的流式计算框架,而ClickHouse是一款非常优秀的OLAP类引擎,它们是各自所处领域的佼佼者,这一点是毋庸置疑的。Flink除了各种流式计算场景外也必然可以用于流式统计,ClickHouse同样也可以用于流式统计,但我不认为它们是优秀的流式统计工具。XL-Lighthouse在流式统计这......
  • 中电金信:技术实践|Flink多线程实现异构集群的动态负载均衡
    导语:ApacheFlink是一个框架和分布式处理引擎,用于对无界和有界数据流进行有状态计算。本文主要从实际案例入手并结合作者的实践经验,向各位读者分享当应用场景中异构集群无法做到负载均衡时,如何通过Flink的自定义多线程来实现异构集群的动态负载均衡。●1. 前言●2. 出现的问......
  • Apache Flink目录遍历漏洞复现CVE-2020-17519
    ApacheFlink目录遍历漏洞复现CVE-2020-17519前置知识ApacheFlink:ApacheFlink是一个框架和分布式处理引擎,用于在无边界和有边界数据流上进行有状态的计算。Flink能在所有常见集群环境中运行,并能以内存速度和任意规模进行计算。漏洞利用条件:ApacheFlink版本为1.11.0......
  • 8-13|Cannot run program "C:\Users\Administrator\AppData\Local\Temp\GoLand
    您的错误消息指的是尝试运行的程序与您当前的Windows版本不兼容。这可能是因为您正在使用一个旧版本的Windows(例如32位的版本)并试图运行一个为新版本(例如64位)编译的程序。以下是解决这个问题的建议步骤:1.**确认您的操作系统版本**: -打开“运行”(按`Win+R`),输入`msinfo32......
  • 史上最全Flink面试题,高薪必备,大数据面试宝典
    文章很长,且持续更新,建议收藏起来,慢慢读!疯狂创客圈总目录博客园版为您奉上珍贵的学习资源:免费赠送:《尼恩Java面试宝典》持续更新+史上最全+面试必备2000页+面试必备+大厂必备+涨薪必备免费赠送:《尼恩技术圣经+高并发系列PDF》,帮你实现技术自由,完成职业升级,薪......
  • flink-cdc同步mysql数据到elasticsearch
    1,什么是cdcCDC是(ChangeDataCapture变更数据获取)的简称。核心思想是,监测并捕获数据库的变动(包括数据或数据表的插入INSERT、更新UPDATE、删除DELETE等),将这些变更按发生的顺序完整记录下来,写入到消息中间件中以供其他服务进行订阅及消费。2,flink的cdc项目地址:https://github......
  • 基于 Flink & Paimon 实现 Streaming Warehouse 数据一致性管理
    一、背景早期的数仓生产体系主要以离线数仓为主,业务按照自己的业务需求将数仓分为不同的层次,例如DWD、DWS、ADS等。在离线数仓中,业务数据会经过离线ETL加工进入数仓,层与层之间的数据转换也会使用离线ETL来进行处理。ADS层可以直接对外提供Serving能力,中间层通常会使用Hiv......
  • Flink Unaligned Checkpoint 在 Shopee 的优化和实践
    一、Checkpoint存在的问题1.1数据库的Snapshot和恢复机制首先看一下数据库的Snapshot和恢复机制。数据库周期性地快照,每一次Snapshot是一个全量快照,同时要持续地写ChangeLog或WAL。当数据库crash后,新的数据库实例会从Snapshot3以及最新的ChangeLog恢复。1.2Fli......
  • Flink源码解析(零)——源码解析系列随笔说明
    00、博主仅是数据开发及数仓开发工程师,出于提升自身对Flink系统原理掌握考虑,自愿花费精力整理源码解析系列随笔,并非专业Flink系统开发人员,在源码解析过程中出现非专业行为望见谅。希望Flink系统开发专业人员多提意见,不胜感激。01、Flink源码解析系列随笔主要基于Flink1.17.1版本......
  • 一、Flink-1.13.6源码编译运行
    1、概述本节演示如何在本地编译、运行Flink源码。技术有限,欢迎各位大佬在评论区批评指正。2、版本说明名称版本flink1.13.6jdk1.8Maven3.2.5操作系统Mac3、编译Flink源码1)从github下载Flink源码gitclonehttps://github.com/apache/flink......