首页 > 其他分享 >Introduing some cores concepts within kafka

Introduing some cores concepts within kafka

时间:2023-11-18 10:22:07浏览次数:42  
标签:will Introduing some within messages kafka topic data consumer

Introduction

Today, we talk about some cores concepts within kafka. Apache kafka is a distributed publish-subscribe messaging system. It is originally developed at Linkdln Corporation and later on become a part of Apacche project, kafka is a fast, scalable, distributed in nature by its design, partitioned and replicated commit log service.producers publish messages to kafka topics, and consumers subsribe to these topics and consume the messages, a server in a kafka is called broker, for each topic, the kafka cluster maintains a partition for scaling, parallelism and fault-tolerance. each partiiton is an ordered , immutable sequence of messages that is continually appended to a commit log. The messages in the partitions are each assigned a sequential id number called the offset.

Otherwise, how should I write this artical? I think it is better to introduce the architectue of kafka before the core concepts, Now, let talk about kafka's architecture.

Architecture

Firstly, I want to introduce the basic concepts of kafka, Its architecture consists of the following components:

  • A stream of messages of a particular type is defined as a topic, a message is defined as a payload of bytes and a topic is a category or feel name to which messages are published.
  • A producers can be anyone who can publish messages to a topic.
  • The published messages are then stored at a set of servers called Brokers or kafka cluster.
  • A consumer can subscribe to one or more Topics and consume the published Messages by pulling data from tthe brokers.

The overall architecture of kafka is shown in above figure. Since kafka is a distributed in a nature, a kafka broker typically consists of multiple brokers, To balance load, a topic is divided into multiple partitions and each broker and each broker stores one or more of those partitions. Multiple producers and consumers can publish and retrieve messages at the same time.

Now, let me talk about these three concepts: topic, producer, consumer.

topic

 Officially, A kafka topic is a fundamental concept in the Apache Kafka streaming platform , serving as a logical channel or category for organizing and categorizing data streams. Topics act as message queues, where data is published by producers and consumed by consumers in a distributed and real-time manner. Each kafka topic represents a specific data feed or stream, allowing related information to be grouped together. Topics are highly scalable and durable, often partitioned across multiple servers to enable parallel processing and fault tolerance, they faciliate decoupling data producers from consumers in event-driven architectures, making it possible to build efficient, resilient and scalable data pipelines and applications that can handle vast volumes of data with low latency.

A kafka topic partition is a critical mechanism in the apache kafka messaging system that allows for the parallel and distributed processing of data streams. It involves breaking down a kafka topic into smaller, independently manageable units or topic partitions, Each partition is essentially a linear, ordered log of messages, and it serves as the unit of parallelism, allowing multiple consumers to read and process messages simultaneously, Partitions enable kafka to achieve high throughput and scalability , as they can distributed across multiple kafka brokers and processed concurrently by consumers ,Additionally, partitioning ensures fault tolerance, as each partition has replicas on other servers , ensuring data durability and availability even during hardware failures.

There is a famous definition of topic,  kafka topics are a permanent log of events or sequences, every topic can serve information to nurmerous consumers, which is why producers are in some cases referred to as a publisher, and consumers are called subscribers. Partitions serve to repeat information across brokers, every kafka topic is isolated into segments and each partition can be set on a different node.

so , what  the role of kafka topics play? the role of kafka topics in data streaming is fundamental to functionality and efficiency of kafka, topic serve as the primary organizational structure for kafka stream, their role can be summarized in several key aspects-

  • Data organization-kafka topics categorize and organize data into logical channels of categories. Each topic represents a specific data stream or feed , allowing data to be organized based on its source, purpose, or content.
  • Decoupling producers and consumers-kafka topics enable data producers and consumers to decouple, producers publish data to topics without knowing who or what will consume it , promoting loose coupling and flexibility in system design.
  • Parallel Processing, Topics are divided into partitions , which can be processed independently and in parallel, This parallelism facilitates the high-throughput procesing of data, making kafka suitable for real-time and high-volume data streams.
  • Data retention-kafka topics can retain data for a configurable period, ensuring consumers can access historical data, replay events , or perform analytics on past data.
  • Scalability and fault tolerance - kafka topics are partitions across multiple brokers, distributing the data load and ensuring fault tolerance, this architecture allows kafka to scale horizontally to handlle massive data volumes.
  • Messages Ordering- within a partition , kafka maintains strict message order, ensuring that data is processing sequentially,  which is critical for applications requiring ordered event processing.

I believe reader has know what is definition of topic, what the role topics play in kafka? how can kafka have a 

high-performance, high-scalability, fault-tolerance through topics work. now let talk about producers within kafka.

producer

In a nutshell, producers create new messages. In other publish/subscirbe systems, these may be called publisher or writers, A message may produced to a specific topic, by default, the producer will balance messages over all partitions of a topic evenly, In some cases, the producer will direct messages to specific partitions, this is typically done using the message key and a partitioner that will generate a hash of key and map it to a specific partition, this ensures thats all messages produced with a given key will get written to the same partition.

There are many reasons an application might need to write messages to kafka: recording user activities for auditing or analysis, recording metrics, storing log messages, recording information from smart appliances, communicating asynchronously with other application , buffering indormation before writing to a database, and much more.Those diverse use cases also imply diverse requirements: is every message critial , or can we tolerate loss of message? Are we ok with accidentially duplicating messages? Are there any strict latency or throughput requirements we need to support?

There is some particular sitution, for example, In  the credit card transaction processing example, we can see it is critical to never lose a single message or duplicate any messages, Latency should be low, but latencies up to 500ms can be tolerated, and throughput should be very high --we expect to process up a million messages a second. A different use case might be to store click information from a website, In that case, some messages loss or a few duplicates can be tolerated; latency can be high as long as there is no impact on user experience, In other words, we don't mind if it takes a few seconds for the message to arrive at kafka, as long as the next page loads immediately after the user clicks on a link, Throughtput will depend on the level of activity we anticipate on our website.

So we must know , the different requirements will influence the way we use the producer API to write messages to kafka and the configuration you are,while the producer API is very simple, there is a bit more that go on under the hood of the producer when we send data, the below figure shows the main steps involved in sending data to kafka.

 we start producing messages to kafka by creating a ProducerRecord,  which must include the topic we want to send the record to and a value. Optionally, we can also specify a key, a partition , a timestamp, and/or a collection of headers. Once we send the ProducerRecord, the first thing the producer will do is serialize the key and value objects to byte arrays so they can be sent over the network.

Next, if we didn't explicitly specify a partition, the data is sent to a partiitoner, the partitioner will choose a partition for us, usually based on the producerRecord key, Once a partiiton is selected , the producer knows which topic and partiiton the record will go on, It then adds the record to a batch of records that will also be sent to the same topic and partiiton, A separate thread is responsible for sending those batches of records to the appropriate kafka brokers.

When the broker receives the messages , it sends back a response, If the messages were sucessfully written to kafka , it will return a RecoredMetedata object with the topic, partiiton, and offset of the record within the partion . If the broker failed to write the message , it will return an error, When the producer receives an error, it may retry sending the message a few more times before giving up and returning an error.

consumer

Applications that need to read data from kafka use a KafkaConsumer to subscribe to kafka topics and receive messages from these topics, reading data from kafka is a bit different than reading data from other messaging systems, and there are a few unique concepts and ideas involved, I will start by explaining some of the important concepts, 

To understand how to read data from kafka , you first need to understand its consumers and consumer groups. The following sections cover those concepts.

Supose you have an application that to read messages from a kafka topic, run some validations against them , and write the results to another data store , In this case , your application will create a consumer object, subscribe to the approprite topic, and start receiving messages, validating them, and writing the results, this may work for a while, but what if the rate at which producers write messages to the topic exceeds the rate at which your application can validate them? if you are limited to single consumer reading and processing the data, your application may fall further and further behind , unable to keep up with the rate of incoming messages, Obviously there is a need to scale consumption from topics . just like multiple producers can write to the same topic, we need to allow multiple consumers to read from same topic, splitting the data from them.

Typically, Kafka consumer are part of a consumer group, when multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partiitons in a topic. for example, there is a topic with four partiitons, now suppose we created a consumer C1 belong to a consumer group called G1, if we use C1 subscribe to topic, C1 will get all messages from all four partitions, causing there is only one consumer in G1, if we add a new consumer C2 , to group G1, each consumer will get messages from two partitions . Perhaps messages from partion 0 and partition 2 to C1, and messages from partiitons 1 and 3 go to consumer C2. Naturally, if G1 has four consumers, then each will read messages from a single partition. because the number of consumers in G1 is equal to number of partitions.If we have more consumers to a single group with a single topic than we have partitions, some of consumers will be idle and get no messages at all.

It is common for kafka consumers to do high-latency operations such as write to a database or a time-consuming computation on the data, In these cases, a single consumer can't possibly keep up with the rate data flows into a topic, and adding more consumer that share the load by having each consumer own just a subset of the partitions and messages is our main method of scaling, This is a good reason to create topics with a large number of partiitons, we need to keep in mind that there is no point in adding more consumers than you have partitions in a topic, becase of some of the consumers will be idle.

So totally, you can add consumers to an existing consumer group to scale the reading and processing of messages from the topics, so each additional consumer in a group will only get a subset of the messages.

As we saw in the previous section, consumers in a consumer group share ownership of the partitions in the topics they subscribe to,  when we add a new consumer to the same group, it starts consuming messages from partitions previously consumed by another consumer, The same thing happens when a consumer shuts down or crashes, it leaves the group, and the partition it used to consume will be consumed by one of the remaining consumers,  this is called consumer rebalances ,Moving partition ownership from one consumer to another is called a rebalance, rebalances are important because they provide the consumer group with high-availability and scalability, allowing us to easily and safely add and remove consumers, but in the normal course of events they can be fairly undesirable. because one consumer failed ,the other consumers in same group must have some effects which are inappropriately.

So there is another feature called static group membership, all this is true unless you configure a consumer with a unique group.instance.id, which makes the consumer a static member of the group, When a consumer first joins a consumer group as a static member of the group, it is assigned a set of partitions according to the partition assignment strategy the group is using, as normal, however when this consumer shut down, it dose not automatically leave the group, it remains a member of the group until its session times out, when the consumer rejoins the group again, it is recognized with its static identity and is reassigned the same partitions it previously held without triggering a rebalance , The group coordinator that caches the assignment for each member of the group dose not need to trigger a rebalance but can just send the cache assignment to the rejoining static member. Static group membership is useful when your application maintains local state or cache that is populated by the partitions that are assigned to each consumer, when re-creating this cache is time-consuming, you don't want this process to happen every time a consumer restarts, on the flip side, it is important that the partitions owned by each consumer will not get reassigned when a consumer is restarts, for a certain duration, no consumer will consume messages from these partitions, it is important to note that static members of consumer groups do not leave the group proactively when they shut down, and detecting when they are "really gone" depends on session.timeout.ms configuration. you will want to set it high enough to avoid triggering rebalances on a simple application restart, but set it low enough to allow automatic ressignment of their partiitons to avoid large gaps in processing these partition.

Conclusion

It is time to summarize this acticle, totally, it show the basic concepts in kafka, included, the basic architecture, and how the producer and consumer to be run , at the same time, I hope it will good for you. next time I will show more details about this concepts, how to configure and how to code. see you!

 

标签:will,Introduing,some,within,messages,kafka,topic,data,consumer
From: https://www.cnblogs.com/boanxin/p/17825596.html

相关文章

  • umount 报错umount: /new_room: target is busy. (In some cases useful info
    挂载逻辑卷后,尝试更新逻辑卷的文件系统[root@serverlost+found]#umount/new_room/umount:/new_room:targetisbusy.(Insomecasesusefulinfoaboutprocessesthatusethedeviceisfoundbylsof(8)orfuser(1))报错说繁忙上网查发现我进入......
  • re2-cpp-is-awesome
    没做出来,看题解得攻防世界逆向高手题之re2-cpp-is-awesome_align20h-CSDN博客注意汇编知识align8,alignnum是让后面的字节都对齐num,也就是这里都对齐8才对,中间补7个0。可是这里下一个数和上一个数明明间隔4而已!后来查了很多资料才发现是IDA自动把多个0判断成对齐操作了,这里a......
  • 无涯教程-Clojure - some函数
    返回值集合中x的some第一个逻辑真值。some-语法(somep1col)参数   - 'p1'是需要测试的断言, "col"是需要测试的值的集合。返回值 - 如果断言对于每个值都为true,则返回true,否则为false。some-示例(nsclojure.examples.example(:gen-class))(defnE......
  • Some seqs are too long, please rebuild the program with make parameter MAX_SEQ=n
     001、cd-hit报错如下Someseqsaretoolong,pleaserebuildtheprogramwithmakeparameterMAX_SEQ=new-maximum-length(e.g.makeMAX_SEQ=10000000) 002、解决方法重新编译该软件:(base)[[email protected]]$makeMAX_SEQ=10000000......
  • eclipase项目导入错误Some projects cannot be imported because they already exist
    根本原因:workplace和工程名冲突解决办法:1.新建一个目录,打开工程选择该路径 2.导入实际工程得路径位置 ......
  • Some Recent Thoughts Wrritten By NiuJiawen-2019141490165
    Recently,manystudentswhoarejunioryeararetakingpartininterviewsforexemptstudents.Somehaveobtainedsatisfactoryoffers,whereas,othersthinkitistoohardtogetawonderfuloffer. Andthereisnodoubtthatthelatterwillbefrustrated......
  • 话题1:why do some young people keep moving
    whydosomeyoungpeoplekeepmovingeg:wellIthinkfistofallyoungpeoplecompearetooldergenerationhashigherlevelofmobility,becausetheyarenotafraidofnewenvironmentIseealotofoldergenerationstheyaresoreluctanttomoveeven......
  • [ABC320E] Somen Nagashi题解
    2023-09-16题目题目传送门翻译翻译难度&重要性(1~10):4题目来源AtCoder题目算法优先队列解题思路水题一道。需要两个优先队列:因为每一次是队首的人拿到面条,即队列中编号最小的拿面条,就用一个优先队列用来维护当前队列中的编号最小的人。由于每一次拿了面条后再......
  • How to parse OR AND within text
    假设你有一行Stringcondition="AorBandC";语句,请问怎么做才能变成一行真正的逻辑表达式(能在计算机中运行计算)?Resolution声明一个List<List<String>>结构;先分割or;变成[A,BandC]不包含and的,插入List<List<String>>结构;List<List<String>>.add([A])声......
  • [git] 报错unsafe repository xxx is owned by someone else解决方法
    git进行了版本升级,添加了新的目录安全限制。造成在进行git常规操作时,或在各类编辑器如VSCode中无法发现.git文件,报错:fatal:unsaferepository(xxxisownedbysomeoneelse.)Toaddanexceptionforthisdirectory,callgitconfig–global–addsafe.directory方......