首页 > 其他分享 >Seatunnel学习记录

Seatunnel学习记录

时间:2023-03-25 17:57:44浏览次数:64  
标签:Seatunnel 同步 记录 SeaTunnel 学习 synchronization 连接器 Spark data

1 简介

About Seatunnel

SeaTunnel is a very easy-to-use ultra-high-performance distributed data integration platform that supports real-time synchronization of massive data. It can synchronize tens of billions of data stably and efficiently every day, and has been used in the production of nearly 100 companies.

SeaTunnel是一个非常易于使用的超高性能分布式数据集成平台,支持海量数据的实时同步。它每天可以稳定高效地同步数百亿个数据,并已在近100家公司的生产中使用。

Why do we need SeaTunnel

SeaTunnel focuses on data integration and data synchronization, and is mainly designed to solve common problems in the field of data integration:
Various data sources: There are hundreds of commonly-used data sources of which versions are incompatible. With the emergence of new technologies, more data sources are appearing. It is difficult for users to find a tool that can fully and quickly support these data sources.
Complex synchronization scenarios: Data synchronization needs to support various synchronization scenarios such as offline-full synchronization, offline- incremental synchronization, CDC, real-time synchronization, and full database synchronization.
High demand in resource: Existing data integration and data synchronization tools often require vast computing resources or JDBC connection resources to complete real-time synchronization of massive small tables. This has increased the burden on enterprises to a certain extent.
Lack of quality and monitoring: Data integration and synchronization processes often experience loss or duplication of data. The synchronization process lacks monitoring, and it is impossible to intuitively understand the real-situation of the data during the task process.
Complex technology stack: The technology components used by enterprises are different, and users need to develop corresponding synchronization programs for different components to complete data integration.
Difficulty in management and maintenance: Limited to different underlying technology components (Flink/Spark) , offline synchronization and real-time synchronization often have be developed and managed separately, which increases thedifficulty of the management and maintainance.

SeaTunnel专注于数据集成和数据同步,主要用于解决数据集成领域的常见问题:
各种数据源:有数百种常用的数据源,其版本不兼容。随着新技术的出现,越来越多的数据源出现了。用户很难找到一种能够完全快速支持这些数据源的工具。
复杂的同步场景:数据同步需要支持离线完全同步、离线增量同步、CDC、实时同步和全数据库同步等各种同步场景。
对资源的高需求:现有的数据集成和数据同步工具往往需要大量的计算资源或JDBC连接资源来完成海量小表的实时同步。这在一定程度上增加了企业的负担。
缺乏质量和监控:数据集成和同步过程经常会出现数据丢失或重复。同步过程缺乏监控,无法直观地了解任务过程中数据的真实情况。
复杂的技术堆栈:企业使用的技术组件不同,用户需要为不同的组件开发相应的同步程序来完成数据集成。
管理和维护难度:受限于不同的底层技术组件(Flink/Spark),离线同步和实时同步往往需要单独开发和管理,这增加了管理和维护的难度。

Features of SeaTunnel

Rich and extensible Connector: SeaTunnel provides a Connector API that does not depend on a specific execution engine. Connectors (Source, Transform, Sink) developed based on this API can run On many different engines, such as SeaTunnel Engine, Flink, Spark that are currently supported.
Connector plug-in: The plug-in design allows users to easily develop their own Connector and integrate it into the SeaTunnel project. Currently, SeaTunnel has supported more than 70 Connectors, and the number is surging. There is the list of the currently-supported connectors
Batch-stream integration: Connectors developed based on SeaTunnel Connector API are perfectly compatible with offline synchronization, real-time synchronization, full- synchronization, incremental synchronization and other scenarios. It greatly reduces the difficulty of managing data integration tasks.
Support distributed snapshot algorithm to ensure data consistency.
Multi-engine support: SeaTunnel uses SeaTunnel Engine for data synchronization by default. At the same time, SeaTunnel also supports the use of Flink or Spark as the execution engine of the Connector to adapt to the existing technical components of the enterprise. SeaTunnel supports multiple versions of Spark and Flink.
JDBC multiplexing, database log multi-table parsing: SeaTunnel supports multi-table or whole database synchronization, which solves the problem of over- JDBC connections; supports multi-table or whole database log reading and parsing, which solves the need for CDC multi-table synchronization scenarios Problems with repeated reading and parsing of logs.
High throughput and low latency: SeaTunnel supports parallel reading and writing, providing stable and reliable data synchronization capabilities with high throughput and low latency.
Perfect real-time monitoring: SeaTunnel supports detailed monitoring information of each step in the data synchronization process, allowing users to easily understand the number of data, data size, QPS and other information read and written by the synchronization task.
Two job development methods are supported: coding and canvas design: The SeaTunnel web project https://github.com/apache/incubator-seatunnel-web provides visual management of jobs, scheduling, running and monitoring capabilities.

SeaTunnel的特点
丰富且可扩展的连接器:SeaTunnel提供了一个不依赖于特定执行引擎的连接器API。基于此API开发的连接器(Source、Transform、Sink)可以在许多不同的引擎上运行,如目前支持的SeaTunnel Engine、Flink、Spark。
连接器插件:插件设计允许用户轻松开发自己的连接器,并将其集成到SeaTunnel项目中。目前,SeaTunnel已经支持了70多个连接器,而且数量还在激增。有当前支持的连接器的列表
批处理流集成:基于SeaTunnel连接器API开发的连接器完全兼容离线同步、实时同步、完全同步、增量同步等场景。它大大降低了管理数据集成任务的难度。
支持分布式快照算法,确保数据一致性。
多引擎支持:默认情况下,SeaTunnel使用SeaTunnel引擎进行数据同步。同时,SeaTunnel还支持使用Flink或Spark作为连接器的执行引擎,以适应企业现有的技术组件。SeaTunnel支持Spark和Flink的多个版本。
JDBC复用,数据库日志多表解析:SeaTunnel支持多表或全数据库同步,解决了JDBC连接过多的问题;支持多表或全数据库日志读取和解析,解决了CDC多表同步场景对日志重复读取和解析的问题。
高吞吐量和低延迟:SeaTunnel支持并行读写,提供稳定可靠的高吞吐量和高延迟数据同步功能。
完美的实时监控:SeaTunnel支持数据同步过程中每一步的详细监控信息,让用户轻松了解同步任务读取和写入的数据数量、数据大小、QPS等信息。
支持两种job开发方法:编码和画布设计:SeaTunnel web项目https://github.com/apache/incubator-seatunnel-web提供作业、调度、运行和监控功能的可视化管理。

标签:Seatunnel,同步,记录,SeaTunnel,学习,synchronization,连接器,Spark,data
From: https://www.cnblogs.com/route/p/17255240.html

相关文章

  • 上位机学习记录(7) 小边框控件编写(绘制外边框,标题栏与文字就行)
    上位机学习记录(7)小边框控件编写publicpartialclassHeadPanel:Panel{publicHeadPanel(){InitializeComponent();......
  • Docker学习笔记:二、安装Docker
    二、安装Docker安装环境CentOS71、Docker版本CE即社区版(免费,支持周期7个月)stabletestnightlyEE即企业版,强调安全,付费使用,支持周期24个月安装指南:https......
  • go语言学习-grpc2:proto文件说明
    messageprotobuf中定义一个消息类型是通过关键字message字段指定。消息就是需要传输的数据格式的定义,它类似java中的class,go中的structmessageUser{stringusername=1......
  • 【C++】类与对象理解和学习(中)
    六大默认成员函数前言每个类中都含有六大默认成员函数,也就是说,即使这个类是个空类,里面什么都没有写,但是编译器依然会自动生成六个默认成员函数,可以说它们六个是祖师爷钦点的......
  • #yyds干货盘点 前端小知识点扫盲笔记记录
    前言大家好我是歌谣微信公众号关注前端小歌谣带你进入前端巅峰人才交流群MVC和MVVM//在MVVM框架下视图和模型是不能直接通信的,只能通过ViewModel进行交互,它能够监听到数......
  • 2023爬虫学习笔记 -- MongoDB数据库
    一、下载安装mongodb1、下载地址https://www.mongodb.com/try/download/community2、一路下一步安装,路径不要出现空格中文等特殊字符3、设置环境变量将bin目录地址放到path......
  • 嵌入式学习记录
    整整三年的疫情已经把我折磨的不像样子了,身材走样,体重从90kg不到,到现在的110kg,从基本不怎么水肿的脸到现在几乎每天都挂着的水肿,甚至连自己最喜欢的篮球和健身也基本不......
  • LINUX学习笔记
    Linux学习笔记1VMwareWorkstationPro中打开虚拟机后,1.文件操作(1)进入到home文件夹.zzh@ubuntu:/$cdhomezzh@ubuntu:/home$(2)在home文件夹中添加一一个新的文......
  • 集合幂级数学习笔记
    定义有时候我们会研究定义域在集合上的函数:考虑一个固定的全集\(U\)和其幂集\(2^U\),我们有一些\(2^U\rightarrowF\)的函数,其中\(F\)是某个域。对于定义在集合上的......
  • java——Zookeeper学习——zk概览转载
    一、ZooKeeper简介ZooKeeper是一个分布式协调服务,提供了诸如数据发布/订阅、负载均衡、命名服务、分布式协调/通知和分布式锁等分布式基础服务。1.1、数据结构ZooKeeper......