首页 > 其他分享 >Apache Amoro数据湖管理和治理工具部署

Apache Amoro数据湖管理和治理工具部署

时间:2024-08-02 17:55:14浏览次数:22  
标签:service local amoro 治理 conf Apache spark true Amoro

一、Amoro介绍

2024 年 3 月 11 日,Amoro 项目顺利通过投票,正式进入 Apache 软件基金会(ASF,Apache Software Foundation)的孵化器,成为 ASF 的一个孵化项目。

Amoro 是建立在开放数据湖表格式之上的湖仓管理系统。2020 年开始, 网易大数据团队在公司内基于 Apache Iceberg 进行湖仓一体架构的探索,孵化了流式湖仓服务 Arctic。

官网:https://amoro.apache.org/

二、安装

注:更新情况下先暂停服务,然后备份

1、下载amoro包(root用户)

cd /root/wang

wget https://******/amoro/amoro-0.7.0-gaotu.tar.gz

2、解压(root用户)

tar -zxf amoro-0.7.0-gaotu.tar.gz

mv amoro-0.7.0 amoro

3、下载mysql jar包

cd /root/wang/amoro/lib

MYSQL_JDBC_DRIVER_VERSION=8.0.30
wget https://repo1.maven.org/maven2/mysql/mysql-connector-java/${MYSQL_JDBC_DRIVER_VERSION}/mysql-connector-java-${MYSQL_JDBC_DRIVER_VERSION}.jar

4、建amoro库

mysql -h127.0.0.1 -uroot -p123456

CREATE DATABASE IF NOT EXISTS amoro;

5、修改配置信息(可以直接复制以前的配置文件)

cd /root/wang/amoro/conf

修改项:server-expose-host(本机内网ip)、bind-port:(服务端口9091)、 url: jdbc:mysql(mysql账号密码信息)

ams:
  admin-username: admin
  admin-password: admin
  server-bind-host: "0.0.0.0"
  server-expose-host: "本地ip地址"
 
  thrift-server:
    max-message-size: 104857600 # 100MB
    selector-thread-count: 2
    selector-queue-size: 4
    table-service:
      bind-port: 1260
      worker-thread-count: 20
    optimizing-service:
      bind-port: 1261
 
  http-server:
    bind-port: 9092
    rest-auth-type: basic
 
  refresh-external-catalogs:
    interval: 180000 # 3min
    thread-count: 10
    queue-size: 1000000
 
  refresh-tables:
    thread-count: 10
    interval: 60000 # 1min
 
  self-optimizing:
    commit-thread-count: 10
    runtime-data-keep-days: 30
    runtime-data-expire-interval-hours: 1
 
  optimizer:
    heart-beat-timeout: 60000 # 1min
    task-ack-timeout: 30000 # 30s
    polling-timeout: 3000 # 3s
    max-planning-parallelism: 1 # default 1
 
  blocker:
    timeout: 60000 # 1min
 
  # optional features
  expire-snapshots:
    enabled: true
    thread-count: 10
 
  clean-orphan-files:
    enabled: true
    thread-count: 10
 
  clean-dangling-delete-files:
    enabled: true
    thread-count: 10
 
  sync-hive-tables:
    enabled: true
    thread-count: 10
 
  data-expiration:
    enabled: false
    thread-count: 10
    interval: 1d
 
  auto-create-tags:
    enabled: true
    thread-count: 3
    interval: 60000 # 1min
 
#  database:
#    type: derby
#    jdbc-driver-class: org.apache.derby.jdbc.EmbeddedDriver
#    url: jdbc:derby:/root/amoro/derby-persistent;create=true
#    connection-pool-max-total: 20
#    connection-pool-max-idle: 16
#    connection-pool-max-wait-millis: 1000
 
#    MySQL database configuration.
  database:
    type: mysql
    jdbc-driver-class: com.mysql.cj.jdbc.Driver
    url: jdbc:mysql://127.0.0.1:3306/amoro?useUnicode=true&characterEncoding=UTF8&autoReconnect=true&useAffectedRows=true&allowPublicKeyRetrieval=true&useSSL=false
    username: root
    password: 123456
    connection-pool-max-total: 20
    connection-pool-max-idle: 16
    connection-pool-max-wait-millis: 1000
 
  #  Postgres database configuration.
  #  database:
  #    type: postgres
  #    jdbc-driver-class: org.postgresql.Driver
  #    url: jdbc:postgresql://127.0.0.1:5432/db
  #    username: user
  #    password: passwd
  #    connection-pool-max-total: 20
  #    connection-pool-max-idle: 16
  #    connection-pool-max-wait-millis: 1000
 
  terminal:
    backend: local
    local.spark.sql.iceberg.handle-timestamp-without-timezone: false
 
#  Kyuubi terminal backend configuration.
#  terminal:
#    backend: kyuubi
#    kyuubi.jdbc.url: jdbc:hive2://127.0.0.1:10009/
 
#  High availability configuration.
#  ha:
#    enabled: true
#    cluster-name: default
#    zookeeper-address: 192.168.88.170:2181,192.168.88.104:2182,192.168.88.164:2183
 
 
containers:
  - name: localContainer
    container-impl: org.apache.amoro.server.manager.LocalOptimizerContainer
    properties:
      export.JAVA_HOME: "/usr/local/jdk"   # JDK environment
 
#containers:
 
# - name: KubernetesContainer
#   container-impl: org.apache.amoro.server.manager.KubernetesOptimizerContainer
#    properties:
#     kube-config-path: ~/.kube/config
#     image: apache/amoro:{version}
#     namespace: default
 
  - name: flinkContainer
    container-impl: org.apache.amoro.server.manager.FlinkOptimizerContainer
    properties:
      flink-home: /usr/local/service/flink/                                     # Flink install home
      target: yarn-per-job                                        # Flink run target, (yarn-per-job, yarn-application, kubernetes-application)
      export.JVM_ARGS: -Djava.security.krb5.conf=/etc/krb5.conf   # Flink launch jvm args, like kerberos config when ues kerberos
      export.HADOOP_CONF_DIR: /usr/local/service/hadoop/etc/hadoop/                   # Hadoop config dir
      export.HADOOP_USER_NAME: hadoop                             # Hadoop user submit on yarn
      export.FLINK_CONF_DIR: /usr/local/service/flink/conf/                     # Flink config dir
#      # flink kubernetes application properties.
#      job-uri: "local:///opt/flink/usrlib/optimizer-job.jar"      # Optimizer job main jar for kubernetes application
#      flink-conf.kubernetes.container.image: "apache/amoro-flink-optimizer:{version}"   # Optimizer image ref
#      flink-conf.kubernetes.service-account: flink                # Service account that is used within kubernetes cluster.
      flink-conf.jobmanager.memory.process.size: 1024M
      flink-conf.taskmanager.memory.process.size: 1024M
 
 
#containers:
  - name: sparkContainer
    container-impl: org.apache.amoro.server.manager.SparkOptimizerContainer
    properties:
      spark-home: /usr/local/service/spark/                                     # Spark install home
      master: yarn                                                # The cluster manager to connect to. See the list of https://spark.apache.org/docs/latest/submitting-applications.html#master-urls.
      deploy-mode: cluster                                        # Spark deploy mode, client or cluster
      export.JVM_ARGS: -Djava.security.krb5.conf=/etc/krb5.conf   # Spark launch jvm args, like kerberos config when ues kerberos
      export.HADOOP_CONF_DIR: /usr/local/service/hadoop/etc/hadoop/                   # Hadoop config dir
      export.HADOOP_USER_NAME: hadoop                             # Hadoop user submit on yarn
      export.SPARK_CONF_DIR: /usr/local/service/spark/conf/                     # Spark config dir
#      # spark kubernetes application properties.
#      job-uri: "local:///opt/spark/usrlib/optimizer-job.jar"      # Optimizer job main jar for kubernetes application
#      ams-optimizing-uri: thrift://ams.amoro.service.local:1261   # AMS optimizing uri
#      spark-conf.spark.dynamicAllocation.enabled: "true"          # Enabling DRA feature can make full use of computing resources
      spark-conf.spark.shuffle.service.enabled: "true"           # If spark DRA is used on kubernetes, we should set it false
      spark-conf.spark.dynamicAllocation.shuffleTracking.enabled: "true"                          # Enables shuffle file tracking for executors, which allows dynamic allocation without the need for an external shuffle service
#      spark-conf.spark.kubernetes.container.image: "apache/amoro-spark-optimizer:{version}"       # Optimizer image ref
#      spark-conf.spark.kubernetes.namespace: <spark-namespace>                                    # Namespace that is used within kubernetes cluster
#      spark-conf.spark.kubernetes.authenticate.driver.serviceAccountName: <spark-sa>              # Service account that is used within kubernetes cluster.
      spark-conf.spark.driver.userClassPathFirst: "true"
      spark-conf.spark.executor.userClassPathFirst: "true"
      spark-conf.spark.executor.instances: 1

6、移动到服务目录

cp -R amoro /usr/local/service/

7、修改目录权限

cd /usr/local/service/

chown hadoop:hadoop amoro -R

chmod 755 -R amoro

8、服务管理(hadoop用户)

sudo su - hadoop

cd  /usr/local/service/amoro/bin

启动服务:sh ams.sh start

停止服务:sh ams.sh stop

重启服务:sh ams.sh restart

 

 

三、管理

1、默认关闭自动治理。设置一下参数灰度部分治理表

alter table data_lake_ods.test_table set tblproperties ('self-optimizing.enabled'='true','clean-dangling-delete-files.enabled'='true','clean-orphan-file.enabled'='true','table-expire.enabled' = 'true');

2、开启接口调用

curl -H "Authorization: Basic 替换符" http://127.0.0.1:9092/api/ams/v1/optimize/optimizerGroups/all/optimizers

Authorization生成方式(替换符内容):  Base64(账号:密码)

 

 

 

 

四、文章

1、网易湖仓管理系统 Amoro 进入 Apache 孵化器

https://www.sohu.com/a/767189247_355140

 

标签:service,local,amoro,治理,conf,Apache,spark,true,Amoro
From: https://www.cnblogs.com/robots2/p/18339309

相关文章

  • Apache Storm:实时数据处理的闪电战
    文章目录ApacheStorm原理拓扑结构数据流处理容错机制官网链接基础使用安装与配置编写拓扑提交与运行高级使用状态管理窗口操作多语言支持优点高吞吐量低延迟可扩展性容错性总结ApacheStorm是一个开源的分布式实时计算系统,它允许你以极高的吞吐量处理无界数据......
  • Apache COC闪电演讲总结【OSGraph】
     大家能看到我最近一直在折腾与OSGraph这个产品相关的事情,之前在文章《妙用OSGraph:发掘GitHub知识图谱上的开源故事》中向大家阐述过这个产品的设计理念和应用价值。比方说以下问题就可以在OSGraph上找到明确的答案。 从技术角度说,我们是用GitHub开放数据结合图技术(TuGrap......
  • 企业做数据治理的意义是什么
    企业做数据治理的意义深远且广泛,主要体现在以下几个方面:提升数据质量:数据治理的首要目标是确保数据的准确性、完整性、一致性、及时性和可访问性。通过实施严格的数据质量管理流程,企业可以减少数据错误、重复、缺失和不一致等问题,从而提高数据的可靠性和可用性。高质量的数据......
  • Apache HttpClient发送文件时中文名变问号
    使用ApacheHttpClient发送multipart/form-data,包含有中文名的文件,对方收到的文件名中文变成了问号解决方法:发送方需要设置mode为HttpMultipartMode.RFC6532发送端代码如下,其中关键行为builder.setMode(HttpMultipartMode.RFC6532);importorg.apache.http.HttpEntity;impor......
  • 文件解析漏洞总结(IIS,NGINX,APACHE)
    目录一、IIS解析漏洞IIS6.X方式一:目录解析方式二:畸形文件解析IIS7.X利用条件环境配置下载链接:二、Nginx解析漏洞2.1:nginx_parsing利用条件利用姿势2.2:CVE-2013-4547影响版本利用姿势三、Apache解析漏洞3.1:apache_parsing利用姿势3.2:CVE-2017-15715影响版......
  • Apache DolphinScheduler用户线上Meetup火热来袭!
    ApacheDolphinScheduler社区8月用户交流会精彩继续!本次活动邀请到老牌农牧产品实业集团铁骑力士架构工程师,来分享ApacheDolphinScheduler在现代农牧食品加工场景中的应用实践。此外,还将有社区活跃贡献者以ApacheDolphinScheduler为例,总结ApacheDolphinScheduler以及Apache......
  • 成为Apache SeaTunnel贡献者的N种方式
    如何参与开源贡献参与开源贡献的常见方法有多种:1)参与解答在社区中,帮助使用过程中遇到困难的人,帮他们解释框架的用法也算是一种贡献。2)文档贡献帮助框架来完善文档,比如说将英文文档翻译为中文,纠正文档里面的错误单词,这是很多人参与开源贡献的第一步。3)代码贡献经过阅读......
  • Java中的数据流处理框架:Apache Flink
    Java中的数据流处理框架:ApacheFlink大家好,我是微赚淘客系统3.0的小编,是个冬天不穿秋裤,天冷也要风度的程序猿!今天我们来探讨一下Java中的数据流处理框架——ApacheFlink。Flink是一款用于处理数据流和批处理的分布式处理框架。它具有高吞吐量、低延迟和容错的特性,广泛应用于实时......
  • RestTemplate和 apache HttpClient 使用方式
    一、RestTemplate RestTemplate是简化了组装请求对象该类存在于依赖spring-boot-starter-web中,RestTemplate支持get,post现1、RestTemplate不是可直接注入Bean,需要实例化生成BeanSpringBoot的自动配置机制非常强大,但并不是所有的类都被默认自动配置为bean。对于R......
  • Windows下使用Apache和mod_wsgi部署django项目
    一、安装Python确定好所需要的python版本。二、安装Apacheapache下载地址:http://httpd.apache.org/docs/current/platform/windows.html#down下载完成后做如下操作将apache解压后直接复制到你想安装的路径下1、更改httpd.conf文件,找到如下代码并更改路径DefineSRVROOT"E:......