首页 > 其他分享 >制作dolphinscheduler+spark+hadoop镜像

制作dolphinscheduler+spark+hadoop镜像

时间:2024-05-30 17:32:51浏览次数:20  
标签:opt logs dolphinscheduler worker hadoop SPARK spark

项目需要在ds中执行spark集群任务,并且交付方式是提供一个镜像,所以要把这3者做成一个镜像配置进行相应配置。

 

1.准备基础镜像

有大神已经制作好了spark+hadoop镜像,参考链接:https://zhuanlan.zhihu.com/p/421375012

我们下载此镜像

docker pull s1mplecc/spark-hadoop:3

然后准备ds-worker镜像

docker pull apache/dolphinscheduler-worker:latest

 

2.拷贝ds-worker容器中的/opt/dolphinscheduler文件夹到本地

先运行ds-worker镜像(具体可参考ds官方文档:https://dolphinscheduler.apache.org/zh-cn/docs/3.2.1/guide/start/docker)

然后拷贝/opt/dolphinscheduler至当前目录,c515是容器id:

docker cp c515:/opt/dolphinscheduler .

 

3.编写Dockerfile,然后制作镜像

FROM s1mplecc/spark-hadoop:3

ENV DOCKER true
ENV TZ Asia/Shanghai
ENV DOLPHINSCHEDULER_HOME /opt/dolphinscheduler

RUN apt update ; \
    apt install -y sudo ; \
    rm -rf /var/lib/apt/lists/*

WORKDIR $DOLPHINSCHEDULER_HOME

ADD ./dolphinscheduler $DOLPHINSCHEDULER_HOME

EXPOSE 1235

接着运行:

docker build -t dolphin-spark-hadoop:1 .

 

4.编写docker-compose.yml

首先参考下s1mplecc/spark-hadoop和dolphinscheduler提供的yml配置文件。如下:

s1mplecc/spark-hadoop:

version: '2'

services:
  spark:
    image: s1mplecc/spark-hadoop:3
    hostname: master
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ~/docker/spark/share:/opt/share
    ports:
      - '8080:8080'
      - '4040:4040'
      - '8088:8088'
      - '8042:8042'
      - '9870:9870'
      - '19888:19888'
  spark-worker-1:
    image: s1mplecc/spark-hadoop:3
    hostname: worker1
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ~/docker/spark/share:/opt/share
    ports:
      - '8081:8081'
  spark-worker-2:
    image: s1mplecc/spark-hadoop:3
    hostname: worker2
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    volumes:
      - ~/docker/spark/share:/opt/share
    ports:
      - '8082:8081'

 

还有dolphinscheduler:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

version: "3.8"

services:
  dolphinscheduler-postgresql:
    image: bitnami/postgresql:11.11.0
    ports:
      - "5432:5432"
    profiles: ["all", "schema"]
    environment:
      POSTGRESQL_USERNAME: root
      POSTGRESQL_PASSWORD: root
      POSTGRESQL_DATABASE: dolphinscheduler
    volumes:
      - dolphinscheduler-postgresql:/bitnami/postgresql
    healthcheck:
      test: ["CMD", "bash", "-c", "cat < /dev/null > /dev/tcp/127.0.0.1/5432"]
      interval: 5s
      timeout: 60s
      retries: 120
    networks:
      - dolphinscheduler

  dolphinscheduler-zookeeper:
    image: bitnami/zookeeper:3.6.2
    profiles: ["all"]
    environment:
      ALLOW_ANONYMOUS_LOGIN: "yes"
      ZOO_4LW_COMMANDS_WHITELIST: srvr,ruok,wchs,cons
    volumes:
      - dolphinscheduler-zookeeper:/bitnami/zookeeper
    healthcheck:
      test: ["CMD", "bash", "-c", "cat < /dev/null > /dev/tcp/127.0.0.1/2181"]
      interval: 5s
      timeout: 60s
      retries: 120
    networks:
      - dolphinscheduler

  dolphinscheduler-schema-initializer:
    image: ${HUB}/dolphinscheduler-tools:${TAG}
    env_file: .env
    profiles: ["schema"]
    command: [ tools/bin/upgrade-schema.sh ]
    depends_on:
      dolphinscheduler-postgresql:
        condition: service_healthy
    volumes:
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
      - dolphinscheduler-resource-local:/dolphinscheduler
    networks:
      - dolphinscheduler

  dolphinscheduler-api:
    image: ${HUB}/dolphinscheduler-api:${TAG}
    ports:
      - "12345:12345"
      - "25333:25333"
    profiles: ["all"]
    env_file: .env
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:12345/dolphinscheduler/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      dolphinscheduler-zookeeper:
        condition: service_healthy
    volumes:
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
      - dolphinscheduler-resource-local:/dolphinscheduler
    networks:
      - dolphinscheduler

  dolphinscheduler-alert:
    image: ${HUB}/dolphinscheduler-alert-server:${TAG}
    profiles: ["all"]
    env_file: .env
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:50053/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    volumes:
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
    networks:
      - dolphinscheduler

  dolphinscheduler-master:
    image: ${HUB}/dolphinscheduler-master:${TAG}
    profiles: ["all"]
    env_file: .env
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:5679/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      dolphinscheduler-zookeeper:
        condition: service_healthy
    volumes:
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
    networks:
      - dolphinscheduler

  dolphinscheduler-worker:
    image: ${HUB}/dolphinscheduler-worker:${TAG}
    profiles: ["all"]
    env_file: .env
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:1235/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      dolphinscheduler-zookeeper:
        condition: service_healthy
    volumes:
      - dolphinscheduler-worker-data:/tmp/dolphinscheduler
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
      - dolphinscheduler-resource-local:/dolphinscheduler
    networks:
      - dolphinscheduler

networks:
  dolphinscheduler:
    driver: bridge

volumes:
  dolphinscheduler-postgresql:
  dolphinscheduler-zookeeper:
  dolphinscheduler-worker-data:
  dolphinscheduler-logs:
  dolphinscheduler-shared-local:
  dolphinscheduler-resource-local:

 

我们要做的就是将ds部署文件里的worker节点替换成第三步制作好的新镜像dolphin-spark-hadoop,写好的yaml文件如下:

# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

version: "3.8"

services:
  dolphinscheduler-postgresql:
    image: bitnami/postgresql:11.11.0
    ports:
      - "5432:5432"
    profiles: ["all", "schema"]
    environment:
      POSTGRESQL_USERNAME: root
      POSTGRESQL_PASSWORD: root
      POSTGRESQL_DATABASE: dolphinscheduler
    volumes:
      - dolphinscheduler-postgresql:/bitnami/postgresql
    healthcheck:
      test: ["CMD", "bash", "-c", "cat < /dev/null > /dev/tcp/127.0.0.1/5432"]
      interval: 5s
      timeout: 60s
      retries: 120
    networks:
      - dolphinscheduler

  dolphinscheduler-zookeeper:
    image: bitnami/zookeeper:3.6.2
    profiles: ["all"]
    environment:
      ALLOW_ANONYMOUS_LOGIN: "yes"
      ZOO_4LW_COMMANDS_WHITELIST: srvr,ruok,wchs,cons
    volumes:
      - dolphinscheduler-zookeeper:/bitnami/zookeeper
    healthcheck:
      test: ["CMD", "bash", "-c", "cat < /dev/null > /dev/tcp/127.0.0.1/2181"]
      interval: 5s
      timeout: 60s
      retries: 120
    networks:
      - dolphinscheduler

  dolphinscheduler-schema-initializer:
    image: ${HUB}/dolphinscheduler-tools:${TAG}
    env_file: .env
    profiles: ["schema"]
    command: [ tools/bin/upgrade-schema.sh ]
    depends_on:
      dolphinscheduler-postgresql:
        condition: service_healthy
    volumes:
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
      - dolphinscheduler-resource-local:/dolphinscheduler
    networks:
      - dolphinscheduler

  dolphinscheduler-api:
    image: ${HUB}/dolphinscheduler-api:${TAG}
    ports:
      - "12345:12345"
      - "25333:25333"
    profiles: ["all"]
    env_file: .env
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:12345/dolphinscheduler/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      dolphinscheduler-zookeeper:
        condition: service_healthy
    volumes:
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
      - dolphinscheduler-resource-local:/dolphinscheduler
    networks:
      - dolphinscheduler

  dolphinscheduler-alert:
    image: ${HUB}/dolphinscheduler-alert-server:${TAG}
    profiles: ["all"]
    env_file: .env
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:50053/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    volumes:
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
    networks:
      - dolphinscheduler

  dolphinscheduler-master:
    image: ${HUB}/dolphinscheduler-master:${TAG}
    profiles: ["all"]
    env_file: .env
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:5679/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      dolphinscheduler-zookeeper:
        condition: service_healthy
    volumes:
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
    networks:
      - dolphinscheduler

  dolphinscheduler-worker-1:
    image: dolphin-spark-hadoop:1
    hostname: master
    profiles: ["all"]
    env_file: .env
    environment:
      - SPARK_MODE=master
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:1235/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      dolphinscheduler-zookeeper:
        condition: service_healthy
    volumes:
      - dolphinscheduler-worker-data:/tmp/dolphinscheduler
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
      - dolphinscheduler-resource-local:/dolphinscheduler
    ports:
      - '8080:8080'
      - '4040:4040'
      - '8088:8088'
      - '8042:8042'
      - '9870:9870'
      - '19888:19888'
    networks:
      - dolphinscheduler
    
  dolphinscheduler-worker-2:
    image: dolphin-spark-hadoop:1
    hostname: worker1
    profiles: ["all"]
    env_file: .env
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:1235/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      dolphinscheduler-zookeeper:
        condition: service_healthy
    volumes:
      - dolphinscheduler-worker-data:/tmp/dolphinscheduler
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
      - dolphinscheduler-resource-local:/dolphinscheduler
    ports:
      - '8081:8081'
    networks:
      - dolphinscheduler
      
  dolphinscheduler-worker-3:
    image: dolphin-spark-hadoop:1
    hostname: worker2
    profiles: ["all"]
    env_file: .env
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://master:7077
      - SPARK_WORKER_MEMORY=1G
      - SPARK_WORKER_CORES=1
      - SPARK_RPC_AUTHENTICATION_ENABLED=no
      - SPARK_RPC_ENCRYPTION_ENABLED=no
      - SPARK_LOCAL_STORAGE_ENCRYPTION_ENABLED=no
      - SPARK_SSL_ENABLED=no
    healthcheck:
      test: [ "CMD", "curl", "http://localhost:1235/actuator/health" ]
      interval: 30s
      timeout: 5s
      retries: 3
    depends_on:
      dolphinscheduler-zookeeper:
        condition: service_healthy
    volumes:
      - dolphinscheduler-worker-data:/tmp/dolphinscheduler
      - dolphinscheduler-logs:/opt/dolphinscheduler/logs
      - dolphinscheduler-shared-local:/opt/soft
      - dolphinscheduler-resource-local:/dolphinscheduler
    ports:
      - '8082:8081'
    networks:
      - dolphinscheduler

networks:
  dolphinscheduler:
    driver: bridge

volumes:
  dolphinscheduler-postgresql:
  dolphinscheduler-zookeeper:
  dolphinscheduler-worker-data:
  dolphinscheduler-logs:
  dolphinscheduler-shared-local:
  dolphinscheduler-resource-local:

注意:s1mplecc/spark-hadoop里运行了一主两从,总共3个spark+hadoop节点,所以ds-worker也要对应起3个。如果要起更多worker节点的话,需要调整hadoop配置里的workers,此处需要重新制作镜像或挂载新的workers配置文件,同时ds-work节点也要和spark+hadoop相对应。

 

5.运行docker-compose

参考ds官方文档:

# 如果需要初始化或者升级数据库结构,需要指定profile为schema
$ docker-compose --profile schema up -d

# 启动dolphinscheduler所有服务,指定profile为all
$ docker-compose --profile all up -d

注意要在第4步新制作的yaml文件目录下执行

 

6.进入容器内执行hadoop集群启动脚本和ds-worker启动脚本

进入容器后:

../start-hadoop.sh

然后再执行:

nohup /opt/dolphinscheduler/bin/start.sh > /dev/null 2>&1 &

稍后ds以及spark、hadoop就可以正常运行了,可以在ds上运行spark任务了^_^

 

标签:opt,logs,dolphinscheduler,worker,hadoop,SPARK,spark
From: https://www.cnblogs.com/unique--soul/p/18222856

相关文章

  • SparkSQL编程-DataFrame
    SparkSession在老的版本中,SparkSQL提供两种SQL查询起始点:一个叫SQLContext,用于Spark自己提供的SQL查询;一个叫HiveContext,用于连接Hive的查询。从2.0开始,SparkSession作为Spark最新的SQL查询起始点,实质上是SQLContext和HiveContext的组合,所以在SQLContext......
  • SparkSQL概述
    为了给熟悉RDBMS(关系数据库管理系统)但又不理解MapReduce的技术人员提供快速上手的工具,hive应运而生,它是运行在Hadoop上的SQL-on-hadoop工具;但是MapReduce计算过程中大量的中间磁盘落地过程消耗了大量的I/O,运行效率低;sparksql则是采用内存存储可以减少大量的中间......
  • Spark的共享变量
    传递给Spark的函数,如map()或者filter()的判断条件函数,能够利用定义在函数之外的变量,但是集群中的每一个task都会得到变量的一个副本,并且task对变量进行的更新则不会被返回给driver.而Spark的两种共享变量:累加器(accumulator)和广播变量(broadcastvariable).累加器......
  • spark sql导出数据为excel文件和csv文件
    一、利用to_csv函数导出数据为csv文件:df=spark.sql('''select*fromtable;''')df.toPandas().to_csv('table.csv',index=False)其中:index=False参数表示在保存时不包括行索引。二、利用to_excel函数导出数据为excel文件:df=spark.sql('''select*from......
  • spark sql中的FORMAT_NUMBER和ROUND函数
    一、例子:FORMAT_NUMBER(ROUND(value,2),'0.00')二、ROUND函数的作用:用于将数值字段舍入到指定的小数位数,如果未指定小数位数,则默认将数字舍入到最接近的整数。三、FORMAT_NUMBER函数的作用:用于将数字格式化为指定的格式,而不是进行舍入。四、两者的区别:如果小数点后面的数字,最......
  • spark sql中的几种数据库join
    一、连接类型:InnerJoin:内连接;FullOuterJoin:全外连接;LeftOuterJoin:左外连接;RightOuterJoin:右外连接;LeftSemiJoin:左半连接;LeftAntiJoin:左反连接;NaturalJoin:自然连接;Cross(orCartesian)Join:交叉(或笛卡尔)连接。二、crossjoin的例子:WITH......
  • spark sql实现“平均月活”和“平均周活”及相关函数
    一、平均月活:SELECTdate_format(time,'yyyy-MM')AScurrent_month,COUNT(DISTINCTuser_id)ASmonth_active_user_numFROMtableWHEREtime>=trunc(now(),'YEAR')GROUPBYdate_format(time,'yyyy-MM');二、平均周活:WITHweek_......
  • Spark下的Work目录定时清理
     问题在跑spark任务的时候发现任务不能执行。在查看的时候发现sparkwork节点的/usr/local/spark/work/目录占用了很大空间,导致根目录/满了。原因使用sparkstandalone模式执行任务,没提交一次任务,在每个节点work目录下都会生成一个文件夹,命名规则app-20160614191730-0249。该文......
  • Hadoop HDFS DataNode动态扩容机制
    胡弦,视频号2023年度优秀创作者,互联网大厂P8技术专家,SpringCloudAlibaba微服务架构实战派(上下册)和RocketMQ消息中间件实战派(上下册)的作者,资深架构师,技术负责人,极客时间训练营讲师,四维口袋KVP最具价值技术专家,技术领域专家团成员,2021电子工业出版社年度优秀作者,获得2023电......
  • Hadoop HDFS DataNode存储高性能,高可用和高并发设计
    胡弦,视频号2023年度优秀创作者,互联网大厂P8技术专家,SpringCloudAlibaba微服务架构实战派(上下册)和RocketMQ消息中间件实战派(上下册)的作者,资深架构师,技术负责人,极客时间训练营讲师,四维口袋KVP最具价值技术专家,技术领域专家团成员,2021电子工业出版社年度优秀作者,获得2023电......