首页 > 编程问答 >运行 Spark-Shell 程序时出现错误

运行 Spark-Shell 程序时出现错误

时间:2024-07-26 14:29:38浏览次数:17  
标签:python apache-spark pyspark apache-spark-sql

我正在尝试创建 Spark Shell 程序,但在运行时出现错误。

下面是我正在执行的代码。

from pyspark.sql import *
from pyspark import SparkConf
from lib.logger import Log4j

# conf = SparkConf()
# conf.set("spark.executor.extraJavaOptions", "-Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark")


if __name__ == "__main__":
    spark = SparkSession.builder \
            .appName("Hello Spark") \
            .master("local[3") \
            .getOrCreate()

    logger = Log4j(spark)
    logger.info("Starting HelloSpark")

    # your processing code

    logger.info("Finished HelloSpark")

    # spark.stop()

Python 版本:3.12.4

PS C:\Spark\spark-3.5.1-bin-hadoop3> python --version
Python 3.12.4

Java 版本:11.0.23

PS C:\Spark\spark-3.5.1-bin-hadoop3> Java --version
java 11.0.23 2024-04-16 LTS
Java(TM) SE Runtime Environment 18.9 (build 11.0.23+7-LTS-222)
Java HotSpot(TM) 64-Bit Server VM 18.9 (build 11.0.23+7-LTS-222, mixed mode)
PS C:\Spark\spark-3.5.1-bin-hadoop3>

错误:

Windows PowerShell
Copyright (C) Microsoft Corporation. All rights reserved.

Install the latest PowerShell for new features and improvements! https://aka.ms/PSWindows

PS C:\Spark\spark-3.5.1-bin-hadoop3> spark-submit --properties-file C:\Spark\spark-3.5.1-bin-hadoop3\conf\spark-defaults.conf 'C:\Users\JainRonit\OneDrive - STCO\Desktop\Personal\Study\Coding\Pyspark\02-Spark-First-Project\HelloSpark.py'
24/07/26 11:13:49 INFO SparkContext: Running Spark version 3.5.1
24/07/26 11:13:49 INFO SparkContext: OS info Windows 11, 10.0, amd64
24/07/26 11:13:49 INFO SparkContext: Java version 11.0.23
24/07/26 11:13:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/07/26 11:13:50 ERROR SparkContext: Error initializing SparkContext.
java.lang.Exception: spark.executor.extraJavaOptions is not allowed to set Spark options (was '-Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=HelloSpark'). Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit.
        at org.apache.spark.SparkConf.$anonfun$validateSettings$4(SparkConf.scala:525)
        at org.apache.spark.SparkConf.$anonfun$validateSettings$4$adapted(SparkConf.scala:521)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:521)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:410)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:834)
24/07/26 11:13:50 INFO SparkContext: SparkContext is stopping with exitCode 0.
24/07/26 11:13:50 INFO SparkContext: Successfully stopped SparkContext
Traceback (most recent call last):
  File "C:\Users\JainRonit\OneDrive - STCO\Desktop\Personal\Study\Coding\Pyspark\02-Spark-First-Project\HelloSpark.py", line 13, in <module>
    .getOrCreate()
     ^^^^^^^^^^^^^
  File "C:\Spark\spark-3.5.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\sql\session.py", line 497, in getOrCreate
  File "C:\Spark\spark-3.5.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\context.py", line 515, in getOrCreate
  File "C:\Spark\spark-3.5.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\context.py", line 203, in __init__
  File "C:\Spark\spark-3.5.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\context.py", line 296, in _do_init
  File "C:\Spark\spark-3.5.1-bin-hadoop3\python\lib\pyspark.zip\pyspark\context.py", line 421, in _initialize_context
  File "C:\Spark\spark-3.5.1-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py", line 1587, in __call__
  File "C:\Spark\spark-3.5.1-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.Exception: spark.executor.extraJavaOptions is not allowed to set Spark options (was '-Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=HelloSpark'). Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit.
        at org.apache.spark.SparkConf.$anonfun$validateSettings$4(SparkConf.scala:525)
        at org.apache.spark.SparkConf.$anonfun$validateSettings$4$adapted(SparkConf.scala:521)
        at scala.Option.foreach(Option.scala:407)
        at org.apache.spark.SparkConf.validateSettings(SparkConf.scala:521)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:410)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:834)

24/07/26 11:13:50 INFO ShutdownHookManager: Shutdown hook called
24/07/26 11:13:50 INFO ShutdownHookManager: Deleting directory C:\Users\JainRonit\AppData\Local\Temp\spark-0326d309-090a-4a5f-af13-d7fe347ab38d

Spark-default.conf 文件

spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=HelloSpark

我尝试通过在spark-defaults.cong 中设置默认内容来运行我的代码,但我在执行过程中遇到错误。


错误信息清楚地表明了问题所在:

java.lang.Exception: spark.executor.extraJavaOptions is not allowed to set Spark options (was '-Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=HelloSpark'). Set them directly on a SparkConf or in a properties file when using ./bin/spark-submit.

虽然你正在使用 spark-defaults.conf 文件,但你试图在 spark.executor.extraJavaOptions 中设置 Spark 选项,而这些选项实际上不是 JVM 选项。

要解决此问题,你需要直接在你的 SparkConf 对象或使用 --conf 选项通过 spark-submit 设置这些属性。

以下是调整后的代码和执行步骤:

1. 修改 HelloSpark.py 代码:

from pyspark.sql import SparkSession
from pyspark import SparkConf
from lib.logger import Log4j

if __name__ == "__main__":
    conf = SparkConf() \
        .setAppName("Hello Spark") \
        .setMaster("local[3]") \
        .set("log4j.configuration", "file:log4j.properties") \
        .set("spark.yarn.app.container.log.dir", "app-logs") \
        .set("logfile.name", "HelloSpark")

    spark = SparkSession.builder \
            .config(conf=conf) \
            .getOrCreate()

    logger = Log4j(spark)
    logger.info("Starting HelloSpark")

    # your processing code

    logger.info("Finished HelloSpark")

    # spark.stop()

在这个修改后的代码中:

  • 我们创建了一个 SparkConf 对象,并直接设置 Spark 应用程序的名称、master URL 以及你想要设置的与日志相关的属性。
  • 然后,我们将 SparkConf 对象传递给 SparkSession.builder ,以确保这些配置被应用。

2. 使用 spark-submit 运行代码:

spark-submit  'C:\Users\JainRonit\OneDrive - STCO\Desktop\Personal\Study\Coding\Pyspark\02-Spark-First-Project\HelloSpark.py'

注意,我们不再需要使用 --properties-file 选项,因为我们现在直接在代码中设置配置。

通过这些修改,你应该能够成功运行你的 Spark 应用程序并正确配置日志记录。

标签:python,apache-spark,pyspark,apache-spark-sql
From: 78796536

相关文章

  • Python,Pyinstaller打包含taichi模块的程序
    Python版本3.9、taichi版本1.7.1,pyinstaller版本6.9.0问题描述:正常Pyinstaller打包后报错[Taichi]version1.7.1,llvm15.0.1,commit0f143b2f,win,python3.9.19[Taichi]Startingonarch=x64Traceback(mostrecentcalllast):File"taichi\lang\_wrap_inspec......
  • Python,运行Yolo项目,报错AttributeError: ‘ImageDraw‘ object has no attribute ‘te
    Python3.9问题描述:其他电脑已经运行成功的Python,YOLO代码到我电脑上运行报错Traceback(mostrecentcalllast): File"C:\Users\Administrator\Desktop\20240725\识别项目\predict.py",line122,in<module>  frame=np.array(yolo.detect_image(frame)) Fil......
  • Python从零开始制做文字游戏(荒岛求生)
    文章目录前言开发游戏《荒岛求生》游戏大纲背景内容通关条件游戏过程探索荒岛购买物资休息总结代码开发定义变量当前代码引入背景故事当前代码循环问题解决:函数当前代码制作延时当前代码制作a函数(探索荒岛阶段)展示数......
  • 使用 Python 进行数据分析:入门指南
    使用Python进行数据分析:入门指南1.简介本指南将介绍如何使用Python进行数据分析,涵盖从数据加载到可视化分析的各个方面。2.必要的库NumPy:用于数值计算和数组操作。Pandas:用于数据处理和分析,提供DataFrame结构。Matplotlib:用于数据可视化,创建各种图表。Seab......
  • IT实战课堂计算机毕业设计源码精品基于Python的高校教育教材采购出入库进销存储信息管
    项目功能简介:《[含文档+PPT+源码等]精品基于Python的高校教育教材信息管理系统设计与实现》该项目含有源码、文档、PPT、配套开发软件、软件安装教程、项目发布教程、包运行成功以及课程答疑与微信售后交流群、送查重系统不限次数免费查重等福利!软件开发环境及开发工具:开......
  • 为什么我的 Python 脚本失败并出现 TypeError?
    我正在编写一个Python脚本,该脚本应该计算数字列表的总和。但是,当我运行代码时遇到TypeError这是一个最小的例子:numbers=[1,2,3,'4']total=sum(numbers)print(total)Theerrormessageis:TypeError:unsupportedoperandtype(s)for+:'int'and'str......
  • 如何通过socks代理传递所有Python的流量?
    有如何通过http代理传递所有Python的流量?但是,它不处理sock代理。我想使用sock代理,我们可以通过ssh隧道轻松获得它。ssh-D5005user@server你可以使用socks库,让你的Python代码通过SOCKS代理传递所有流量。这个库可以让你在套接字级别上指定代......
  • 如何在streamlit python中流式传输由LLM生成的输出
    代码:fromlangchain_community.vectorstoresimportFAISSfromlangchain_community.embeddingsimportHuggingFaceEmbeddingsfromlangchainimportPromptTemplatefromlangchain_community.llmsimportLlamaCppfromlangchain.chainsimportRetrievalQAimports......
  • python mysql操作
    pipinstallmysql-connector-pythonimportmysql.connector#配置数据库连接参数config={'user':'your_username','password':'your_password','host':'your_host','database'......
  • 将多个文件并行读取到 Pyspark 中的单独数据帧中
    我正在尝试将大型txt文件读入数据帧。每个文件大小为10-15GB,因为IO需要很长时间。我想并行读取多个文件并将它们放入单独的数据帧中。我尝试了下面的代码frommultiprocessing.poolimportThreadPooldefread_file(file_path):returnspark.read.csv(file......