我正在尝试在 PySpark 中显示我的结果。
我正在使用 Spark 3.5.1 和安装了 Java 8 的 PySpark 3.5.1,一切都设置良好。
建议添加此内容的一些答案:
import findspark
findspark.init()
或添加此内容到配置:
.config("spark.memory.offHeap.enabled","true") \
.config("spark.memory.offHeap.size","10g") \
但我仍然收到此错误:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
Cell In[42], line 3
1 import findspark
2 findspark.init()
----> 3 value_counts.show()
File ...new_env2\Lib\site-packages\pyspark\sql\dataframe.py:945, in DataFrame.show(self, n, truncate, vertical)
885 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None:
886 """Prints the first ``n`` rows to the console.
887
888 .. versionadded:: 1.3.0
(...)
943 name | Bob
944 """
--> 945 print(self._show_string(n, truncate, vertical))
File ...new_env2\Lib\site-packages\pyspark\sql\dataframe.py:963, in DataFrame._show_string(self, n, truncate, vertical)
957 raise PySparkTypeError(
958 error_class="NOT_BOOL",
959 message_parameters={"arg_name": "vertical", "arg_type": type(vertical).__name__},
960 )
962 if isinstance(truncate, bool) and truncate:
--> 963 return self._jdf.showString(n, 20, vertical)
964 else:
965 try:
File ...new_env2\Lib\site-packages\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):
File ....new_env2\Lib\site-packages\pyspark\errors\exceptions\captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw)
177 def deco(*a: Any, **kw: Any) -> Any:
178 try:
--> 179 return f(*a, **kw)
180 except Py4JJavaError as e:
181 converted = convert_exception(e.java_exception)
File ~\OneDrive\Bureau\Desktop\big_data\big_data\new_env2\Lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
Py4JJavaError: An error occurred while calling o194.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 50) (Zaineb executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1160)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1214)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
at org.apache.spark.sql.execution.python.BasePythonUDFRunner$PythonUDFWriterThread.writeIteratorToStream(PythonUDFRunner.scala:58)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
Caused by: java.io.EOFException
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
... 26 more
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1160)
at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1214)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
at scala.collection.Iterator.foreach(Iterator.scala:943)
at scala.collection.Iterator.foreach$(Iterator.scala:943)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
at org.apache.spark.sql.execution.python.BasePythonUDFRunner$PythonUDFWriterThread.writeIteratorToStream(PythonUDFRunner.scala:58)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
Caused by: java.io.EOFException
at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
... 26 more
我尝试了来自不同论坛的不同答案来解决相同的问题,但他们都建议尝试 Findspark 或检查适合我的 Java 版本。
有人遇到同样的错误吗?
从堆栈跟踪来看,根本问题不是关于
findspark
或堆外内存配置。错误信息“Python worker exited unexpectedly (crashed)”表明 Python worker 进程在执行 Spark 任务时崩溃。这可能是由于各种原因造成的,需要进一步调查。
以下是可以尝试调试此问题的步骤:
-
检查 Python worker 日志: 错误信息表明 Python worker 进程崩溃。检查 worker 日志以获取有关崩溃原因的更多详细信息非常重要。日志位置可能因的 Spark 部署而异,但通常可以在 Spark master 节点的
work
目录中找到它们。查找与失败的任务尝试相对应的日志文件。 -
简化代码: 从一个尽可能简单的示例开始,例如只加载一个小型数据集并将其显示出来。如果此操作有效,则表明问题出在代码的特定部分。尝试逐步添加更多代码,直到找到导致崩溃的原因。
-
资源限制: Python worker 进程崩溃可能是由于资源限制造成的,例如内存不足。尝试增加分配给 Python worker 的内存量。可以通过设置
spark.executor.memory
和spark.python.worker.memory
配置参数来实现。 -
依赖项冲突: 请确保的 Spark 环境中安装的所有库都与的 Spark 和 Python 版本兼容。如果存在不兼容的依赖项,则可能会导致意外行为,包括 worker 进程崩溃。尝试创建一个新的虚拟环境并仅安装运行代码所需的必要库。
-
数据问题: 数据本身可能会导致 Python worker 进程崩溃,例如,如果数据中存在损坏的记录或意外的值。尝试对数据执行一些基本的数据验证和清理,看看这是否能解决问题。
-
Spark 版本: 虽然没有提到使用旧版本的 Spark,但有时某些 Spark 版本可能存在已知问题。请考虑升级到最新版本的 Spark 或降级到已知稳定的版本,看看这是否能解决问题。
如果在尝试了这些步骤后仍然遇到问题,请提供以下更多信息,以便我提供更具体的帮助:
- Spark 和 Python 版本
- 的 Spark 集群的设置(本地、独立、YARN 等)
- 尝试运行的代码片段
- Python worker 日志中的任何相关错误消息
提供更多信息将有助于我更准确地确定问题的原因。
标签:python,apache-spark,pyspark,apache-spark-sql From: 78795824