尝试使用 PySpark show 函数显示结果时出错

标签：python apache-spark pyspark apache-spark-sql

我正在尝试在 PySpark 中显示我的结果。

我正在使用 Spark 3.5.1 和安装了 Java 8 的 PySpark 3.5.1，一切都设置良好。

建议添加此内容的一些答案：

import findspark
findspark.init()

或添加此内容到配置：

.config("spark.memory.offHeap.enabled","true") \
.config("spark.memory.offHeap.size","10g") \

但我仍然收到此错误：

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[42], line 3
      1 import findspark
      2 findspark.init()
----> 3 value_counts.show()

File ...new_env2\Lib\site-packages\pyspark\sql\dataframe.py:945, in DataFrame.show(self, n, truncate, vertical)
    885 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None:
    886     """Prints the first ``n`` rows to the console.
    887 
    888     .. versionadded:: 1.3.0
   (...)
    943     name | Bob
    944     """
--> 945     print(self._show_string(n, truncate, vertical))

File ...new_env2\Lib\site-packages\pyspark\sql\dataframe.py:963, in DataFrame._show_string(self, n, truncate, vertical)
    957     raise PySparkTypeError(
    958         error_class="NOT_BOOL",
    959         message_parameters={"arg_name": "vertical", "arg_type": type(vertical).__name__},
    960     )
    962 if isinstance(truncate, bool) and truncate:
--> 963     return self._jdf.showString(n, 20, vertical)
    964 else:
    965     try:

File ...new_env2\Lib\site-packages\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File ....new_env2\Lib\site-packages\pyspark\errors\exceptions\captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw)
    177 def deco(*a: Any, **kw: Any) -> Any:
    178     try:
--> 179         return f(*a, **kw)
    180     except Py4JJavaError as e:
    181         converted = convert_exception(e.java_exception)

File ~\OneDrive\Bureau\Desktop\big_data\big_data\new_env2\Lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o194.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 50) (Zaineb executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1160)
    at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
    at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1214)
    at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
    at org.apache.spark.sql.execution.python.BasePythonUDFRunner$PythonUDFWriterThread.writeIteratorToStream(PythonUDFRunner.scala:58)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
Caused by: java.io.EOFException
    at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
    ... 26 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1160)
    at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
    at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1214)
    at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
    at org.apache.spark.sql.execution.python.BasePythonUDFRunner$PythonUDFWriterThread.writeIteratorToStream(PythonUDFRunner.scala:58)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
Caused by: java.io.EOFException
    at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
    ... 26 more

我尝试了来自不同论坛的不同答案来解决相同的问题，但他们都建议尝试 Findspark 或检查适合我的 Java 版本。

有人遇到同样的错误吗？

从堆栈跟踪来看，根本问题不是关于 findspark 或堆外内存配置。错误信息“Python worker exited unexpectedly (crashed)”表明 Python worker 进程在执行 Spark 任务时崩溃。这可能是由于各种原因造成的，需要进一步调查。

以下是可以尝试调试此问题的步骤：

检查 Python worker 日志： 错误信息表明 Python worker 进程崩溃。检查 worker 日志以获取有关崩溃原因的更多详细信息非常重要。日志位置可能因的 Spark 部署而异，但通常可以在 Spark master 节点的 work 目录中找到它们。查找与失败的任务尝试相对应的日志文件。
简化代码： 从一个尽可能简单的示例开始，例如只加载一个小型数据集并将其显示出来。如果此操作有效，则表明问题出在代码的特定部分。尝试逐步添加更多代码，直到找到导致崩溃的原因。
资源限制： Python worker 进程崩溃可能是由于资源限制造成的，例如内存不足。尝试增加分配给 Python worker 的内存量。可以通过设置 spark.executor.memory 和 spark.python.worker.memory 配置参数来实现。
依赖项冲突： 请确保的 Spark 环境中安装的所有库都与的 Spark 和 Python 版本兼容。如果存在不兼容的依赖项，则可能会导致意外行为，包括 worker 进程崩溃。尝试创建一个新的虚拟环境并仅安装运行代码所需的必要库。
数据问题： 数据本身可能会导致 Python worker 进程崩溃，例如，如果数据中存在损坏的记录或意外的值。尝试对数据执行一些基本的数据验证和清理，看看这是否能解决问题。
Spark 版本： 虽然没有提到使用旧版本的 Spark，但有时某些 Spark 版本可能存在已知问题。请考虑升级到最新版本的 Spark 或降级到已知稳定的版本，看看这是否能解决问题。

如果在尝试了这些步骤后仍然遇到问题，请提供以下更多信息，以便我提供更具体的帮助：

Spark 和 Python 版本
的 Spark 集群的设置（本地、独立、YARN 等）
尝试运行的代码片段
Python worker 日志中的任何相关错误消息

提供更多信息将有助于我更准确地确定问题的原因。

标签：python,apache-spark,pyspark,apache-spark-sql
From： 78795824

尝试使用 PySpark show 函数显示结果时出错

相关文章

赞助商

阅读排行