首页 > 编程问答 >尝试使用 PySpark show 函数显示结果时出错

尝试使用 PySpark show 函数显示结果时出错

时间:2024-07-26 08:33:02浏览次数:9  
标签:python apache-spark pyspark apache-spark-sql

我正在尝试在 PySpark 中显示我的结果。

我正在使用 Spark 3.5.1 和安装了 Java 8 的 PySpark 3.5.1,一切都设置良好。

建议添加此内容的一些答案:

import findspark
findspark.init()

或添加此内容到配置:

.config("spark.memory.offHeap.enabled","true") \
.config("spark.memory.offHeap.size","10g") \

但我仍然收到此错误:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[42], line 3
      1 import findspark
      2 findspark.init()
----> 3 value_counts.show()

File ...new_env2\Lib\site-packages\pyspark\sql\dataframe.py:945, in DataFrame.show(self, n, truncate, vertical)
    885 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None:
    886     """Prints the first ``n`` rows to the console.
    887 
    888     .. versionadded:: 1.3.0
   (...)
    943     name | Bob
    944     """
--> 945     print(self._show_string(n, truncate, vertical))

File ...new_env2\Lib\site-packages\pyspark\sql\dataframe.py:963, in DataFrame._show_string(self, n, truncate, vertical)
    957     raise PySparkTypeError(
    958         error_class="NOT_BOOL",
    959         message_parameters={"arg_name": "vertical", "arg_type": type(vertical).__name__},
    960     )
    962 if isinstance(truncate, bool) and truncate:
--> 963     return self._jdf.showString(n, 20, vertical)
    964 else:
    965     try:

File ...new_env2\Lib\site-packages\py4j\java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File ....new_env2\Lib\site-packages\pyspark\errors\exceptions\captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw)
    177 def deco(*a: Any, **kw: Any) -> Any:
    178     try:
--> 179         return f(*a, **kw)
    180     except Py4JJavaError as e:
    181         converted = convert_exception(e.java_exception)

File ~\OneDrive\Bureau\Desktop\big_data\big_data\new_env2\Lib\site-packages\py4j\protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o194.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 3.0 failed 1 times, most recent failure: Lost task 2.0 in stage 3.0 (TID 50) (Zaineb executor driver): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1160)
    at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
    at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1214)
    at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
    at org.apache.spark.sql.execution.python.BasePythonUDFRunner$PythonUDFWriterThread.writeIteratorToStream(PythonUDFRunner.scala:58)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
Caused by: java.io.EOFException
    at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
    ... 26 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2856)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2792)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2791)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2791)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1247)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1247)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1247)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:3060)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2994)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2983)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:612)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:594)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:789)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.ContextAwareIterator.hasNext(ContextAwareIterator.scala:39)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$GroupedIterator.takeDestructively(Iterator.scala:1160)
    at scala.collection.Iterator$GroupedIterator.go(Iterator.scala:1176)
    at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1214)
    at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1217)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator.foreach(Iterator.scala:943)
    at scala.collection.Iterator.foreach$(Iterator.scala:943)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:322)
    at org.apache.spark.sql.execution.python.BasePythonUDFRunner$PythonUDFWriterThread.writeIteratorToStream(PythonUDFRunner.scala:58)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.$anonfun$run$1(PythonRunner.scala:451)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1928)
    at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:282)
Caused by: java.io.EOFException
    at java.base/java.io.DataInputStream.readInt(DataInputStream.java:398)
    at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:774)
    ... 26 more

我尝试了来自不同论坛的不同答案来解决相同的问题,但他们都建议尝试 Findspark 或检查适合我的 Java 版本。

有人遇到同样的错误吗?


从堆栈跟踪来看,根本问题不是关于 findspark 或堆外内存配置。错误信息“Python worker exited unexpectedly (crashed)”表明 Python worker 进程在执行 Spark 任务时崩溃。这可能是由于各种原因造成的,需要进一步调查。

以下是可以尝试调试此问题的步骤:

  1. 检查 Python worker 日志: 错误信息表明 Python worker 进程崩溃。检查 worker 日志以获取有关崩溃原因的更多详细信息非常重要。日志位置可能因的 Spark 部署而异,但通常可以在 Spark master 节点的 work 目录中找到它们。查找与失败的任务尝试相对应的日志文件。

  2. 简化代码: 从一个尽可能简单的示例开始,例如只加载一个小型数据集并将其显示出来。如果此操作有效,则表明问题出在代码的特定部分。尝试逐步添加更多代码,直到找到导致崩溃的原因。

  3. 资源限制: Python worker 进程崩溃可能是由于资源限制造成的,例如内存不足。尝试增加分配给 Python worker 的内存量。可以通过设置 spark.executor.memory spark.python.worker.memory 配置参数来实现。

  4. 依赖项冲突: 请确保的 Spark 环境中安装的所有库都与的 Spark 和 Python 版本兼容。如果存在不兼容的依赖项,则可能会导致意外行为,包括 worker 进程崩溃。尝试创建一个新的虚拟环境并仅安装运行代码所需的必要库。

  5. 数据问题: 数据本身可能会导致 Python worker 进程崩溃,例如,如果数据中存在损坏的记录或意外的值。尝试对数据执行一些基本的数据验证和清理,看看这是否能解决问题。

  6. Spark 版本: 虽然没有提到使用旧版本的 Spark,但有时某些 Spark 版本可能存在已知问题。请考虑升级到最新版本的 Spark 或降级到已知稳定的版本,看看这是否能解决问题。

如果在尝试了这些步骤后仍然遇到问题,请提供以下更多信息,以便我提供更具体的帮助:

  • Spark 和 Python 版本
  • 的 Spark 集群的设置(本地、独立、YARN 等)
  • 尝试运行的代码片段
  • Python worker 日志中的任何相关错误消息

提供更多信息将有助于我更准确地确定问题的原因。

标签:python,apache-spark,pyspark,apache-spark-sql
From: 78795824

相关文章