首页 > 编程问答 >Pyspark 数据框不返回值超过 8 位的行

Pyspark 数据框不返回值超过 8 位的行

时间:2024-07-25 13:12:48浏览次数:17  
标签:python dataframe apache-spark pyspark apache-spark-sql

我在 Pyspark 中创建了一个示例数据框,ID 列包含一些超过 8 位数字的值。但它仅返回 ID 字段中的值少于 8 位的行。任何人都可以建议如何编写适当的代码,如果条件匹配,该代码将返回所有值。

# importing sparksession from pyspark.sql module 
from pyspark.sql import SparkSession 
  
# creating sparksession and giving an app name 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
data = [["2116722", "sravan", "company 1"], 
        ["2716722", "ojaswi", "company 2"], 
        ["2119722", "bobby", "company 3"], 
        ["21156311722", "sravan", "company 1"], 
        ["21422", "ojaswi", None], 
        ["2216722", "rohith", "company 2"], 
        ["3116722672", "gnanesh", "company 1"], 
        ["2156722", None, "company 2"], 
        ["4115666122", "bobby", "company 3"], 
        ["21190745", "rohith", "company 2"]] 
  
# specify column names 
columns = ['ID', 'Employee NAME', 'Company Name'] 
  
# creating a dataframe from the lists of data 
dataframe = spark.createDataFrame(data, columns) 
dataframe.where(dataframe["ID"] > 100000).show()

输出:

+--------+-------------+------------+
|      ID|Employee NAME|Company Name|
+--------+-------------+------------+
| 2116722|       sravan|   company 1|
| 2716722|       ojaswi|   company 2|
| 2119722|        bobby|   company 3|
| 2216722|       rohith|   company 2|
| 2156722|         NULL|   company 2|
|21190745|       rohith|   company 2|
+--------+-------------+------------+

预期输出

+--------+-------------+------------+
|      ID|Employee NAME|Company Name|
+--------+-------------+------------+
| 2116722|       sravan|   company 1|
| 2716722|       ojaswi|   company 2|
| 2119722|        bobby|   company 3|
| 2216722|       rohith|   company 2|
| 2156722|         NULL|   company 2|
|21190745|       rohith|   company 2|
|21156311722|    sravan|   company 1|
|4115666122|      bobby|   company 3| 
|3116722672|    gnanesh|   company 1|
+--------+-------------+------------+


问题在于正在对最初推断为整数的“ID”列执行数字比较。 Python 中的整数有一个最大值,任何超过该最大值的数字都可能以这种方式被截断或解释错误。 要解决这个问题,应该将“ID”列转换为字符串类型,以便 PySpark 正确处理大数。

试试下面的代码:

from pyspark.sql import SparkSession
from pyspark.sql.types import *

# 创建 sparksession 并指定应用程序名称 
spark = SparkSession.builder.appName('sparkdf').getOrCreate() 
data = [["2116722", "sravan", "company 1"], 
        ["2716722", "ojaswi", "company 2"], 
        ["2119722", "bobby", "company 3"], 
        ["21156311722", "sravan", "company 1"], 
        ["21422", "ojaswi", None], 
        ["2216722", "rohith", "company 2"], 
        ["3116722672", "gnanesh", "company 1"], 
        ["2156722", None, "company 2"], 
        ["4115666122", "bobby", "company 3"], 
        ["21190745", "rohith", "company 2"]] 

# 指定列名 
columns = ['ID', 'Employee NAME', 'Company Name'] 

# 从数据列表创建数据框 
dataframe = spark.createDataFrame(data, columns)

# 将“ID”列转换为字符串类型
dataframe = dataframe.withColumn("ID", dataframe["ID"].cast(StringType()))

dataframe.where(dataframe["ID"] > "100000").show()

在本代码中, dataframe.withColumn("ID", dataframe["ID"].cast(StringType())) 会将“ID”列转换为字符串类型。这确保了比较操作将按预期执行,并返回所有具有大于 100000 的 ID 的行。

此更改应能为提供预期的输出。

标签:python,dataframe,apache-spark,pyspark,apache-spark-sql
From: 78788103

相关文章