我在 Pyspark 中创建了一个示例数据框,ID 列包含一些超过 8 位数字的值。但它仅返回 ID 字段中的值少于 8 位的行。任何人都可以建议如何编写适当的代码,如果条件匹配,该代码将返回所有值。
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
data = [["2116722", "sravan", "company 1"],
["2716722", "ojaswi", "company 2"],
["2119722", "bobby", "company 3"],
["21156311722", "sravan", "company 1"],
["21422", "ojaswi", None],
["2216722", "rohith", "company 2"],
["3116722672", "gnanesh", "company 1"],
["2156722", None, "company 2"],
["4115666122", "bobby", "company 3"],
["21190745", "rohith", "company 2"]]
# specify column names
columns = ['ID', 'Employee NAME', 'Company Name']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.where(dataframe["ID"] > 100000).show()
输出:
+--------+-------------+------------+
| ID|Employee NAME|Company Name|
+--------+-------------+------------+
| 2116722| sravan| company 1|
| 2716722| ojaswi| company 2|
| 2119722| bobby| company 3|
| 2216722| rohith| company 2|
| 2156722| NULL| company 2|
|21190745| rohith| company 2|
+--------+-------------+------------+
预期输出
+--------+-------------+------------+
| ID|Employee NAME|Company Name|
+--------+-------------+------------+
| 2116722| sravan| company 1|
| 2716722| ojaswi| company 2|
| 2119722| bobby| company 3|
| 2216722| rohith| company 2|
| 2156722| NULL| company 2|
|21190745| rohith| company 2|
|21156311722| sravan| company 1|
|4115666122| bobby| company 3|
|3116722672| gnanesh| company 1|
+--------+-------------+------------+
问题在于正在对最初推断为整数的“ID”列执行数字比较。 Python 中的整数有一个最大值,任何超过该最大值的数字都可能以这种方式被截断或解释错误。 要解决这个问题,应该将“ID”列转换为字符串类型,以便 PySpark 正确处理大数。
试试下面的代码:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
# 创建 sparksession 并指定应用程序名称
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
data = [["2116722", "sravan", "company 1"],
["2716722", "ojaswi", "company 2"],
["2119722", "bobby", "company 3"],
["21156311722", "sravan", "company 1"],
["21422", "ojaswi", None],
["2216722", "rohith", "company 2"],
["3116722672", "gnanesh", "company 1"],
["2156722", None, "company 2"],
["4115666122", "bobby", "company 3"],
["21190745", "rohith", "company 2"]]
# 指定列名
columns = ['ID', 'Employee NAME', 'Company Name']
# 从数据列表创建数据框
dataframe = spark.createDataFrame(data, columns)
# 将“ID”列转换为字符串类型
dataframe = dataframe.withColumn("ID", dataframe["ID"].cast(StringType()))
dataframe.where(dataframe["ID"] > "100000").show()
在本代码中,
dataframe.withColumn("ID", dataframe["ID"].cast(StringType()))
会将“ID”列转换为字符串类型。这确保了比较操作将按预期执行,并返回所有具有大于 100000 的 ID 的行。
此更改应能为提供预期的输出。
标签:python,dataframe,apache-spark,pyspark,apache-spark-sql From: 78788103