首页 > 其他分享 >PySpark Functions

PySpark Functions

时间:2024-05-31 13:32:50浏览次数:22  
标签:customer Functions name PySpark df dataset column Example

1. Select Columns

- Example

`df = df.select(
	"customer_id",
	"customer_name"
)`

2. Creating or Replacing a column

- Example

df = df.withColumn("always_one", F.lit(1))
df = df.withColumn("customer_id_copy",F.col("customer_id"))

3. Rename a column

df.withColumnRenamed(<former_column_name>, <new_column_name>)
- Example
df = df.withColumnRenamed("sap_product_code","product_code")

4. Creating columns

--Returning a Column that contains <value> in every row: F.lit(<value>)
-- Example
df = df.withColumn("test",F.lit(1))

-- Example for null values: you have to give a type to the column since None has no type
df = df.withColumn("null_column",F.lit(None).cast("string"))

5. If then else statements

F.when(<condition>, <column>).otherwise(<column>)
--Example
df = df.withColumn(
	"new_column",
	F.when(
		F.col("source") == "OK",
		F.lit("OneKey")
	).when(
		F.col("source") == "ABV_BC",
		F.lit("Business Contact")
	).otherwise(
		F.lit("other source"))

6. Concatenating columns

F.concat(<column_1>, <column_2>, <column_3>, ...)
-- Example
df = df.withColumn(
	"new_column",
	F.cancat(
		F.col("firstname"),
		F.col("lastname")
	)
)

7. Joining datasets

dataset_a.join(dataset_b, on="column_to_join_on", how="left")
- Example
customer_with_address = customer.join(address, on="customer_id", how="left")
- Example with multiple columns to join on
dataset_c = dataset_a.join(dataset_b, on=["customer_id", "territory", "product"], how="inner")

8. Grouping by

# Example
import pyspark.sql.functions as F
aggregated_calls = calls.groupBy("customer_id").
agg(
  F.mean("duration").alias("mean_duration")
)

9. Pivoting

- Example
customer_specialty = specialty.groupBy("customer_id").pivot("priority")

10. Window functions

- Example
from pyspark.sql.window import Window
window = Window.partitionBy("l0_customer_id","address_id").orderBy(F.col("ordered_code_locale"))
ordered_code_locale = dataset.withColumn(
	"order_code_locale_row",
	F.row_number().over(window)
)

11. Iterating over columns

-- Example only with the column name
for column_name in dataset.columns:
-- Adds the "new_name_" prefix to all the columns of the dataset 
  dataset = dataset.withColumnRenamed(column_name, "new_name_{column_name}".format(column_name))

-- Example with the column types
for column_name, column_type in dataset.dtypes:
-- Replace all columns values by "Test"
  dataset = dataset.withColumn(column_name, F.lit("Test"))

12. Iteration Dictionaries

# Define a dictionary
my_dictionary = {
  "dog": "Alice",
  "cat": "Johnny"
}

# Iterate through the dictionary
for animal, name in my_dictionary.items():
  # Do something
  print(animal, name)

# Iterate through the dictionary
for animal in my_dictionary.keys():
  # Do something
  print(animal)

# Iterate through the dictionary
for name in my_dictionary.values():
  # Do something
  print(name)

13. lists

my_list = [
  "dog",
  "cat"
]

# Iterate through the list
for animal in my_list:
  # Do something
  print(animal)

# Iterate through the list, and get the index of the current element
for index, animal in enumerate(my_list):
  # Do something
  print(index, animal)

标签:customer,Functions,name,PySpark,df,dataset,column,Example
From: https://www.cnblogs.com/Jasmine6-Lee/p/18224067

相关文章

  • PySpark分布式项目运行流程
    1.PySpark是Spark为Python开发者提供的API。2.基于PySpark的分布式项目主要由三部分组成,如图1所示,我们在开发自己的分布式程序时,只需要关注两部分,1是开发自己项目的PySpark代码,2是将该代码运行需要的环境进行打包。下面的countNum.py即一个简单的分布式程序。#count......
  • 【pyspark速成专家】5_Spark之RDD编程3
    目录​编辑六,共享变量七,分区操作六,共享变量当spark集群在许多节点上运行一个函数时,默认情况下会把这个函数涉及到的对象在每个节点生成一个副本。但是,有时候需要在不同节点或者节点和Driver之间共享变量。Spark提供两种类型的共享变量,广播变量和累加器。广播变量是......
  • Semantic Kernel入门系列:利用Handlebars创建Prompts functions
    引言本章我们将学习通过HandlebarsPromptsTemplate来创建Promptsfunctions。什么是Handlebars?Handlebars是一个流行的JavaScript模板引擎,它允许你通过在HTML中使用简单的占位符来创建动态的HTML。它使用模板和输入对象来生成HTML或其他文本格式。Handlebars模板看......
  • Semantic Kernel入门系列:利用YAML定义prompts functions
    引言在上一章节我们熟悉了promptsfunctions(提示函数)的创建,我们了解了PromptTemplateConfig中各个属性的简单使用。SemanticKernel允许我们利用多种方式去创建prompts包括nativefunctions,promptsfunctions或者也叫Semanticfunctions,和Yaml文件等。本章的我们将学习利......
  • C++ Virtual Functions
    Virtual这个关键字在多态中扮演一个绝对重要的角色,只要memberfunctions声明的前面加上virtual的关键字,他就会成为 Virtualmemberfunctions。任何一个class如果拥有virtualfunctions,就可以得到C++编译器的虚拟机制(virtualmechanism)的服务。这个class的所有derivedclass......
  • PySpark-大数据分析实用指南-全-
    PySpark大数据分析实用指南(全)原文:zh.annas-archive.org/md5/62C4D847CB664AD1379DE037B94D0AE5译者:飞龙协议:CCBY-NC-SA4.0前言ApacheSpark是一个开源的并行处理框架,已经存在了相当长的时间。ApacheSpark的许多用途之一是在集群计算机上进行数据分析应用程序。本书......
  • JavaScript execute asynchronous functions in Parallel with count and Promise All
    JavaScriptexecuteasynchronousfunctionsinParallelwithcountandPromiseAllInOneJavaScript使用count和Promise并行执行异步函数errorsfunctionpromiseAll<T>(functions:Fn<T>[]):Promise<T[]>{returnnewPromise((resolve,reject)=&......
  • 深入学习Semantic Kernel:创建和配置prompts functions
    引言上一章我们熟悉了一下SemanticKernel的理论知识,Kernel创建以及简单的Sample熟悉了一下SK的基本使用。在SemanticKernel中的kernelfunctions由两部分组成第一部分是promptsfunctions(提示函数),第二部分Nativefunction(原生函数),kernelfunctions是构成插件(Plu......
  • AoPS - Chapter 7 Functions
    这一章主要讲解函数的运算与函数方程求解。函数的运算对于函数\(f\),若函数\(g\)满足对任意\(x\)有\(g(f(x))=x\),则\(g\)为\(f\)的反函数(Inverse)。若函数\(h\)满足对于任意\(x\)有\(h(x)=g(f(x))\),则\(h\)为\(f\)与\(g\)的复合(composition),记为\(h=g......
  • RCTF 2019]Nextphp 利用phpFFI来绕过 disable_functions限制
    今天做了一道RCE+反序列化的题目,看看吧。进入题目就给出了源码点击查看代码<?phpif(isset($_GET['a'])){eval($_GET['a']);}else{show_source(__FILE__);}Rce看看,发现许多函数都被dis掉了,试试无参数rce。发现了当前目录的一些文件(1.txt是我弄上去的,......