PySpark Functions

时间：2024-05-31 13:32:50浏览次数：22

标签：customer Functions name PySpark df dataset column Example

1. Select Columns

- Example

`df = df.select(
	"customer_id",
	"customer_name"
)`

2. Creating or Replacing a column

- Example

df = df.withColumn("always_one", F.lit(1))
df = df.withColumn("customer_id_copy",F.col("customer_id"))

3. Rename a column

df.withColumnRenamed(<former_column_name>, <new_column_name>)
- Example
df = df.withColumnRenamed("sap_product_code","product_code")

4. Creating columns

--Returning a Column that contains <value> in every row: F.lit(<value>)
-- Example
df = df.withColumn("test",F.lit(1))

-- Example for null values: you have to give a type to the column since None has no type
df = df.withColumn("null_column",F.lit(None).cast("string"))

5. If then else statements

F.when(<condition>, <column>).otherwise(<column>)
--Example
df = df.withColumn(
	"new_column",
	F.when(
		F.col("source") == "OK",
		F.lit("OneKey")
	).when(
		F.col("source") == "ABV_BC",
		F.lit("Business Contact")
	).otherwise(
		F.lit("other source"))

6. Concatenating columns

F.concat(<column_1>, <column_2>, <column_3>, ...)
-- Example
df = df.withColumn(
	"new_column",
	F.cancat(
		F.col("firstname"),
		F.col("lastname")
	)
)

7. Joining datasets

dataset_a.join(dataset_b, on="column_to_join_on", how="left")
- Example
customer_with_address = customer.join(address, on="customer_id", how="left")
- Example with multiple columns to join on
dataset_c = dataset_a.join(dataset_b, on=["customer_id", "territory", "product"], how="inner")

8. Grouping by

# Example
import pyspark.sql.functions as F
aggregated_calls = calls.groupBy("customer_id").
agg(
  F.mean("duration").alias("mean_duration")
)

9. Pivoting

- Example
customer_specialty = specialty.groupBy("customer_id").pivot("priority")

10. Window functions

- Example
from pyspark.sql.window import Window
window = Window.partitionBy("l0_customer_id","address_id").orderBy(F.col("ordered_code_locale"))
ordered_code_locale = dataset.withColumn(
	"order_code_locale_row",
	F.row_number().over(window)
)

11. Iterating over columns

-- Example only with the column name
for column_name in dataset.columns:
-- Adds the "new_name_" prefix to all the columns of the dataset 
  dataset = dataset.withColumnRenamed(column_name, "new_name_{column_name}".format(column_name))

-- Example with the column types
for column_name, column_type in dataset.dtypes:
-- Replace all columns values by "Test"
  dataset = dataset.withColumn(column_name, F.lit("Test"))

12. Iteration Dictionaries

# Define a dictionary
my_dictionary = {
  "dog": "Alice",
  "cat": "Johnny"
}

# Iterate through the dictionary
for animal, name in my_dictionary.items():
  # Do something
  print(animal, name)

# Iterate through the dictionary
for animal in my_dictionary.keys():
  # Do something
  print(animal)

# Iterate through the dictionary
for name in my_dictionary.values():
  # Do something
  print(name)

13. lists

my_list = [
  "dog",
  "cat"
]

# Iterate through the list
for animal in my_list:
  # Do something
  print(animal)

# Iterate through the list, and get the index of the current element
for index, animal in enumerate(my_list):
  # Do something
  print(index, animal)

标签：customer,Functions,name,PySpark,df,dataset,column,Example
From： https://www.cnblogs.com/Jasmine6-Lee/p/18224067

PySpark分布式项目运行流程
1.PySpark是Spark为Python开发者提供的API。2.基于PySpark的分布式项目主要由三部分组成，如图1所示，我们在开发自己的分布式程序时，只需要关注两部分，1是开发自己项目的PySpark代码，2是将该代码运行需要的环境进行打包。下面的countNum.py即一个简单的分布式程序。#count......
【pyspark速成专家】5_Spark之RDD编程3
目录编辑六，共享变量七，分区操作六，共享变量当spark集群在许多节点上运行一个函数时，默认情况下会把这个函数涉及到的对象在每个节点生成一个副本。但是，有时候需要在不同节点或者节点和Driver之间共享变量。Spark提供两种类型的共享变量，广播变量和累加器。广播变量是......
Semantic Kernel入门系列：利用Handlebars创建Prompts functions
引言本章我们将学习通过HandlebarsPromptsTemplate来创建Promptsfunctions。什么是Handlebars？Handlebars是一个流行的JavaScript模板引擎，它允许你通过在HTML中使用简单的占位符来创建动态的HTML。它使用模板和输入对象来生成HTML或其他文本格式。Handlebars模板看......
Semantic Kernel入门系列：利用YAML定义prompts functions
引言在上一章节我们熟悉了promptsfunctions(提示函数)的创建,我们了解了PromptTemplateConfig中各个属性的简单使用。SemanticKernel允许我们利用多种方式去创建prompts包括nativefunctions,promptsfunctions或者也叫Semanticfunctions,和Yaml文件等。本章的我们将学习利......
C++ Virtual Functions
Virtual这个关键字在多态中扮演一个绝对重要的角色，只要memberfunctions声明的前面加上virtual的关键字，他就会成为 Virtualmemberfunctions。任何一个class如果拥有virtualfunctions，就可以得到C++编译器的虚拟机制（virtualmechanism)的服务。这个class的所有derivedclass......
PySpark-大数据分析实用指南-全-
PySpark大数据分析实用指南（全）原文：zh.annas-archive.org/md5/62C4D847CB664AD1379DE037B94D0AE5译者：飞龙协议：CCBY-NC-SA4.0前言ApacheSpark是一个开源的并行处理框架，已经存在了相当长的时间。ApacheSpark的许多用途之一是在集群计算机上进行数据分析应用程序。本书......
JavaScript execute asynchronous functions in Parallel with count and Promise All
JavaScriptexecuteasynchronousfunctionsinParallelwithcountandPromiseAllInOneJavaScript使用count和Promise并行执行异步函数errorsfunctionpromiseAll<T>(functions:Fn<T>[]):Promise<T[]>{returnnewPromise((resolve,reject)=&......
深入学习Semantic Kernel：创建和配置prompts functions
引言上一章我们熟悉了一下SemanticKernel的理论知识,Kernel创建以及简单的Sample熟悉了一下SK的基本使用。在SemanticKernel中的kernelfunctions由两部分组成第一部分是promptsfunctions(提示函数),第二部分Nativefunction(原生函数),kernelfunctions是构成插件(Plu......
AoPS - Chapter 7 Functions
这一章主要讲解函数的运算与函数方程求解。函数的运算对于函数$f$，若函数$g$满足对任意$x$有$g(f(x))=x$，则$g$为$f$的反函数（Inverse）。若函数$h$满足对于任意$x$有$h(x)=g(f(x))$，则$h$为$f$与$g$的复合（composition），记为\(h=g......
RCTF 2019]Nextphp 利用phpFFI来绕过 disable_functions限制
今天做了一道RCE+反序列化的题目，看看吧。进入题目就给出了源码点击查看代码<?phpif(isset($_GET['a'])){eval($_GET['a']);}else{show_source(__FILE__);}Rce看看，发现许多函数都被dis掉了，试试无参数rce。发现了当前目录的一些文件(1.txt是我弄上去的，......

PySpark Functions

1. Select Columns

2. Creating or Replacing a column

3. Rename a column

4. Creating columns

5. If then else statements

6. Concatenating columns

7. Joining datasets

8. Grouping by

9. Pivoting

10. Window functions

11. Iterating over columns

12. Iteration Dictionaries

13. lists

相关文章

赞助商

阅读排行