方法 | 介绍 | 简单使用 |
flatmap | 对RDD中的每一个元素进行先map再压扁,最后返回操作的结果 | scala> sc.parallelize(Array("a b c", "d e f", "h i j")).collect scala> sc.parallelize(Array("a b c", "d e f", "h i j")).flatMap(_.split(" ")).collect |
sortBy | sortBy(x=>x,true) //默认升序 sortBy(x=>x+"",true)//变成了字符串,结果为字典顺序
| scala> sc.parallelize(List(5, 6, 4, 7, 3, 8, 2, 9, 1, 10)).sortBy(x=>x,true).collect scala> sc.parallelize(List(5, 6, 4, 7, 3, 8, 2, 9, 1, 10)).sortBy(x=>x+"",true).collect |
union | 并集 | |
intersection | 交集 | |
subtract | 差集 | |
cartesian | 笛卡尔积 | |
join | join(内连接)聚合具有相同key组成的value元组 | val rdd1 = sc.parallelize(List(("tom", 1), ("jerry", 2), ("kitty", 3)))
scala> rdd1.join(rdd2).collect |
rightOuterJoin | | scala> rdd1.rightOuterJoin(rdd2).collect res43: Array[(String, (Option[Int], Int))] = Array((tom,(Some(1),8)), (tom,(Some(1),2)), (jerry,(Some(2),9)), (shuke,(None,7))) |
groupbykey | groupByKey()的功能是,对具有相同键的值进行分组 | scala> val rdd6 = sc.parallelize(Array(("tom",1), ("jerry",2), ("kitty",3), scala> rdd6.groupByKey.collect |
cogroup | 先在RDD内部按照key分组,再在多个RDD间按照key分组 | val rdd1 = sc.parallelize(List(("tom", 1), ("tom", 2), ("jerry", 3),
scala> rdd1.cogroup(rdd2).collect |
groupBy | 根据指定的函数中的规则/key进行分组 | scala> val intRdd = sc.parallelize(List(1,2,3,4,5,6)) scala> intRdd.groupBy(x=>{if(x%2==0) "even" else "odd"}).collect |
reduce | 注意reduce是Action算子 | val rdd1 = sc.parallelize(List(1, 2, 3, 4, 5)) //reduce聚合 val result = rdd1.reduce(_ + _) //第一_ 上次一个运算的结果,第二个_ 这一 次进来的元素 |
reducebykey | 注意reducebykey是转换算子 | |
repartition | 改变分区数: 注意: | scala> val rdd1 = sc.parallelize(1 to 10,3) scala> rdd1.repartition(2).partitions.length scala> rdd1.partitions.length |
collect | 显示数据 | |
count | 求RDD中最外层元素的个数 | scala> val rdd3 = sc.parallelize(List(List("a b c", "a b b"),List("e f g", "a f g"), List("h i j", "a a b"))) scala> rdd3.count |
distinct | 去重 | scala> val rdd = sc.parallelize(Array(1,2,3,4,5,5,6,7,8,1,2,3,4), 3) scala> rdd.distinct.collect |
top | 取出最大的前N个 | scala> sc.parallelize(List(3,6,1,2,4,5)).top(2) res56: Array[Int] = Array(6, 5) |
take | //按照原来的顺序取前N个 | scala> sc.parallelize(List(3,6,1,2,4,5)).take(2) res57: Array[Int] = Array(3, 6) |
first | 按照原来的顺序取前第一个 | scala> sc.parallelize(List(3,6,1,2,4,5)).first res58: Int = 3 |
标签:parallelize,scala,Int,List,RDD,sc,Array,方法 From: https://blog.51cto.com/u_12277263/5809177