文章目录
《Spark Python API函数学习:pyspark API(1)》
《Spark Python API函数学习:pyspark API(2)》
《Spark Python API函数学习:pyspark API(3)》
《Spark Python API函数学习:pyspark API(4)》
《Spark Python API函数学习:pyspark API(2)》
《Spark Python API函数学习:pyspark API(3)》
《Spark Python API函数学习:pyspark API(4)》
Spark支持Scala、Java以及Python语言,本文将通过图片和简单例子来学习pyspark API。


如果想及时了解Spark、Hadoop或者Hbase相关的文章,欢迎关注微信公共帐号:iteblog_hadoop
histogram
# histogram (example #1) x = sc.parallelize([1,3,1,2,3]) y = x.histogram(buckets = 2) print(x.collect()) print(y) [1, 3, 1, 2, 3] ([1, 2, 3], [2, 3]) # histogram (example #2) x = sc.parallelize([1,3,1,2,3]) y = x.histogram([0,0.5,1,1.5,2,2.5,3,3.5]) print(x.collect()) print(y) [1, 3, 1, 2, 3] ([0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5], [0, 0, 2, 0, 1, 0, 2])
mean
# mean x = sc.parallelize([1,3,2]) y = x.mean() print(x.collect()) print(y) [1, 3, 2] 2.0
variance
# variance x = sc.parallelize([1,3,2]) y = x.variance() # divides by N print(x.collect()) print(y) [1, 3, 2] 0.666666666667
stdev
# stdev x = sc.parallelize([1,3,2]) y = x.stdev() # divides by N print(x.collect()) print(y) [1, 3, 2] 0.816496580928
sampleStdev
# sampleStdev x = sc.parallelize([1,3,2]) y = x.sampleStdev() # divides by N-1 print(x.collect()) print(y) [1, 3, 2] 1.0
sampleVariance
# sampleVariance x = sc.parallelize([1,3,2]) y = x.sampleVariance() # divides by N-1 print(x.collect()) print(y) [1, 3, 2] 1.0
countByValue
# countByValue
x = sc.parallelize([1,3,1,2,3])
y = x.countByValue()
print(x.collect())
print(y)
[1, 3, 1, 2, 3]
defaultdict(<type 'int'>, {1: 2, 2: 1, 3: 2})
top
# top x = sc.parallelize([1,3,1,2,3]) y = x.top(num = 3) print(x.collect()) print(y) [1, 3, 1, 2, 3] [3, 3, 2]
takeOrdered
# takeOrdered x = sc.parallelize([1,3,1,2,3]) y = x.takeOrdered(num = 3) print(x.collect()) print(y) [1, 3, 1, 2, 3] [1, 1, 2]
take
# take x = sc.parallelize([1,3,1,2,3]) y = x.take(num = 3) print(x.collect()) print(y) [1, 3, 1, 2, 3] [1, 3, 1]
first
# first x = sc.parallelize([1,3,1,2,3]) y = x.first() print(x.collect()) print(y) [1, 3, 1, 2, 3] 1
collectAsMap
# collectAsMap
x = sc.parallelize([('C',3),('A',1),('B',2)])
y = x.collectAsMap()
print(x.collect())
print(y)
[('C', 3), ('A', 1), ('B', 2)]
{'A': 1, 'C': 3, 'B': 2}
keys
# keys
x = sc.parallelize([('C',3),('A',1),('B',2)])
y = x.keys()
print(x.collect())
print(y.collect())
[('C', 3), ('A', 1), ('B', 2)]
['C', 'A', 'B']
values
# values
x = sc.parallelize([('C',3),('A',1),('B',2)])
y = x.values()
print(x.collect())
print(y.collect())
[('C', 3), ('A', 1), ('B', 2)]
[3, 1, 2]
reduceByKey
# reduceByKey
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
y = x.reduceByKey(lambda agg, obj: agg + obj)
print(x.collect())
print(y.collect())
[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
[('A', 12), ('B', 3)]
reduceByKeyLocally
# reduceByKeyLocally
x = sc.parallelize([('B',1),('B',2),('A',3),('A',4),('A',5)])
y = x.reduceByKeyLocally(lambda agg, obj: agg + obj)
print(x.collect())
print(y)
[('B', 1), ('B', 2), ('A', 3), ('A', 4), ('A', 5)]
{'A': 12, 'B': 3}
本博客文章除特别声明,全部都是原创!原创文章版权归过往记忆大数据(过往记忆)所有,未经许可不得转载。
本文链接: 【Spark Python API函数学习:pyspark API(3)】(https://www.iteblog.com/archives/1399.html)


spark 初学者,对我非常有帮助,可以作为字典使用。仅供参考!