How to store each element to dictionary and count dictionary value with pyspark? - pyspark

I want to count elements value of dictionary. I try with this code:
def f_items(data, steps=0):
items = defaultdict(int)
for element in data:
if element in data:
items[element] += 1
else:
items[element] = 1
return items.items()
data = [[1, 2, 3, 'E'], [1, 2, 3, 'E'], [5, 2, 7, 112, 'A'] ]
rdd = sc.parallelize(data)
items = rdd.flatMap(lambda data: [y for y in f_items(data)], True)
print (items.collect())
The output of this code is shown below:
[(1, 1), (2, 1), (3, 1), ('E', 1), (1, 1), (2, 1), (3, 1), ('E', 1), (5, 1), (2, 1), (7, 1), (112, 1), ('A', 1)]
But, it should show the result following:
[(1, 2), (2, 3), (3, 3), ('E', 2), (5, 1), (7, 1), (112, 1), ('A', 1)]
How to achieve this?

Your last step should be a reduceByKey function call on the items rdd.
final_items = items.reduceByKey(lambda x,y: x+y)
print (final_items.collect())
You can look into this link to see some examples of reduceByKey in scala, java and python.

Related

Getting the first item for a tuple for each row in a list in Scala

I am looking to do this in Scala, but nothing works. In pyspark it works obviously.
from operator import itemgetter
rdd = sc.parallelize([(0, [(0,'a'), (1,'b'), (2,'c')]), (1, [(3,'x'), (5,'y'), (6,'z')])])
mapped = rdd.mapValues(lambda v: map(itemgetter(0), v))
Output
mapped.collect()
[(0, [0, 1, 2]), (1, [3, 5, 6])]
val rdd = sparkContext.parallelize(List(
(0, Array((0, "a"), (1, "b"), (2, "c"))),
(1, Array((3, "x"), (5, "y"), (6, "z")))
))
rdd
.mapValues(v => v.map(_._1))
.foreach(v=>println(v._1+"; "+v._2.toSeq.mkString(",") ))
Output:
0; 0,1,2
1; 3,5,6

Group_by_key in order in Pyspark

rrr = sc.parallelize([1, 2, 3])
fff = sc.parallelize([5, 6, 7, 8])
test = rrr.cartesian(fff)
Here's test:
[(1, 5),(1, 6),(1, 7),(1, 8),
(2, 5),(2, 6),(2, 7),(2, 8),
(3, 5),(3, 6),(3, 7),(3, 8)]
Is there a way to preserve the order after calling groupByKey:
test.groupByKey().mapValues(list).take(2)
Output is this where the list is in random order:
Out[255]: [(1, [8, 5, 6, 7]), (2, [5, 8, 6, 7]), (3, [6, 8, 7, 5])]
The desired output is:
[(1, [5,6,7,8]), (2, [5,6,7,8]), (3, [5,6,7,8])]
How to achieve this?
You can add one more mapValues to sort the lists:
result = test.groupByKey().mapValues(list).mapValues(sorted)

AggregateByKey in Pyspark not giving expected output

I have an RDD which has 2 partition and key value pair data as value:
rdd5.glom().collect()
[[(u'hive', 1), (u'python', 1), (u'spark', 1), (u'hive', 1),
(u'spark', 1), (u'python', 1)], [(u'spark', 1), (u'java', 1),
(u'java', 1), (u'spark', 1)]]
When I perform aggregateByKey
rdd6=rdd5.aggregateByKey((0,0), lambda acc,val: (acc[0]+1,acc[1]+val), lambda acc1,acc2 : (acc1[1]+acc2[1])/acc1[0]+acc2[0])
It is not giving me expected result:
Output:
[(u'python', (2, 2)), (u'spark', 1), (u'java', (2, 2)), (u'hive', (2,
2))]
Expected:
[(u'python', 1), (u'spark', 1), (u'java', 1), (u'hive', 1)]
I can see key present in one partition only not giving me expected output. What changes should I make to achieve that?
Ok so below is the way to do this using reduceByKey and aggregateByKey.
The problem you had with aggregateByKey is that the last function is responsiable for adding two accumulators. It has to return the same structure as all other functions so that when adding another new accumulator (From another partition) it will work again.
It is very similar to combineByKey, see here.
rdd = sc.parallelize([(u'hive', 1), (u'python', 1), (u'spark', 1),\
(u'hive', 1), (u'spark', 1), (u'python', 1), (u'spark', 1), (u'java', 1), (u'java', 1), (u'spark', 1)])
print rdd.aggregateByKey( (0, 0), lambda acc, val: (acc[0] + 1,acc[1] + val),\
lambda acc1, acc2 : (acc1[0] + acc2[0], acc1[1] + acc2[1])).collect()
print rdd.mapValues(lambda x: (1, x)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1])).collect()
[(u'spark', (4, 4)), (u'java', (2, 2)), (u'hive', (2, 2)), (u'python',
(2, 2))]
[(u'spark', (4, 4)), (u'java', (2, 2)), (u'hive', (2, 2)), (u'python',
(2, 2))]
If you are trying to average the values, you can add another mapValues at the end like so:
print rdd.aggregateByKey( (0, 0),\
lambda acc, val: (acc[0] + 1,acc[1] + val),\
lambda acc1, acc2 : (acc1[0] + acc2[0], acc1[1] + acc2[1]))\
.mapValues(lambda x: x[1] * 1.0 / x[0])\
.collect()
[(u'spark', 1.0), (u'java', 1.0), (u'hive', 1.0), (u'python', 1.0)]

Pyspark - after groupByKey and count distinct value according to the key?

I would like to find how many distinct values according to the key, for example, suppose I have
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("b", 2), ("a", 2)])
And I have done using groupByKey
sorted(x.groupByKey().map(lambda x : (x[0], list(x[1]))).collect())
x.groupByKey().mapValues(len).collect()
the output will by like,
[('a', [1, 1, 2]), ('b', [1, 2])]
[('a', 3), ('b', 2)]
However, I want to find distinct values in the list, the output should be like,
[('a', [1, 2]), ('b', [1, 2])]
[('a', 2), ('b', 2)]
I am very new to spark and try to apply the distinct() function somewhere, but all failed :-(
Thanks a lot in advance!
you can use set instead of list -
sorted(x.groupByKey().map(lambda x : (x[0], set(x[1]))).collect())
You can try number of approaches for same. I solved it using below approach:-
from operator import add
x = sc.parallelize([("a", 1), ("b", 1), ("a", 1), ("b", 2), ("a", 2)])
x = x.map(lambda n:((n[0],n[1]), 1))
x.groupByKey().map(lambda n:(n[0][0],1)).reduceByKey(add).collect()
OutPut:-
[('b', 2), ('a', 2)]
Hope This will help you.

Idiomatic scala solution to combining sequences

Imagine a function combineSequences: (seqs: Set[Seq[Int]])Set[Seq[Int]] that combines sequences when the last item of first sequence matches the first item of the second sequence. For example, if you have the following sequences:
(1, 2)
(2, 3)
(5, 6, 7, 8)
(8, 9, 10)
(3, 4, 10)
The result of combineSequences would be:
(5, 6, 7, 8, 8, 9, 10)
(1, 2, 2, 3, 3, 4, 10)
Because sequences 1, 2, and 5 combine together. If multiple sequences could combine to create a different result, the decisions is arbitrary. For example, if we have the sequences:
(1, 2)
(2, 3)
(2, 4)
There are two correct answers. Either:
(1, 2, 2, 3)
(2, 4)
Or:
(1, 2, 2, 4)
(2, 3)
I can only think of a very imperative and fairly opaque implementation. I'm wondering if anyone has a solution that would be more idiomatic scala. I've run into related problems a few times now.
Certainly not the most optimized solution but I've gone for readability.
def combineSequences[T]( seqs: Set[Seq[T]] ): Set[Seq[T]] = {
if ( seqs.isEmpty ) seqs
else {
val (seq1, otherSeqs) = (seqs.head, seqs.tail)
otherSeqs.find(_.headOption == seq1.lastOption) match {
case Some( seq2 ) => combineSequences( otherSeqs - seq2 + (seq1 ++ seq2) )
case None =>
otherSeqs.find(_.lastOption == seq1.headOption) match {
case Some( seq2 ) => combineSequences( otherSeqs - seq2 + (seq2 ++ seq1) )
case None => combineSequences( otherSeqs ) + seq1
}
}
}
}
REPL test:
scala> val seqs = Set(Seq(1, 2), Seq(2, 3), Seq(5, 6, 7, 8), Seq(8, 9, 10), Seq(3, 4, 10))
seqs: scala.collection.immutable.Set[Seq[Int]] = Set(List(1, 2), List(2, 3), List(8, 9, 10), List(5, 6, 7, 8), List(3, 4, 10))
scala> combineSequences( seqs )
res10: Set[Seq[Int]] = Set(List(1, 2, 2, 3, 3, 4, 10), List(5, 6, 7, 8, 8, 9, 10))
scala> val seqs = Set(Seq(1, 2), Seq(2, 3, 100), Seq(5, 6, 7, 8), Seq(8, 9, 10), Seq(100, 4, 10))
seqs: scala.collection.immutable.Set[Seq[Int]] = Set(List(100, 4, 10), List(1, 2), List(8, 9, 10), List(2, 3, 100), List(5, 6, 7, 8))
scala> combineSequences( seqs )
res11: Set[Seq[Int]] = Set(List(5, 6, 7, 8, 8, 9, 10), List(1, 2, 2, 3, 100, 100, 4, 10))