PySpark: Count pair frequency occurences - pyspark

Let's say I have a dataset as follows:
1: a, b, c
2: a, d, c
3: c, d, e
I want to write a Pyspark code to count the occurrences of each of the pairs such as (a,b), (a,c), (b,c) etc.
Expected output:
(a,b) 1
(b,c) 1
(c,d) 2
etc..
Note that, (c,d) and (d,c) should be the same instant.
How should I go about it?
Till now, I have written the code to read the data from textfile as follows -
sc = SparkContext("local", "bp")
spark = SparkSession(sc)
data = sc.textFile('doc.txt')
dataFlatMap = data.flatMap(lambda x: x.split(" "))
Any pointers would be appreciated.

I relied on the answer in this question - How to create a Pyspark Dataframe of combinations from list column
Below is the code that creates a udf where itertools.combinations function is applied to the list of items. The combinations in udf are sorted to avoid double counting occurrences such as ("a", "b") and ("b", "a"). Once you get combinations, you can groupBy and count rows. You may want to count distinct rows in case list elements are repeating, like ("a", "a", "b"), but this depends on your requirements.
import pyspark.sql.functions as F
import itertools
from pyspark.sql.types import *
data = [(1, ["a", "b", "c"]), (2, ["a", "d", "c"]), (3, ["c", "d", "e"])]
df = spark.createDataFrame(data, schema = ["id", "arr"])
# df is
# id arr
# 1 ["a", "b", "c"]
# 2 ["a", "d", "c"]
# 3 ["c", "d", "e"]
#udf(returnType=ArrayType(ArrayType(StringType())))
def combinations_udf(arr):
x = (list(itertools.combinations(arr, 2)))
return [ sorted([y[0], y[1]]) for y in x ]
df1 = (df.withColumn("combinations", F.explode(combinations_udf1("arr"))))
df_ans =(df1
.groupBy("combinations")
.agg(F.countDistinct("id").alias("count"))
.orderBy(F.desc("count")))
For the given dataframe df, df_ans is

Related

SparkMap[RDD] - map toMany-One

I have a rdd in the following form:
[ ("a") -> (pos3, pos5), ("b") -> (pos1, pos7), .... ]
and
(pos1 ,pos2, ............, posn)
Q: How can I map each position to its key ?(to be something like the following)
("b", "e", "a", "d", "a" .....)
// "b" correspond to pos 1, "e" correspond to pose 2 and ...
Example(edit) :
// chunk of my data
val data = Vector(("a",(124)), ("b",(125)), ("c",(121, 123)), ("d",(122)),..)
val rdd = sc.parallelize(data)
// from rdd I can create my position rdd which is something like:
val positions = Vector(1,2,3,4,.......125) // my positions
// I want to map each position to my tokens("a", "b", "c", ....) to achive:
Vector("a", "b", "a", ...)
// a correspond to pos1, b correspond to pos2 ...
Not sure you have to use Spark to address this specific use case (starting with a Vector, ending with a Vector containing all your data characters).
Nevertheless, here's some suggestion if it suits your needs :
val data = Vector(("a",Set(124)), ("b", Set(125)), ("c", Set(121, 123)), ("d", Set(122)))
val rdd = spark.sparkContext.parallelize(data)
val result = rdd.flatMap{case (k,positions) => positions.map(p => Map(p -> k))}
.reduce(_ ++ _) //here, we aggregate the Map objects together, reducing partitions first and then merging executors results
.toVector
.sortBy(_._1) //We sort data based on position
.map(_._2) // We only keep characters
.mkString

filter element of rdd[array[string] by array of [0,1]

I want to select some elements(feature) of rdd based on binary array. I have an array consisting of 0,1 with size 40 that specify if an element is present at that index or not.
My RDD was created form kddcup99 dataset
val rdd=sc.textfile("./data/kddcup.txt")
val data=rdd.map(_.split(','))
How can I to filter or select elements of data(rdd[Array[String]]) whose value of correspondent index in binary array is 1?
If I understood your question correctly, you have an array like :
val arr = Array(1, 0, 1, 1, 1, 0)
And a RDD[Array[String]] which looks like :
val rdd = sc.parallelize(Array(
Array("A", "B", "C", "D", "E", "F") ,
Array("G", "H", "I", "J", "K", "L")
) )
Now, to get elements at the indices where arr has 1, you need to first get the indices which have 1 as the value in arr
val requiredIndices = arr.zipWithIndex.filter(_._1 == 1).map(_._2)
requiredIndices: Array[Int] = Array(0, 2, 3, 4)
And then similarily with RDD, you can use zipWithIndex and contains to check if that index is available in your requiredIndices array :
rdd.map(_.zipWithIndex.filter(x => requiredIndices.contains(x._2) ).map(_._1) )
// Array[Array[String]] = Array(Array(A, C, D, E), Array(G, I, J, K))

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:
val rdd = sc.parallelize(Seq(
(1, List("a")),
(2, List("a", "b")),
(3, List("b", "c", "d")),
(4, List("f"))))
For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:
val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
(1, List("a", "c")),
(2, List("a", "b", "b", "c")),
(3, List("b", "c", "d", "a", "a", "f")),
(4, List("f", "d"))))
I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):
// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => x._2.map(s => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = count.map { x =>
(x._1, rddFlat.sample(withReplacement = true, x._2.toDouble / count.map(_._2).sum, scala.util.Random.nextLong()))
}.map(x => x._2.map(t => (x._1, t._2))).reduce(_.union(_))
// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
.union(rddRandom)
.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.
By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.
In a nutshell, we need a distributed method to a weighted sample of all the values.
What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.
Let's write a spark-based algo for that:
// sample data to guide us.
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys
val data = sc.parallelize(Seq(
(1, List("A", "B")),
(2, List("x", "y", "z")),
(3, List("1", "2", "3", "4")),
(4, List("foo", "bar")),
(5, List("+")),
(6, List())))
val flattenedData = data.flatMap{case (k,vlist) => vlist.map(v=> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize = data.map{case (k, list) => (k,list.size)}
val totalElements = keysBySize.map{case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples = probMatrix.map{case ((key, prob),value) =>
((key,value),(prob, Random.nextDouble))}
ScoredSamples looks like this:
((1,A),(0.16666666666666666,0.911900315814998))
((1,B),(0.16666666666666666,0.13615047422122906))
((1,x),(0.16666666666666666,0.6292430257377151))
((1,y),(0.16666666666666666,0.23839887096373114))
((1,z),(0.16666666666666666,0.9174808344986465))
...
val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}
samples looks like this:
(1,foo)
(1,bar)
(2,1)
(2,3)
(3,y)
...
Now, we union our sampled data with the original and have our final result.
val result = (flattenedData union samples).groupByKey.mapValues(_.toList)
result.collect()
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))
Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey, which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

Match two RDDs [String]

I try to match two RDD's:
RDD1 contains a huge amount of words [String] and RDD2 contains city names [String].
I want to return a RDD with the elements from RDD1 that are in RDD2.
Something like the opposite to subtract.
Afterwards I want to count the occurrence of each remaining word, but that won't be a problem.
thank you
I want to return an RDD with the elements from RDD1 that are in RDD2
If I got you right:
rdd1.subtract(rdd2.subtract(rdd1))
Note the difference between this code and intersection:
val rdd1 = sc.parallelize(Seq("a", "a", "b", "c"))
val rdd2 = sc.parallelize(Seq("a", "c", "d"))
val diff = rdd1.subtract(rdd2)
rdd1.subtract(diff).collect()
res0: Array[String] = Array(a, a, c)
rdd1.intersection(rdd2).collect()
res1: Array[String] = Array(a, c)
So, if your first RDD contains duplicates, and your goal is to take into account that duplicates, your may prefer double subtract solution. Otherwise, intersection fits well.

IndexedSeq and removing multiple elements

I have mySeq: IndexedSeq[A] and another myIncludedSeq: IndexedSeq[A], where every element of myIncludedSeq, is contained in mySeq.
I want to create a new IndexedSeq[A] from mySeq without all elements from myIncludedSeq.
I can't find any nice functional approach for this problem. How would you approach it?
Example:
val mySeq = IndexedSeq("a", "b", "a", "c", "d", "a")
val myIncludedSeq = IndexedSeq("a", "d", "a")
//magic
val expectedResult = IndexedSeq("b", "c", "a") //the order does not matter
How about this?
val original = IndexedSeq("a", "b", "a", "c", "d", "a")
val exclude = IndexedSeq("a", "d", "a")
val result = original.diff(exclude)
// IndexedSeq[String] = Vector(b, c, a)
From list's diff doc:
Computes the multiset difference between this list and another sequence.
Returns a new list which contains all elements of this list except
some of occurrences of elements that also appear in "excluding" list. If an
element value x appears n times in that, then the first n occurrences
of x will not form part of the result, but any following occurrences
will.