SparkMap[RDD] - map toMany-One - scala

I have a rdd in the following form:
[ ("a") -> (pos3, pos5), ("b") -> (pos1, pos7), .... ]
and
(pos1 ,pos2, ............, posn)
Q: How can I map each position to its key ?(to be something like the following)
("b", "e", "a", "d", "a" .....)
// "b" correspond to pos 1, "e" correspond to pose 2 and ...
Example(edit) :
// chunk of my data
val data = Vector(("a",(124)), ("b",(125)), ("c",(121, 123)), ("d",(122)),..)
val rdd = sc.parallelize(data)
// from rdd I can create my position rdd which is something like:
val positions = Vector(1,2,3,4,.......125) // my positions
// I want to map each position to my tokens("a", "b", "c", ....) to achive:
Vector("a", "b", "a", ...)
// a correspond to pos1, b correspond to pos2 ...

Not sure you have to use Spark to address this specific use case (starting with a Vector, ending with a Vector containing all your data characters).
Nevertheless, here's some suggestion if it suits your needs :
val data = Vector(("a",Set(124)), ("b", Set(125)), ("c", Set(121, 123)), ("d", Set(122)))
val rdd = spark.sparkContext.parallelize(data)
val result = rdd.flatMap{case (k,positions) => positions.map(p => Map(p -> k))}
.reduce(_ ++ _) //here, we aggregate the Map objects together, reducing partitions first and then merging executors results
.toVector
.sortBy(_._1) //We sort data based on position
.map(_._2) // We only keep characters
.mkString

Related

PySpark: Count pair frequency occurences

Let's say I have a dataset as follows:
1: a, b, c
2: a, d, c
3: c, d, e
I want to write a Pyspark code to count the occurrences of each of the pairs such as (a,b), (a,c), (b,c) etc.
Expected output:
(a,b) 1
(b,c) 1
(c,d) 2
etc..
Note that, (c,d) and (d,c) should be the same instant.
How should I go about it?
Till now, I have written the code to read the data from textfile as follows -
sc = SparkContext("local", "bp")
spark = SparkSession(sc)
data = sc.textFile('doc.txt')
dataFlatMap = data.flatMap(lambda x: x.split(" "))
Any pointers would be appreciated.
I relied on the answer in this question - How to create a Pyspark Dataframe of combinations from list column
Below is the code that creates a udf where itertools.combinations function is applied to the list of items. The combinations in udf are sorted to avoid double counting occurrences such as ("a", "b") and ("b", "a"). Once you get combinations, you can groupBy and count rows. You may want to count distinct rows in case list elements are repeating, like ("a", "a", "b"), but this depends on your requirements.
import pyspark.sql.functions as F
import itertools
from pyspark.sql.types import *
data = [(1, ["a", "b", "c"]), (2, ["a", "d", "c"]), (3, ["c", "d", "e"])]
df = spark.createDataFrame(data, schema = ["id", "arr"])
# df is
# id arr
# 1 ["a", "b", "c"]
# 2 ["a", "d", "c"]
# 3 ["c", "d", "e"]
#udf(returnType=ArrayType(ArrayType(StringType())))
def combinations_udf(arr):
x = (list(itertools.combinations(arr, 2)))
return [ sorted([y[0], y[1]]) for y in x ]
df1 = (df.withColumn("combinations", F.explode(combinations_udf1("arr"))))
df_ans =(df1
.groupBy("combinations")
.agg(F.countDistinct("id").alias("count"))
.orderBy(F.desc("count")))
For the given dataframe df, df_ans is

Counting number of occurrences of Array element in a RDD

I have a RDD1 with Key-Value pair of type [(String, Array[String])] (i will refer to them as (X, Y)), and a Array Z[String].
I'm trying for every element in Z to count how many X instances there are that have Z in Y. I want my output as ((X, Z(i)), #ofinstances).
RDD1= ((A, (2, 3, 4), (B, (4, 4, 4)), (A, (4, 5)))
Z = (1, 4)
then i want to get:
(((A, 4), 2), ((B, 4), 1))
Hope that made sense.
As you can see over, i only want an element if there is atleast one occurence.
I have tried this so far:
val newRDD = RDD1.map{case(x, y) => for(i <- 0 to (z.size-1)){if(y.contains(z(i))) {((x, z(i)), 1)}}}
My output here is an RDD[Unit]
Im not sure if what i'm asking for is even possible, or if i have to do it an other way.
So it is just another word count
val rdd = sc.parallelize(Seq(
("A", Array("2", "3", "4")),
("B", Array("4", "4", "4")),
("A", Array("4", "5"))))
val z = Array("1", "4")
To make lookups efficient convert z to Set:
val zs = z.toSet
val result = rdd
.flatMapValues(_.filter(zs contains _).distinct)
.map((_, 1))
.reduceByKey(_ + _)
where
_.filter(zs contains _).distinct
filters out values that occur in z and deduplicates.
result.take(2).foreach(println)
// ((B,4),1)
// ((A,4),2)

filter element of rdd[array[string] by array of [0,1]

I want to select some elements(feature) of rdd based on binary array. I have an array consisting of 0,1 with size 40 that specify if an element is present at that index or not.
My RDD was created form kddcup99 dataset
val rdd=sc.textfile("./data/kddcup.txt")
val data=rdd.map(_.split(','))
How can I to filter or select elements of data(rdd[Array[String]]) whose value of correspondent index in binary array is 1?
If I understood your question correctly, you have an array like :
val arr = Array(1, 0, 1, 1, 1, 0)
And a RDD[Array[String]] which looks like :
val rdd = sc.parallelize(Array(
Array("A", "B", "C", "D", "E", "F") ,
Array("G", "H", "I", "J", "K", "L")
) )
Now, to get elements at the indices where arr has 1, you need to first get the indices which have 1 as the value in arr
val requiredIndices = arr.zipWithIndex.filter(_._1 == 1).map(_._2)
requiredIndices: Array[Int] = Array(0, 2, 3, 4)
And then similarily with RDD, you can use zipWithIndex and contains to check if that index is available in your requiredIndices array :
rdd.map(_.zipWithIndex.filter(x => requiredIndices.contains(x._2) ).map(_._1) )
// Array[Array[String]] = Array(Array(A, C, D, E), Array(G, I, J, K))

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:
val rdd = sc.parallelize(Seq(
(1, List("a")),
(2, List("a", "b")),
(3, List("b", "c", "d")),
(4, List("f"))))
For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:
val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
(1, List("a", "c")),
(2, List("a", "b", "b", "c")),
(3, List("b", "c", "d", "a", "a", "f")),
(4, List("f", "d"))))
I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):
// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => x._2.map(s => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = count.map { x =>
(x._1, rddFlat.sample(withReplacement = true, x._2.toDouble / count.map(_._2).sum, scala.util.Random.nextLong()))
}.map(x => x._2.map(t => (x._1, t._2))).reduce(_.union(_))
// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
.union(rddRandom)
.aggregateByKey(List[String]())(_ :+ _, _ ++ _)
What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.
By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.
In a nutshell, we need a distributed method to a weighted sample of all the values.
What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.
Let's write a spark-based algo for that:
// sample data to guide us.
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys
val data = sc.parallelize(Seq(
(1, List("A", "B")),
(2, List("x", "y", "z")),
(3, List("1", "2", "3", "4")),
(4, List("foo", "bar")),
(5, List("+")),
(6, List())))
val flattenedData = data.flatMap{case (k,vlist) => vlist.map(v=> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize = data.map{case (k, list) => (k,list.size)}
val totalElements = keysBySize.map{case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples = probMatrix.map{case ((key, prob),value) =>
((key,value),(prob, Random.nextDouble))}
ScoredSamples looks like this:
((1,A),(0.16666666666666666,0.911900315814998))
((1,B),(0.16666666666666666,0.13615047422122906))
((1,x),(0.16666666666666666,0.6292430257377151))
((1,y),(0.16666666666666666,0.23839887096373114))
((1,z),(0.16666666666666666,0.9174808344986465))
...
val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}
samples looks like this:
(1,foo)
(1,bar)
(2,1)
(2,3)
(3,y)
...
Now, we union our sampled data with the original and have our final result.
val result = (flattenedData union samples).groupByKey.mapValues(_.toList)
result.collect()
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))
Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey, which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

Scala two list match data

First List data as below
List(("A",66729122803169854198650092,"SD"),("B",14941578978240528153321786,"HD"),("C",14941578978240528153321786,"PD"))
and second list contains data as below
List(("X",14941578978240528153321786),("Y",68277588597782900503675727),("Z",14941578978240528153321786),("L"66729122803169854198650092))
using above two list I want to form following list which matched first list second number to second list second number so my output should as below
List(("X",14941578978240528153321786,"B","HD"),("X",14941578978240528153321786,"C","PD"), ("Y",68277588597782900503675727,"",""),("Z",14941578978240528153321786,"B","HD"),("Z",14941578978240528153321786,"C","PD"),
("L",66729122803169854198650092,"A","SD"))
val tuples3 = List(
("A", "66729122803169854198650092", "SD"),
("B", "14941578978240528153321786", "HD"),
("C", "14941578978240528153321786", "PD"))
val tuples2 = List(
("X", "14941578978240528153321786"),
("Y", "68277588597782900503675727"),
("Z", "14941578978240528153321786"),
("L", "66729122803169854198650092"))
Group first list by target field:
val tuples3Grouped =
tuples3
.groupBy(_._2)
.mapValues(_.map(t => (t._1, t._3)))
.withDefaultValue(List(("", "")))
Zip all data:
val result = for{ (first, second) <- tuples2
t <- tuples3Grouped(second)
} yield (first, second, t._1, t._2)