How to use RDD.flatMap? - scala

I have a text file with lines that contain userid and rid separated by | (pipe). rid values correspond to many labels on another file.
How can I use flatMap to implement a method as follows:
xRdd = sc.textFile("file.txt").flatMap { line =>
val (userid,rid) = line.split("\\|")
val labelsArr = getLabels(rid)
labelsArr.foreach{ i =>
((userid, i), 1)
}
}
At compile time, I get an error:
type mismatch; found : Unit required: TraversableOnce[?]

piecing together the information provided it seems you will have to replace your foreach operation with a map operation.
xRdd = sc.textFile("file.txt") flatMap { line =>
val (userid,rid) = line.split("\\|")
val labelsArr = getLabels(rid)
labelsArr.map(i=>((userid,i),1))
}

This is exactly the reason why I said here and here that Scala's for-comprehension could make things easier. And should help you out too.
When you see a series of flatMap and map that's the moment where the nesting should trigger some thinking about solutions to cut the "noise". That begs for simpler solutions, doesn't it?
See the following and appreciate Scala (and its for-comprehension) yourself!
val lines = sc.textFile("file.txt")
val pairs = for {
line <- lines
Array(userid, rid) = line.split("\\|")
label <- getLabels(rid)
} yield ((userid, label), 1)
If you throw in Spark SQL to the mix, things would get even simpler. Just to whet your appetite:
scala> pairs.toDF.show
+-----------------+---+
| _1| _2|
+-----------------+---+
| [jacek,1]| 1|
|[jacek,getLabels]| 1|
| [agata,2]| 1|
|[agata,getLabels]| 1|
+-----------------+---+
I'm sure you can guess what was inside my file.txt file, can't you?

Related

programmatically add one or more condition filter in scala spark

How do i get all lines from raw csv file by filtering with multiple condition.
I have raw file and i change it into DF.
val text = sc.textFile("hdfs:///data/text/")
case class TextFile(id:String, time:String,text:String)
val textDf = text.map(_.split(",")).map(s => TextFile(s(0).toString(),
s(1).toString(),
s(2).toString()
)).toDF()
And i also have condition file.
val findWord = sc.textFile("hdfs:///condition/text.txt").collect.toList
if i was known what the condition are, i just need to write down like this
textDf.filter(lower($"text").contains("ok") || lower($"text").contains("yes"))
There was various conditions so i try like this
val test = findWord.map(v => s"""lower($$"text").contains("$v")""").mkString(" || ");
textDf.filter(test).collect
but i can't run it. Also print(test) is same as condition what i need, can't use in df filter.
org.apache.spark.sql.catalyst.parser.ParseException:
How do i solve my problem?
Thanks for your help and advice.
Trying to build a String condition is not the best practice I would say. You can manipulate the Column class instead. Like this:
val condition = words.map(v => col("text").contains(s"$v")).reduce(_||_)
Which produces the following Column:
condition: org.apache.spark.sql.Column = (((contains(text, yes) OR contains(text, ok)) OR contains(text, k)) OR contains(text, y))
On an example :
val words = List("yes", "ok", "k", "y")
val condition = words.map(v => col("text").contains(s"$v")).reduce(_||_)
val df = Seq( ("word"), ("text"), ("ok"), ("abc"), ("y") ).toDF("text")
df.filter(condition).show
Output:
+----+
|text|
+----+
| ok|
| y|
+----+
You can dynamically construct you filtering condition based on the findWords file. Supposing findWords is a List[String] you can do something like this
val accFilter = lit("1") === "1" // a column that has a default true condition
val composedFilter = findWords
.foldLeft(accFilter){case(accFilter, word) => {
accFilter || lower($"text").contains(word)
}}
this will make the filtering be build based on || condition. Then you simply do
textDf.filter(composedFilter)

Looking for ways to optimize the program [Scala/Spark]

I have an RDD which look the following:
( (tag_1, set_1), (tag_2, set_2) ) , ... , ( (tag_M, set_M), (tag_L, set_L) ), ...
And for each pair from the RDD I'm going to compute the expression
for k=0,..,3 and to find the sum: p(0)+...p(3). For each pair of pairs n_1 is length of the set in the first pair and n_2 is length of the set in the second pair.
For now I wrote the following:
val N = 1000
pairRDD.map({
case ((t1,l1), (t2,l2)) => (t1,t2, {
val n_1 = l1.size
val n_2 = l2.size
val vals = (0 to 3).map(k => {
val P1 = (0 to (n_2-k-1))
.map(j => 1 - n_1/(N-j.toDouble))
.foldLeft(1.0)(_*_)
val P2 = (0 to (k-1))
.map(j => (n_1-j.toDouble)*(n_2-j.toDouble)/(N-n_2+k.toDouble-j.toDouble)/(k.toDouble-j.toDouble) )
.foldLeft(1.0)(_*_)
P1*P2
})
vals.sum.toDouble
})
})
The problem is it seems to work really slow and I hope there are some features of scala/spark that I don't know about and that could reduce the time of execution there.
Edit:
1) In the first place I have a csv-file with 2 columns: tag and message_id. For each tag I'm finding messages where it could be found and creating pairs like I described above (tagIdsZipped). The code is here
2) Then I want to compute the expression for each pair and write it down to file.
Actually, I also would like to filter the result, but it would be even longer, so I'm even not trying for now.
3) No, actually I dont, but the problems happened, when I tried to use this code, previously I did the following:
val tagPairsWithMeasure: RDD[(Tag, Tag, Measure)] = tagIdsZipped.map({
case ((t1,l1), (t2,l2)) => (t1,t2, {
val numer = l1.intersect(l2).size
val denom = Math.sqrt(l1.size)*Math.sqrt(l2.size)
numer.toDouble / denom
})
})
and everything worked fine. (see 4) )
4) In the file I described in 1) there are about 25million rows (~1.2 GB). I'm computing in on Xeon E5-2673 #2.4GHz and 32 GB RAM. It took about 1.5h to execute the code with the function I described in 3). I see, that there are more operations now, but it took about 3hours and only about 25% of the task was done. The main problem is I will have to work with about 3 times more data, but I can't even do it on a 'smaller' one.
Thank you in advance!
As has been mentioned there is not much to improve about Spark.
The biggest issue I can see here is using range.map.
(0 to (n_2-k-1)) creates a Range object.
Calling map on it creates a Vector allocating much memory.
The most simple solution is to work with iterators since foldLeft is a streaming-friendly function:
(0 to (n_2-k-1)).iterator instead of (0 to (n_2-k-1))
It also probably makes sense to try rewriting it imperatively using vars, loops and arrays since since computation inside a loop is extremely cheap. But it is a weapon of the last chance.
have you tried using dataframes?
maybe you can create a dataframe with a schema like this:
tagIdsDF
+-----------------------------+
|tag1 | set1 |tag2 | set2 |
+-----------------------------+
|tag_1 |set_1 |tag_2 |set_2 |
|... |
|tag_M |set_M |tag_L |set_L |
+-----------------------------+
and define a UDF to compute the sum:
val pFun = udf((l1:Seq[Double], l2:Seq[Double]) => {
val n_1 = l1.size
val n_2 = l2.size
val vals = (0 to 3).map(k => {
val P1 = (0 to (n_2-k-1))
.map(j => 1 - n_1/(N-j.toDouble))
.foldLeft(1.0)(_*_)
val P2 = (0 to (k-1))
.map(j => (n_1-j.toDouble)*(n_2-j.toDouble)/(N-n_2+k.toDouble-j.toDouble)/(k.toDouble-j.toDouble) )
.foldLeft(1.0)(_*_)
P1*P2
})
vals.sum.toDouble
})
notice that you don't need to pass tag_1/tag_2 because this information is on the resulting dataframe, then you can call it like this:
val tagWithMeasureDF = tagIdsDF.withColumn("measure", pFun($"set1", $"set2"))
and you get this df:
tagWithMeasureDF
+-----------------------------+---------+
|tag1 | set1 |tag2 | set2 | measure |
+---------------------------------------+
|tag_1 |set_1 |tag_2 |set_2 | m1 |
|... ... ...|
|tag_M |set_M |tag_L |set_L | mn |
+---------------------------------------+
doing something like this maybe helps you to achieve the desired performance.
Hope this helps you and if it works tell me!

Comparing values from different keys in scala / spark

I am trying to find the difference between values for keys that are related (but not the same). For example, lets say that I have the following map:
(“John_1”,[“a”,”b”,”c”])
(“John_2”,[“a”,”b”])
(“John_3”,[”b”,”c”])
(“Mary_5”,[“a”,”d”])
(“John_5”,[“c”,”d”,”e”])
I want to compare the contents of Name_# to Name_(#-1) and get the difference. So, for the example above, I would like to get (ex:
(“John_1”,[“a”,”b”,”c”]) //Since there is no John_0, all of the contents are new, so I keep them all
(“John_2”,[]) //Since all of the contents of John_2 appear in John_1, the resulting list is empty (for now, I don’t care about what happened to “c”
(“John_3”,[”c”]) //In this case, “c” is a new item (because I don’t care whether it existed prior to John_2). Again, I don’t care what happened to “a”.
(“Mary_5”,[“a”,”d”]) //There is no Mary_4 so all the items are kept
(“John_5”,[“c”,”d”,”e”]) //There is no John_4 so all the items are kept.
I was thinking on doing some kind of aggregateByKey and then just find the difference between the lists, but I do not know how to make the match between the keys that I care about, namely Name_# with Name_(#-1).
Split "id":
import org.apache.spark.sql.functions._
val df = Seq(
("John_1", Seq("a","b","c")), ("John_2", Seq("a","b")),
("John_3", Seq("b","c")), ("Mary_5", Seq("a","d")),
("John_5", Seq("c","d","e"))
).toDF("key", "values").withColumn(
"user", split($"key", "_")(0)
).withColumn("id", split($"key", "_")(1).cast("long"))
Add window:
val w = org.apache.spark.sql.expressions.Window
.partitionBy($"user").orderBy($"id")
and udf
val diff = udf((x: Seq[String], y: Seq[String]) => y.diff(x)
and compute:
df
.withColumn("is_previous", coalesce($"id" - lag($"id", 1).over(w) === 1, lit(false)))
.withColumn("diff", when($"is_previous", diff( lag($"values", 1).over(w), $"values")).otherwise($"values"))
.show
// +------+---------+----+---+-----------+---------+
// | key| values|user| id|is_previous| diff|
// +------+---------+----+---+-----------+---------+
// |Mary_5| [a, d]|Mary| 5| false| [a, d]|
// |John_1|[a, b, c]|John| 1| false|[a, b, c]|
// |John_2| [a, b]|John| 2| true| []|
// |John_3| [b, c]|John| 3| true| [c]|
// |John_5|[c, d, e]|John| 5| false|[c, d, e]|
// +------+---------+----+---+-----------+---------+
I managed to solve my issue as follows:
First create a function that computes the previous key from the current key
def getPrevKey(k: String): String = {
val (n, h) = k.split(“_”)
val i = h.toInt
val sb = new StringBuilder
sb.append(n).append(“_”).append(i-1)
return sb.toString
}
Then, create a copy of my RDD with the shifted key:
val copyRdd = myRdd.map(row => {
val k1 = row._1
val v1 = row._2
val k2 = getPrevHour(k1)
(k2,v1)
})
And finally, I union both RDDs and reduce by key by taking the difference between the lists:
val result = myRdd.union(copyRdd)
.reduceByKey(_.diff(_))
This gets me the exact result I need, but has the problem that it requires a lot of memory due to the union. The final result is not that large, but the partial results really weigh down the process.

How to count the number of words per line in text file using RDD?

Is there a way to count the number of word occurrences for each line of an RDD and not the complete RDD using map and reduce?
For example, if an RDD[String] contains these two lines:
Let's have some fun.
To have fun you don't need any plans.
then the output should be like a map containing the key value pairs:
("Let's",1)
("have",1)
("some",1)
("fun",1)
("To",1)("have",1)("fun",1)("you",1)("don't",1)("need",1)("plans",1)
Please, don't use RDD API if you've just started using Spark and no one told you to use it. There's so much nicer and often more efficient Spark SQL API to do this and many other distributed computations over large datasets in Spark.
Using RDD API is like using assembler for something you can use Scala (or other higher-level programming language) for. It's certainly too much to think about when starting your journey into Spark that I'd personally recommend the higher-level API of Spark SQL with DataFrames and Datasets in the first place.
Given the input:
$ cat input.txt
Let's have some fun.
To have fun you don't need any plans.
and that you were to use Dataset API, you could do the following:
val lines = spark.read.text("input.txt").withColumnRenamed("value", "line")
val wordsPerLine = lines.withColumn("words", explode(split($"line", "\\s+")))
scala> wordsPerLine.show(false)
+-------------------------------------+------+
|line |words |
+-------------------------------------+------+
|Let's have some fun. |Let's |
|Let's have some fun. |have |
|Let's have some fun. |some |
|Let's have some fun. |fun. |
| | |
|To have fun you don't need any plans.|To |
|To have fun you don't need any plans.|have |
|To have fun you don't need any plans.|fun |
|To have fun you don't need any plans.|you |
|To have fun you don't need any plans.|don't |
|To have fun you don't need any plans.|need |
|To have fun you don't need any plans.|any |
|To have fun you don't need any plans.|plans.|
+-------------------------------------+------+
scala> wordsPerLine.
groupBy("line", "words").
count.
withColumn("word_count", struct($"words", $"count")).
select("line", "word_count").
groupBy("line").
agg(collect_set("word_count")).
show(truncate = false)
+-------------------------------------+------------------------------------------------------------------------------+
|line |collect_set(word_count) |
+-------------------------------------+------------------------------------------------------------------------------+
|To have fun you don't need any plans.|[[fun,1], [you,1], [don't,1], [have,1], [plans.,1], [any,1], [need,1], [To,1]]|
|Let's have some fun. |[[have,1], [fun.,1], [Let's,1], [some,1]] |
| |[[,1]] |
+-------------------------------------+------------------------------------------------------------------------------+
Done. Simple, isn't it?
See functions object (for explode and struct functions).
According to my understanding you can do the following
You said that you have RDD[String] data
val data = Seq("Let's have some fun.",
"To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)
You can apply flatMap to split the lines and create (word, 1) tuples in map function
val output = rddData.flatMap(_.split(" ")).map(word => (word, 1))
that should give you your desired output
output.foreach(println)
To have occurances by line you should do the following
val output = rddData.map(_.split(" ").map((_, 1)).groupBy(_._1)
.map { case (group: String, traversable) => traversable.reduce{(a,b) => (a._1, a._2 + b._2)} }.toList).flatMap(tuple => tuple)
What you want is to transform a line into a Map(word, count). So you can define a function count word by line :
def wordsCount(line: String):Map[String,Int] = {
line.split(" ").map(v => (v,1)).groupBy(_._1).mapValues(_.size)
}
then just apply it to your RDD[String]:
val lines:RDD[String] = ...
val wordsByLineRDD:RDD[Map[String,Int]] = lines.map(wordsCount)
// this should give you a Map per line with count of each word
wordsByLineRDD.take(2)
// Something like
// Array(Map(some -> 1, have -> 1, Let's -> 1, fun. -> 1), Map(any -> 1, have -> 1, don't -> 1, you -> 1, need -> 1, fun -> 1, To -> 1, plans. -> 1))
Although it is an old question; I was looking for an answer to this in pySpark. Finally managed like the below.
file_ = cont_.parallelize (
["shots are shots that are shots with more big shots by big people",
"people comes in all shapes and sizes, as people are idoits of the idiots",
"i know what i am writing is nonsense, but i don't care because i am doing this to test my spark program",
"my spark is a current spark, that spark in my eyes."]
)
file_ \
.map(lambda x : [((x, i), 1) for i in x.split()]) \
.flatMap(lambda x : x) \
.reduceByKey(lambda x, y : x + y) \
.sortByKey(False) \
.map(lambda x : (x[0][1], x[1])) \
.collect()
Let's say you have your rdd like this
val data = Seq("Let's have some fun.",
"To have fun you don't need any plans.")
val rddData = sparkContext.parallelize(data)
Then simply apply flapMap and then map
val res = rddData.flatMap(line => line.split(" ")).map(word => (word,1))
Expected Output
res.take(100)
res4: Array[(String, Int)] = Array((Let's,1), (have,1), (some,1), (fun.,1), (To,1), (have,1), (fun,1), (you,1), (don't,1), (need,1), (any,1), (plans.,1))

Stratified sampling in Spark

I have data set which contains user and purchase data. Here is an example, where first element is userId, second is productId, and third indicate boolean.
(2147481832,23355149,1)
(2147481832,973010692,1)
(2147481832,2134870842,1)
(2147481832,541023347,1)
(2147481832,1682206630,1)
(2147481832,1138211459,1)
(2147481832,852202566,1)
(2147481832,201375938,1)
(2147481832,486538879,1)
(2147481832,919187908,1)
...
I want to make sure I only take 80% of each users data and build an RDD while take the rest of the 20% and build a another RDD. Lets call train and test. I would like to stay away from using groupBy to start with since it can create memory problem since data set is large. Whats the best way to do this?
I could do following but this will not give 80% of each user.
val percentData = data.map(x => ((math.random * 100).toInt, x._1. x._2, x._3)
val train = percentData.filter(x => x._1 < 80).values.repartition(10).cache()
One possible solution is in Holden's answer, and here is some other solutions :
Using RDDs :
You can use the sampleByKeyExact transformation, from the PairRDDFunctions class.
sampleByKeyExact(boolean withReplacement, scala.collection.Map fractions, long seed)
Return a subset of this RDD sampled by key (via stratified sampling) containing exactly math.ceil(numItems * samplingRate) for each stratum (group of pairs with the same key).
And this is how I would do :
Considering the following list :
val seq = Seq(
(2147481832,23355149,1),(2147481832,973010692,1),(2147481832,2134870842,1),(2147481832,541023347,1),
(2147481832,1682206630,1),(2147481832,1138211459,1),(2147481832,852202566,1),(2147481832,201375938,1),
(2147481832,486538879,1),(2147481832,919187908,1),(214748183,919187908,1),(214748183,91187908,1)
)
I would create an RDD Pair, mapping all the users as keys :
val data = sc.parallelize(seq).map(x => (x._1,(x._2,x._3)))
Then I'll set up fractions for each key as following, since sampleByKeyExact takes a Map of fraction for each key :
val fractions = data.map(_._1).distinct.map(x => (x,0.8)).collectAsMap
What I have done here is mapping on the keys to find distinct keys and then associate each to a fraction equals to 0.8. I collect the whole as a Map.
To sample now :
import org.apache.spark.rdd.PairRDDFunctions
val sampleData = data.sampleByKeyExact(false, fractions, 2L)
or
val sampleData = data.sampleByKeyExact(withReplacement = false, fractions = fractions,seed = 2L)
You can check the count on your keys or data or data sample :
scala > data.count
// [...]
// res10: Long = 12
scala > sampleData.count
// [...]
// res11: Long = 10
Using DataFrames :
Let's consider the same data (seq) from the previous section.
val df = seq.toDF("keyColumn","value1","value2")
df.show
// +----------+----------+------+
// | keyColumn| value1|value2|
// +----------+----------+------+
// |2147481832| 23355149| 1|
// |2147481832| 973010692| 1|
// |2147481832|2134870842| 1|
// |2147481832| 541023347| 1|
// |2147481832|1682206630| 1|
// |2147481832|1138211459| 1|
// |2147481832| 852202566| 1|
// |2147481832| 201375938| 1|
// |2147481832| 486538879| 1|
// |2147481832| 919187908| 1|
// | 214748183| 919187908| 1|
// | 214748183| 91187908| 1|
// +----------+----------+------+
We will need the underlying RDD to do that on which we creates tuples of the elements in this RDD by defining our key to be the first column :
val data: RDD[(Int, Row)] = df.rdd.keyBy(_.getInt(0))
val fractions: Map[Int, Double] = data.map(_._1)
.distinct
.map(x => (x, 0.8))
.collectAsMap
val sampleData: RDD[Row] = data.sampleByKeyExact(withReplacement = false, fractions, 2L)
.values
val sampleDataDF: DataFrame = spark.createDataFrame(sampleData, df.schema) // you can use sqlContext.createDataFrame(...) instead for spark 1.6)
You can now check the count on your keys or df or data sample :
scala > df.count
// [...]
// res9: Long = 12
scala > sampleDataDF.count
// [...]
// res10: Long = 10
Since Spark 1.5.0 you can use DataFrameStatFunctions.sampleBy method:
df.stat.sampleBy("keyColumn", fractions, seed)
Something like this is may be well suited to something like "Blink DB", but lets look at the question. There are two ways to interpret what you've asked one is:
1) You want 80% of your users, and you want all of the data for them.
2) You want 80% of each users data
For #1 you could do a map to get the user ids, call distinct, and then sample 80% of them (you may want to look at kFold in MLUtils or BernoulliCellSampler). You can then filter your input data to just the set of IDs you want.
For #2 you could look at BernoulliCellSampler and simply apply it directly.