Spark Scala - get number of unique values by keys - scala

This is a question from a beginner. I have a text file containing computer login information. Once I filter bad records, and map to only the 2 elements I need, I get rdd that looks like:
(user10,Server1)
(user40,Server2)
(user20,Server2)
(user25,Server2)
(user30,Server2)
(user30,Server2)
(user71,Server1)
(user10,Server1)
I need to find for each server the count of unique users, I would like to get something like:
(server1,2)
(server2,4)
I need to stay at Rdd level; no data-frames yet, and I don't know how to proceed. Any help is appreciated.

I provide a solution easy to understand for you.
def logic(data: RDD[(String, String)]
): RDD[(String, Int)] = {
data
.map { case (user, server) =>
(server, Set(user))
}
.reduceByKey(_ ++ _)
.map { case (server, userSet) =>
(server, userSet.size)
}
}
Set data structure can be used as the tool to find unique users.

If you already have reduced the input text file to following RDD
(user10,Server1)
(user40,Server2)
(user20,Server2)
(user25,Server2)
(user30,Server2)
(user30,Server2)
(user71,Server1)
(user10,Server1)
Final RDD that you require would be similar to wordcount examples thats abundant in the web but needs a little bit trick. You can do the following
val finalRdd = rdd.groupBy(x => (x._1, x._2)).map{case(k,v) => k}.map(x => (x._2, 1)).reduceByKey(_+_)
finalRdd would be
(Server2,4)
(Server1,2)

Related

read a file in scala and get key value pairs as Map[String, List[String]]

i am reading a file and getting the records as a Map[String, List[String]] in spark-scala. similar thing i want to achieve in pure scala form without any spark references(not reading an rdd). what should i change to make it work in a pure scala way
rdd
.filter(x => (x != null) && (x.length > 0))
.zipWithIndex()
.map {
case (line, index) =>
val array = line.split("~").map(_.trim)
(array(0), array(1), index)
}
.groupBy(_._1)
.mapValues(x => x.toList.sortBy(_._3).map(_._2))
.collect
.toMap
For the most part it will remain the same except for the groupBy part in rdd. Scala List also has the map, filter, reduce etc. methods. So they can be used in almost a similar fashion.
val lines = Source.fromFile('filename.txt').getLines.toList
Once the file is read and stored in List, the methods can be applied to it.
For the groupBy part, one simple approach can be to sort the tuples on the key. That will effectively cluster the tuples with same keys together.
val grouped = scala.util.Sorting.stablesort(arr, (e1: String, e2: String, e3: String)
=> e1._1 < e2._2)
There can be better solutions definitely, but this would effectively do the same task.
I came up with the below approach
Source.fromInputStream(
getClass.getResourceAsStream(filePath)).getLines.filter(
lines =>(lines != null) && (lines.length > 0)).map(_.split("~")).toList.groupBy(_(0)).map{ case (key, values) => (key, values.map(_(1))) }

Serialising temp collections created in Spark executors during task execution

I'm trying to find an effective way of writing collections created inside tasks to the output files of the job. For example, if we iterate over a RDD using foreach, we can create data structures that are local to the executor ex.,ListBuffer arr in the following code snippet. My problem is that how do I serialise arr and write it to file?
(1) Should I use FileWriter api or Spark saveAsTextFile will work?
(2) What will be the advantages of using one over the other
(3) Is there a better way of achieving the same.
PS: The reason I am using foreach instead of map is because I might not be able to transform all my RDD rows and I want to avoid getting Null values in the output.
val dataSorted: RDD[(Int, Int)] = <Some Operation>
val arr: ListBuffer = ListBuffer[(String, String)]()
dataSorted.foreach {
case (e, r) => {
if(e.id > 1000) {
arr += (("a", "b"))
}
}
}
Thanks,
Devj
You should not use driver's variables, but Accumulators - therw are articles about them with code examples here and here, also this question maybe helpful - there is simplified code example of custom AccumulatorParam
Write your own accumulator, that is able to add (String, String) or use built-in CollectionAccumulator. This is implementation of AccumulatorV2, new version of accumulator from Spark 2
Other way is to use Spark built-in filter and map functions - thanks #ImDarrenG for suggesting flatMap, but I think filter and map will be easier:
val result : Array[(String, String)] = someRDD
.filter(x => x._1 > 1000) // filter only good rows
.map (x => ("a", "b"))
.collect() // convert to arrat
The Spark API saves you some file handling code but essentially achieves the same thing.
The exception is if you are not using, say, HDFS and do not want your output file to be partitioned (spread across the executors file systems). In this case you will need to collect the data to the driver and use FileWriter to write to a single file, or files, and how you achieve that will depend on how much data you have. If you have more data than driver has memory you will need to handle it differently.
As mentioned in another answer, you're creating an array in your driver, while adding items from your executors, which will not work in a cluster environment. Something like this might be a better way to map your data and handle nulls:
val outputRDD = dataSorted.flatMap {
case (e, r) => {
if(e.id > 1000) {
Some(("a", "b"))
} else {
None
}
}
}
// save outputRDD to file/s here using the approapriate method...

Apache Spark - Intersection of Multiple RDDs

In apache spark one can union multiple RDDs efficiently by using sparkContext.union() method. Is there something similar if someone wants to intersect multiple RDDs? I have searched in sparkContext methods and I could not find anything or anywhere else. One solution could be to union the rdds and then retrieve the duplicates, but I do not think it could be that efficient. Assuming I have the following example with key/value pair collections:
val rdd1 = sc.parallelize(Seq((1,1.0),(2,1.0)))
val rdd2 = sc.parallelize(Seq((1,2.0),(3,4.0),(3,1.0)))
I want to retrieve a new collection which has the following elements:
(1,2.0) (1,1.0)
But of course for multiple rdds and not just two.
Try:
val rdds = Seq(
sc.parallelize(Seq(1, 3, 5)),
sc.parallelize(Seq(3, 5)),
sc.parallelize(Seq(1, 3))
)
rdds.map(rdd => rdd.map(x => (x, None))).reduce((x, y) => x.join(y).keys.map(x => (x, None))).keys
There is an intersection method on RDD, but it only takes one other RDD:
def intersection(other: RDD[T]): RDD[T]
Let's implement the method you want in terms of this one.
def intersectRDDs[T](rdds: Seq[RDD[T]]): RDD[T] = {
rdds.reduce { case (left, right) => left.intersection(right)
}
If you've looked at the implementation of Spark joins, you can optimize the execution by putting the largest RDD first:
def intersectRDDs[T](rdds: Seq[RDD[T]]): RDD[T] = {
rdds.sortBy(rdd => -1 * rdd.partitions.length)
.reduce { case (left, right) => left.intersection(right)
}
EDIT: It looks like I misread your example: your text looked like you were searching for the inverse behavior to rdd.union, but your example implied you want intersect by key. My answer does not address this case.

Spark RDD pipe value from tuple

I have a Spark RDD where each element is a tuple in the form (key, input). I would like to use the pipe method to pass the inputs to an external executable and generate a new RDD of the form (key, output). I need the keys for correlation later.
Here is an example using the spark-shell:
val data = sc.parallelize(
Seq(
("file1", "one"),
("file2", "two two"),
("file3", "three three three")))
// Incorrectly processes the data (calls toString() on each tuple)
data.pipe("wc")
// Loses the keys, generates extraneous results
data.map( elem => elem._2 ).pipe("wc")
Thanks in advance.
The solution with map is not correct as map is not guarantee to preserve partitioning so using zip after will fail. You need to use mapValues to preserve the partition of the initial RDD.
data.zip(
data.mapValues{ _.toString }.pipe("my_executable")
).map { case ((key, input), output) =>
(key, output)
}
Considering you cannot pass label in/out of executable, this might work:
rdd
.map(x => x._1)
.zip(rdd
.map(x => x._2)
.pipe("my executable"))
Please, be aware, that this can be fragile, and will definitely break if your executable not produces exactly single line on each input record.

Summary Statistics for string types in spark

Is there something like summary function in spark like that in "R".
The summary calculation which comes with spark(MultivariateStatisticalSummary) operates only on numeric types.
I am interested in getting the results for string types also like the first four max occuring strings(groupby kind of operation) , number of uniques etc.
Is there any preexisting code for this ?
If not what please suggest the best way to deal with string types.
I don't think there is such a thing for String in MLlib. But it would probably be a valuable contribution, if you are going to implement it.
Calculating just one of these metrics is easy. E.g. for top 4 by frequency:
def top4(rdd: org.apache.spark.rdd.RDD[String]) =
rdd
.map(s => (s, 1))
.reduceByKey(_ + _)
.map { case (s, count) => (count, s) }
.top(4)
.map { case (count, s) => s }
Or number of uniques:
def numUnique(rdd: org.apache.spark.rdd.RDD[String]) =
rdd.distinct.count
But doing this for all metrics in a single pass takes more work.
These examples assume that, if you have multiple "columns" of data, you have split each column into a separate RDD. This is a good way to organize the data, and it's necessary for operations that perform a shuffle.
What I mean by splitting up the columns:
def split(together: RDD[(Long, Seq[String])],
columns: Int): Seq[RDD[(Long, String)]] = {
together.cache // We will do N passes over this RDD.
(0 until columns).map {
i => together.mapValues(s => s(i))
}
}