Spark-Scala: Map the first element of list with every other element of list when lists are of varying length - scala

I have dataset of the following type in a textile:
1004,bb5469c5|2021-09-19 01:25:30,4f0d-bb6f-43cf552b9bc6|2021-09-25 05:12:32,1954f0f|2021-09-19 01:27:45,4395766ae|2021-09-19 01:29:13,
1018,36ba7a7|2021-09-19 01:33:00,
1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59,77a90a1c97b|2021-09-19 01:34:53,
1022,3623fe40|2021-09-19 01:33:00,
1028,6c77d26c-6fb86|2021-09-19 01:50:50,f0ac93b3df|2021-09-19 01:51:11,
1032,ac55-4be82f28d|2021-09-19 01:54:20,82229689e9da|2021-09-23 01:19:47,
I read the file using sc.textFile which returns an RDD of type Array[String] after which I perform the operations .map(x=>x.substring(1,x.length()-1)).map(x=>x.split(",").toList)
After split.toList I want to map the first element of each of the lists obtained to every other element of the list for which I use .map(x=>(x(0),x(1))).toDF("c1","c2")
This works fine for those lists which have only one value after split but skips on all other elements of the lists having more than one value for obvious reasons. For eg:
.map(x=>(x(0),x(1))) returns [1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59] but skips out on the third element here 77a90a1c97b|2021-09-19 01:34:53
How can I write a map function which returns [1020,23fe40-4796-ad3d-6d5499b|2021-09-19 01:38:59], [1020,77a90a1c97b|2021-09-19 01:34:53] given that all the lists created using .map(x=>x.split(",").toList) are of varying lengths (have varying number of elements)?

I noted the ',' at the end of the file, but split ignores nulls.
The solution is as follows, just try it and you will see it works:
// x._n cannot work here initially.
val rdd = spark.sparkContext.textFile("/FileStore/tables/oddfile_01.txt")
val rdd2 = rdd.map(line => line.split(','))
val rdd3 = rdd2.map(x => (x(0), x.tail.toList))
val rdd4 = rdd3.flatMap{case (x, y) => y.map((x, _))}
rdd4.collect
Cardinality does change in this approach though.

Related

Passing parameters to foldLeft scala

I am trying to use Scala's foldLeft function to compare a Stream with a List of Lists.
This is a snippet of the code I have
def freqParse(pairs: (Pair,Int,String), record: String): (Pair,Int,String) ={
val m: Pair = ("","")
val t: FreqPairs = Map((m,0))
(pairs._1,pairs._2,pairs._3)
}
val freqItems = items.map(v => (v._1)).toList
val cross = freqItems.flatMap(x => freqItems.map(y => (x, y))) // cross product to get pair of frequent items
val freq = lines.foldLeft(pairs,0,delim)(freqParse)
cross is basically a lists where each element is a pair of strings List(List(a,b), List(a,c)...).
lines is an input stream with a record per line (25902 lines in total).
I want to count how many times each pair (or each element in cross) occurs in the entirety of the stream. Essentially comparing all elements in cross to all elements in lines.
I decided foldLeft because that way I can take each line in the stream and split it at a delimiter and then check if both the elements in cross appear or not.
I am able to split each record in lines but I don't know how to pass the cross variable to the function to begin the comparison.

Scala. Need for loop where the iterations return a growing list

I have a function that takes a value and returns a list of pairs, pairUp.
and a key set, flightPass.keys
I want to write a for loop that runs pairUp for each value of flightPass.keys, and returns a big list of all these returned values.
val result:List[(Int, Int)] = pairUp(flightPass.keys.toSeq(0)).toList
for (flight<- flightPass.keys.toSeq.drop(1))
{val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
I've tried a few different variations on this, always getting the error:
<console>:23: error: forward reference extends over definition of value result
for (flight<- flightPass.keys.toSeq.drop(1)) {val result:List[(Int, Int)] = result ++ pairUp(flight).toList}
^
I feel like this should work in other languages, so what am I doing wrong here?
First off, you've defined result as a val, which means it is immutable and can't be modified.
So if you want to apply "pairUp for each value of flightPass.keys", why not map()?
val result = flightPass.keys.map(pairUp) //add .toList if needed
A Scala method which converts a List of values into a List of Lists and then reduces them to a single List is called flatMap which is short for map then flatten. You would use it like this:
flightPass.keys.toSeq.flatMap(k => pairUp(k))
This will take each 'key' from flightPass.keys and pass it to pairUp (the mapping part), then take the resulting Lists from each call to pairUp and 'flatten' them, resulting in a single joined list.

Spark: Dividing one array by elements in another

I am new to Apache Spark and Scala. I am trying to understand something here: -
I have one array:
Companies= Array(
(Microsoft,478953),
(IBM,332042),
(JP Morgan,226003),
(Google,342033)
)
I wanted to divide this by another array, element by element:
Count = Array((Microsoft,4), (IBM,3), (JP Morgan,2), (Google,3))
I used this code :
val result: Array[(String, Double)] = wordMapCount
.zip(letterMapCount)
.map { case ((letter, wc), (_, lc)) => (letter, lc.toDouble / wc) }
From here: Divide Arrays
This works. However, I do not understand it. Why does zip require the second array and not the first one also the case matching how is that working here?
Why does zip require the second array and not the first one?
Because that's how zip works. It takes two separate RDD instances and maps one over the other to create pair of the first and second element:
def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]
case matching how is that working here
You have two tuples:
(Microsoft, 478953), (Microsoft,4)
What this partial function does decomposition of the tuple type via a call to Tuple2.unapply. This:
case ((letter, wc), (_, lc))
Means "extract the first argument (_1) from the first tuple into a fresh value named letter, and the second argument (_2) to a fresh value named wc. Same goes for the second tuple. And then, it creates a new tuple with letter as the first value and the division of lc and wc as the second argument.

How to transform Array[(Double, Double)] into Array[Double] in Scala?

I'm using MLlib of Spark (v1.1.0) and Scala to do k-means clustering applied to a file with points (longitude and latitude).
My file contains 4 fields separated by comma (the last two are the longitude and latitude).
Here, it's an example of k-means clustering using Spark:
https://spark.apache.org/docs/1.1.0/mllib-clustering.html
What I want to do is to read the last two fields of my files that are in a specific directory in HDFS, transform them into an RDD<Vector> o use this method in KMeans class:
train(RDD<Vector> data, int k, int maxIterations)
This is my code:
val data = sc.textFile("/user/test/location/*")
val parsedData = data.map(s => Vectors.dense(s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))))
But when I run it in spark-shell I get the following error:
error: overloaded method value dense with alternatives: (values:
Array[Double])org.apache.spark.mllib.linalg.Vector (firstValue:
Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (Array[(Double, Double)])
So, I don't know how to transform my Array[(Double, Double)] into Array[Double]. Maybe there is another way to read the two fields and convert them into RDD<Vector>, any suggestion?
Previous suggestion using flatMap was based on the assumption that you wanted to map over the elements of the array given by the .split(",") - and offered to satisfy the types, by using Array instead of Tuple2.
The argument received by the .map/.flatMap functions is an element of the original collection, so should be named 'field' (singluar) for clarity. Calling fields(2) selects the 3rd character of each of the elements of the split - hence the source of confusion.
If what you're after is the 3rd and 4th elements of the .split(",") array, converted to Double:
s.split(",").drop(2).take(2).map(_.toDouble)
or if you want all BUT the first to fields converted to Double (if there may be more than 2):
s.split(",").drop(2).map(_.toDouble)
There're two 'factory' methods for dense Vectors:
def dense(values: Array[Double]): Vector
def dense(firstValue: Double, otherValues: Double*): Vector
While the provided type above is Array[Tuple2[Double,Double]] and hence does not type-match:
(Extracting the logic above:)
val parseLineToTuple: String => Array[(Double,Double)] = s => s=> s.split(',').map(fields => (fields(2).toDouble,fields(3).toDouble))
What is needed here is to create a new Array out of the input String, like this: (again focusing only on the specific parsing logic)
val parseLineToArray: String => Array[Double] = s=> s.split(",").flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble)))
Integrating that in the original code should solve the issue:
val data = sc.textFile("/user/test/location/*")
val vectors = data.map(s => Vectors.dense(parseLineToArray(s))
(You can of course inline that code, I separated it here to focus on the issue at hand)
val parsedData = data.map(s => Vectors.dense(s.split(',').flatMap(fields => Array(fields(2).toDouble,fields(3).toDouble))))

How to create a map from a RDD[String] using scala?

My file is,
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
Here there are 7 rows & 5 columns(0,1,2,3,4)
I want the output as,
Map(0 -> Set("sunny","overcast","rainy"))
Map(1 -> Set("hot","mild","cool"))
Map(2 -> Set("high","normal"))
Map(3 -> Set("false","true"))
Map(4 -> Set("yes","no"))
The output must be the type of [Map[Int,Set[String]]]
EDIT: Rewritten to present the map-reduce version first, as it's more suited to Spark
Since this is Spark, we're probably interested in parallelism/distribution. So we need to take care to enable that.
Splitting each string into words can be done in partitions. Getting the set of values used in each column is a bit more tricky - the naive approach of initialising a set then adding every value from every row is inherently serial/local, since there's only one set (per column) we're adding the value from each row to.
However, if we have the set for some part of the rows and the set for the rest, the answer is just the union of these sets. This suggests a reduce operation where we merge sets for some subset of the rows, then merge those and so on until we have a single set.
So, the algorithm:
Split each row into an array of strings, then change this into an
array of sets of the single string value for each column - this can
all be done with one map, and distributed.
Now reduce this using an
operation that merges the set for each column in turn. This also can
be distributed
turn the single row that results into a Map
It's no coincidence that we do a map, then a reduce, which should remind you of something :)
Here's a one-liner that produces the single row:
val data = List(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val row = data.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
Converting it to a Map as the question asks:
val theMap = row.zipWithIndex.map(_.swap).toMap
Zip the list with the index, since that's what we need as the key of
the map.
The elements of each tuple are unfortunately in the wrong
order for .toMap, so swap them.
Then we have a list of (key, value)
pairs which .toMap will turn into the desired result.
These don't need to change AT ALL to work with Spark. We just need to use a RDD, instead of the List. Let's convert data into an RDD just to demo this:
val conf = new SparkConf().setAppName("spark-scratch").setMaster("local")
val sc= new SparkContext(conf)
val rdd = sc.makeRDD(data)
val row = rdd.map(_.split("\\W+").map(s=>Set(s)))
.reduce{(a, b) => (a zip b).map{case (l, r) => l ++ r}}
(This can be converted into a Map as before)
An earlier oneliner works neatly (transpose is exactly what's needed here) but is very difficult to distribute (transpose inherently needs to visit every row)
data.map(_.split("\\W+")).transpose.map(_.toSet)
(Omitting the conversion to Map for clarity)
Split each string into words.
Transpose the result, so we have a list that has a list of the first words, then a list of the second words, etc.
Convert each of those to a set.
Maybe this do the trick:
val a = Array(
"sunny,hot,high,FALSE,no",
"sunny,hot,high,TRUE,no",
"overcast,hot,high,FALSE,yes",
"rainy,mild,high,FALSE,yes",
"rainy,cool,normal,FALSE,yes",
"rainy,cool,normal,TRUE,no",
"overcast,cool,normal,TRUE,yes")
val b = new Array[Map[String, Set[String]]](5)
for (i <- 0 to 4)
b(i) = Map(i.toString -> (Set() ++ (for (s <- a) yield s.split(",")(i))) )
println(b.mkString("\n"))