Mapping multiple arrays in scala - scala

I have two arrays in Scala both with the same numbers
val v = myGraph.vertices.collect.map(_._1)
which gives:
Array[org.apache.spark.graphx.VertexId] = Array(-7023794695707475841, -44591218498176864, 757355101589630892, 21619280952332745)
and another
val w = myGraph.vertices.collect.map(_._2._2)
which gives:
Array[String] = Array(2, 3, 1, 2)
and i want to create a string using
val z = v.map("{id:" + _ + "," + "group:" + "1" + "}").mkString(",")
which gives:
String = {id:-7023794695707475841,group:1},{id:-44591218498176864,group:1},{id:757355101589630892,group:1},{id:21619280952332745,group:1}
But now instead of the hardcoded group of "1", i want to instead map in the numbers from the w array to give:
String = {id:-7023794695707475841,group:2},{id:-44591218498176864,group:3},{id:757355101589630892,group:1},{id:21619280952332745,group:2}
How do i do this?

There's a method in Scala collections called zip which pairs up two collections just the way you need.
val v = Array(-37581, -44864, 757102, 21625)
val w = Array(2, 3, 1, 2)
val z = v.zip(w).map {
case (v, w) => "{id:" + v + "," + "group:" + w + "}"
}.mkString(",")
Value z becomes:
{id:-37581,group:2},{id:-44864,group:3},{id:757102,group:1},{id:21625,group:2}

Related

Strange behaviour in curly braces vs braces in scala

I've read through several curly braces and braces differences in stackoverflow, such as What is the formal difference in Scala between braces and parentheses, and when should they be used?, but I didn't find the answer for my following question
object Test {
def main(args: Array[String]) {
val m = Map("foo" -> 3, "bar" -> 4)
val m2 = m.map(x => {
val y = x._2 + 1
"(" + y.toString + ")"
})
// The following DOES NOT work
// m.map(x =>
// val y = x._2 + 1
// "(" + y.toString + ")"
// )
println(m2)
// The following works
// If you explain {} as a block, and inside the block is a function
// m.map will take a function, how does this function take 2 lines?
val m3 = m.map { x =>
val y = x._2 + 2 // this line
"(" + y.toString + ")" // and this line they both belong to the same function
}
println(m3)
}
}
The answer is very simple, when you use something like:
...map(x => x + 1)
You can only have one expression. So, something like:
scala> List(1,2).map(x => val y = x + 1; y)
<console>:1: error: illegal start of simple expression
List(1,2).map(x => val y = x + 1; y)
...
Simply doesn't work. Now, let's contrast this with:
scala> List(1,2).map{x => val y = x + 1; y} // or
scala> List(1,2).map(x => { val y = x + 1; y })
res4: List[Int] = List(2, 3)
And going even a little further:
scala> 1 + 3 + 4
res8: Int = 8
scala> {val y = 1 + 3; y} + 4
res9: Int = 8
Btw, the last y never left the scope in the {},
scala> y
<console>:18: error: not found: value y

Spark Scala Understanding reduceByKey(_ + _)

I can't understand reduceByKey(_ + _) in the first example of spark with scala
object WordCount {
def main(args: Array[String]): Unit = {
val inputPath = args(0)
val outputPath = args(1)
val sc = new SparkContext()
val lines = sc.textFile(inputPath)
val wordCounts = lines.flatMap {line => line.split(" ")}
.map(word => (word, 1))
.reduceByKey(_ + _) **I cant't understand this line**
wordCounts.saveAsTextFile(outputPath)
}
}
Reduce takes two elements and produce a third after applying a function to the two parameters.
The code you shown is equivalent to the the following
reduceByKey((x,y)=> x + y)
Instead of defining dummy variables and write a lambda, Scala is smart enough to figure out that what you trying achieve is applying a func (sum in this case) on any two parameters it receives and hence the syntax
reduceByKey(_ + _)
reduceByKey takes two parameters, apply a function and returns
reduceByKey(_ + _) is equivalent to reduceByKey((x,y)=> x + y)
Example :
val numbers = Array(1, 2, 3, 4, 5)
val sum = numbers.reduceLeft[Int](_+_)
println("The sum of the numbers one through five is " + sum)
Results :
The sum of the numbers one through five is 15
numbers: Array[Int] = Array(1, 2, 3, 4, 5)
sum: Int = 15
Same reduceByKey(_ ++ _) is equivalent to reduceByKey((x,y)=> x ++ y)

String Hive function for split key-value pair into two columns

How to split this data T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0 into two columns using hive function
For example
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0
You can do this with a regex implementation:
def main(args: Array[String]) {
val s = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val pattern = "[A-Z]\\_\\d+\\.?\\d*"
var buff = new String()
val r = Pattern.compile(pattern)
val m = r.matcher(s)
while (m.find()) {
buff = buff + (m.group(0))
buff = buff + "\n"
}
buff = buff.toString.replaceAll("\\_", " ")
println("output:\n" + buff)
}
Output:
output:
T 32
P 1
A 420
H 60
R 0.30841494477846165
S 0
If you need to collect the data for further processing, and you're guaranteed it's always paired correctly, you could do something like this.
scala> val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
str: String = T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0
scala> val data = str.split("_").sliding(2,2)
data: Iterator[Array[String]] = non-empty iterator
scala> data.toList // just to see it
res29: List[Array[String]] = List(Array(T, 32), Array(P, 1), Array(A, 420), Array(H, 60), Array(R, 0.30841494477846165), Array(S, 0))
You can split your string, get an array, zipWithIndex and filter based on index to get two arrays col1 and col2 and then use it for printing:
val str = "T_32_P_1_A_420_H_60_R_0.30841494477846165_S_0"
val tmp = str.split('_').zipWithIndex
val col1 = tmp.filter( p => p._2 % 2 == 0 ).map( p => p._1)
val col2 = tmp.filter( p => p._2 % 2 != 0 ).map( p => p._1)
//col1: Array[String] = Array(T, P, A, H, R, S)
//col2: Array[String] = Array(32, 1, 420, 60, ...

scala: Handle tuple where second element of tuple is an array of strings

I have an rdd and the structure of the RDD is as follows:
org.apache.spark.rdd.RDD[(String, Array[String])] = MappedRDD[40] at map at <console>:14
Here is x.take(1) looks like:
Array[(String, Array[String])] = Array((8239427349237423,Array(122641|2|2|1|1421990315711|38|6487985623452037|684|, 1229|2|1|1|1411349089424|87|462966136107937|1568|.....))
For each string in the array I want to split by "|" and take the 6th item and return it with the first element of the tuple as follows:
8239427349237423-6487985623452037
8239427349237423-4629661361079371
I started as follows:
def getValues(lines: Array[String]) {
for(line <- lines) {
line.split("|")(6)
}
I also tried following:
val b= x.map(a => (a._1, a._2.flatMap(y => y.split("|")(6))))
But that ended up giving me following:
Array[(String, Array[Char])] = Array((8239427349237423,Array(1, 2, 4, |, 9, |, 4, 1, 7, 6, |, 2, 9, 2, 7, 2, |, 7, |,....)))
If you want to do it for the whole x you can use flatMap:
def getValues(x: Array[(String, Array[String])]) =
x flatMap (line => line._2 map (line._1 + "-" + _.split("\\|")(6)))
Or, maybe a bit more clearly, with for-comprehension:
def getValues(x: Array[(String, Array[String])]) =
for {
(fst, snd) <- x
line <- snd
} yield fst + "-" + line.split("\\|")(6)
You have to call split with "\\|" argument, because it takes a regular expression and | is a special symbol, thus you need to escape it. (Edit: or you can use '|' (a Char), as suggested by #BenReich)
To answer your comment, you can modify getValues to take a single element from x as an argument:
def getValues(item: (String, Array[String])) =
item._2 map (item._1 + "-" + _.split('|')(6))
And then call it with
x flatMap getValues

In Scala, how do I keep track of running totals without using var?

For example, suppose I wish to read in fat, carbs and protein and wish to print the running total of each variable. An imperative style would look like the following:
var totalFat = 0.0
var totalCarbs = 0.0
var totalProtein = 0.0
var lineNumber = 0
for (lineData <- allData) {
totalFat += lineData...
totalCarbs += lineData...
totalProtein += lineData...
lineNumber += 1
printCSV(lineNumber, totalFat, totalCarbs, totalProtein)
}
How would I write the above using only vals?
Use scanLeft.
val zs = allData.scanLeft((0, 0.0, 0.0, 0.0)) { case(r, c) =>
val lineNr = r._1 + 1
val fat = r._2 + c...
val carbs = r._3 + c...
val protein = r._4 + c...
(lineNr, fat, carbs, protein)
}
zs foreach Function.tupled(printCSV)
Recursion. Pass the sums from previous row to a function that will add them to values from current row, print them to CSV and pass them to itself...
You can transform your data with map and get the total result with sum:
val total = allData map { ... } sum
With scanLeft you get the particular sums of each step:
val steps = allData.scanLeft(0) { case (sum,lineData) => sum+lineData}
val result = steps.last
If you want to create several new values in one iteration step I would prefer a class which hold the values:
case class X(i: Int, str: String)
object X {
def empty = X(0, "")
}
(1 to 10).scanLeft(X.empty) { case (sum, data) => X(sum.i+data, sum.str+data) }
It's just a jump to the left,
and then a fold to the right /:
class Data (val a: Int, val b: Int, val c: Int)
val list = List (new Data (3, 4, 5), new Data (4, 2, 3),
new Data (0, 6, 2), new Data (2, 4, 8))
val res = (new Data (0, 0, 0) /: list)
((acc, x) => new Data (acc.a + x.a, acc.b + x.b, acc.c + x.c))