Spark RDD tuple transformation - scala

I'm trying to transform an RDD of tuple of Strings of this format :
(("abc","xyz","123","2016-02-26T18:31:56"),"15") TO
(("abc","xyz","123"),"2016-02-26T18:31:56","15")
Basically seperating out the timestamp string as a seperate tuple element. I tried following but it's still not clean and correct.
val result = rdd.map(r => (r._1.toString.split(",").toVector.dropRight(1).toString, r._1.toString.split(",").toList.last.toString, r._2))
However, it results in
(Vector(("abc", "xyz", "123"),"2016-02-26T18:31:56"),"15")
The expected output I'm looking for is
(("abc", "xyz", "123"),"2016-02-26T18:31:56","15")
This way I can access the elements using r._1, r._2 (the timestamp string) and r._3 in a seperate map operation.
Any hints/pointers will be greatly appreciated.

Vector.toString will include the String 'Vector' in its result. Instead, use Vector.mkString(",").
Example:
scala> val xs = Vector(1,2,3)
xs: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
scala> xs.toString
res25: String = Vector(1, 2, 3)
scala> xs.mkString
res26: String = 123
scala> xs.mkString(",")
res27: String = 1,2,3
However, if you want to be able to access (abc,xyz,123) as a Tuple and not as a string, you could also do the following:
val res = rdd.map{
case ((a:String,b:String,c:String,ts:String),d:String) => ((a,b,c),ts,d)
}

Related

length of each word in array by using scala

I have data like this below. In an array we have different words
scala> val x=rdd.flatMap(_.split(" "))
x: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at flatMap at <console>:26
scala> x.collect
res41: Array[String] = Array(Roses, are, red, Violets, are, blue)
I want find the length of each word in an array in scala
Spark allows you to chain the functions that are defined on a RDD[T], which is RDD[String] in your case. You can add the map function following your flatMap function to get the lengths.
val sentence: String = "Apache Spark is a cluster compute engine"
val sentenceRDD = sc.makeRDD(List(sentence))
val wordLength = sentenceRDD.flatMap(_.split(" ")).map(_.length)
wordLength.take(2)
For instance I'll use your value x to show the demonstration:
we can do something like this to find the length of each word in array in scala
>x.map(s => s -> s.length)
This will print out the following:
Array[(String, Int)] = Array((Roses,5), (are,3), (red,3), (Violets,7), (are,3), (blue,4))
In the case, if you are using Spark. Then change as follows:
>x.map(s => s -> s.length).collect()
This will print out the following:
Array[(String, Int)] = Array((Roses,5), (are,3), (red,3), (Violets,7), (are,3), (blue,4))
If you want only the length of each word then use this:
>x.map(_.length).collect()
Output:
Array(5,3,3,7,3,4)
you can just give ...
val a = Array("Roses", "are", "red", "Violets", "are", "blue")
var b = a.map(x => x.length)
This will give you Array[Int] = Array(5, 3, 3, 7, 3, 4)

Accessing a specific element of an Array RDD in apache-spark scala

I have a RDD that is containing an array of key,value pairs. I want to get an element with key (say 4).
scala> val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:27
scala> val b = a.keyBy(_.length)
b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[1] at keyBy at <console>:29
I have tried to apply filter on it but getting error.
scala> val c = b.filter(p => p(0) = 4);
<console>:31: error: value update is not a member of (Int, String)
val c = b.filter(p => p(0) = 4);
I want to print the key,value pair with specific key (say 4) as Array((4,lion))
The data is always coming in the form of an array of key,value pair
use p._1 instead of p(0).
val rdd = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 1)
val kvRdd: RDD[(Int, String)] = rdd.keyBy(_.length)
val filterRdd: RDD[(Int, String)] = kvRdd.filter(p => p._1 == 4)
//display rdd
println(filterRdd.collect().toList)
List((4,lion))
There's a lookup method applicable to RDDs of Key-Value pairs (RDDs of type RDD[(K,V)]) that directly offers this functionality.
b.lookup(4)
// res4: Seq[String] = WrappedArray(lion)
b.lookup(5)
// res6: Seq[String] = WrappedArray(tiger, eagle)

Concatenate String to each element of a List in a Spark dataframe with Scala

I have two columns in a Spark dataframe: one is a String, and the other is a List of Strings. How do I create a new column that is the concatenation of the String in column one with each element of the list in column 2, resulting in another list in column 3.
For example, if column 1 is "a", and column 2 is ["A","B"], I'd like the output in column 3 of the dataframe to to be ["aA","aB"].
So far, I have:
val multiplier = (x1: String, x2: Seq[String]) => {x1+x2}
val multiplierUDF = udf(multiplier)
val df2 = df1
.withColumn("col3", multiplierUDF(df1("col1"),df1("col2")))
which gives aWrappedArray(A,B)
I suggest you try your udf functions outside of spark, and get them working for local variables first. If you do:
val multiplier = (x1: String, x2: Seq[String]) => {x1+x2}
multiplier("a", Seq("A", "B"))
// output
res1: String = aList(A, B)
You'll see multiplier doesn't do what you want.
I think you're looking for:
val multiplier = (x1: String, x2: Seq[String]) => x2.map(x1+_)
multiplier("a", Seq("A", "B"))
//output
res2: Seq[String] = List(aA, aB)
I think you should redefine your UDF to something similar to my function append
val a = Seq("A", "B")
val p = "a"
def append(init: String, tails: Seq[String]) = tails.map(x => init + x)
append(p, a)
//res1: Seq[String] = List(aA, aB)

Selecting multiple arbitrary columns from Scala array using map()

I'm new to Scala (and Spark). I'm trying to read in a csv file and extract multiple arbitrary columns from the data. The following function does this, but with hard-coded column indices:
def readCSV(filename: String, sc: SparkContext): RDD[String] = {
val input = sc.textFile(filename).map(line => line.split(","))
val out = input.map(csv => csv(2)+","+csv(4)+","+csv(15))
return out
}
Is there a way to use map with an arbitrary number of column indices passed to the function in an array?
If you have a sequence of indices, you could map over it and return the values :
scala> val m = List(List(1,2,3), List(4,5,6))
m: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6))
scala> val indices = List(0,2)
indices: List[Int] = List(0, 2)
// For each inner sequence, get the relevant values
// indices.map(inner) is the same as indices.map(i => inner(i))
scala> m.map(inner => indices.map(inner))
res1: List[List[Int]] = List(List(1, 3), List(4, 6))
// If you want to join all of them use .mkString
scala> m.map(inner => indices.map(inner).mkString(","))
res2: List[String] = List(1,3, 4,6) // that's actually a List containing 2 String

Simple question about tuple of scala

I'm new to scala, and what I'm learning is tuple.
I can define a tuple as following, and get the items:
val tuple = ("Mike", 40, "New York")
println("Name: " + tuple._1)
println("Age: " + tuple._2)
println("City: " + tuple._3)
My question is:
How to get the length of a tuple?
Is tuple mutable? Can I modify its items?
Is there any other useful operation we can do on a tuple?
Thanks in advance!
1] tuple.productArity
2] No.
3] Some interesting operations you can perform on tuples: (a short REPL session)
scala> val x = (3, "hello")
x: (Int, java.lang.String) = (3,hello)
scala> x.swap
res0: (java.lang.String, Int) = (hello,3)
scala> x.toString
res1: java.lang.String = (3,hello)
scala> val y = (3, "hello")
y: (Int, java.lang.String) = (3,hello)
scala> x == y
res2: Boolean = true
scala> x.productPrefix
res3: java.lang.String = Tuple2
scala> val xi = x.productIterator
xi: Iterator[Any] = non-empty iterator
scala> while(xi.hasNext) println(xi.next)
3
hello
See scaladocs of Tuple2, Tuple3 etc for more.
One thing that you can also do with a tuple is to extract the content using the match expression:
def tupleview( tup: Any ){
tup match {
case (a: String, b: String) =>
println("A pair of strings: "+a + " "+ b)
case (a: Int, b: Int, c: Int) =>
println("A triplet of ints: "+a + " "+ b + " " +c)
case _ => println("Unknown")
}
}
tupleview( ("Hello", "Freewind"))
tupleview( (1,2,3))
Gives:
A pair of strings: Hello Freewind
A triplet of ints: 1 2 3
Tuples are immutable, but, like all cases classes, they have a copy method that can be used to create a new Tuple with a few changed elements:
scala> (1, false, "two")
res0: (Int, Boolean, java.lang.String) = (1,false,two)
scala> res0.copy(_2 = true)
res1: (Int, Boolean, java.lang.String) = (1,true,two)
scala> res1.copy(_1 = 1f)
res2: (Float, Boolean, java.lang.String) = (1.0,true,two)
Concerning question 3:
A useful thing you can do with Tuples is to store parameter lists for functions:
def f(i:Int, s:String, c:Char) = s * i + c
List((3, "cha", '!'), (2, "bora", '.')).foreach(t => println((f _).tupled(t)))
//--> chachacha!
//--> borabora.
[Edit] As Randall remarks, you'd better use something like this in "real life":
def f(i:Int, s:String, c:Char) = s * i + c
val g = (f _).tupled
List((3, "cha", '!'), (2, "bora", '.')).foreach(t => println(g(t)))
In order to extract the values from tuples in the middle of a "collection transformation chain" you can write:
val words = List((3, "cha"),(2, "bora")).map{ case(i,s) => s * i }
Note the curly braces around the case, parentheses won't work.
Another nice trick ad question 3) (as 1 and 2 are already answered by others)
val tuple = ("Mike", 40, "New York")
tuple match {
case (name, age, city) =>{
println("Name: " + name)
println("Age: " + age)
println("City: " + city)
}
}
Edit: in fact it's rather a feature of pattern matching and case classes, a tuple is just a simple example of a case class...
You know the size of a tuple, it's part of it's type. For example if you define a function def f(tup: (Int, Int)), you know the length of tup is 2 because values of type (Int, Int) (aka Tuple2[Int, Int]) always have a length of 2.
No.
Not really. Tuples are useful for storing a fixed amount of items of possibly different types and passing them around, putting them into data structures etc. There's really not much you can do with them, other than creating tuples, and getting stuff out of tuples.
1 and 2 have already been answered.
A very useful thing that you can use tuples for is to return more than one value from a method or function. Simple example:
// Get the min and max of two integers
def minmax(a: Int, b: Int): (Int, Int) = if (a < b) (a, b) else (b, a)
// Call it and assign the result to two variables like this:
val (x, y) = minmax(10, 3) // x = 3, y = 10
Using shapeless, you easily get a lot of useful methods, that are usually available only on collections:
import shapeless.syntax.std.tuple._
val t = ("a", 2, true, 0.0)
val first = t(0)
val second = t(1)
// etc
val head = t.head
val tail = t.tail
val init = t.init
val last = t.last
val v = (2.0, 3L)
val concat = t ++ v
val append = t :+ 2L
val prepend = 1.0 +: t
val take2 = t take 2
val drop3 = t drop 3
val reverse = t.reverse
val zip = t zip (2.0, 2, "a", false)
val (unzip, other) = zip.unzip
val list = t.toList
val array = t.toArray
val set = t.to[Set]
Everything is typed as one would expect (that is first has type String, concat has type (String, Int, Boolean, Double, Double, Long), etc.)
The last method above (.to[Collection]) should be available in the next release (as of 2014/07/19).
You can also "update" a tuple
val a = t.updatedAt(1, 3) // gives ("a", 3, true, 0.0)
but that will return a new tuple instead of mutating the original one.