Can someone explain this line of Scala code to me? - scala

Scala syntax has been driving me nuts. Below is a line of Scala from a Spark driver program. I get most of it except the very end.
val ratings = lines.map(x => x.toString().split("\t")(2))
The (2) just floating out there doesn't make sense. I understand intellectually that it's the third item in the RDD, but why is there not a dot or something connecting it to the rest of the statement?

It's Scala's syntax for accessing an Array element.
x.toString().split("\t")
The above returns an Array. Adding the (2) returns the third element in that array. This is syntactic sugar for calling .apply(2) on the array, which gives you the element at the supplied index.
An example:
val numbers = Array("beaver", "aardvark", "warthog")
numbers(0) // "beaver"; same as numbers.apply(0)
numbers(1) // "aardvark"
numbers(2) // "warthog"

Because the string x is split into an array and this is the syntax to access the array element

In my observation
val fruits = Array("Apple", "Banana", "Orange");
fruits.map(x => x.toString().split("\t")(0))
Array[String] = Array(Apple, Banana, Orange)
fruits.map(x => x.toString().split("\t"))
Array[Array[String]] = Array(Array(Apple), Array(Banana), Array(Orange))
fruits.map(x => x.toString())
Array[String] = Array(Apple, Banana, Orange)

Related

Filter array of tuple with an array of Double SCALA

I am very new to scala. Here is my issue:
I have an array:
val numbers = Array(1, 2, 3, 4, 5)
And an array of tupples.
val arrayTuple= Array((1,2),(10,5),(40,5),(3,4))
I would like to filter this list and keep only tuples that have their first elment in the list numbers.
val filtered=arrayTuple.filter(numbers.contains(_.1)).map(x=>x)
But it doesn't work. Can you help me please. Thank you
Your syntax to access the first element of a tuple is wrong (see the Scaladoc). You also don't need the map:
val filtered = arrayTuple.filter(t => numbers.contains(t._1))

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

Parse and concatenate two columns

I am trying to parse and concatenate two columns at the same time using the following expression:
val part : RDD[(String)] = sc.textFile("hdfs://xxx:8020/user/sample_head.csv")
.map{line => val row = line split ','
(row(1), row(2)).toString}
which returns something like:
Array((AAA,111), (BBB,222),(CCC,333))
But how could I directly get:
Array(AAA, 111 , BBB, 222, CCC, 333)
Your toString() on a tuple really doesn't make much sense to me. Can you explain why do you want to create strings from tuples and then split them again later?
If you are willing to map each row into a list of elements instead of a stringified tuple of elements, you could rewrite
(row(1), row(2)).toString
to
List(row(1), row(2))
and simply flatten the resulting list:
val list = List("0,aaa,111", "1,bbb,222", "2,ccc,333")
val tuples = list.map{ line =>
val row = line split ','
List(row(1), row(2))}
val flattenedTuples = tuples.flatten
println(flattenedTuples) // prints List(aaa, 111, bbb, 222, ccc, 333)
Note that what you are trying to achieve involves flattening and can be done using flatMap, but not using just map. You need to either flatMap directly, or do map followed by flatten like I showed you (I honestly don't remember if Spark supports flatMap). Also, as you can see I used a List as a more idiomatic Scala data structure, but it's easily convertible to Array and vice versa.

In Scala, how to get a slice of a list from nth element to the end of the list without knowing the length?

I'm looking for an elegant way to get a slice of a list from element n onwards without having to specify the length of the list. Lets say we have a multiline string which I split into lines and then want to get a list of all lines from line 3 onwards:
string.split("\n").slice(3,X) // But I don't know what X is...
What I'm really interested in here is whether there's a way to get hold of a reference of the list returned by the split call so that its length can be substituted into X at the time of the slice call, kind of like a fancy _ (in which case it would read as slice(3,_.length)) ? In python one doesn't need to specify the last element of the slice.
Of course I could solve this by using a temp variable after the split, or creating a helper function with a nice syntax, but I'm just curious.
Just drop first n elements you don't need:
List(1,2,3,4).drop(2)
res0: List[Int] = List(3, 4)
or in your case:
string.split("\n").drop(2)
There is also paired method .take(n) that do the opposite thing, you can think of it as .slice(0,n).
In case you need both parts, use .splitAt:
val (left, right) = List(1,2,3,4).splitAt(2)
left: List[Int] = List(1, 2)
right: List[Int] = List(3, 4)
The right answer is takeRight(n):
"communism is sharing => resource saver".takeRight(3)
//> res0: String = ver
You can use scala's list method 'takeRight',This will not throw exception when List's length is not enough, Like this:
val t = List(1,2,3,4,5);
t.takeRight(3);
res1: List[Int] = List(3,4,5)
If list is not longer than you want take, this will not throw Exception:
val t = List(4,5);
t.takeRight(3);
res1: List[Int] = List(4,5)
get last 2 elements:
List(1,2,3,4,5).reverseIterator.take(2)

Get item in the list in Scala?

How in the world do you get just an element at index i from the List in scala?
I tried get(i), and [i] - nothing works. Googling only returns how to "find" an element in the list. But I already know the index of the element!
Here is the code that does not compile:
def buildTree(data: List[Data2D]):Node ={
if(data.length == 1){
var point:Data2D = data[0] //Nope - does not work
}
return null
}
Looking at the List api does not help, as my eyes just cross.
Use parentheses:
data(2)
But you don't really want to do that with lists very often, since linked lists take time to traverse. If you want to index into a collection, use Vector (immutable) or ArrayBuffer (mutable) or possibly Array (which is just a Java array, except again you index into it with (i) instead of [i]).
Safer is to use lift so you can extract the value if it exists and fail gracefully if it does not.
data.lift(2)
This will return None if the list isn't long enough to provide that element, and Some(value) if it is.
scala> val l = List("a", "b", "c")
scala> l.lift(1)
Some("b")
scala> l.lift(5)
None
Whenever you're performing an operation that may fail in this way it's great to use an Option and get the type system to help make sure you are handling the case where the element doesn't exist.
Explanation:
This works because List's apply (which sugars to just parentheses, e.g. l(index)) is like a partial function that is defined wherever the list has an element. The List.lift method turns the partial apply function (a function that is only defined for some inputs) into a normal function (defined for any input) by basically wrapping the result in an Option.
Why parentheses?
Here is the quote from the book programming in scala.
Another important idea illustrated by this example will give you insight into why arrays are accessed with parentheses in Scala. Scala has fewer special cases than Java. Arrays are simply instances of classes like any other class in Scala. When you apply parentheses surrounding one or more values to a variable, Scala will transform the code into an invocation of a method named apply on that variable. So greetStrings(i) gets transformed into greetStrings.apply(i). Thus accessing an element of an array in Scala is simply a method call like any other. This principle is not restricted to arrays: any application of an object to some arguments in parentheses will be transformed to an apply method call. Of course this will compile only if that type of object actually defines an apply method. So it's not a special case; it's a general rule.
Here are a few examples how to pull certain element (first elem in this case) using functional programming style.
// Create a multdimension Array
scala> val a = Array.ofDim[String](2, 3)
a: Array[Array[String]] = Array(Array(null, null, null), Array(null, null, null))
scala> a(0) = Array("1","2","3")
scala> a(1) = Array("4", "5", "6")
scala> a
Array[Array[String]] = Array(Array(1, 2, 3), Array(4, 5, 6))
// 1. paratheses
scala> a.map(_(0))
Array[String] = Array(1, 4)
// 2. apply
scala> a.map(_.apply(0))
Array[String] = Array(1, 4)
// 3. function literal
scala> a.map(a => a(0))
Array[String] = Array(1, 4)
// 4. lift
scala> a.map(_.lift(0))
Array[Option[String]] = Array(Some(1), Some(4))
// 5. head or last
scala> a.map(_.head)
Array[String] = Array(1, 4)
Please use parentheses () to access the list of elements, as shown below.
list_name(index)