Selecting multiple arbitrary columns from Scala array using map() - scala

I'm new to Scala (and Spark). I'm trying to read in a csv file and extract multiple arbitrary columns from the data. The following function does this, but with hard-coded column indices:
def readCSV(filename: String, sc: SparkContext): RDD[String] = {
val input = sc.textFile(filename).map(line => line.split(","))
val out = input.map(csv => csv(2)+","+csv(4)+","+csv(15))
return out
}
Is there a way to use map with an arbitrary number of column indices passed to the function in an array?

If you have a sequence of indices, you could map over it and return the values :
scala> val m = List(List(1,2,3), List(4,5,6))
m: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6))
scala> val indices = List(0,2)
indices: List[Int] = List(0, 2)
// For each inner sequence, get the relevant values
// indices.map(inner) is the same as indices.map(i => inner(i))
scala> m.map(inner => indices.map(inner))
res1: List[List[Int]] = List(List(1, 3), List(4, 6))
// If you want to join all of them use .mkString
scala> m.map(inner => indices.map(inner).mkString(","))
res2: List[String] = List(1,3, 4,6) // that's actually a List containing 2 String

Related

How to calculate length of string in a tuple in scala

Given a list of tuples, where the 1st element of the tuple is an integer and the second element is a string,
scala> val tuple2 : List[(Int,String)] = List((1,"apple"),(2,"ball"),(3,"cat"),(4,"doll"),(5,"eggs"))
tuple2: List[(Int, String)] = List((1,apple), (2,ball), (3,cat), (4,doll), (5,eggs))
I want to print the numbers where the corresponding string length is 4.
Can this be done in one line ?
you need .collect which is filter+map
given your input,
scala> val input : List[(Int,String)] = List((1,"apple"),(2,"ball"),(3,"cat"),(4,"doll"),(5,"eggs"))
input: List[(Int, String)] = List((1,apple), (2,ball), (3,cat), (4,doll), (5,eggs))
filter those of length 4,
scala> input.collect { case(number, string) if string.length == 4 => number}
res2: List[Int] = List(2, 4, 5)
alternative solution using filter + map,
scala> input.filter { case(number, string) => string.length == 4 }
.map { case (number, string) => number}
res4: List[Int] = List(2, 4, 5)
you filter and print as below
tuple2.filter(_._2.length == 4).foreach(x => println(x._1))
You should have output as
2
4
5
I like #prayagupd answer using collect. But foldLeft is the one of my favourite function in Scala! you can use foldLeft:
scala> val input : List[(Int,String)] = List((1,"apple"),(2,"ball"),(3,"cat"),(4,"doll"),(5,"eggs"))
input: List[(Int, String)] = List((1,apple), (2,ball), (3,cat), (4,doll), (5,eggs))
scala> input.foldLeft(List.empty[Int]){case (acc, (n,str)) => if(str.length ==4) acc :+ n else acc}
res3: List[Int] = List(2, 4, 5)
Using a for comprehension as follows,
for ((i,s) <- tuple2 if s.size == 4) yield i
which for the example above delivers
List(2, 4, 5)
Note we pattern match and extract the elements in each tuple and filter by string size. To print a list consider for instance aList.foreach(println).
This will do:
tuple2.filter(_._2.size==4).map(_._1)
In Scala REPL:
scala> val tuple2 : List[(Int,String)] = List((1,"apple"),(2,"ball"),(3,"cat"),(4,"doll"),(5,"eggs"))
tuple2: List[(Int, String)] = List((1,apple), (2,ball), (3,cat), (4,doll), (5,eggs))
scala> tuple2.filter(_._2.size==4).map(_._1)
res261: List[Int] = List(2, 4, 5)
scala>

Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names

I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement.
So i have created a Scala List of 100 column names.
And then i want to iterate through a for loop to actually drop the column in each for loop iteration.
Below is the code.
final val dropList: List[String] = List("Col1","Col2",...."Col100”)
def drpColsfunc(inputDF: DataFrame): DataFrame = {
for (i <- 0 to dropList.length - 1) {
val returnDF = inputDF.drop(dropList(i))
}
return returnDF
}
val test_df = drpColsfunc(input_dataframe)
test_df.show(5)
If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:
df.drop("colA", "colB", "colC")
Answer:
val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*)
This should work fine :
val dropList : List[String] |
val df : DataFrame |
val test_df = df.drop(dropList : _*)
You can just do,
def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame =
dropList.foldLeft(inputDF)((df, col) => df.drop(col))
It will return you the DataFrame without the columns passed in dropList.
As an example (of what's happening behind the scene), let me put it this way.
scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)
scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)
scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)
The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.
You can use the drop operation to drop multiple columns. If you are having column names in the list that you need to drop than you can pass that using :_* after the column list variable and it would drop all the columns in the list that you pass.
Scala:
val df = Seq(("One","Two","Three"),("One","Two","Three"),("One","Two","Three")).toDF("Name","Name1","Name2")
val columnstoDrop = List("Name","Name1")
val df1 = df.drop(columnstoDrop:_*)
Python:
In python you can use the * operator to do the same stuff.
data = [("One", "Two","Three"), ("One", "Two","Three"), ("One", "Two","Three")]
columns = ["Name","Name1","Name2"]
df = spark.sparkContext.parallelize(data).toDF(columns)
columnstoDrop = ["Name","Name1"]
df1 = df.drop(*columnstoDrop)
Now in df1 you would get the dataframe with only one column i.e Name2.

Spark RDD tuple transformation

I'm trying to transform an RDD of tuple of Strings of this format :
(("abc","xyz","123","2016-02-26T18:31:56"),"15") TO
(("abc","xyz","123"),"2016-02-26T18:31:56","15")
Basically seperating out the timestamp string as a seperate tuple element. I tried following but it's still not clean and correct.
val result = rdd.map(r => (r._1.toString.split(",").toVector.dropRight(1).toString, r._1.toString.split(",").toList.last.toString, r._2))
However, it results in
(Vector(("abc", "xyz", "123"),"2016-02-26T18:31:56"),"15")
The expected output I'm looking for is
(("abc", "xyz", "123"),"2016-02-26T18:31:56","15")
This way I can access the elements using r._1, r._2 (the timestamp string) and r._3 in a seperate map operation.
Any hints/pointers will be greatly appreciated.
Vector.toString will include the String 'Vector' in its result. Instead, use Vector.mkString(",").
Example:
scala> val xs = Vector(1,2,3)
xs: scala.collection.immutable.Vector[Int] = Vector(1, 2, 3)
scala> xs.toString
res25: String = Vector(1, 2, 3)
scala> xs.mkString
res26: String = 123
scala> xs.mkString(",")
res27: String = 1,2,3
However, if you want to be able to access (abc,xyz,123) as a Tuple and not as a string, you could also do the following:
val res = rdd.map{
case ((a:String,b:String,c:String,ts:String),d:String) => ((a,b,c),ts,d)
}

Why Spark doesn't allow map-side combining with array keys?

I'm using Spark 1.3.1 and I'm curious why Spark doesn't allow using array keys on map-side combining.
Piece of combineByKey function:
if (keyClass.isArray) {
if (mapSideCombine) {
throw new SparkException("Cannot use map-side combining with array keys.")
}
}
Basically for the same reason why default partitioner cannot partition array keys.
Scala Array is just a wrapper around Java array and its hashCode doesn't depend on a content:
scala> val x = Array(1, 2, 3)
x: Array[Int] = Array(1, 2, 3)
scala> val h = x.hashCode
h: Int = 630226932
scala> x(0) = -1
scala> x.hashCode() == h1
res3: Boolean = true
It means that two arrays with exact the same content are not equal
scala> x
res4: Array[Int] = Array(-1, 2, 3)
scala> val y = Array(-1, 2, 3)
y: Array[Int] = Array(-1, 2, 3)
scala> y == x
res5: Boolean = false
As result Arrays cannot be used as a meaningful keys. If you're not convinced just check what happens when you use Array as key for Scala Map:
scala> Map(Array(1) -> 1, Array(1) -> 2)
res7: scala.collection.immutable.Map[Array[Int],Int] = Map(Array(1) -> 1, Array(1) -> 2)
If you want to use a collection as key you should use an immutable data structure like a Vector or a List.
scala> Map(Array(1).toVector -> 1, Array(1).toVector -> 2)
res15: scala.collection.immutable.Map[Vector[Int],Int] = Map(Vector(1) -> 2)
See also:
SI-1607
How does HashPartitioner work?
A list as a key for PySpark's reduceByKey

Specify subset of elements in Spark RDD (Scala)

My dataset is a RDD[Array[String]] with more than 140 columns. How can I select a subset of columns without hard-coding the column numbers (.map(x => (x(0),x(3),x(6)...))?
This is what I've tried so far (with success):
val peopleTups = people.map(x => x.split(",")).map(i => (i(0),i(1)))
However, I need more than a few columns, and would like to avoid hard-coding them.
This is what I've tried so far (that I think would be better, but has failed):
// Attempt 1
val colIndices = [0,3,6,10,13]
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Error output from attempt 1:
<console>:28: error: type mismatch;
found : List[Int]
required: Int
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Attempt 2
colIndices map peopleTups.lift
// Attempt 3
colIndices map peopleTups
// Attempt 4
colIndices.map(index => peopleTups.apply(index))
I found this question and tried it, but because I'm looking at an RDD instead of an array, it didn't work: How can I select a non-sequential subset elements from an array using Scala and Spark?
You should map over the RDD instead of the indices.
val list = List.fill(2)(Array.range(1, 6))
// List(Array(1, 2, 3, 4, 5), Array(1, 2, 3, 4, 5))
val rdd = sc.parallelize(list) // RDD[Array[Int]]
val indices = Array(0, 2, 3)
val selectedColumns = rdd.map(array => indices.map(array)) // RDD[Array[Int]]
selectedColumns.collect()
// Array[Array[Int]] = Array(Array(1, 3, 4), Array(1, 3, 4))
What about this?
val data = sc.parallelize(List("a,b,c,d,e", "f,g,h,i,j"))
val indices = List(0,3,4)
data.map(_.split(",")).map(ss => indices.map(ss(_))).collect
This should give
res1: Array[List[String]] = Array(List(a, d, e), List(f, i, j))