Specify subset of elements in Spark RDD (Scala) - scala

My dataset is a RDD[Array[String]] with more than 140 columns. How can I select a subset of columns without hard-coding the column numbers (.map(x => (x(0),x(3),x(6)...))?
This is what I've tried so far (with success):
val peopleTups = people.map(x => x.split(",")).map(i => (i(0),i(1)))
However, I need more than a few columns, and would like to avoid hard-coding them.
This is what I've tried so far (that I think would be better, but has failed):
// Attempt 1
val colIndices = [0,3,6,10,13]
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Error output from attempt 1:
<console>:28: error: type mismatch;
found : List[Int]
required: Int
val peopleTups = people.map(x => x.split(",")).map(i => i(colIndices))
// Attempt 2
colIndices map peopleTups.lift
// Attempt 3
colIndices map peopleTups
// Attempt 4
colIndices.map(index => peopleTups.apply(index))
I found this question and tried it, but because I'm looking at an RDD instead of an array, it didn't work: How can I select a non-sequential subset elements from an array using Scala and Spark?

You should map over the RDD instead of the indices.
val list = List.fill(2)(Array.range(1, 6))
// List(Array(1, 2, 3, 4, 5), Array(1, 2, 3, 4, 5))
val rdd = sc.parallelize(list) // RDD[Array[Int]]
val indices = Array(0, 2, 3)
val selectedColumns = rdd.map(array => indices.map(array)) // RDD[Array[Int]]
selectedColumns.collect()
// Array[Array[Int]] = Array(Array(1, 3, 4), Array(1, 3, 4))

What about this?
val data = sc.parallelize(List("a,b,c,d,e", "f,g,h,i,j"))
val indices = List(0,3,4)
data.map(_.split(",")).map(ss => indices.map(ss(_))).collect
This should give
res1: Array[List[String]] = Array(List(a, d, e), List(f, i, j))

Related

Sum of Values based on key in scala

I am new to scala I have List of Integers
val list = List((1,2,3),(2,3,4),(1,2,3))
val sum = list.groupBy(_._1).mapValues(_.map(_._2)).sum
val sum2 = list.groupBy(_._1).mapValues(_.map(_._3)).sum
How to perform N values I tried above but its not good way how to sum N values based on key
Also I have tried like this
val sum =list.groupBy(_._1).values.sum => error
val sum =list.groupBy(_._1).mapvalues(_.map(_._2).sum (_._3).sum) error
It's easier to convert these tuples to List[Int] with shapeless and then work with them. Your tuples are actually more like lists anyways. Also, as a bonus, you don't need to change your code at all for lists of Tuple4, Tuple5, etc.
import shapeless._, syntax.std.tuple._
val list = List((1,2,3),(2,3,4),(1,2,3))
list.map(_.toList) // convert tuples to list
.groupBy(_.head) // group by first element of list
.mapValues(_.map(_.tail).map(_.sum).sum) // sums elements of all tails
Result is Map(2 -> 7, 1 -> 10).
val sum = list.groupBy(_._1).map(i => (i._1, i._2.map(j => j._1 + j._2 + j._3).sum))
> sum: scala.collection.immutable.Map[Int,Int] = Map(2 -> 9, 1 -> 12)
Since tuple can't type safe convert to List, need to specify add one by one as j._1 + j._2 + j._3.
using the first element in the tuple as the key and the remaining elements as what you need you could do something like this:
val list = List((1,2,3),(2,3,4),(1,2,3))
list: List[(Int, Int, Int)] = List((1, 2, 3), (2, 3, 4), (1, 2, 3))
val sum = list.groupBy(_._1).map { case (k, v) => (k -> v.flatMap(_.productIterator.toList.drop(1).map(_.asInstanceOf[Int])).sum) }
sum: Map[Int, Int] = Map(2 -> 7, 1 -> 10)
i know its a bit dirty to do asInstanceOf[Int] but when you do .productIterator you get a Iterator of Any
this will work for any tuple size

Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names

I have a dataframe which has columns around 400, I want to drop 100 columns as per my requirement.
So i have created a Scala List of 100 column names.
And then i want to iterate through a for loop to actually drop the column in each for loop iteration.
Below is the code.
final val dropList: List[String] = List("Col1","Col2",...."Col100”)
def drpColsfunc(inputDF: DataFrame): DataFrame = {
for (i <- 0 to dropList.length - 1) {
val returnDF = inputDF.drop(dropList(i))
}
return returnDF
}
val test_df = drpColsfunc(input_dataframe)
test_df.show(5)
If you just want to do nothing more complex than dropping several named columns, as opposed to selecting them by a particular condition, you can simply do the following:
df.drop("colA", "colB", "colC")
Answer:
val colsToRemove = Seq("colA", "colB", "colC", etc)
val filteredDF = df.select(df.columns .filter(colName => !colsToRemove.contains(colName)) .map(colName => new Column(colName)): _*)
This should work fine :
val dropList : List[String] |
val df : DataFrame |
val test_df = df.drop(dropList : _*)
You can just do,
def dropColumns(inputDF: DataFrame, dropList: List[String]): DataFrame =
dropList.foldLeft(inputDF)((df, col) => df.drop(col))
It will return you the DataFrame without the columns passed in dropList.
As an example (of what's happening behind the scene), let me put it this way.
scala> val list = List(0, 1, 2, 3, 4, 5, 6, 7)
list: List[Int] = List(0, 1, 2, 3, 4, 5, 6, 7)
scala> val removeThese = List(0, 2, 3)
removeThese: List[Int] = List(0, 2, 3)
scala> removeThese.foldLeft(list)((l, r) => l.filterNot(_ == r))
res2: List[Int] = List(1, 4, 5, 6, 7)
The returned list (in our case, map it to your DataFrame) is the latest filtered. After each fold, the latest is passed to the next function (_, _) => _.
You can use the drop operation to drop multiple columns. If you are having column names in the list that you need to drop than you can pass that using :_* after the column list variable and it would drop all the columns in the list that you pass.
Scala:
val df = Seq(("One","Two","Three"),("One","Two","Three"),("One","Two","Three")).toDF("Name","Name1","Name2")
val columnstoDrop = List("Name","Name1")
val df1 = df.drop(columnstoDrop:_*)
Python:
In python you can use the * operator to do the same stuff.
data = [("One", "Two","Three"), ("One", "Two","Three"), ("One", "Two","Three")]
columns = ["Name","Name1","Name2"]
df = spark.sparkContext.parallelize(data).toDF(columns)
columnstoDrop = ["Name","Name1"]
df1 = df.drop(*columnstoDrop)
Now in df1 you would get the dataframe with only one column i.e Name2.

Replacing the values of an RDD with another

I have two data sets like below. Each data set has "," separated numbers in each line.
Dataset 1
1,2,0,8,0
2,0,9,0,3
Dataset 2
7,5,4,6,3
4,9,2,1,8
I have to replace the zeroes of the first data set with the corresponding values from the data set 2.
So the result would look like this
1,2,4,8,3
2,9,9,1,3
I replaced the values with the code below.
val rdd1 = sc.textFile(dataset1).flatMap(l => l.split(","))
val rdd2 = sc.textFile(dataset2).flatMap(l => l.split(","))
val result = rdd1.zip(rdd2).map( x => if(x._1 == "0") x._2 else x._1)
The output I got is of the format RDD[String]. But I need the output in the format RDD[Array[String]] as this format would be more suitable for my further transformations.
If you want an RDD[Array[String]], where each element of the array correspond to a line, don't flat map the values after splitting, just map them.
scala> val rdd1 = sc.parallelize(List("1,2,0,8,0", "2,0,9,0,3")).map(l => l.split(","))
rdd1: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[1] at map at <console>:27
scala> val rdd2 = sc.parallelize(List("7,5,4,6,3", "4,9,2,1,8")).map(l => l.split(","))
rdd2: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[3] at map at <console>:27
scala> val result = rdd1.zip(rdd2).map{case(arr1, arr2) => arr1.zip(arr2).map{case(v1, v2) => if(v1 == "0") v2 else v1}}
result: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[5] at map at <console>:31
scala> result.collect
res0: Array[Array[String]] = Array(Array(1, 2, 4, 8, 3), Array(2, 9, 9, 1, 3))
or maybe less verbose:
val result = rdd1.zip(rdd2).map(t => t._1.zip(t._2).map(x => if(x._1 == "0") x._2 else x._1))

Selecting multiple arbitrary columns from Scala array using map()

I'm new to Scala (and Spark). I'm trying to read in a csv file and extract multiple arbitrary columns from the data. The following function does this, but with hard-coded column indices:
def readCSV(filename: String, sc: SparkContext): RDD[String] = {
val input = sc.textFile(filename).map(line => line.split(","))
val out = input.map(csv => csv(2)+","+csv(4)+","+csv(15))
return out
}
Is there a way to use map with an arbitrary number of column indices passed to the function in an array?
If you have a sequence of indices, you could map over it and return the values :
scala> val m = List(List(1,2,3), List(4,5,6))
m: List[List[Int]] = List(List(1, 2, 3), List(4, 5, 6))
scala> val indices = List(0,2)
indices: List[Int] = List(0, 2)
// For each inner sequence, get the relevant values
// indices.map(inner) is the same as indices.map(i => inner(i))
scala> m.map(inner => indices.map(inner))
res1: List[List[Int]] = List(List(1, 3), List(4, 6))
// If you want to join all of them use .mkString
scala> m.map(inner => indices.map(inner).mkString(","))
res2: List[String] = List(1,3, 4,6) // that's actually a List containing 2 String

Initializing a 2D (multi-dimensional) array in Scala

It's easy to initialize a 2D array (or, in fact, any multidimensional array) in Java by putting something like that:
int[][] x = new int[][] {
{ 3, 5, 7, },
{ 0, 4, 9, },
{ 1, 8, 6, },
};
It's easy to read, it resembles a 2D matrix, etc, etc.
But how do I do that in Scala?
The best I could come up with looks, well, much less concise:
val x = Array(
Array(3, 5, 7),
Array(0, 4, 9),
Array(1, 8, 6)
)
The problems I see here:
It repeats "Array" over and over again (like there could be anything else besides Array)
It requires to omit trailing , in every Array invocation
If I screw up and insert something besides Array() in the middle of array, it will go okay with compiler, but type of x would silently become Array[Any] instead of Array[Array[Int]]:
val x = Array(
Array(3, 5, 7),
Array(0, 4), 9, // <= OK with compiler, silently ruins x
Array(1, 8, 6)
)
There is a guard against it, to specify the type directly, but it looks even more overkill than in Java:
val x: Array[Array[Int]] = Array(
Array(3, 5, 7),
Array(0, 4), 9, // <= this one would trigger a compiler error
Array(1, 8, 6)
)
This last example needs Array even 3 times more than I have to say int[][] in Java.
Is there any clear way around this?
Personally I'd suck it up and type out (or cut and paste) "Array" a few times for clarity's sake. Include the type annotation for safety, of course. But if you're really running out of e-ink, a quick easy hack would be simply to provide an alias for Array, for example:
val > = Array
val x: Array[Array[Int]] = >(
>(3, 5, 7),
>(0, 4, 9),
>(1, 8, 6)
)
You could also provide a type alias for Array if you want to shorten the annotation:
type >[T] = Array[T]
val x: >[>[Int]] = ...
I suggest to use Scala 2.10 and macros:
object MatrixMacro {
import language.experimental.macros
import scala.reflect.macros.Context
import scala.util.Try
implicit class MatrixContext(sc: StringContext) {
def matrix(): Array[Array[Int]] = macro matrixImpl
}
def matrixImpl(c: Context)(): c.Expr[Array[Array[Int]]] = {
import c.universe.{ Try => _, _ }
val matrix = Try {
c.prefix.tree match {
case Apply(_, List(Apply(_, List(Literal(Constant(raw: String)))))) =>
def toArrayAST(c: List[TermTree]) =
Apply(Select(Select(Ident("scala"), newTermName("Array")), newTermName("apply")), c)
val matrix = raw split "\n" map (_.trim) filter (_.nonEmpty) map {
_ split "," map (_.trim.toInt)
}
if (matrix.map(_.length).distinct.size != 1)
c.abort(c.enclosingPosition, "rows of matrix do not have the same length")
val matrixAST = matrix map (_ map (i => Literal(Constant(i)))) map (i => toArrayAST(i.toList))
toArrayAST(matrixAST.toList)
}
}
c.Expr(matrix getOrElse c.abort(c.enclosingPosition, "not a matrix of Int"))
}
}
Usage with:
scala> import MatrixMacro._
import MatrixMacro._
scala> matrix"1"
res86: Array[Array[Int]] = Array(Array(1))
scala> matrix"1,2,3"
res87: Array[Array[Int]] = Array(Array(1, 2, 3))
scala> matrix"""
| 1, 2, 3
| 4, 5, 6
| 7, 8, 9
| """
res88: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9))
scala> matrix"""
| 1, 2
| 1
| """
<console>:57: error: rows of matrix do not have the same length
matrix"""
^
scala> matrix"a"
<console>:57: error: not a matrix of Int
matrix"a"
^
I don't think you will get it shorter. ;)
If using a mere List of List (which in itself cannot guarantee that every sub list is of the same size) is not a problem for you, and you are only concerned with easy syntax and avoiding errors at creation-time, scala has many ways to create nice syntax constructs.
One such possibility would be a simple helper:
object Matrix {
def apply[X]( elements: Tuple3[X, X, X]* ): List[List[X]] = {
elements.toList.map(_.productIterator.toList.asInstanceOf[List[X]] )
}
// Here you might add other overloads for Tuple4, Tuple5 etc if you need "matrixes" of those sizes
}
val x = Matrix(
(3, 5, 7),
(0, 4, 9),
(1, 8, 6)
)
About your concerns:
It repeats "List" over and over again (like there could be anything else besides List)
Not the case here.
It requires to omit trailing , in every List invocation
Unfortunately that is still true here, not much you can do given scala's syntactic rules.
If I screw up and insert something besides List() in the middle of array, it will go okay with compiler, but type of x would silently become List[Any] instead of List[List[Int]]:
val x = List(
List(3, 5, 7),
List(0, 4), 9, // <= OK with compiler, silently ruins x
List(1, 8, 6)
)
The equivalent code now faile to compile:
scala> val x = Matrix(
| (3, 5, 7),
| (0, 4), 9,
| (1, 8, 6)
| )
<console>:10: error: type mismatch;
found : (Int, Int)
required: (?, ?, ?)
(0, 4), 9,
And finally if you want to explicitly specify the type of elements (say that you want to protect against the possibility of inadvertently mixing Ints and Doubles), you only have to specify Matrix[Int] instead of the ugly List[List[Int]]:
val x = Matrix[Int](
(3, 5, 7),
(0, 4, 9),
(1, 8, 6)
)
EDIT: I see that you replaced List with Array in your question. To use arrays all you have to use is to replace List with Array and toList with toArray in my code above.
Since I'm also in disgust with this trailing comma issue (i.e. I cannot simply exchange the last line with any other) I sometimes use either a fluent API or the constructor syntax trick to get the syntax I like. An example using the constructor syntax would be:
trait Matrix {
// ... and the beast
private val buffer = ArrayBuffer[Array[Int]]()
def >(vals: Int*) = buffer += vals.toArray
def build: Array[Array[Int]] = buffer.toArray
}
Which allows:
// beauty ...
val m = new Matrix {
>(1, 2, 3)
>(4, 5, 6)
>(7, 8, 9)
} build
Unfortunately, this relies on mutable data although it is only used temporarily during the construction. In cases where I want maximal beauty for the construction syntax I would prefer this solution.
In case build is too long/verbose you might want to replace it by an empty apply function.
I don't know if this is the easy way, but I've included some code below for converting nested tuples into '2D' arrays.
First, you need some boiler plate for getting the size of the tuples as well as converting the tuples into [Array[Array[Double]]. The series of steps I used were:
Figure out the number of rows and columns in the tuple
Turn the nested tuple into a one row Array
Reshape the array based on the size of the original tuple.
The code for that is:
object Matrix {
/**
* Returns the size of a series of nested tuples.
*/
def productSize(t: Product): (Int, Int) = {
val a = t.productArity
val one = t.productElement(0)
if (one.isInstanceOf[Product]) {
val b = one.asInstanceOf[Product].productArity
(a, b)
}
else {
(1, a)
}
}
/**
* Flattens out a nested tuple and returns the contents as an iterator.
*/
def flattenProduct(t: Product): Iterator[Any] = t.productIterator.flatMap {
case p: Product => flattenProduct(p)
case x => Iterator(x)
}
/**
* Convert a nested tuple to a flattened row-oriented array.
* Usage is:
* {{{
* val t = ((1, 2, 3), (4, 5, 6))
* val a = Matrix.toArray(t)
* // a: Array[Double] = Array(1, 2, 3, 4, 5, 6)
* }}}
*
* #param t The tuple to convert to an array
*/
def toArray(t: Product): Array[Double] = flattenProduct(t).map(v =>
v match {
case c: Char => c.toDouble
case b: Byte => b.toDouble
case sh: Short => sh.toDouble
case i: Int => i.toDouble
case l: Long => l.toDouble
case f: Float => f.toDouble
case d: Double => d
case s: String => s.toDouble
case _ => Double.NaN
}
).toArray[Double]
def rowArrayTo2DArray[#specialized(Int, Long, Float, Double) A: Numeric](m: Int, n: Int,
rowArray: Array[A]) = {
require(rowArray.size == m * n)
val numeric = implicitly[Numeric[A]]
val newArray = Array.ofDim[Double](m, n)
for (i <- 0 until m; j <- 0 until n) {
val idx = i * n + j
newArray(i)(j) = numeric.toDouble(rowArray(idx))
}
newArray
}
/**
* Factory method for turning tuples into 2D arrays
*/
def apply(data: Product): Array[Array[Double]] = {
def size = productSize(data)
def array = toArray(data)
rowArrayTo2DArray(size._1, size._2, array)
}
}
Now to use this, you could just do the following:
val a = Matrix((1, 2, 3))
// a: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0))
val b = Matrix(((1, 2, 3), (4, 5, 6), (7, 8, 9)))
// b: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0),
// Array(4.0, 5.0, 6.0),
// Array(7.0, 8.0, 9.0))
val c = Matrix((1L, 2F, "3")) // Correctly handles mixed types
// c: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0))
val d = Matrix((1L, 2F, new java.util.Date())) // Non-numeric types convert to NaN
// d: Array[Array[Double]] = Array(Array(1.0, 2.0, NaN))
Alternatively, if you could just call the rowArrayTo2DArray directly using the size of the array you want and a 1D array of values:
val e = Matrix.rowArrayTo2DArray(1, 3, Array(1, 2, 3))
// e: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0))
val f = Matrix.rowArrayTo2DArray(3, 1, Array(1, 2, 3))
// f: Array[Array[Double]] = Array(Array(1.0), Array(2.0), Array(3.0))
val g = Matrix.rowArrayTo2DArray(3, 3, Array(1, 2, 3, 4, 5, 6, 7, 8, 9))
// g: Array[Array[Double]] = Array(Array(1.0, 2.0, 3.0),
// Array(4.0, 5.0, 6.0),
// Array(7.0, 8.0, 9.0))
glancing through the answers, i did not find what to me seems the most obvious & simple way to do it. instead of Array, you can use a tuple.
would look something like that:
scala> val x = {(
| (3,5,7),
| (0,4,9),
| (1,8,6)
| )}
x: ((Int, Int, Int), (Int, Int, Int), (Int, Int, Int)) = ((3,5,7),(0,4,9),(1,8,6))
seems clean & elegant?
i think so :)