How to randomly sample a fraction of the rows in a DataFrame? - scala

I am trying to get the data frame as a list of records using collect function and it is very slow for a data frame with 4000+ columns. Are there any faster alternatives? I even tried doing df.persist() before calling .collect() but even that didn't help.
val data = df
.collect()
.map(
x ⇒
x.toSeq.toList.map(_ match {
case null ⇒ ""
case other ⇒ other.toString
})
)
.toList
EDIT (from comments):
So the use case is to get the records from the dataframe and show them as sample data.

Based on your question and comments, it sounds like you're looking for a way to sample columns and rows. Here's a simple way to take N random columns and randomly sample a fraction of the rows in a DataFrame:
val df = Seq(
(1, "a", 10.0, 100L),
(2, "b", 20.0, 200L),
(3, "c", 30.0, 300L)
).toDF("c1", "c2", "c3", "c4")
import scala.util.Random
// e.g. Take 3 random columns and randomly pick ~70% of rows
df.
select(Random.shuffle(df.columns.toSeq).take(3).map(col): _*).
sample(70.0/100).
show
// +---+---+---+
// | c1| c2| c4|
// +---+---+---+
// | 1| a|100|
// | 3| c|300|
// +---+---+---+

You should limit the number of rows you fetch to the driver, collect will get everything.
Either use
df.limit(20).collect
or
df.take(20)
Also, I should be faster if you first map your Row to a List[String] and then collect. Like this, this operation runs on the executors:
val data = df
.map(
x ⇒
x.toSeq.toList.map(_ match {
case null ⇒ ""
case other ⇒ other.toString
})
)
.take(20)
.toList

Related

Dynamic dataframe with n columns and m rows

Reading data from json(dynamic schema) and i'm loading that to dataframe.
Example Dataframe:
scala> import spark.implicits._
import spark.implicits._
scala> val DF = Seq(
(1, "ABC"),
(2, "DEF"),
(3, "GHIJ")
).toDF("id", "word")
someDF: org.apache.spark.sql.DataFrame = [number: int, word: string]
scala> DF.show
+------+-----+
|id | word|
+------+-----+
| 1| ABC|
| 2| DEF|
| 3| GHIJ|
+------+-----+
Requirement:
Column count and names can be anything. I want to read rows in loop to fetch each column one by one. Need to process that value in subsequent flows. Need both column name and value. I'm using scala.
Python:
for i, j in df.iterrows():
print(i, j)
Need the same functionality in scala and it column name and value should be fetched separtely.
Kindly help.
df.iterrows is not from pyspark, but from pandas. In Spark, you can use foreach :
DF
.foreach{_ match {case Row(id:Int,word:String) => println(id,word)}}
Result :
(2,DEF)
(3,GHIJ)
(1,ABC)
I you don't know the number of columns, you cannot use unapply on Row, then just do :
DF
.foreach(row => println(row))
Result :
[1,ABC]
[2,DEF]
[3,GHIJ]
And operate with row using its methods getAs etc

Merging RDD records to obtain a single Row with multiple conditional counters

As a little bit of context, what I'm trying to achieve here is given multiple rows grouped by a certain set of keys, after that first reduce I would like to group them in a general row by, for example, date, with each of the grouped counters previously calculated. This may not seem clear by just reading it so here is an example output (quite simple, nothing complex) of what should happen.
(("Volvo", "T4", "2019-05-01"), 5)
(("Volvo", "T5", "2019-05-01"), 7)
(("Audi", "RS6", "2019-05-01"), 4)
And once merged those Row objects...
date , volvo_counter , audi_counter
"2019-05-01" , 12 , 4
I reckon this is quite a corner case and that there may be different approaches but I was wondering if there was any solution within the same RDD so there's no need for multiple RDDs divided by counter.
What you want to do is a pivot. You talk about RDDs so I assume that your question is: "how to do a pivot with the RDD API?". As far as I know there is no built-in function in the RDD API that does it. You could do it yourself like this:
// let's create sample data
val rdd = sc.parallelize(Seq(
(("Volvo", "T4", "2019-05-01"), 5),
(("Volvo", "T5", "2019-05-01"), 7),
(("Audi", "RS6", "2019-05-01"), 4)
))
// If the keys are not known in advance, we compute their distinct values
val values = rdd.map(_._1._1).distinct.collect.toSeq
// values: Seq[String] = WrappedArray(Volvo, Audi)
// Finally we make the pivot and use reduceByKey on the sequence
val res = rdd
.map{ case ((make, model, date), counter) =>
date -> values.map(v => if(make == v) counter else 0)
}
.reduceByKey((a, b) => a.indices.map(i => a(i) + b(i)))
// which gives you this
res.collect.head
// (String, Seq[Int]) = (2019-05-01,Vector(12, 4))
Note that you can write much simpler code with the SparkSQL API:
// let's first transform the previously created RDD to a dataframe:
val df = rdd.map{ case ((a, b, c), d) => (a, b, c, d) }
.toDF("make", "model", "date", "counter")
// And then it's as simple as that:
df.groupBy("date")
.pivot("make")
.agg(sum("counter"))
.show
+----------+----+-----+
| date|Audi|Volvo|
+----------+----+-----+
|2019-05-01| 4| 12|
+----------+----+-----+
I think it's easier to do with DataFrame:
val data = Seq(
Record(Key("Volvo", "2019-05-01"), 5),
Record(Key("Volvo", "2019-05-01"), 7),
Record(Key("Audi", "2019-05-01"), 4)
)
val rdd = spark.sparkContext.parallelize(data)
val df = rdd.toDF()
val modelsExpr = df
.select("key.model").as("model")
.distinct()
.collect()
.map(r => r.getAs[String]("model"))
.map(m => sum(when($"key.model" === m, $"value").otherwise(0)).as(s"${m}_counter"))
df
.groupBy("key.date")
.agg(modelsExpr.head, modelsExpr.tail: _*)
.show(false)

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

Using of String Functions in Dataframe Join in scala

I am trying to join two dataframe with condition like "Wo" in "Hello World" i.e (dataframe1 col contains dataframe2 col1 value).
In HQL, we can use instr(t1.col1,t2.col1)>0
How can I achieve this same condtition in Dataframe in Scala ? I tried
df1.join(df2,df1("col1").indexOfSlice(df2("col1")) > 0)
But it throwing me the below error
error: value indexOfSlice is not a member of
org.apache.spark.sql.Column
I just want to achive the below hql query using DataFrames.
select t1.*,t2.col1 from t1,t2 where instr(t1.col1,t2.col1)>0
The following solution is tested with spark 2.2. You'll be needing to define a UDF and you can specify a join condition as part of where filter :
val indexOfSlice_ = (c1: String, c2: String) => c1.indexOfSlice(c2)
val islice = udf(indexOfSlice_)
val df10: DataFrame = Seq(("Hello World", 2), ("Foo", 3)).toDF("c1", "c2")
val df20: DataFrame = Seq(("Wo", 2), ("Bar", 3)).toDF("c3", "c4")
df10.crossJoin(df20).where(islice(df10.col("c1"), df20.col("c3")) > 0).show
// +-----------+---+---+---+
// | c1| c2| c3| c4|
// +-----------+---+---+---+
// |Hello World| 2| Wo| 2|
// +-----------+---+---+---+
PS: Beware ! Using a cross-join is an expensive operation as it yields a cartesian join.
EDIT: Consider reading this when you want to use this solution.

SumProduct in Spark DataFrame

I want to create essentially a sumproduct across columns in a Spark DataFrame. I have a DataFrame that looks like this:
id val1 val2 val3 val4
123 10 5 7 5
I also have a Map that looks like:
val coefficents = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
I want to take the value in each column of the DataFrame, multiply it by the corresponding value from the map, and return the result in a new column so essentially:
(10*1) + (5*2) + (7*3) + (5*4) = 61
I tried this:
val myDF1 = myDF.withColumn("mySum", {var a:Double = 0.0; for ((k,v) <- coefficients) a + (col(k).cast(DoubleType)*coefficients(k));a})
but got an error that the "+" method was overloaded. Even if I solved that, I'm not sure this would work. Any ideas? I could always dynamically build a SQL query as text string and do it that way but I was hoping for something a little more eloquent.
Any ideas are appreciated.
Problem with your code is that you try to add a Column to Double. cast(DoubleType) affects only a type of stored value, not a type of column itself. Since Double doesn't provide *(x: org.apache.spark.sql.Column): org.apache.spark.sql.Column method everything fails.
To make it work you can for example do something like this:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col, lit}
val df = sc.parallelize(Seq(
(123, 10, 5, 7, 5), (456, 1, 1, 1, 1)
)).toDF("k", "val1", "val2", "val3", "val4")
val coefficients = Map("val1" -> 1, "val2" -> 2, "val3" -> 3, "val4" -> 4)
val dotProduct: Column = coefficients
// To be explicit you can replace
// col(k) * v with col(k) * lit(v)
// but it is not required here
// since we use * f Column.* method not Int.*
.map{ case (k, v) => col(k) * v } // * -> Column.*
.reduce(_ + _) // + -> Column.+
df.withColumn("mySum", dotProduct).show
// +---+----+----+----+----+-----+
// | k|val1|val2|val3|val4|mySum|
// +---+----+----+----+----+-----+
// |123| 10| 5| 7| 5| 61|
// |456| 1| 1| 1| 1| 10|
// +---+----+----+----+----+-----+
It looks like the issue is that you aren't actually doing anything with a
for((k, v) <- coefficients) a + ...
You probably meant a += ...
Also, some advice for cleaning up the block of code inside the withColumn call:
You don't need to call coefficients(k) because you've already got its value in v from for((k,v) <- coefficients)
Scala is pretty good at making one-liners, but it's kinda cheating if you have to put semicolons in that one line :P I'd suggest breaking up the sum calculation section into one line per expression.
The sum expression could be rewritten as a fold which avoids using a var (idiomatic Scala usually avoids vars), e.g.
import org.apache.spark.sql.functions.lit
coefficients.foldLeft(lit(0.0)){
case (sumSoFar, (k,v)) => col(k).cast(DoubleType) * v + sumSoFar
}
I'm not sure if this is possible through the DataFrame API since you are only able to work with columns and not any predefined closures (e.g. your parameter map).
I've outlined a way below using the underlying RDD of the DataFrame:
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Initializing your input example.
val df1 = sc.parallelize(Seq((123, 10, 5, 7, 5))).toDF("id", "val1", "val2", "val3", "val4")
// Return column names as an array
val names = df1.columns
// Grab underlying RDD and zip elements with column names
val rdd1 = df1.rdd.map(row => (0 until row.length).map(row.getInt(_)).zip(names))
// Tack on accumulated total to the existing row
val rdd2 = rdd0.map { seq => Row.fromSeq(seq.map(_._1) :+ seq.map { case (value: Int, name: String) => value * coefficents.getOrElse(name, 0) }.sum) }
// Create output schema (with total)
val totalSchema = StructType(df1.schema.fields :+ StructField("total", IntegerType))
// Apply schema to create output dataframe
val df2 = sqlContext.createDataFrame(rdd1, totalSchema)
// Show output:
df2.show()
...
+---+----+----+----+----+-----+
| id|val1|val2|val3|val4|total|
+---+----+----+----+----+-----+
|123| 10| 5| 7| 5| 61|
+---+----+----+----+----+-----+