Replicating rows in Spark Dataset N times - scala

When I try to do something like this in Spark:
val replicas = 10
val dsReplicated = ds flatMap (a => 0 until replicas map ((a, _)))
I get the following exception:
java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
... 48 elided
I can achieve this using Spark DataFrame with explode function. I'd like to achieve something similar using Datasets.
For the reference, here is the code that replicates rows using the DataFrame API:
val dfReplicated = df.
withColumn("__temporarily__", typedLit((0 until replicas).toArray)).
withColumn("idx", explode($"__temporarily__")).
drop($"__temporarily__")

Here is one way of doing it:
case class Zip(zipcode: String)
case class Person(id: Int,name: String,zipcode: List[Zip])
data: org.apache.spark.sql.Dataset[Person]
data.show()
+---+----+--------------+
| id|name| zipcode|
+---+----+--------------+
| 1| AAA|[[MVP], [RB2]]|
| 2| BBB|[[KFG], [YYU]]|
| 3| CCC|[[JJJ], [7IH]]|
+---+----+--------------+
data.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- zipcode: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- zipcode: string (nullable = true)
val df = data.withColumn("ArrayCol",explode($"zipcode"))
df.select($"id",$"name",$"ArrayCol.zipcode").show()
Output:
+---+----+-------+
| id|name|zipcode|
+---+----+-------+
| 1| AAA| MVP|
| 1| AAA| RB2|
| 2| BBB| KFG|
| 2| BBB| YYU|
| 3| CCC| JJJ|
| 3| CCC| 7IH|
+---+----+-------+
Using Dataset:
val resultDS = data.flatMap(x => x.zipcode.map(y => (x.id,x.name,y.zipcode)))
resultDS.show(false)
//resultDS:org.apache.spark.sql.Dataset[(Int, String, String)] =
// [_1: integer, _2: string ... 1 more fields]
//+---+---+---+
//|_1 |_2 |_3 |
//+---+---+---+
//|1 |AAA|MVP|
//|1 |AAA|RB2|
//|2 |BBB|KFG|
//|2 |BBB|YYU|
//|3 |CCC|JJJ|
//|3 |CCC|7IH|
//+---+---+---+

Related

Transpose DataFrame single row to column in Spark with scala

I saw this question here:
Transpose DataFrame Without Aggregation in Spark with scala and I wanted to do exactly the opposite.
I have this Dataframe with a single row, with values that are string, int, bool, array:
+-----+-------+-----+------+-----+
|col1 | col2 |col3 | col4 |col5 |
+-----+-------+-----+------+-----+
|val1 | val2 |val3 | val4 |val5 |
+-----+-------+-----+------+-----+
And I want to transpose it like this:
+-----------+-------+
|Columns | values|
+-----------+-------+
|col1 | val1 |
|col2 | val2 |
|col3 | val3 |
|col4 | val4 |
|col5 | val5 |
+-----------+-------+
I am using Apache Spark 2.4.3 with Scala 2.11
Edit: Values can be of any type (int, double, bool, array), not only strings.
Thought differently with out using arrays_zip (which is available in => Spark 2.4)] and got the below...
It will work for Spark =>2.0 onwards in a simpler way (flatmap , map and explode functions)...
Here map function (used in with column) creates a new map column. The input columns must be grouped as key-value pairs.
Case : String data type in Data :
import org.apache.spark.sql.functions._
val df: DataFrame =Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")
var columnsAndValues = df.columns.flatMap { c => Array(lit(c), col(c)) }
df.printSchema()
df.withColumn("myMap", map(columnsAndValues:_*)).select(explode($"myMap"))
.toDF("Columns","Values").show(false)
Result :
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col3: string (nullable = true)
|-- col4: string (nullable = true)
|-- col5: string (nullable = true)
+-------+------+
|Columns|Values|
+-------+------+
|col1 |val1 |
|col2 |val2 |
|col3 |val3 |
|col4 |val4 |
|col5 |val5 |
+-------+------+
Case : Mix of data types in Data :
If you have different types convert them to String... remaining steps wont change..
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
Full Example :
import org.apache.spark.sql.functions._
import spark.implicits._
import org.apache.spark.sql.Column
val df = Seq(((2), (3), (true), (2.4), ("val"))).toDF("col1", "col2", "col3", "col4", "col5")
df.printSchema()
/**
* convert all columns to to string type since its needed further
*/
val df1 = df.select(df.columns.map(c => col(c).cast(StringType)): _*)
df1.printSchema()
var ColumnsAndValues: Array[Column] = df.columns.flatMap { c => {
Array(lit(c), col(c))
}
}
df1.withColumn("myMap", map(ColumnsAndValues: _*))
.select(explode($"myMap"))
.toDF("Columns", "Values")
.show(false)
Result :
root
|-- col1: integer (nullable = false)
|-- col2: integer (nullable = false)
|-- col3: boolean (nullable = false)
|-- col4: double (nullable = false)
|-- col5: string (nullable = true)
root
|-- col1: string (nullable = false)
|-- col2: string (nullable = false)
|-- col3: string (nullable = false)
|-- col4: string (nullable = false)
|-- col5: string (nullable = true)
+-------+------+
|Columns|Values|
+-------+------+
|col1 |2 |
|col2 |3 |
|col3 |true |
|col4 |2.4 |
|col5 |val |
+-------+------+
From Spark-2.4 Use arrays_zip with array(column_values), array(column_names) then explode to get the result.
Example:
val df=Seq((("val1"),("val2"),("val3"),("val4"),("val5"))).toDF("col1","col2","col3","col4","col5")
val cols=df.columns.map(x => col(s"${x}"))
val str_cols=df.columns.mkString(",")
df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//| val1| col1|
//| val2| col2|
//| val3| col3|
//| val4| col4|
//| val5| col5|
//+------+-------+
UPDATE:
val df=Seq(((2),(3),(true),(2.4),("val"))).toDF("col1","col2","col3","col4","col5")
df.printSchema
//root
// |-- col1: integer (nullable = false)
// |-- col2: integer (nullable = false)
// |-- col3: boolean (nullable = false)
// |-- col4: double (nullable = false)
// |-- col5: string (nullable = true)
//cast to string
val cols=df.columns.map(x => col(s"${x}").cast("string").alias(s"${x}"))
val str_cols=df.columns.mkString(",")
df.withColumn("new",explode(arrays_zip(array(cols:_*),split(lit(str_cols),",")))).
select("new.*").
toDF("values","Columns").
show()
//+------+-------+
//|values|Columns|
//+------+-------+
//| 2| col1|
//| 3| col2|
//| true| col3|
//| 2.4| col4|
//| val| col5|
//+------+-------+

Read CSV with last column as array of values (and the values are inside parenthesis and separated by comma) in Spark

I have a CSV file where the last column is inside parenthesis and the values are separated by commas. The number of values is variable in the last column. When I read them to as Dataframe with some column names as follows, I get Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: The number of columns doesn't match. My CSV file looks like this
a1,b1,true,2017-05-16T07:00:41.0000000,2.5,(c1,d1,e1)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,f2,g2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2)
a2,b2,true,2017-05-26T07:00:42.0000000,0.5,(c2,d2,e2,k2,f2)
what I finally want is something like this:
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: boolean(nullable = true)
|-- STime: datetype(nullable = true)
|-- TotalMinutes: double(nullable = true)
|-- SomeArrayHeader: array<string>(nullable = true)
I have written the following code till now:
val infoDF =
sqlContext.read.format("csv")
.option("header", "false")
.load(inputPath)
.toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"SomeArrayHeader")
I thought of reading them without giving column names and then cast the columns which are after the 5th columns to array type. But then I am having problems with the parentheses. Is there a way I can do this while reading and telling that fields inside parenthesis are actually one field of type array.
Ok. The solution is only tactical for your case. The below one worked for me
val df = spark.read.option("quote", "(").csv("in/staff.csv").toDF(
"MId",
"PId",
"IsTeacher",
"STime",
"TotalMinutes",
"arr")
df.show()
val df2 = df.withColumn("arr",split(regexp_replace('arr,"[)]",""),","))
df2.printSchema()
df2.show()
Output:
+---+---+---------+--------------------+------------+---------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+---------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| c1,d1,e1)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,f2,g2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5| c2,d2,e2)|
| a2| b2| true|2017-05-26T07:00:...| 0.5|c2,d2,e2,k2,f2)|
+---+---+---------+--------------------+------------+---------------+
root
|-- MId: string (nullable = true)
|-- PId: string (nullable = true)
|-- IsTeacher: string (nullable = true)
|-- STime: string (nullable = true)
|-- TotalMinutes: string (nullable = true)
|-- arr: array (nullable = true)
| |-- element: string (containsNull = true)
+---+---+---------+--------------------+------------+--------------------+
|MId|PId|IsTeacher| STime|TotalMinutes| arr|
+---+---+---------+--------------------+------------+--------------------+
| a1| b1| true|2017-05-16T07:00:...| 2.5| [c1, d1, e1]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, f2, g2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5| [c2, d2, e2]|
| a2| b2| true|2017-05-26T07:00:...| 0.5|[c2, d2, e2, k2, f2]|
+---+---+---------+--------------------+------------+--------------------+

Spark GroupBy agg collect_list multiple columns

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:
scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
| A| D1| T0| P1|
| A| D0| T1| P2|
| B| Y1| T0| P3|
| B| Y2| T2| P3|
| C| H1| T0| P5|
| C| H0| T9| P5|
| B| Y0| T1| P2|
| B| H1| T3| P6|
| D| H1| T2| P4|
+---+-----+----+-----+
scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)
scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]
scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
| B| [Y1, Y2, Y0, H1]| [T0, T2, T1, T3]| [P3, P3, P2, P6]|
| D| [H1]| [T2]| [P4]|
| C| [H1, H0]| [T0, T9]| [P5, P5]|
| A| [D1, D0]| [T0, T1]| [P1, P2]|
+---+-------------------+------------------+-------------------+
Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?
You can use collect_list(struct(col1, col2)) AS elements.
Example:
df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema
df
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- cd_item: string (nullable = true)
|-- nm_item: string (nullable = true)
outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cd_item: string (nullable = true)
| | |-- nm_item: string (nullable = true)

Spark Scala replace Dataframe blank records to "0"

I need to replace my Dataframe field's blank records to "0"
Here is my code -->
import sqlContext.implicits._
case class CInspections (business_id:Int, score:String, date:String, type1:String)
val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = raw_inspections.map ( line => line.split ("\t"))
val raw_inspectionsRDD = raw_inspectionsmap.map ( raw_inspections => CInspections (raw_inspections(0).toInt,raw_inspections(1), raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
raw_inspectionsDF.show()
I am using case class and then converting to Dataframe. But I need "score" as Int as I have to perform some operations and sort it.
But if I declare it as score:Int then I am getting error for blank values.
java.lang.NumberFormatException: For input string: "" 
+-----------+-----+--------+--------------------+
|business_id|score| date| type1|
+-----------+-----+--------+--------------------+
| 10| |20140807|Reinspection/Foll...|
| 10| 94|20140729|Routine - Unsched...|
| 10| |20140124|Reinspection/Foll...|
| 10| 92|20140114|Routine - Unsched...|
| 10| 98|20121114|Routine - Unsched...|
| 10| |20120920|Reinspection/Foll...|
| 17| |20140425|Reinspection/Foll...|
+-----------+-----+--------+--------------------+
I need score field as Int because for the below query, it sort as String not Int and giving wrong result
sqlContext.sql("""select raw_inspectionsDF.score from raw_inspectionsDF where score <>"" order by score""").show()
+-----+
|score|
+-----+
| 100|
| 100|
| 100|
+-----+
Empty string can't be converted to Integer, you need to make the Score nullable so that if the field is missing, it is represented as null, you can try the following:
import scala.util.{Try, Success, Failure}
1) Define a customized parse function which returns None, if the string can't be converted to an Int, in your case empty string;
def parseScore(s: String): Option[Int] = {
Try(s.toInt) match {
case Success(x) => Some(x)
case Failure(x) => None
}
}
2) Define the score field in your case class to be an Option[Int] type;
case class CInspections (business_id:Int, score: Option[Int], date:String, type1:String)
val raw_inspections = sc.textFile("test.csv")
val raw_inspectionsmap = raw_inspections.map(line => line.split("\t"))
3) Use the customized parseScore function to parse the score field;
val raw_inspectionsRDD = raw_inspectionsmap.map(raw_inspections =>
CInspections(raw_inspections(0).toInt, parseScore(raw_inspections(1)),
raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
//root
// |-- business_id: integer (nullable = false)
// |-- score: integer (nullable = true)
// |-- date: string (nullable = true)
// |-- type1: string (nullable = true)
raw_inspectionsDF.show()
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
| 1| null| a| b|
| 2| 3| s| k|
+-----------+-----+----+-----+
4) After parsing the file correctly, you can easily replace null value with 0 using na functions fill:
raw_inspectionsDF.na.fill(0).show
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
| 1| 0| a| b|
| 2| 3| s| k|
+-----------+-----+----+-----+

delete constant columns spark having issue with timestamp column

hi guys i did this code that allows to drop columns with constant values.
i start by computing the standard deviation i then drop the ones having standard equal to zero ,but i got this issue when having a column which has a timestamp type what to do
cannot resolve 'stddev_samp(time.1)' due to data type mismatch: argument 1 requires double type, however, 'time.1' is of timestamp type.;;
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
//val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
df.show(5)
//df.columns.map(p=>s"`${p}`")
//val aggs = df.columns.map(c => stddev(c).as(c))
val aggs = df.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = df.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
df.select(columnsToKeep: _*).show(5,false)
Using stddev
stddev is only defined on numeric columns. If you want to compute the standard deviation of a date column you will need to convert it to a time stamp first:
scala> var myDF = (0 to 10).map(x => (x, scala.util.Random.nextDouble)).toDF("id", "rand_double")
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double]
scala> myDF = myDF.withColumn("Date", current_date())
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: date (nullable = false)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|2017-03-21|
| 1| 0.5968932024004612|2017-03-21|
| 2|0.05912760417456575|2017-03-21|
| 3|0.29974600653895667|2017-03-21|
| 4| 0.8448407414817856|2017-03-21|
| 5| 0.2049495659443249|2017-03-21|
| 6| 0.4184846380144779|2017-03-21|
| 7|0.21400484330739022|2017-03-21|
| 8| 0.9558142791013501|2017-03-21|
| 9|0.32530639391058036|2017-03-21|
| 10| 0.5100585655062743|2017-03-21|
+---+-------------------+----------+
scala> myDF = myDF.withColumn("Date", unix_timestamp($"Date"))
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: long (nullable = true)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|1490072400|
| 1| 0.5968932024004612|1490072400|
| 2|0.05912760417456575|1490072400|
| 3|0.29974600653895667|1490072400|
| 4| 0.8448407414817856|1490072400|
| 5| 0.2049495659443249|1490072400|
| 6| 0.4184846380144779|1490072400|
| 7|0.21400484330739022|1490072400|
| 8| 0.9558142791013501|1490072400|
| 9|0.32530639391058036|1490072400|
| 10| 0.5100585655062743|1490072400|
+---+-------------------+----------+
At this point all of the columns are numeric so your code will run fine:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val aggs = myDF.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = myDF.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(myDF.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
myDF.select(columnsToKeep: _*).show(false)
// Exiting paste mode, now interpreting.
+---+-------------------+
|id |rand_double |
+---+-------------------+
|0 |0.3786008989478248 |
|1 |0.5968932024004612 |
|2 |0.05912760417456575|
|3 |0.29974600653895667|
|4 |0.8448407414817856 |
|5 |0.2049495659443249 |
|6 |0.4184846380144779 |
|7 |0.21400484330739022|
|8 |0.9558142791013501 |
|9 |0.32530639391058036|
|10 |0.5100585655062743 |
+---+-------------------+
aggs: Array[org.apache.spark.sql.Column] = Array(stddev_samp(id) AS `id`, stddev_samp(rand_double) AS `rand_double`, stddev_samp(Date) AS `Date`)
stddevs: org.apache.spark.sql.DataFrame = [id: double, rand_double: double ... 1 more field]
columnsToKeep: Seq[org.apache.spark.sql.Column] = ArrayBuffer(id, rand_double)
Using countDistinct
All that being said, it would be more general to use countDistinct:
scala> val distCounts = myDF.select(myDF.columns.map(c => countDistinct(c) as c): _*).first.toSeq.zip(myDF.columns)
distCounts: Seq[(Any, String)] = ArrayBuffer((11,id), (11,rand_double), (1,Date))]
scala> distCounts.foldLeft(myDF)((accDF, dc_col) => if (dc_col._1 == 1) accDF.drop(dc_col._2) else accDF).show
+---+-------------------+
| id| rand_double|
+---+-------------------+
| 0| 0.3786008989478248|
| 1| 0.5968932024004612|
| 2|0.05912760417456575|
| 3|0.29974600653895667|
| 4| 0.8448407414817856|
| 5| 0.2049495659443249|
| 6| 0.4184846380144779|
| 7|0.21400484330739022|
| 8| 0.9558142791013501|
| 9|0.32530639391058036|
| 10| 0.5100585655062743|
+---+-------------------+