Creating data frames in Spark using Sets in Scala

Creating data frames in Spark using Sets in Scala - scala

I want to create a dummy data frame in Spark using Scala which looks like this -
+-----------+----+
|channel_set|rate|
+-----------+----+
| [A, D]| 0.0|
| [C]| 0.0|
| [D]| 1.0|
| [B, A]| 0.5|
+-----------+----+
I have tried below code to do that -
val b = Array((Set("A","D"),0.0) , (Set("C"),0.0), (Set("D"),1.0), (Set("B","A"),0.5) )
val dummy_data = sc.parallelize(b).toDF("channel_set", "rate")
But facing an error -
scala> val dummy_data = sc.parallelize(b).toDF("channel_set", "rate")
java.lang.UnsupportedOperationException: No Encoder found for scala.collection.immutable.Set[java.lang.String]
- field (class: "scala.collection.immutable.Set", name: "_1")
- root class: "scala.Tuple2"
Kindly help.

Array is a mutable object and dataframes/datasets should have static schema i.e. fixed dataTypes. So use of Seq or List should work for you as they are immutables.
val df = Seq(
(Array("A","D"),0.0),
(Array("C"),0.0),
(Array("D"),1.0),
(Array("B","A"),0.5)
).toDF("channel_set", "rate")
df.show(false)
You should have dataframe as
+-----------+----+
|channel_set|rate|
+-----------+----+
|[A, D] |0.0 |
|[C] |0.0 |
|[D] |1.0 |
|[B, A] |0.5 |
+-----------+----+

If you look at your error message, it's the Set type that Spark's SQL/DataFrame API doesn't support:
java.lang.UnsupportedOperationException:
No Encoder found for scala.collection.immutable.Set[java.lang.String]
Here's the data types supported by Spark SQL/DataFrame. That said, you can use Set within a UDF, if needed.
In creating a DataFrame, Spark handles Seq, List, Array in a similar fashion. If you do a printSchema and show on the following 3 DataFrames, you'll see that they're identical.
sc.parallelize(Array(
(Array("A","D"),0.0) , (Array("C"),0.0), (Array("D"),1.0), (Array("B","A"),0.5)
)).toDF("channel_set", "rate")
sc.parallelize(List(
(List("A","D"),0.0) , (List("C"),0.0), (List("D"),1.0), (List("B","A"),0.5)
)).toDF("channel_set", "rate")
sc.parallelize(Seq(
(Seq("A","D"),0.0) , (Seq("C"),0.0), (Seq("D"),1.0), (Seq("B","A"),0.5)
)).toDF("channel_set", "rate")
// res.printSchema
// root
// |-- channel_set: array (nullable = true)
// | |-- element: string (containsNull = true)
// |-- rate: double (nullable = false)
// res.show
// +-----------+----+
// |channel_set|rate|
// +-----------+----+
// | [A, D]| 0.0|
// | [C]| 0.0|
// | [D]| 1.0|
// | [B, A]| 0.5|
// +-----------+----+

Related

Scala Spark: Flatten Array of Key/Value structs

I have an input dataframe which contains an array-typed column. Each entry in the array is a struct consisting of a key (one of about four values) and a value. I want to turn this into a dataframe with one column for each possible key, and nulls where that value is not in the array for that row. Keys are never duplicated in any of the arrays, but they may be out of order or missing.
So far the best I've got is
val wantedCols =df.columns
.filter(_ != arrayCol)
.filter(_ != "col")
val flattened = df
.select((wantedCols.map(col(_)) ++ Seq(explode(col(arrayCol)))):_*)
.groupBy(wantedCols.map(col(_)):_*)
.pivot("col.key")
.agg(first("col.value"))
This does exactly what I want, but it's hideous and I have no idea what the ramifactions of grouping on every-column-but-one would be. What's the RIGHT way to do this?
EDIT: Example input/output:
case class testStruct(name : String, number : String)
val dfExampleInput = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))))
.toDF("index", "state", "entries")
.show
+-----+-----+------------------+
|index|state| entries|
+-----+-----+------------------+
| 0| KY| [[A, 45]]|
| 1| OR|[[A, 30], [B, 10]]|
+-----+-----+------------------+
val dfExampleOutput = Seq(
(0, "KY", "45", null),
(1, "OR", "30", "10"))
.toDF("index", "state", "A", "B")
.show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
FURTHER EDIT:
I submitted a solution myself (see below) that handles this well so long as you know the keys in advance (in my case I do.) If finding the keys is an issue, another answer holds code to handle that.

Without groupBy pivot agg first
Please check below code.
scala> val df = Seq((0, "KY", Seq(("A", "45"))),(1, "OR", Seq(("A", "30"),("B", "10")))).toDF("index", "state", "entries").withColumn("entries",$"entries".cast("array<struct<name:string,number:string>>"))
df: org.apache.spark.sql.DataFrame = [index: int, state: string ... 1 more field]
scala> df.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
scala> df.show(false)
+-----+-----+------------------+
|index|state|entries |
+-----+-----+------------------+
|0 |KY |[[A, 45]] |
|1 |OR |[[A, 30], [B, 10]]|
+-----+-----+------------------+
scala> val finalDFColumns = df.select(explode($"entries").as("entries")).select("entries.*").select("name").distinct.map(_.getAs[String](0)).orderBy($"value".asc).collect.foldLeft(df.limit(0))((cdf,c) => cdf.withColumn(c,lit(null))).columns
finalDFColumns: Array[String] = Array(index, state, entries, A, B)
scala> val finalDF = df.select($"*" +: (0 until max).map(i => $"entries".getItem(i)("number").as(i.toString)): _*)
finalDF: org.apache.spark.sql.DataFrame = [index: int, state: string ... 3 more fields]
scala> finalDF.show(false)
+-----+-----+------------------+---+----+
|index|state|entries |0 |1 |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala> finalDF.printSchema
root
|-- index: integer (nullable = false)
|-- state: string (nullable = true)
|-- entries: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- number: string (nullable = true)
|-- 0: string (nullable = true)
|-- 1: string (nullable = true)
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).show(false)
+-----+-----+------------------+---+----+
|index|state|entries |A |B |
+-----+-----+------------------+---+----+
|0 |KY |[[A, 45]] |45 |null|
|1 |OR |[[A, 30], [B, 10]]|30 |10 |
+-----+-----+------------------+---+----+
scala>
Final Output
scala> finalDF.columns.zip(finalDFColumns).foldLeft(finalDF)((fdf,column) => fdf.withColumnRenamed(column._1,column._2)).drop($"entries").show(false)
+-----+-----+---+----+
|index|state|A |B |
+-----+-----+---+----+
|0 |KY |45 |null|
|1 |OR |30 |10 |
+-----+-----+---+----+

I wouldn't worry too much about grouping by several columns, other than potentially making things confusing. In that vein, if there is a simpler, more maintainable way, go for it. Without example input/output, I'm not sure if this gets you where you're trying to go, but maybe it'll be of use:
Seq(Seq("k1" -> "v1", "k2" -> "v2")).toDS() // some basic input based on my understanding of your description
.select(explode($"value")) // flatten the array
.select("col.*") // de-nest the struct
.groupBy("_2") // one row per distinct value
.pivot("_1") // one column per distinct key
.count // or agg(first) if you want the value in each column
.show
+---+----+----+
| _2| k1| k2|
+---+----+----+
| v2|null| 1|
| v1| 1|null|
+---+----+----+
Based on what you've now said, I get the impression that there are many columns like "state" that aren't required for the aggregation, but need to be in the final result.
For reference, if you didn't need to pivot, you could add a struct column with all such fields nested within, then add it to your aggregation, eg: .agg(first($"myStruct"), first($"number")). The main advantage is only having actual key column(s) referenced in the groubBy. But when using pivot things get a little weird, so we'll set that option aside.
In this use case, the simplest way I could come up with involves splitting your dataframe and joining it back together after the aggregation using some rowkey. In this example I am assuming that "index" is suitable for that purpose:
val mehCols = dfExampleInput.columns.filter(_ != "entries").map(col)
val mehDF = dfExampleInput.select(mehCols:_*)
val aggDF = dfExampleInput
.select($"index", explode($"entries").as("entry"))
.select($"index", $"entry.*")
.groupBy("index")
.pivot("name")
.agg(first($"number"))
scala> mehDF.join(aggDF, Seq("index")).show
+-----+-----+---+----+
|index|state| A| B|
+-----+-----+---+----+
| 0| KY| 45|null|
| 1| OR| 30| 10|
+-----+-----+---+----+
I doubt you would see much of a difference in performance, if any. Maybe at the extremes, eg: very many meh columns, or very many pivot columns, or something like that, or maybe nothing at all. Personally, I would test both with decently-sized input, and if there wasn't a significant difference, use whichever one seemed easier to maintain.

Here is another way that is based on the assumption that there are no duplicates on the entries column i.e Seq(testStruct("A", "30"), testStruct("A", "70"), testStruct("B", "10")) will cause an error. The next solution combines both RDD and Dataframe APIs for the implementation:
import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.types.StructType
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.cache
// get all possible keys from entries i.e Seq[A, B, C]
val finalCols = df.select(explode($"entries").as("entry"))
.select($"entry".getField("name").as("entry_name"))
.distinct
.collect
.map{_.getAs[String]("entry_name")}
.sorted // Attention: we need to retain the order of the columns
// 1. when generating row values and
// 2. when creating the schema
val rdd = df.rdd.map{ r =>
// transform the entries array into a map i.e Map(A -> 30, B -> 10)
val entriesMap = r.getSeq[Row](2).map{r => (r.getString(0), r.getString(1))}.toMap
// transform finalCols into a map with null value i.e Map(A -> null, B -> null, C -> null)
val finalColsMap = finalCols.map{c => (c, null)}.toMap
// replace null values with those that are present from the current row by merging the two previous maps
// Attention: this should retain the order of finalColsMap
val merged = finalColsMap ++ entriesMap
// concatenate the two first row values ["index", "state"] with the values from merged
val finalValues = Seq(r(0), r(1)) ++ merged.values
Row.fromSeq(finalValues)
}
val extraCols = finalCols.map{c => s"`${c}` STRING"}
val schema = StructType.fromDDL("`index` INT, `state` STRING," + extraCols.mkString(","))
val finalDf = spark.createDataFrame(rdd, schema)
finalDf.show
// +-----+-----+---+----+----+
// |index|state| A| B| C|
// +-----+-----+---+----+----+
// | 0| KY| 45|null|null|
// | 1| OR| 30| 10|null|
// | 2| FL| 30| 10| 20|
// | 3| TX| 19| 60| 40|
// +-----+-----+---+----+----+
Note: the solution requires one extra action to retrieve the unique keys although it doesn't cause any shuffling since it it based on narrow transformations only.

I've worked out a solution myself:
def extractFromArray(colName : String, key : String, numKeys : Int, keyName : String) = {
val indexCols = (0 to numKeys-1).map(col(colName).getItem(_))
indexCols.foldLeft(lit(null))((innerCol : Column, indexCol : Column) =>
when(indexCol.isNotNull && (indexCol.getItem(keyName) === key), indexCol)
.otherwise(innerCol))
}
Example:
case class testStruct(name : String, number : String)
val df = Seq(
(0, "KY", Seq(testStruct("A", "45"))),
(1, "OR", Seq(testStruct("A", "30"), testStruct("B", "10"))),
(2, "FL", Seq(testStruct("A", "30"), testStruct("B", "10"), testStruct("C", "20"))),
(3, "TX", Seq(testStruct("B", "60"), testStruct("A", "19"), testStruct("C", "40")))
)
.toDF("index", "state", "entries")
.withColumn("A", extractFromArray("entries", "B", 3, "name"))
.show
which produces:
+-----+-----+--------------------+-------+
|index|state| entries| A|
+-----+-----+--------------------+-------+
| 0| KY| [[A, 45]]| null|
| 1| OR| [[A, 30], [B, 10]]|[B, 10]|
| 2| FL|[[A, 30], [B, 10]...|[B, 10]|
| 3| TX|[[B, 60], [A, 19]...|[B, 60]|
+-----+-----+--------------------+-------+
This solution is a little different from other answers:
It works on only a single key at a time
It requires the key name and number of keys be known in advance
It produces a column of structs, rather than doing the extra step of extracting specific values
It works as a simple column-to-column operation, rather than requiring transformations on the entire DF
It can be evaluated lazily
The first three issues can be handled by calling code, and leave it somewhat more flexible for cases where you already know the keys or where the structs contain additional values to extract.

How can I split a column containing array of some struct into separate columns?

I have the following scenarios:
case class attribute(key:String,value:String)
case class entity(id:String,attr:List[attribute])
val entities = List(entity("1",List(attribute("name","sasha"),attribute("home","del"))),
entity("2",List(attribute("home","hyd"))))
val df = entities.toDF()
// df.show
+---+--------------------+
| id| attr|
+---+--------------------+
| 1|[[name,sasha], [d...|
| 2| [[home,hyd]]|
+---+--------------------+
//df.printSchema
root
|-- id: string (nullable = true)
|-- attr: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- key: string (nullable = true)
| | |-- value: string (nullable = true)
what I want to produce is
+---+--------------------+-------+
| id| name | home |
+---+--------------------+-------+
| 1| sasha |del |
| 2| null |hyd |
+---+--------------------+-------+
How do I go about this. I looked at quite a few similar questions on stack but couldn't find anything useful.
My main motive is to do groupBy on different attributes, thus want to bring it in the above mentioned format.
I looked into explode functionality. It breaks downs a list in separate rows, I don't want that. I want to create more columns from the array of attribute.
Similar things I found:
Spark - convert Map to a single-row DataFrame
Split 1 column into 3 columns in spark scala
Spark dataframe - Split struct column into 2 columns

That can easily be reduced to PySpark converting a column of type 'map' to multiple columns in a dataframe or How to get keys and values from MapType column in SparkSQL DataFrame. First convert attr to map<string, string>
import org.apache.spark.sql.functions.{explode, map_from_entries, map_keys}
val dfMap = df.withColumn("attr", map_from_entries($"attr"))
then it's just a matter of finding the unique keys
val keys = dfMap.select(explode(map_keys($"attr"))).as[String].distinct.collect
then selecting from the map
val result = dfMap.select($"id" +: keys.map(key => $"attr"(key) as key): _*)
result.show
+---+-----+----+
| id| name|home|
+---+-----+----+
| 1|sasha| del|
| 2| null| hyd|
+---+-----+----+
Less efficient but more concise variant is to explode and pivot
val result = df
.select($"id", explode(map_from_entries($"attr")))
.groupBy($"id")
.pivot($"key")
.agg(first($"value"))
result.show
+---+----+-----+
| id|home| name|
+---+----+-----+
| 1| del|sasha|
| 2| hyd| null|
+---+----+-----+
but in practice I'd advise against it.

delete constant columns spark having issue with timestamp column

hi guys i did this code that allows to drop columns with constant values.
i start by computing the standard deviation i then drop the ones having standard equal to zero ,but i got this issue when having a column which has a timestamp type what to do
cannot resolve 'stddev_samp(time.1)' due to data type mismatch: argument 1 requires double type, however, 'time.1' is of timestamp type.;;
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
//val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
df.show(5)
//df.columns.map(p=>s"`${p}`")
//val aggs = df.columns.map(c => stddev(c).as(c))
val aggs = df.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = df.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
df.select(columnsToKeep: _*).show(5,false)

Using stddev
stddev is only defined on numeric columns. If you want to compute the standard deviation of a date column you will need to convert it to a time stamp first:
scala> var myDF = (0 to 10).map(x => (x, scala.util.Random.nextDouble)).toDF("id", "rand_double")
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double]
scala> myDF = myDF.withColumn("Date", current_date())
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: date (nullable = false)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|2017-03-21|
| 1| 0.5968932024004612|2017-03-21|
| 2|0.05912760417456575|2017-03-21|
| 3|0.29974600653895667|2017-03-21|
| 4| 0.8448407414817856|2017-03-21|
| 5| 0.2049495659443249|2017-03-21|
| 6| 0.4184846380144779|2017-03-21|
| 7|0.21400484330739022|2017-03-21|
| 8| 0.9558142791013501|2017-03-21|
| 9|0.32530639391058036|2017-03-21|
| 10| 0.5100585655062743|2017-03-21|
+---+-------------------+----------+
scala> myDF = myDF.withColumn("Date", unix_timestamp($"Date"))
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: long (nullable = true)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|1490072400|
| 1| 0.5968932024004612|1490072400|
| 2|0.05912760417456575|1490072400|
| 3|0.29974600653895667|1490072400|
| 4| 0.8448407414817856|1490072400|
| 5| 0.2049495659443249|1490072400|
| 6| 0.4184846380144779|1490072400|
| 7|0.21400484330739022|1490072400|
| 8| 0.9558142791013501|1490072400|
| 9|0.32530639391058036|1490072400|
| 10| 0.5100585655062743|1490072400|
+---+-------------------+----------+
At this point all of the columns are numeric so your code will run fine:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val aggs = myDF.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = myDF.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(myDF.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
myDF.select(columnsToKeep: _*).show(false)
// Exiting paste mode, now interpreting.
+---+-------------------+
|id |rand_double |
+---+-------------------+
|0 |0.3786008989478248 |
|1 |0.5968932024004612 |
|2 |0.05912760417456575|
|3 |0.29974600653895667|
|4 |0.8448407414817856 |
|5 |0.2049495659443249 |
|6 |0.4184846380144779 |
|7 |0.21400484330739022|
|8 |0.9558142791013501 |
|9 |0.32530639391058036|
|10 |0.5100585655062743 |
+---+-------------------+
aggs: Array[org.apache.spark.sql.Column] = Array(stddev_samp(id) AS `id`, stddev_samp(rand_double) AS `rand_double`, stddev_samp(Date) AS `Date`)
stddevs: org.apache.spark.sql.DataFrame = [id: double, rand_double: double ... 1 more field]
columnsToKeep: Seq[org.apache.spark.sql.Column] = ArrayBuffer(id, rand_double)
Using countDistinct
All that being said, it would be more general to use countDistinct:
scala> val distCounts = myDF.select(myDF.columns.map(c => countDistinct(c) as c): _*).first.toSeq.zip(myDF.columns)
distCounts: Seq[(Any, String)] = ArrayBuffer((11,id), (11,rand_double), (1,Date))]
scala> distCounts.foldLeft(myDF)((accDF, dc_col) => if (dc_col._1 == 1) accDF.drop(dc_col._2) else accDF).show
+---+-------------------+
| id| rand_double|
+---+-------------------+
| 0| 0.3786008989478248|
| 1| 0.5968932024004612|
| 2|0.05912760417456575|
| 3|0.29974600653895667|
| 4| 0.8448407414817856|
| 5| 0.2049495659443249|
| 6| 0.4184846380144779|
| 7|0.21400484330739022|
| 8| 0.9558142791013501|
| 9|0.32530639391058036|
| 10| 0.5100585655062743|
+---+-------------------+

Check equality for two Spark DataFrames in Scala

I'm new to Scala and am having problems writing unit tests.
I'm trying to compare and check equality for two Spark DataFrames in Scala for unit testing, and realized that there is no easy way to check equality for two Spark DataFrames.
The C++ equivalent code would be (assuming that the DataFrames are represented as double arrays in C++):
int expected[10][2];
int result[10][2];
for (int row = 0; row < 10; row++) {
for (int col = 0; col < 2; col++) {
if (expected[row][col] != result[row][col]) return false;
}
}
The actual test would involve testing for equality based on the data types of the columns of the DataFrames (testing with precision tolerance for floats, etc).
It seems like there's not an easy way to iteratively loop over all the elements in the DataFrames using Scala and the other solutions for checking equality of two DataFrames such as df1.except(df2) do not work in my case as I need to be able to provide support for testing equality with tolerance for floats and doubles.
Of course, I could try to round all the elements beforehand and compare the results afterwards, but I would like to see if there are any other solutions that would allow me to iterate through the DataFrames to check for equality.

import org.scalatest.{BeforeAndAfterAll, FeatureSpec, Matchers}
outDf.collect() should contain theSameElementsAs (dfComparable.collect())
# or ( obs order matters ! )
// outDf.except(dfComparable).toDF().count should be(0)
outDf.except(dfComparable).count should be(0)

If you want to check if both the data frames are equal or not for testing purpose, you can make use of subtract() method of data frame (supported in version 1.3 and above)
You can check if diff of both data frames is empty or 0.
e.g. df1.subtract(df2).count() == 0

Assuming that you have a fixed # of col and rows, one solution could be join both Df's by row index (in case you do not have id's for the records), and then iterate direct in the final DF [with all the columns of both DF's].
Something like this:
Schemas
DF1
root
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: double (nullable = true)
DF2
root
|-- col1: double (nullable = true)
|-- col2: double (nullable = true)
|-- col3: double (nullable = true)
df1
+----------+-----------+------+
| col1| col2| col3|
+----------+-----------+------+
|1.20000001| 1.21| 1.2|
| 2.1111| 2.3| 22.2|
| 3.2|2.330000001| 2.333|
| 2.2444| 2.344|2.3331|
+----------+-----------+------+
df2
+------+-----+------+
| col1| col2| col3|
+------+-----+------+
| 1.2| 1.21| 1.2|
|2.1111| 2.3| 22.2|
| 3.2| 2.33| 2.333|
|2.2444|2.344|2.3331|
+------+-----+------+
Added row index
df1
+----------+-----------+------+---+
| col1| col2| col3|row|
+----------+-----------+------+---+
|1.20000001| 1.21| 1.2| 0|
| 2.1111| 2.3| 22.2| 1|
| 3.2|2.330000001| 2.333| 2|
| 2.2444| 2.344|2.3331| 3|
+----------+-----------+------+---+
df2
+------+-----+------+---+
| col1| col2| col3|row|
+------+-----+------+---+
| 1.2| 1.21| 1.2| 0|
|2.1111| 2.3| 22.2| 1|
| 3.2| 2.33| 2.333| 2|
|2.2444|2.344|2.3331| 3|
+------+-----+------+---+
Combined DF
+---+----------+-----------+------+------+-----+------+
|row| col1| col2| col3| col1| col2| col3|
+---+----------+-----------+------+------+-----+------+
| 0|1.20000001| 1.21| 1.2| 1.2| 1.21| 1.2|
| 1| 2.1111| 2.3| 22.2|2.1111| 2.3| 22.2|
| 2| 3.2|2.330000001| 2.333| 3.2| 2.33| 2.333|
| 3| 2.2444| 2.344|2.3331|2.2444|2.344|2.3331|
+---+----------+-----------+------+------+-----+------+
This is how you can do that:
println("Schemas")
println("DF1")
df1.printSchema()
println("DF2")
df2.printSchema()
println("df1")
df1.show
println("df2")
df2.show
val finaldf1 = df1.withColumn("row", monotonically_increasing_id())
val finaldf2 = df2.withColumn("row", monotonically_increasing_id())
println("Added row index")
println("df1")
finaldf1.show()
println("df2")
finaldf2.show()
val joinedDfs = finaldf1.join(finaldf2, "row")
println("Combined DF")
joinedDfs.show()
val tolerance = 0.001
def isInValidRange(a: Double, b: Double): Boolean ={
Math.abs(a-b)<=tolerance
}
joinedDfs.take(10).foreach(row => {
assert( isInValidRange(row.getDouble(1), row.getDouble(4)) , "Col1 validation. Row %s".format(row.getLong(0)+1))
assert( isInValidRange(row.getDouble(2), row.getDouble(5)) , "Col2 validation. Row %s".format(row.getLong(0)+1))
assert( isInValidRange(row.getDouble(3), row.getDouble(6)) , "Col3 validation. Row %s".format(row.getLong(0)+1))
})
Note: Assert's are not serialized, a workaround is use take() to avoid errors.

Aggregations in JDBCRDD or RDD

I'm brand new in Sacla and Spark, and I'm trying to create a SQL query over SqlServer with Spark using jdbcRDD, and do some transformations on it with mappings and aggregations.
This is what I have, a Table with n String columns and m Number columns.
like
"A", "A1",1,2
"A", "A1",4,3
"A", "A2",3,4
"B", "B1",6,7
...
...
what i'm looking for is create a hierarchival structure grouping the strings and aggregating the numeric columns like
A
|->A1
|->(5,5)
|->A2
|->(3,4)
B
|->B1
|->(6,7)
I was able to create the hierarchie but I'm not able to perform the agregation on the list of numeric values.

If you're loading your data over JDBC I would simply use DataFrames:
import sqlContext.implicits._
import org.apache.spark.sql.functions.sum
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.DataFrame
val options: Map[(String, String)] = ???
val df: DataFrame = sqlContext.read
.format("jdbc")
.options(options)
.load()
.toDF("k1", "k2", "v1", "v2")
df.printSchema
// root
// |-- k1: string (nullable = true)
// |-- k2: string (nullable = true)
// |-- v1: integer (nullable = true)
// |-- v2: integer (nullable = true)
df.show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 1| 2|
// | A| A1| 4| 3|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
With input like above all you need is a basic aggregation
df
.groupBy($"k1", $"k2")
.agg(sum($"v1").alias("v1"), sum($"v2").alias("v2")).show
// +---+---+---+---+
// | k1| k2| v1| v2|
// +---+---+---+---+
// | A| A1| 5| 5|
// | A| A2| 3| 4|
// | B| B1| 6| 7|
// +---+---+---+---+
If you have RDD like this:
val rdd RDD[(String, String, Int, Int)] = ???
rdd.first
// (String, String, Int, Int) = (A,A1,1,2)
There is no reason to built complex hierarchy. Simple PairRDD should be enough:
val aggregated: RDD[((String, String), breeze.linalg.Vector[Int])] = rdd
.map{case (k1, k2, v1, v2) => ((k1, k2), breeze.linalg.Vector(v1, v2))}
.reduceByKey(_ + _)
aggregated.first
// ((String, String), breeze.linalg.Vector[Int]) = ((A,A2),DenseVector(3, 4))
Keeping hierarchical structure is ineffective but you can group above RDD like this:
aggregated.map{case ((k1, k2), v) => (k1, (k2, v))}.groupByKey

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Creating data frames in Spark using Sets in Scala - scala

Related

Scala Spark: Flatten Array of Key/Value structs

How can I split a column containing array of some struct into separate columns?

delete constant columns spark having issue with timestamp column

Check equality for two Spark DataFrames in Scala

Aggregations in JDBCRDD or RDD

Categories

Resources