spark - mimic pyspark asDict() for scala without using case class

spark - mimic pyspark asDict() for scala without using case class - scala

Pyspark allows you to create a Dictionary when a single a single row is returned from the dataframe using the below approach.
t=spark.sql("SET").withColumn("rw",expr("row_number() over(order by key)")).collect()[0].asDict()
print(t)
print(t["key"])
print(t["value"])
print(t["rw"])
print("Printing using for comprehension")
[print(t[i]) for i in t ]
Results:
{'key': 'spark.app.id', 'value': 'local-1594577194330', 'rw': 1}
spark.app.id
local-1594577194330
1
Printing using for comprehension
spark.app.id
local-1594577194330
1
I'm trying the same in scala-spark. It is possible using case class approach.
case class download(key:String, value:String,rw:Long)
val t=spark.sql("SET").withColumn("rw",expr("row_number() over(order by key)")).as[download].first
println(t)
println(t.key)
println(t.value)
println(t.rw)
Results:
download(spark.app.id,local-1594580739413,1)
spark.app.id
local-1594580739413
1
In actual problem, I have nearly 200+ columns and don't want to use case class approach. I'm trying something like below to avoid the case class option.
val df =spark.sql("SET").withColumn("rw",expr("row_number() over(order by key)"))
(df.columns).zip(df.take(1)(0))
but getting error.
<console>:28: error: type mismatch;
found : (String, String, Long)
required: Iterator[?]
(df.columns.toIterator).zip(df.take(1)(0))
Is there a way to solve this.

In scala, there is a method getValuesMap to convert a row into Map[columnName: String, columnValue: T].
Try using the same as below-
val df =spark.sql("SET").withColumn("rw",expr("row_number() over(order by key)"))
df.show(false)
df.printSchema()
/**
* +----------------------------+-------------------+---+
* |key |value |rw |
* +----------------------------+-------------------+---+
* |spark.app.id |local-1594644271573|1 |
* |spark.app.name |TestSuite |2 |
* |spark.driver.host |192.168.1.3 |3 |
* |spark.driver.port |58420 |4 |
* |spark.executor.id |driver |5 |
* |spark.master |local[2] |6 |
* |spark.sql.shuffle.partitions|2 |7 |
* +----------------------------+-------------------+---+
*
* root
* |-- key: string (nullable = false)
* |-- value: string (nullable = false)
* |-- rw: integer (nullable = true)
*/
val map = df.head().getValuesMap(df.columns)
println(map)
println(map("key"))
println(map("value"))
println(map("rw"))
println("Printing using for comprehension")
map.foreach(println)
/**
* Map(key -> spark.app.id, value -> local-1594644271573, rw -> 1)
* spark.app.id
* local-1594644271573
* 1
* Printing using for comprehension
* (key,spark.app.id)
* (value,local-1594644271573)
* (rw,1)
*/

The problem is that zip is a method on a collection which can only take another collection object which implements IterableOnce, and df.take(1)(0) is a Spark SQL Row, which doesn't fall into that category.
Try converting the row to a Seq using its toSeq method.
df.columns.zip(df.take(1)(0).toSeq)
result:
Array((key,spark.app.id), (value,local-1594577194330), (rw,1))

Related

Get average length of values of a column (from a hive table) in spark along with data types

Task: Get data types of a table (in hive) and the average length of values of each column.
I'm trying to do the above task in spark using scala.
Firstly I did
val table = spark.sql("desc table")
The output has three columns, col_name, datatype, comment.
And then, I tried to get only the column values as a comma-separated string.
val col_string = table.select("col_name").rdd.map(i => "avg(length(trim("+i(0).toString+")))").collect.mkString(", ")
Now, I can use this string in another query to get the average length of all columns like given below, but the output dataframe has as many numbers of columns as the table, I don't know how to join it with the table dataframe.
val tbl_length = spark.sql("select " + col_string + " from schema.table")
I've looked at transposing the second dataframe, that looks not efficient, and hard for me to grasp as a beginner in spark and scala.
Is my method above is good/efficient one? if there is a better way please suggest.
Even if there is a better way, can you also please explain how I can join two such datasets of row=>column.
Input table:
col1| col2| col3
Ac| 123| 0
Defg| 23456| 0
Expected output
column_name| data_type| avg_length
col1| String| 3
col2| Int| 4
col3| Int| 1

Try this-
val table = spark.catalog.getTable("df")
val df = spark.sql(s"select * from ${table.name}")
df.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
val aggs = df.columns.map(f => avg(length(trim(col(f)))).as(f))
val values = df.agg(aggs.head, aggs.tail: _*).head.getValuesMap[Double](df.columns).values.toSeq
df.schema.map(sf => (sf.name, sf.dataType)).zip(values).map{ case ((name, dt), value) => (name, dt.simpleString, value)}
.toDF("column_name", "data_type", "avg_length")
.show(false)
/**
* +-----------+---------+----------+
* |column_name|data_type|avg_length|
* +-----------+---------+----------+
* |id |bigint |1.0 |
* |name |string |4.0 |
* +-----------+---------+----------+
*/

Better Alternatives to EXCEPT Spark Scala

I have been told that EXCEPT is a very costly operation and one should always try to avoid using EXCEPT.
My Use Case -
val myFilter = "rollNo='11' AND class='10'"
val rawDataDf = spark.table(<table_name>)
val myFilteredDataframe = rawDataDf.where(myFilter)
val allOthersDataframe = rawDataDf.except(myFilteredDataframe)
But I am confused, in such use case , what are my alternatives ?

Use left anti join as below-
val df = spark.range(2).withColumn("name", lit("foo"))
df.show(false)
df.printSchema()
/**
* +---+----+
* |id |name|
* +---+----+
* |0 |foo |
* |1 |foo |
* +---+----+
*
* root
* |-- id: long (nullable = false)
* |-- name: string (nullable = false)
*/
val df2 = df.filter("id=0")
df.join(df2, df.columns.toSeq, "leftanti")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |foo |
* +---+----+
*/

How to find the size of a dataframe in pyspark

How can I replicate this code to get the dataframe size in pyspark?
scala> val df = spark.range(10)
scala> print(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats)
Statistics(sizeInBytes=80.0 B, hints=none)
What I would like to do is get the sizeInBytes value into a variable.

In Spark 2.4 you can do
df = spark.range(10)
df.createOrReplaceTempView('myView')
spark.sql('explain cost select * from myView').show(truncate=False)
|== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8)), Statistics(sizeInBytes=80.0 B, hints=none)
In Spark 3.0.0-preview2 you can use explain with the cost mode:
df = spark.range(10)
df.explain(mode='cost')
== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8)), Statistics(sizeInBytes=80.0 B)

See of this helps-
Reading the json file source and computing stats like size in bytes , number of rows etc. This stat will also help spark to take it=ntelligent decision while optimizing execution plan This code should be same in pysparktoo
/**
* file content
* spark-test-data.json
* --------------------
* {"id":1,"name":"abc1"}
* {"id":2,"name":"abc2"}
* {"id":3,"name":"abc3"}
*/
val fileName = "spark-test-data.json"
val path = getClass.getResource("/" + fileName).getPath
spark.catalog.createTable("df", path, "json")
.show(false)
/**
* +---+----+
* |id |name|
* +---+----+
* |1 |abc1|
* |2 |abc2|
* |3 |abc3|
* +---+----+
*/
// Collect only statistics that do not require scanning the whole table (that is, size in bytes).
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+---------+-------+
* |col_name |data_type|comment|
* +----------+---------+-------+
* |Statistics|68 bytes | |
* +----------+---------+-------+
*/
spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)
/**
* +----------+----------------+-------+
* |col_name |data_type |comment|
* +----------+----------------+-------+
* |Statistics|68 bytes, 3 rows| |
* +----------+----------------+-------+
*/
more info - databricks sql doc

Typically, you can access the scala methods through py4j. I just tried this in the pyspark shell:
>>> spark._jsparkSession.sessionState().executePlan(df._jdf.queryExecution().logical()).optimizedPlan().stats().sizeInBytes()
716

Convertions String which has list of objects to a dataframe

i have a string as
str=[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},
{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]
i am trying to convert it to a data frame...
+-------+--------------+
|packQty|gtin |
+-------+--------------+
|120.0 |0005236 |
|10.0 |0005200 |
|12.0 |00042276 |
|20.0 |00052000 |
+-------+--------------+
i have created a schema as
val schema=new StructType()
.add("packQty",FloatType)
.add("gtin", StringType)
val df =Seq(str).toDF("testQTY")
val df2=df.withColumn("jsonData",from_json($"testQTY",schema)).select("jsonData.*")
this is returning me a data frame with only one record...
+-------+--------------+
|packQty|gtin |
+-------+--------------+
|120.0 |0005236|
+-------+--------------+
how to modify the schema so that i can get all the records
if it was a an array then i could have used explode() function to get the values, but i am getting this error.
cannot resolve 'explode(`gtins`)' due to data type mismatch: input to
function explode should be array or map type, not string;;
this is how the column is populated
+----------------------------------------------------------------------+
| gtins
|
+----------------------------------------------------------------------
|[{"packQty":120.0,"gtin":"000520"},{"packQty":10.0,”gtin":"0005200"}]
+----------------------------------------------------------------------+

Keep your schema inside ArrayType i.e ArrayType(new StructType().add("packQty",FloatType).add("gtin", StringType)) this will give you null values as schema column names are not matched with json data.
Change schema ArrayType(new StructType().add("packQty",FloatType).add("gtin", StringType)) to ArrayType(new StructType().add("A",FloatType).add("B", StringType)) & After parsing data rename the required columns.
Please check below code.
If column names are matched in both schema & JSON data.
scala> val json = Seq("""[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]""").toDF("testQTY")
json: org.apache.spark.sql.DataFrame = [testQTY: string]
scala> val schema = ArrayType(StructType(StructField("A",DoubleType,true):: StructField("B",StringType,true) :: Nil))
schema: org.apache.spark.sql.types.ArrayType = ArrayType(StructType(StructField(A,DoubleType,true), StructField(B,StringType,true)),true)
scala> json.withColumn("jsonData",from_json($"testQTY",schema)).select(explode($"jsonData").as("jsonData")).select($"jsonData.A".as("packQty"),$"jsonData.B".as("gtin")).show(false)
+-------+--------+
|packQty|gtin |
+-------+--------+
|120.0 |0005236 |
|10.0 |0005200 |
|12.0 |00042276|
|20.0 |00052000|
+-------+--------+
If column names are not matched in both schema & JSON data.
scala> val json = Seq("""[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]""").toDF("testQTY")
json: org.apache.spark.sql.DataFrame = [testQTY: string]
scala> val schema = ArrayType(StructType(StructField("packQty",DoubleType,true):: StructField("gtin",StringType,true) :: Nil)) // Column names are not matched with json & schema.
schema: org.apache.spark.sql.types.ArrayType = ArrayType(StructType(StructField(packQty,DoubleType,true), StructField(gtin,StringType,true)),true)
scala> json.withColumn("jsonData",from_json($"testQTY",schema)).select(explode($"jsonData").as("jsonData")).select($"jsonData.*").show(false)
+-------+----+
|packQty|gtin|
+-------+----+
|null |null|
|null |null|
|null |null|
|null |null|
+-------+----+
Alternative way of parsing json string into DataFrame using DataSet
scala> val json = Seq("""[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]""").toDS // Creating DataSet from json string.
json: org.apache.spark.sql.Dataset[String] = [value: string]
scala> val schema = StructType(StructField("A",DoubleType,true):: StructField("B",StringType,true) :: Nil) // Creating schema.
schema: org.apache.spark.sql.types.StructType = StructType(StructField(A,DoubleType,true), StructField(B,StringType,true))
scala> spark.read.schema(schema).json(json).select($"A".as("packQty"),$"B".as("gtin")).show(false)
+-------+--------+
|packQty|gtin |
+-------+--------+
|120.0 |0005236 |
|10.0 |0005200 |
|12.0 |00042276|
|20.0 |00052000|
+-------+--------+

Specifying one more option -
val data = """[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]"""
val df2 = Seq(data).toDF("gtins")
df2.show(false)
df2.printSchema()
/**
* +--------------------------------------------------------------------------------------------------------+
* |gtins |
* +--------------------------------------------------------------------------------------------------------+
* |[{"A":120.0,"B":"0005236"},{"A":10.0,"B":"0005200"},{"A":12.0,"B":"00042276"},{"A":20.0,"B":"00052000"}]|
* +--------------------------------------------------------------------------------------------------------+
*
* root
* |-- gtins: string (nullable = true)
*/
df2.selectExpr("inline_outer(from_json(gtins, 'array<struct<A:double, B:string>>')) as (packQty, gtin)")
.show(false)
/**
* +-------+--------+
* |packQty|gtin |
* +-------+--------+
* |120.0 |0005236 |
* |10.0 |0005200 |
* |12.0 |00042276|
* |20.0 |00052000|
* +-------+--------+
*/

How to aggregate map columns after groupBy?

I need to union two dataframes and combine the columns by keys. The two datafrmaes have the same schema, for example:
root
|-- id: String (nullable = true)
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
I want to group by "id" and aggregate the "cMap" together to deduplicate.
I tried the code:
val df = df_a.unionAll(df_b).groupBy("id").agg(collect_list("cMap") as "cMap").
rdd.map(x => {
var map = Map[String,String]()
x.getAs[Seq[Map[String,String]]]("cMap").foreach( y =>
y.foreach( tuple =>
{
val key = tuple._1
val value = tuple._2
if(!map.contains(key))//deduplicate
map += (key -> value)
}))
Row(x.getAs[String]("id"),map)
})
But it seems collect_list cannnot be used to map structure:
org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but map<string,string> was passed as parameter 1..;
Is there other solution for the problem?

You have to use explode function on the map columns first to destructure maps into key and value columns, union the result datasets followed by distinct to de-duplicate and only then groupBy with some custom Scala coding to aggregate the maps.
Stop talking and let's do some coding then...
Given the datasets:
scala> a.show(false)
+---+-----------------------+
|id |cMap |
+---+-----------------------+
|one|Map(1 -> one, 2 -> two)|
+---+-----------------------+
scala> a.printSchema
root
|-- id: string (nullable = true)
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
scala> b.show(false)
+---+-------------+
|id |cMap |
+---+-------------+
|one|Map(1 -> one)|
+---+-------------+
scala> b.printSchema
root
|-- id: string (nullable = true)
|-- cMap: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
You should first use explode function on the map columns.
explode(e: Column): Column Creates a new row for each element in the given array or map column.
val a_keyValues = a.select('*, explode($"cMap"))
scala> a_keyValues.show(false)
+---+-----------------------+---+-----+
|id |cMap |key|value|
+---+-----------------------+---+-----+
|one|Map(1 -> one, 2 -> two)|1 |one |
|one|Map(1 -> one, 2 -> two)|2 |two |
+---+-----------------------+---+-----+
val b_keyValues = b.select('*, explode($"cMap"))
With the following you have distinct key-value pairs which is exactly deduplication you asked for.
val distinctKeyValues = a_keyValues.
union(b_keyValues).
select("id", "key", "value").
distinct // <-- deduplicate
scala> distinctKeyValues.show(false)
+---+---+-----+
|id |key|value|
+---+---+-----+
|one|1 |one |
|one|2 |two |
+---+---+-----+
Time for groupBy and create the final map column.
val result = distinctKeyValues.
withColumn("map", map($"key", $"value")).
groupBy("id").
agg(collect_list("map")).
as[(String, Seq[Map[String, String]])]. // <-- leave Rows for typed pairs
map { case (id, list) => (id, list.reduce(_ ++ _)) }. // <-- collect all entries under one map
toDF("id", "cMap") // <-- give the columns their names
scala> result.show(truncate = false)
+---+-----------------------+
|id |cMap |
+---+-----------------------+
|one|Map(1 -> one, 2 -> two)|
+---+-----------------------+
Please note that as of Spark 2.0.0 unionAll has been deprecated and union is the proper union operator:
(Since version 2.0.0) use union()

Since Spark 3.0, you can:
transform your map to an array of map entries with map_entries
collect those arrays by your id using collect_set
flatten the collected array of arrays using flatten
then rebuild the map from flattened array using map_from_entries
See following code snippet where input is your input dataframe:
import org.apache.spark.sql.functions.{col, collect_set, flatten, map_entries, map_from_entries}
input
.withColumn("cMap", map_entries(col("cMap")))
.groupBy("id")
.agg(map_from_entries(flatten(collect_set("cMap"))).as("cMap"))
Example
Given the following dataframe input:
+---+--------------------+
|id |cMap |
+---+--------------------+
|1 |[k1 -> v1] |
|1 |[k2 -> v2, k3 -> v3]|
|2 |[k4 -> v4] |
|2 |[] |
|3 |[k6 -> v6, k7 -> v7]|
+---+--------------------+
The code snippet above returns the following dataframe:
+---+------------------------------+
|id |cMap |
+---+------------------------------+
|1 |[k1 -> v1, k2 -> v2, k3 -> v3]|
|3 |[k6 -> v6, k7 -> v7] |
|2 |[k4 -> v4] |
+---+------------------------------+

I agree with #Shankar. Your codes seems to be flawless.
The only mistake I assume you are doing is that you are importing wrong library.
You had to import
import org.apache.spark.sql.functions.collect_list
But I guess you are importing
import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList
I hope I am guessing it right.