Remove Nulls in specific Rows in Dataframe and combine rows - scala

Need to do the below activity in Spark Dataframes using Scala.
Have tried doing some basic filters isNotNull conditions and others. But no luck.
Input
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
| null| null|[AE,AA,CV]|
| null|[AH,EE,CC]| null|
|[DD,DE,QQ]| null| null|
+----------+----------+----------+
Output
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
|[DD,DE,QQ]|[AH,EE,CC]|[AE,AA,CV]|
+----------+----------+----------+

If the input dataframe is limited to only
+----------+----------+----------+
| Amber| Green| Red|
+----------+----------+----------+
| null| null|[AE,AA,CV]|
| null|[AH,EE,CC]| null|
|[DD,DE,QQ]| null| null|
+----------+----------+----------+
Then doing the following should get you the desired final dataframe
import org.apache.spark.sql.functions._
df.select(collect_list("Amber")(0).as("Amber"), collect_list("Green")(0).as("Green"), collect_list("Red")(0).as("Red")).show(false)
You should be getting
+------------+------------+------------+
|Amber |Green |Red |
+------------+------------+------------+
|[DD, DE, QQ]|[AH, EE, CC]|[AE, AA, CV]|
+------------+------------+------------+
collect_list inbuilt function ignores the null values.

Related

how to reduce dataFrame wisely

I want to transform the following dataFrmae structure where we have details regarding each id and kpi couple of records one with the value_left and the 2nd for the value_right I want to group the 2 records into a single record (as you see in the expected results)
i want to reduce the fillowing dataFrame
+---+---+----------+-----------+
| id|kpi|value_left|value_right|
+---+---+----------+-----------+
| 1|sum| 10| null|
| 1|sum| null| 20|
| 2|avg| 15| null|
| 2|avg| null| 15|
+---+---+----------+-----------+
Expected output dataFrame
+---+---+----------+-----------+
| id|kpi|value_left|value_right|
+---+---+----------+-----------+
| 1|sum| 10| 20|
| 2|avg| 15| 15|
+---+---+----------+-----------+
expected dataFrame

Reading a tsv file in pyspark

I want to read a tsv file but it has no header I am creating my own schema nad then trying to read TSV file but after applyting schema it is showing all columns values as null.Below is my code and result.
from pyspark.sql.types import StructType,StructField,StringType,IntegerType
schema = StructType([StructField("id_code", IntegerType()),StructField("description", StringType())])
df=spark.read.csv("C:/Users/HP/Downloads/`connection_type`.tsv",schema=schema)
df.show();
+-------+-----------+
|id_code|description|
+-------+-----------+
| null| null|
| null| null|
| null| null|
| null| null|
| null| null|
+-------+-----------+
If i read it simply without applying any schema.
df=spark.read.csv("C:/Users/HP/Downloads/connection_type.tsv",sep="/t")
df.show()
+-----------------+
| _c0|
+-----------------+
| 0 Not Specified |
| 1 Modem |
| 2 LAN/Wifi |
| 3 Unknown |
| 4 Mobile Carrier|
+-----------------+
It is not coming in a proper way. Can anyone please help me with this. My sample file is .tsv file and it has below records.
0 Specified
1 Modemwifi
2 LAN/Wifi
3 Unknown
4 Mobile user
Add the sep option and if it is really tab-separated, this will work.
df = spark.read.option("inferSchema","true").option("sep","\t").csv("test.tsv").show()
+---+-----------+
|_c0| _c1|
+---+-----------+
| 0| Specified|
| 1| Modemwifi|
| 2| LAN/Wifi|
| 3| Unknown|
| 4|Mobile user|
+---+-----------+

How do you add/include header row and totals row from a pivot table in pyspark?

I'm working on exporting data via PySpark to Excel. I have a data set
df_raw = spark.createDataFrame([("2015-10", 'U.S.', 500), \
("2018-10", 'Germany', 580), \
("2019-08", 'Japan', 230), \
("2015-12", 'U.S.', 500), \
("2015-11", 'Germany', 580), \
("2015-12", 'Japan', 502), \
("2018-10", 'U.S.', 520), \
("2019-08", 'Canada', 200)]).toDF("ym", "country", "points")
+-------+-------+------+
| ym|country|points|
+-------+-------+------+
|2015-10| U.S.| 500|
|2018-10|Germany| 580|
|2019-08| Japan| 230|
|2015-12| U.S.| 500|
|2015-11|Germany| 580|
|2015-12| Japan| 502|
|2018-10| U.S.| 520|
|2019-08| Canada| 200|
+-------+-------+------+
that I convert to a pivot table
df_pivot = df_raw.groupBy('country').pivot("ym").sum('points')
+-------+-------+-------+-------+-------+-------+
|country|2015-10|2015-11|2015-12|2018-10|2019-08|
+-------+-------+-------+-------+-------+-------+
|Germany| null| 580| null| 580| null|
| U.S.| 500| null| 500| 520| null|
| Canada| null| null| null| null| 200|
| Japan| null| null| 502| null| 230|
+-------+-------+-------+-------+-------+-------+
and I would like to export the table with the header row and a row for grand totals into an Excel spreadsheet via Openpyxl.
I can loop through the dataframe using .collect() and append the records to a worksheet but it doesn't include the header and I would like to add a grand total row as well.
Example of the grand total row:
+-------+-------+-------+-------+-------+-------+
|country|2015-10|2015-11|2015-12|2018-10|2019-08|
+-------+-------+-------+-------+-------+-------+
|Germany| null| 580| null| 580| null|
| U.S.| 500| null| 500| 520| null|
| Canada| null| null| null| null| 200|
| Japan| null| null| 502| null| 230|
+-------+-------+-------+-------+-------+-------+
| | 500| 580| 1002| 1100| 430|
+-------+-------+-------+-------+-------+-------+
How do I accomplish this?
Try looking at the rollup function and unioning it afterwards, e.g.
df = df_raw.groupBy('country').pivot("ym").sum('points')
df2 = df.rollup('country').count()
Alternatively, just take the output of your pivot, dynamically select the date columns (on a regex pattern or something) and aggregate them with sum(), and alias back into the column name.
EDIT:
Now I understand what exactly you wanted. I would still use rollup but combined with some renaming and union, such as:
from functools import reduce
agg_cols = df_pivot.columns[1:]
rollup_df = df_pivot.rollup().sum()
renamed_df = reduce(
lambda rollup_df, idx: rollup_df.withColumnRenamed(rollup_df.columns[idx], agg_cols[idx]),
range(len(rollup_df.columns)), rollup_df
)
renamed_df = renamed_df.withColumn('country', f.lit('Total'))
df_pivot.unionByName(
renamed_df
).show()
Output:
+-------+-------+-------+-------+-------+-------+
|country|2015-10|2015-11|2015-12|2018-10|2019-08|
+-------+-------+-------+-------+-------+-------+
|Germany| null| 580| null| 580| null|
| U.S.| 500| null| 500| 520| null|
| Canada| null| null| null| null| 200|
| Japan| null| null| 502| null| 230|
| Total| 500| 580| 1002| 1100| 430|
+-------+-------+-------+-------+-------+-------+
Tested on PySpark 2.4.3

Not able to get metadata information of the Delta Lake table using Spark

I am trying to get metadata information of the Delta Lake table created using DataFrame. Information on the version, timestamp.
Tried: spark.sql("describe deltaSample").show(10,false) — this is not giving information related to version and timestamp:
I want to know how many versions exist with timeStamp information
+--------+---------+-------+
|col_name|data_type|comment|
+--------+---------+-------+
|_c0 |string |null |
|_c1 |string |null |
+--------+---------+-------+
Below is the code :
// download delta in spark-shell
spark2-shell --packages io.delta:delta-core_2.11:0.2.0
val data = spark.read.csv("/xyz/deltaLake/deltaLakeSample.csv")
// save data frame
data.write.format("delta").save("/xyz/deltaLake/deltaSample")
// create delta lake table
spark.sql("create table deltaSample using delta location '/xyz/deltaLake/deltaSample'")
val updatedInfo = data.withColumn("_c1",when(col("_c1").equalTo("right"), "updated").otherwise(col("_c1")) )
// update delta lake table
updatedInfo.write.format("delta").mode("overwrite").save("/xyz/deltaLake/deltaSample")
spark.read.format("delta").option("versionAsOf", 0).load("/xyz/deltaLake/deltaSample/").show(10,false)
+---+-----+
|_c0|_c1 |
+---+-----+
|rt |right|
|lt |left |
|bk |back |
|frt|front|
+---+-----+
spark.read.format("delta").option("versionAsOf", 1).load("/xyz/deltaLake/deltaSample/").show(10,false)
+---+-------+
|_c0|_c1 |
+---+-------+
|rt |updated|
|lt |left |
|bk |back |
|frt|front |
+---+-------+
// get metadata of the table created. with version, timestamp info.
spark.sql("describe history deltaSample") -- not working
org.apache.spark.sql.AnalysisException: Table or view was not found: history;
at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:47)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:733)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.resolveRelation(Analyzer.scala:685)
expected table display ( eg: added column version , timestamp) :
+--------+---------+-------+-------+------------
|_c0 |_c1 |Version|timestamp |
+--------+---------+-------+-------+------------
|rt |right |0 |2019-07-22 00:24:00|
|lt |left |0 |2019-07-22 00:24:00|
|rt |updated |1 |2019-08-22 00:25:60|
|lt |left |1 |2019-08-22 00:25:60|
+--------+---------+-------+------------------+
The ability to view the history of a Delta Lake table was included the recently announced 0.3.0 per Announcing the Delta Lake 0.3.0 Release.
Currently you can do this using the Scala API; have the ability to do this in SQL is currently on the roadmap. For a Scala API example, with 0.3.0,
import io.delta.tables._
val deltaTable = DeltaTable.forPath(spark, pathToTable)
val fullHistoryDF = deltaTable.history() // get the full history of the table.
val lastOperationDF = deltaTable.history(1) // get the last operation.
with the result of the fullHistoryDF being similar to:
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+
|version| timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+
| 5|2019-07-29 14:07:47| null| null| DELETE|[predicate -> ["(...|null| null| null| 4| null| false|
| 4|2019-07-29 14:07:41| null| null| UPDATE|[predicate -> (id...|null| null| null| 3| null| false|
| 3|2019-07-29 14:07:29| null| null| DELETE|[predicate -> ["(...|null| null| null| 2| null| false|
| 2|2019-07-29 14:06:56| null| null| UPDATE|[predicate -> (id...|null| null| null| 1| null| false|
| 1|2019-07-29 14:04:31| null| null| DELETE|[predicate -> ["(...|null| null| null| 0| null| false|
| 0|2019-07-29 14:01:40| null| null| WRITE|[mode -> ErrorIfE...|null| null| null| null| null| true|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+

How to find the difference of two dataframes

I am working on some Unit Testing Spark code which should be able to generate the difference between two dataframes(raw bucket and curated bucket). Both dataframes(buckets) are the same and we want to execute this code to capture the possible changes after we copy files from raw to curated. . I am aware that I can use except function as follow:
val difference =CuratedDataFrame.union(RawDataFrame).except(CuratedDataFrame.intersect(RawDataFrame))
+-----------+-------+-------------+---------+---------------+
|record |pid |feetype |freq |default |
+-----------+-------+-------------+---------+---------------+
| 1| 45| FAC| Y| T|
| 1| 45| FAC| Y| TTY|
| 1| 47| FAC| R| M|
| 1| 99| FAC| R| M|
+-----------+-------+-------------+---------+---------------+
The except function is returning the entire row but my desired output is as follow :
+-----------+-------+-------------+---------+---------------+
|record |pid |feetype |freq |default |
+-----------+-------+-------------+---------+---------------+
| null|[47,99]| null| null| null |
| null| null| null| null| [T, TTY]|
+----------+-----------+-------+-------------+---------+-----
It means if there is a change in column then it should appear if there is no change then it should be hidden or Null.
For doing this I am using the following approach :
val mapDiffs=(name: String) => when($"l.$name" === $"r.$name", null )
.otherwise(array($"l.$name", $"r.$name")).as(name)
val result = difference.as("l")
.join(RawDataFrame.as("r"), $"l.primaryKey" === $"r.primaryKey","inner")
.select($"l.primaryKey" :: cols.map(mapDiffs): _*)
The above approach requires primary key to be able to join both dataframes and compare them row by row. None of the dataframes have primary key so I had to combine some of the columns to specify a primary key :
+-----------+-------+-------------+---------+---------------+----------+
|record |pid |feetype |freq |default |primaryKey|
+-----------+-------+-------------+---------+---------------+----------+
| 1| 40| FAC| A| N| FAC40A|
| 1| 45| FAC| Y| T| FAC45Y|
| 1| 47| FAC| R| M| FAC47R|
+-----------+-------+-------------+---------+---------------+----------+
The problem is that if any changes happen in the target bucket, the primary key will be consequently changed so comparing both dataframes would be impossible.