Why spark.read.parquet() runs 2 jobs? - scala

I have a parquet file, named test.parquet. It contains some integers. When I read it using following code:
val df = spark.read.parquet("test.parquet")
df.show(false)
+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+
In logs it shows 2 jobs that were executed. They are as follows:
One is parquet job and another one is show job. Whereas, when I read parquet file using following code:
val df = spark.read.schema(StructType(List(StructField("id",LongType,false)))).parquet("test.parquet")
df.show(false)
+---+
|id |
+---+
|11 |
|12 |
|13 |
|14 |
|15 |
|16 |
|17 |
|18 |
|19 |
+---+
Only one job is executed, i.e., show:
So, my question is:
Why first approach executes 2 jobs whereas second approach executes only one ?
And, why second approach is faster than the first one ?

Spark reads the file twice.
1- To evolve the schema
2- To create the dataFrame
Once the schema will be generated, the dataFrame will be created which is fast.

Related

Duplication data from streams on merge in Delta Tables

I have a source table with say following data
+----------------+---+--------+-----------------+---------+
|registrationDate| id|custName| email|eventName|
+----------------+---+--------+-----------------+---------+
| 17-02-2023| 2| Person2|person2#gmail.com| INSERT|
| 17-02-2023| 1| Person1|person1#gmail.com| INSERT|
| 17-02-2023| 5| Person5|person5#gmail.com| INSERT|
| 17-02-2023| 4| Person4|person4#gmail.com| INSERT|
| 17-02-2023| 3| Person3|person3#gmail.com| INSERT|
+----------------+---+--------+-----------------+---------+
the above table is being stored into my S3 in delta format.
Now I'm creating a dataframe from Kinesis streams , and trying to merge it into my delta table. Every operation works fine - upserts, deletes everything but let's say the dataframe generated from the stream looks something like below -
+---------+---+---------------+----------------+----------------+
|eventName|id |custName |email |registrationDate|
+---------+---+---------------+----------------+----------------+
|REMOVE |1 |null |null |null |
|MODIFY |2 |ModPerson2 |modemail#mod.com|09-02-2023 |
|MODIFY |3 |3modifiedperson|modp#mod.com |09-02-2023 |
|INSERT |100|Person100 |p100#p.com |09-02-2023 |
|INSERT |200|Person200 |p200#p.com |09-02-2023 |
|REMOVE |5 |null |null |null |
|REMOVE |200|null |null |null |
+---------+---+---------------+----------------+----------------+
it is evident from the dataframe above created by data streams that i'm inserting a record with ID 200 while also deleting that record with the same id in the same batch.The records with ID 1 and 5 are deleted but not 200.
Merging is not possible in this case of duplication to the delta table. How do i counter this ?
deltaTable = DeltaTable.forPath(spark, 's3://path-to-my-s3')
(deltaTable.alias("first_df").merge(
updated_data.alias("append_df"),
"first_df.id = append_df.id")
.whenMatchedDelete("append_df.eventName='REMOVE'")
.whenMatchedUpdateAll("first_df.id = append_df.id")
.whenNotMatchedInsertAll()
.execute()
)
Resulting Delta table after Merge - Id with 200 still remains
+----------------+---+---------------+-----------------+---------+
|registrationDate| id| custName| email|eventName|
+----------------+---+---------------+-----------------+---------+
| 09-02-2023| 2| ModPerson2| modemail#mod.com| MODIFY|
| 17-02-2023| 4| Person4|person4#gmail.com| INSERT|
| 09-02-2023|100| Person100| p100#p.com| INSERT|
| 09-02-2023|200| Person200| p200#p.com| INSERT|
| null|200| null| null| REMOVE|
| 09-02-2023| 3|3modifiedperson| modp#mod.com| MODIFY|
+----------------+---+---------------+-----------------+---------+

Filter DF using the column of another DF (same col in both DF) Spark Scala

I am trying to filter a DataFrame DF1 using the column of another DataFrame DF2, the col is country_id. I Want to reduce all the rows of the first DataFrame to only the countries that there are on the second DF. An example:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |ARG |5 |
|2015-12-14 |GER |1 |
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
|2015-12-14 |USA |1 |
+--------------+------------+
|USE | country_id |
+--------------+------------+
| F |RUS |
| F |CHN |
Expected:
+--------------+------------+-------+
|Date | country_id | value |
+--------------+------------+-------+
|2015-12-14 |RUS |1 |
|2015-12-14 |CHN |3 |
How could I do this? I am new with Spark so I have thought on use maybe intersect? or would be more efficient other method?
Thanks in advance!
You can use left semi join:
val DF3 = DF1.join(DF2, Seq("country_id"), "left_semi")
DF3.show
//+----------+----------+-----+
//|country_id| Date|value|
//+----------+----------+-----+
//| RUS|2015-12-14| 1|
//| CHN|2015-12-14| 3|
//+----------+----------+-----+
You can also use inner join :
val DF3 = DF1.alias("a").join(DF2.alias("b"), Seq("country_id")).select("a.*")

Filtering empty partitions in RDD

Is there a way to filter empty partitions in RDD? I have some empty partitions after partitioning and I can't use them in action method.
I use Apache Spark in Scala
This is my sample data
val sc = spark.sparkContext
val myDataFrame = spark.range(20).toDF("mycol").repartition($"mycol")
myDataFrame.show(false)
Output :
+-----+
|mycol|
+-----+
|19 |
|0 |
|7 |
|6 |
|9 |
|17 |
|5 |
|1 |
|10 |
|3 |
|12 |
|8 |
|11 |
|2 |
|4 |
|13 |
|18 |
|14 |
|15 |
|16 |
+-----+
In the above code when you do repartition on column then 200 paritions will be created since spark.sql.shuffle.partitions = 200 in that many are not used or empty partitions since data is just 10 numbers (we are trying to fit 20 numbers in to 200 partitions means.... most of the partitions are empty.... :-))
1) Prepare a long accumulator variable to quickly count non empty partitions.
2) Add all non empty partitions in to accumulator variable like below example.
val nonEmptyPartitions = sc.longAccumulator("nonEmptyPartitions")
myDataFrame.foreachPartition(partition =>
if (partition.length > 0) nonEmptyPartitions.add(1))
drop non empty partitions (means coalesce them... less shuffle/ minimum shuffle ).
print them.
val finalDf = myDataFrame.coalesce(nonEmptyPartitions.value.toInt)
println(s"nonEmptyPart : ${nonEmptyPartitions.value.toInt}")
println(s"df.rdd.partitions.length : ${myDataFrame.rdd.getNumPartitions}")
println(s"finalDf.rdd.partitions.length : ${finalDf.rdd.getNumPartitions}")
print them ...
Result :
nonEmptyPart : 20
df.rdd.partitions.length : 200
finalDf.rdd.partitions.length : 20
Proof that all non empty partitions are dropped...
myDataFrame.withColumn("partitionId", org.apache.spark.sql.functions.spark_partition_id)
.groupBy("partitionId")
.count
.show
Result printed partition wise record count :
+-----------+-----+
|partitionId|count|
+-----------+-----+
|128 |1 |
|190 |1 |
|140 |1 |
|164 |1 |
|5 |1 |
|154 |1 |
|112 |1 |
|107 |1 |
|4 |1 |
|49 |1 |
|69 |1 |
|77 |1 |
|45 |1 |
|121 |1 |
|143 |1 |
|58 |1 |
|11 |1 |
|150 |1 |
|68 |1 |
|116 |1 |
+-----------+-----+
Note :
Usage spark_partition_id is for demo/debug purpose only not for production purpose.
I reduced 200 partitions (due to repartition on column ) to 20 non empty partitions.
Conclusion :
Finally you got rid of extra empty partitions which doesnt have any data and avoided un necessary schedule to dummy tasks on empty partitions.
From the little info you provide, I can think about two options. Use mapPartitions and just catching empty iterators and returning them, while working on the non-empty ones.
rdd.mapPartitions { case iter => if(iter.isEmpty) { iter } else { ??? } }
Or you can use repartition, to get rid of the empty partitions.
rdd.repartition(10) // or any proper number
If you dont know the distinct values within the column, and wish to avoid having empty partitions, you can use countApproxDistinct() as:
df.repartition(df.rdd.countApproxDistinct().toInt)
If you wish to filter the existing empty partitions and repartition, you can use as solution suggeste by Sasa
OR:
df.repartition(df.mapPartitions(part => List(part.length).iterator).collect().count(_ != 0)).df.getNumPartitions)
However, in later case the partitions may or may not contain records by value.

Duplicating the record count in apache spark

This is an extension of this question, Apache Spark group by combining types and sub types.
val sales = Seq(
("Warsaw", 2016, "facebook","share",100),
("Warsaw", 2017, "facebook","like",200),
("Boston", 2015,"twitter","share",50),
("Boston", 2016,"facebook","share",150),
("Toronto", 2017,"twitter","like",50)
).toDF("city", "year","media","action","amount")
All good with that solution, however the expected output should be counted in different categories conditionally.
So, the output should look like,
+-------+--------+-----+
| Boston|facebook| 1|
| Boston| share1 | 2|
| Boston| share2 | 2|
| Boston| twitter| 1|
|Toronto| twitter| 1|
|Toronto| like | 1|
| Warsaw|facebook| 2|
| Warsaw|share1 | 1|
| Warsaw|share2 | 1|
| Warsaw|like | 1|
+-------+--------+-----+
Here if the action is share, I need to have that counted both in share1 and share2. When I count it programmatically, I use case statement and say case when action is share, share1 = share1 +1 , share2 = share2+1
But how can I do this in Scala or pyspark or sql ?
Simple filter and unions should give you your desired output
val media = sales.groupBy("city", "media").count()
val action = sales.groupBy("city", "action").count().select($"city", $"action".as("media"), $"count")
val share = action.filter($"media" === "share")
media.union(action.filter($"media" =!= "share"))
.union(share.withColumn("media", lit("share1")))
.union(share.withColumn("media", lit("share2")))
.show(false)
which should give you
+-------+--------+-----+
|city |media |count|
+-------+--------+-----+
|Boston |facebook|1 |
|Boston |twitter |1 |
|Toronto|twitter |1 |
|Warsaw |facebook|2 |
|Warsaw |like |1 |
|Toronto|like |1 |
|Boston |share1 |2 |
|Warsaw |share1 |1 |
|Boston |share2 |2 |
|Warsaw |share2 |1 |
+-------+--------+-----+

Spark Dataframe - Implement Oracle NVL Function while joining

I need to implement NVL function in spark while joining two dataframes.
Input Dataframes :
ds1.show()
---------------
|key | Code |
---------------
|2 | DST |
|3 | CPT |
|null | DTS |
|5 | KTP |
---------------
ds2.show()
------------------
|key | PremAmt |
------------------
|2 | 300 |
|-1 | -99 |
|5 | 567 |
------------------
Need to implement "LEFT JOIN NVL(DS1.key, -1) = DS2.key" .
So I have written like this, but NVL or Coalesce function is missing .so it returned wrong values.
How to incorporate "NVL" in spark dataframes ?
// nvl function is missing, so wrong output
ds1.join(ds1,Seq("key"),"left_outer")
-------------------------
|key | Code |PremAmt |
-------------------------
|2 | DST |300 |
|3 | CPT |null |
|null | DTS |null |
|5 | KTP |567 |
-------------------------
Expected Result :
-------------------------
|key | Code |PremAmt |
-------------------------
|2 | DST |300 |
|3 | CPT |null |
|null | DTS |-99 |
|5 | KTP |567 |
-------------------------
I know one complex way.
val df = df1.join(df2, coalesce(df1("key"), lit(-1)) === df2("key"), "left_outer")
You should rename column name "key" of one df, and drop the column after join.
An implementation of nvl in Scala
import org.apache.spark.sql.Column;
import org.apache.spark.sql.functions.{when, lit};
def nvl(ColIn: Column, ReplaceVal: Any): Column = {
return(when(ColIn.isNull, lit(ReplaceVal)).otherwise(ColIn))
}
Now you can use nvl as you would use any other function for data frame manipulation, like
val NewDf = DF.withColumn("MyColNullsReplaced", nvl($"MyCol", "<null>"))
Obviously, Replaceval must be of the correct type. The example above assumes $"MyCol" is of type string.
This worked for me:
intermediateDF.select(col("event_start_timestamp"),
col("cobrand_id"),
col("rule_name"),
col("table_name"),
coalesce(col("dimension_field1"),lit(0)),
coalesce(col("dimension_field2"),lit(0)),
coalesce(col("dimension_field3"),lit(0)),
coalesce(col("dimension_field4"),lit(0)),
coalesce(col("dimension_field5"),lit(0))
)
The answer is use NVL, this code in python works
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[1]").appName("CommonMethods").getOrCreate()
Note: SparkSession is being bulit in a "chained" fashion,ie. 3 methods are being applied in teh same line
Read CSV file
df = spark.read.csv('C:\\tableausuperstore1_all.csv',inferSchema='true',header='true')
df.createOrReplaceTempView("ViewSuperstore")
The ViewSuperstore can be ued for SQL NOW
print("*trace1-nvl")
df = spark.sql("select nvl(state,'a') testString, nvl(quantity,0) testInt from ViewSuperstore where state='Florida' and OrderDate>current_date() ")
df.show()
print("*trace2-FINAL")
df.select(expr("nvl(colname,'ZZ')"))