create unique runid and append in output dataframe for each time we run spark scala code - scala

create unique runid and append in output dataframe for each time we run spark scala code
Below is my Output dataframe i want to add 1 more column for runid , can anyone help?
+--------+-------------------------------+---+
|order_id|Diff |id |
+--------+-------------------------------+---+
|12 |order_status |1 |
|1 |order_customer_id order_status |1 |
|68885 |New row in DataFrame 2 |1 |
|68886 |New row in DataFrame 2 |1 |
|2 |order_customer_id |1 |
|12 |order_status |2 |
|1 |order_customer_id order_status |2 |
|68885 |New row in DataFrame 2 |2 |
|68886 |New row in DataFrame 2 |2 |
|2 |order_customer_id |2 |
+--------+-------------------------------+---+

Related

spark sql max function not producing right value

I'm trying to find the max of a column grouped by spark partition id. I'm getting the wrong value when applying the max function though. Here is the code:
val partitionCol = uuid()
val localRankCol = "test"
df = df.withColumn(partitionCol, spark_partition_id)
val windowSpec = WindowSpec.partitionBy(partitionCol).orderBy(sortExprs:_*)
val rankDF = df.withColumn(localRankCol, dense_rank().over(windowSpec))
val rankRangeDF = rankDF.agg(max(localRankCol))
rankRangeDF.show(false)
sortExprs is applying an ascending sort on sales.
And the result with some dummy data is (partitionCol is 5th column):
+--------------+------+-----+---------------------------------+--------------------------------+----+
|title |region|sales|r6bea781150fa46e3a0ed761758a50dea|5683151561af407282380e6cf25f87b5|test|
+--------------+------+-----+---------------------------------+--------------------------------+----+
|Die Hard |US |100.0|1 |0 |1 |
|Rambo |US |100.0|1 |0 |1 |
|Die Hard |AU |200.0|1 |0 |2 |
|House of Cards|EU |400.0|1 |0 |3 |
|Summer Break |US |400.0|1 |0 |3 |
|Rambo |EU |100.0|1 |1 |1 |
|Summer Break |APAC |200.0|1 |1 |2 |
|Rambo |APAC |300.0|1 |1 |3 |
|House of Cards|US |500.0|1 |1 |4 |
+--------------+------+-----+---------------------------------+--------------------------------+----+
+---------+
|max(test)|
+---------+
|5 |
+---------+
"test" column has a max value of 4 but 5 is being returned.

Scala Spark dataframe filter using multiple column based on available value

I need to filter a dataframe with the below criteria.
I have 2 columns 4Wheel(Subaru, Toyota, GM, null/empty) and 2Wheel(Yamaha, Harley, Indian, null/empty).
I have to filter on 4Wheel with values (Subaru, Toyota), if 4Wheel contain empty/null then filter on 2Wheel with values (Yamaha, Harley)
I couldn't find this type of filtering in different examples. I am new to spark/scala, so could not get enough idea to implement this.
Thanks,
Barun.
You can use spark SQL built-in function when to check if a column is null or empty, and filter accordingly:
import org.apache.spark.sql.functions.{col, when}
dataframe.filter(when(col("4Wheel").isNull || col("4Wheel").equalTo(""),
col("2Wheel").isin("Yamaha", "Harley")
).otherwise(
col("4Wheel").isin("Subaru", "Toyota")
))
So if you have the following input:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|3 |GM |null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
|8 |null |Indian|
|9 | |Indian|
|10 |null |null |
+---+------+------+
You get the following filtered ouput:
+---+------+------+
|id |4Wheel|2Wheel|
+---+------+------+
|1 |Toyota|null |
|2 |Subaru|null |
|4 |null |Yamaha|
|5 | |Yamaha|
|6 |null |Harley|
|7 | |Harley|
+---+------+------+

Replace the values based on condition spark

I have dataset I want to replace the result column based on the least value of quantity by grouping id,date
id,date,quantity,result
1,2016-01-01,245,1
1,2016-01-01,345,3
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
Here the output, replace the quantity which has least value in that groupby(id,date). Here ordering of rows doesn't matter, any order it can be.
id,date,quantity,result
1,2016-01-01,245,2
1,2016-01-01,345,2
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
Use the Window and get the maximum by max.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy('id', 'date')
df.withColumn('result', f.when(f.col('quantity') == f.min('quantity').over(w), f.col('result'))) \
.withColumn('result', f.max('result').over(w)).show(10, False)
+---+----------+--------+------+
|id |date |quantity|result|
+---+----------+--------+------+
|1 |2016-01-02|120 |5 |
|1 |2016-01-01|245 |2 |
|1 |2016-01-01|345 |2 |
|1 |2016-01-01|123 |2 |
|2 |2016-01-02|453 |1 |
|2 |2016-01-01|567 |1 |
|2 |2016-01-01|568 |1 |
+---+----------+--------+------+

How to convert a List of Lists to a DataFrame in Scala?

I am learning Spark and Scala, and was experimenting in the spark REPL.
When I try to convert a List to a DataFrame, it works as follows:
val convertedDf = Seq(1,2,3,4).toDF("Field1")
However, when I try to convert a list of lists to a DataFrame with two columns (field1, field2), it fails with
java.lang.IllegalArgumentException: requirement failed: The number of
columns doesn't match
error message:
val twoColumnDf =Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).toDF("Field1", (Field2))
How to convert such a List of Lists to a DataFrame in Scala?
If you are seeking ways to have each elements of each sequence in each row of respective columns then following are the options for you
zip
zip both sequences and then apply toDF as
val twoColumnDf =Seq(1,2,3,4,5).zip(Seq(5,4,3,2,3)).toDF("Field1", "Field2")
which should give you twoColumnDf as
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
zipped
Another better way is to use zipped as
val threeColumnDf = (Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).zipped.toList.toDF("Field1", "Field2", "field3")
which should give you
+------+------+------+
|Field1|Field2|field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
But zipped works only for maximum three sequeces Thanks for pointing that out #Shaido
Note: the number of rows is determined by the shortest sequence present
transpose
Tanspose combines all sequences as zip and zipped does but returns list instead of tuples so a little hacking is needed as
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3)).transpose.map{case List(a,b) => (a, b)}.toDF("Field1", "Field2")
+------+------+
|Field1|Field2|
+------+------+
|1 |5 |
|2 |4 |
|3 |3 |
|4 |2 |
|5 |3 |
+------+------+
and
Seq(Seq(1,2,3,4,5), Seq(5,4,3,2,3), Seq(10,10,10,12,14)).transpose.map{case List(a,b,c) => (a, b, c)}.toDF("Field1", "Field2", "Field3")
+------+------+------+
|Field1|Field2|Field3|
+------+------+------+
|1 |5 |10 |
|2 |4 |10 |
|3 |3 |10 |
|4 |2 |12 |
|5 |3 |14 |
+------+------+------+
and so on ...
Note: transpose requires all sequences to be of same length
I hope the answer is helpful
By default, each element is considered to be a Row of the dataFrame.
If you want each of the Seqs to be a different column you need to group them inside a Tuple:
val twoColumnDf =Seq((Seq(1,2,3,4,5), Seq(5,4,3,2,3))).toDF("Field1", "Field2")
twoColumnDf.show
+---------------+---------------+
| Field1| Field2|
+---------------+---------------+
|[1, 2, 3, 4, 5]|[5, 4, 3, 2, 3]|
+---------------+---------------+

Data loss after writing in spark

I obtain a resultant dataframe after performing some computations over it.Say the dataframe is result. When i write it to Amazon S3 there are specific cells which are shown blank. The top 5 of my result dataframe is:
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 | |1 | | |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 | |2 | | |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 | |5 | | |
---------------------------------------------------------
But when i run result.show() command i am able to see the values.
_________________________________________________________
|var30 |var31 |var32 |var33 |var34 |var35 |var36|
--------------------------------------------------------
|-0.00586|0.13821 |0 |2 |1 |1 |6 |
|3.87635 |2.86702 |2.51963 |8 |11 |2 |14 |
|3.78279 |2.54833 |2.45881 |2 |2 |2 |12 |
|-0.10092|0 |0 |1 |1 |3 |1 |
|8.08797 |6.14486 |5.25718 |20 |5 |5 |34 |
---------------------------------------------------------
Also, the blank are shown in same cells every time i run it.
Use this to save data to your s3
DataFrame.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("s3n://Yourpath")
For anyone who might have come across this issue, I can tell what worked for me.
I was joining 1 data frame ( let's say inputDF) with another df ( delta DF) based on some logic and storing in an output data frame (outDF). I was getting same error where by I could see a record in outDF.show() but while writing this dataFrame into a hive table OR persisting the outDF ( using outDF.persist(StorageLevel.MEMORY_AND_DISC)) I wasn't able to see that particular record.
SOLUTION:- I persisted the inputDF ( inputDF.persist(StorageLevel.MEMORY_AND_DISC)) before joining it with deltaDF. After that outDF.show() output was consistent with the hive table where outDF was written.
P.S:- I am not sure how this solved the issue. Would be awesome if someone could explain this, but the above worked for me.