Spark Dataframe - Push a particular Row to the last in a Dataframe - scala

Have been trying to push a particular row in a Spark Dataframe to the end of the Dataframe.
This is what I have tried so far.
Input Dataframe:
+-------------+-------+------------+
|expected_date|count |Downstream |
+-------------+-------+------------+
|2018-08-26 |1 |abc |
|2018-08-26 |6 |Grand Total |
|2018-08-26 |3 |xyy |
|2018-08-26 |2 |xxx |
+-------------+-------+------------+
Code:
df.withColumn("Downstream_Hierarchy", when(col("Downstream") === "Grand Total", 2)
.otherwise(1))
.orderBy(col("Downstream_Hierarchy").asc)
.drop("Downstream_Hierarchy")
Output Dataframe:
+-------------+-------+------------+
|expected_date|count |Downstream |
+-------------+-------+------------+
|2018-08-26 |1 |abc |
|2018-08-26 |3 |xyy |
|2018-08-26 |2 |xxx |
|2018-08-26 |6 |Grand Total |
+-------------+-------+------------+
Is there a simpler way to do this?

Going through your comments, Since the end result is needed in HDFS you can write it as csv to HDFS twice
1st time write dataframe to hdfs without "Grand Total" row.
2nd time write "Grand Total" row alone with save mode as "append".

Data Frame except the required row :
val df1 = df.filter(col("Downstream") =!= "Grand Total" )
Data Frame with the required row :
val df2 = df.filter(col("Downstream") === "Grand Total" )
Required DataFrame :
val df_final = df1.union(df2)
Might not be the best solution, but it avoids the expensive OrderBy operation .

You can try below straightforward steps.
val lastRowDf = df.filter("Downstream='Grand Total'")
val remainDf = df.filter("Downstream !='Grand Total'")
remainDf.unionAll(lastRowDf).show

Related

How to plot a graph with data gaps in Zeppelin?

Dataframe was extracted to a temp table to plot the data density per time unit (1 day):
val dailySummariesDf =
getDFFromJdbcSource(SparkSession.builder().appName("test").master("local").getOrCreate(), s"SELECT * FROM values WHERE time > '2020-06-06' and devicename='Voltage' limit 100000000")
.persist(StorageLevel.MEMORY_ONLY_SER)
.groupBy($"digital_twin_id", window($"time", "1 day")).count().as("count")
.withColumn("windowstart", col("window.start"))
.withColumn("windowstartlong", unix_timestamp(col("window.start")))
.orderBy("windowstart")
dailySummariesDf.
registerTempTable("bank")
Then I plot it with %sql processor
%sql
select windowstart, count
from bank
and
%sql
select windowstartlong, count
from bank
What I get is shown below:
So, my expectation is to have gaps in this graph, as there were days with no data at all. But instead I see it being plotted densely, with October days plotted right after August, not showing a gap for September.
How can I force those graphs to display gaps and regard the real X axis values?
Indeed, grouping a dataset by window column won't produce any rows for the intervals that did not contain any original rows within those intervals.
One way to deal with that I can think of, is to add a bunch of fake rows ("manually fill in the gaps" in raw dataset), and only then apply a groupBy/window. For your case, that can be done by creating a trivial one-column dataset containing all the dates within a range you're interested in, and then joining it to your original dataset.
Here is my quick attempt:
import spark.implicits._
import org.apache.spark.sql.types._
// Define sample data
val df = Seq(("a","2021-12-01"),
("b","2021-12-01"),
("c","2021-12-01"),
("a","2021-12-02"),
("b","2021-12-17")
).toDF("c","d").withColumn("d",to_timestamp($"d"))
// Define a dummy dataframe for the range 12/01/2021 - 12/30/2021
import org.joda.time.DateTime
import org.joda.time.format.DateTimeFormat
val start = DateTime.parse("2021-12-01",DateTimeFormat.forPattern("yyyy-MM-dd")).getMillis/1000
val end = start + 30*24*60*60
val temp = spark.range(start,end,24*60*60).toDF().withColumn("tc",to_timestamp($"id".cast(TimestampType))).drop($"id")
// Fill the gaps in original dataframe
val nogaps = temp.join(df, temp.col("tc") === df.col("d"), "left")
// Aggregate counts by a tumbling 1-day window
val result = nogaps.groupBy(window($"tc","1 day","1 day","5 hours")).agg(sum(when($"c".isNotNull,1).otherwise(0)).as("count"))
result.withColumn("windowstart",to_date(col("window.start"))).select("windowstart","count").orderBy("windowstart").show(false)
+-----------+-----+
|windowstart|count|
+-----------+-----+
|2021-12-01 |3 |
|2021-12-02 |1 |
|2021-12-03 |0 |
|2021-12-04 |0 |
|2021-12-05 |0 |
|2021-12-06 |0 |
|2021-12-07 |0 |
|2021-12-08 |0 |
|2021-12-09 |0 |
|2021-12-10 |0 |
|2021-12-11 |0 |
|2021-12-12 |0 |
|2021-12-13 |0 |
|2021-12-14 |0 |
|2021-12-15 |0 |
|2021-12-16 |0 |
|2021-12-17 |1 |
|2021-12-18 |0 |
|2021-12-19 |0 |
|2021-12-20 |0 |
+-----------+-----+
For illustration purposes only :)

Align multiple dataframes in pyspark

I have these 4 spark dataframes:
order,device,count_1
101,201,2
102,202,4
order,device,count_2
101,201,10
103,203,100
order,device,count_3
104,204,111
103,203,10
order,device,count_4
101,201,4
104,204,11
I want to create a resultant dataframe as:
order,device,count_1,count_2,count_3,count_4
101,201,2,10,,4,
102,202,4,,,,
103,203,,100,10,,
104,204,,,111,11
Is this a case of UNION or JOIN or APPEND? How to get the final resultant df?
You can think of UNION as combining tables by rows, so the number of rows will likely increase. JOIN combines tables by columns. I'm not sure what you mean by APPEND, but in this case, you would want JOIN.
Try:
val df1 = Seq((101,201,2), (102,202,4)).toDF("order" ,"device", "count_1")
val df2 = Seq((101,201,10), (103,203,100)).toDF("order" ,"device", "count_2")
val df3 = Seq((104,204,111), (103,203,10)).toDF("order" ,"device", "count_3")
val df4 = Seq((101,201,4), (104,204,11)).toDF("order" ,"device", "count_4")
val df12 = df1.join(df2, Seq("order", "device"),"fullouter")
df12.show(false)
val df123 = df12.join(df3, Seq("order", "device"),"fullouter")
df123.show(false)
val df1234 = df123.join(df4, Seq("order", "device"),"fullouter")
df1234.show(false)
returns:
+-----+------+-------+-------+-------+-------+
|order|device|count_1|count_2|count_3|count_4|
+-----+------+-------+-------+-------+-------+
|101 |201 |2 |10 |null |4 |
|102 |202 |4 |null |null |null |
|103 |203 |null |100 |10 |null |
|104 |204 |null |null |111 |11 |
+-----+------+-------+-------+-------+-------+
As you can see the comments are flawed and the 1st answer incorrect.
Did in Scala, should be easy to do in pyspark.

How to merge two rows in Spark SQL?

I need to merge rows in the same dataframe based on a key column "id". In the sample data frame, 1 row has data for id,name and age. The other row has id,name, and salary. Rows with same key 'id' have to be merged a single record in the final data frame. If there is just one record, should show them as well with null values [Smith, and Jake] as in example below.
The computation needs to happen on real time data, spark native function based solution would be ideal. I have tried filtering the records based on age and city columns to separate data frames and them perform a left join on ID. But its not very efficient. Looking for any alternate suggestions. Thanks in advance!
Sample Dataframe
val inputDF= Seq(("100","John", Some(35),None)
,("100","John", None,Some("Georgia")),
("101","Mike", Some(25),None),
("101","Mike", None,Some("New York")),
("103","Mary", Some(22),None),
("103","Mary", None,Some("Texas")),
("104","Smith", Some(25),None),
("105","Jake", None,Some("Florida")))
.toDF("id","name","age","city")
Input Dataframe
+---+-----+----+--------+
|id |name |age |city |
+---+-----+----+--------+
|100|John |35 |null |
|100|John |null|Georgia |
|101|Mike |25 |null |
|101|Mike |null|New York|
|103|Mary |22 |null |
|103|Mary |null|Texas |
|104|Smith|25 |null |
|105|Jake |null|Florida |
+---+-----+----+--------+
Expected Output Dataframe
+---+-----+----+---------+
| id| name| age| city|
+---+-----+----+---------+
|100| John| 35| Georgia|
|101| Mike| 25| New York|
|103| Mary| 22| Texas|
|104|Smith| 25| null|
|105| Jake|null| Florida|
+---+-----+----+---------+
Use first or last standard functions with ignoreNulls flag on.
first standard function
val q = inputDF
.groupBy("id", "name")
.agg(first("age", ignoreNulls = true) as "age", first("city", ignoreNulls = true) as "city")
.orderBy("id")
last standard function
val q = inputDF
.groupBy("id","name")
.agg(last("age", true) as "age", last("city") as "city")
.orderBy("id")

Populate a "Grouper" column using .withcolumn in scala.spark dataframe

Trying to populate the grouper column like below. In the table below, X signifies the start of a new record. So, Each X,Y,Z needs to be grouped. In MySQL, I would accomplish like:
select #x:=1;
update table set grouper=if(column_1='X',#x:=#x+1,#x);
I am trying to see if there is a way to do this without using a loop using . With column or something similar.
what I have tried:
var group = 1;
val mydf4 = mydf3.withColumn("grouper", when(col("column_1").equalTo("INS"),group=group+1).otherwise(group))
Example DF
Simple window function and row_number() inbuilt function should get you your desired output
val df = Seq(
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z"),
Tuple1("X"),
Tuple1("Y"),
Tuple1("Z")
).toDF("column_1")
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("column_1").orderBy("column_1")
import org.apache.spark.sql.functions._
df.withColumn("grouper", row_number().over(windowSpec)).orderBy("grouper", "column_1").show(false)
which should give you
+--------+-------+
|column_1|grouper|
+--------+-------+
|X |1 |
|Y |1 |
|Z |1 |
|X |2 |
|Y |2 |
|Z |2 |
+--------+-------+
Note: The last orderBy is just to match the expected output and just for visualization. In real cluster and processing orderBy like that doesn't make sense

how to access the column index for spark dataframe in scala for calculation

I am new to Scala programming , i have worked on R very extensively but while working for scala it has become tough to work in a loop to extract specific columns to perform computation on the column values
let me explain with help of an example :
i have Final dataframe arrived after joining the 2 dataframes,
now i need to perform calculation like
Above is the computation with reference to the columns , so after computation we'll get the below spark dataframe
How to refer to the column index in for-loop to compute the new column values in spark dataframe in scala
Here is one solution:
Input Data:
+---+---+---+---+---+---+---+---+---+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |
+---+---+---+---+---+---+---+---+---+
|24 |74 |74 |21 |66 |65 |100|27 |19 |
+---+---+---+---+---+---+---+---+---+
Zipped the columns to remove the non-matching columns:
val oneCols = data.schema.filter(_.name.contains("1")).map(x => x.name).sorted
val twoCols = data.schema.filter(_.name.contains("2")).map(x => x.name).sorted
val cols = oneCols.zip(twoCols)
//cols: Seq[(String, String)] = List((a1,a2), (b1,b2), (c1,c2), (d1,d2))
Use foldLeft function to dynamically add columns:
import org.apache.spark.sql.functions._
val result = cols.foldLeft(data)((data,c) => data.withColumn(s"Diff_${c._1}",
(col(s"${lit(c._2)}") - col(s"${lit(c._1)}"))/col(s"${lit(c._2)}")))
Here is the result:
result.show(false)
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|a1 |b1 |c1 |d1 |e1 |a2 |b2 |c2 |d2 |Diff_a1 |Diff_b1|Diff_c1 |Diff_d1 |
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+
|24 |74 |74 |21 |66 |65 |100|27 |19 |0.6307692307692307|0.26 |-1.7407407407407407|-0.10526315789473684|
+---+---+---+---+---+---+---+---+---+------------------+-------+-------------------+--------------------+