I want to understand the best way to solve date-related problems in spark SQL. I'm trying to solve simple problem where I have a file that has date ranges like below:
startdate,enddate
01/01/2018,30/01/2018
01/02/2018,28/02/2018
01/03/2018,30/03/2018
and another table that has date and counts:
date,counts
03/01/2018,10
25/01/2018,15
05/02/2018,23
17/02/2018,43
Now all I want to find is sum of counts for each date range, so the output expected is:
startdate,enddate,sum(count)
01/01/2018,30/01/2018,25
01/02/2018,28/02/2018,66
01/03/2018,30/03/2018,0
Following is the code I have written but it's giving me a cartesian result set:
val spark = SparkSession.builder().appName("DateBasedCount").master("local").getOrCreate()
import spark.implicits._
val df1 = spark.read.option("header","true").csv("dateRange.txt").toDF("startdate","enddate")
val df2 = spark.read.option("header","true").csv("dateCount").toDF("date","count")
df1.createOrReplaceTempView("daterange")
df2.createOrReplaceTempView("datecount")
val res = spark.sql("select startdate,enddate,date,visitors from daterange left join datecount on date >= startdate and date <= enddate")
res.rdd.foreach(println)
The output is:
| startdate| enddate| date|visitors|
|01/01/2018|30/01/2018|03/01/2018| 10|
|01/01/2018|30/01/2018|25/01/2018| 15|
|01/01/2018|30/01/2018|05/02/2018| 23|
|01/01/2018|30/01/2018|17/02/2018| 43|
|01/02/2018|28/02/2018|03/01/2018| 10|
|01/02/2018|28/02/2018|25/01/2018| 15|
|01/02/2018|28/02/2018|05/02/2018| 23|
|01/02/2018|28/02/2018|17/02/2018| 43|
|01/03/2018|30/03/2018|03/01/2018| 10|
|01/03/2018|30/03/2018|25/01/2018| 15|
|01/03/2018|30/03/2018|05/02/2018| 23|
|01/03/2018|30/03/2018|17/02/2018| 43|
Now if I groupby startdate and enddate with sum on count I see following result which is incorrect:
| startdate| enddate| sum(count)|
|01/01/2018|30/01/2018| 91.0|
|01/02/2018|28/02/2018| 91.0|
|01/03/2018|30/03/2018| 91.0|
So how do we handle this and what is the best way to deal with dates in Spark SQL? Should we build columns as dateType in first place OR read as strings and then cast it to date while necessary?
The problem is that your dates are not interpreted as dates by Spark automatically, they are just strings. The solution is therefore to convert them into dates:
val df1 = spark.read.option("header","true").csv("dateRange.txt")
.toDF("startdate","enddate")
.withColumn("startdate", to_date(unix_timestamp($"startdate", "dd/MM/yyyy").cast("timestamp")))
.withColumn("enddate", to_date(unix_timestamp($"enddate", "dd/MM/yyyy").cast("timestamp")))
val df2 = spark.read.option("header","true").csv("dateCount")
.toDF("date","count")
.withColumn("date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))
Then use the same code as before. The output of the SQL command is now:
+----------+----------+----------+------+
| startdate| enddate| date|counts|
+----------+----------+----------+------+
|2018-01-01|2018-01-30|2018-01-03| 10|
|2018-01-01|2018-01-30|2018-01-25| 15|
|2018-02-01|2018-02-28|2018-02-05| 23|
|2018-02-01|2018-02-28|2018-02-17| 43|
|2018-03-01|2018-03-30| null| null|
+----------+----------+----------+------+
If the last line should be ignored, simply change to an inner join instead.
Using df.groupBy("startdate", "enddate").sum() on this new dataframe will give the wanted output.
Related
I need to complete my dataset with the dates that are missing, with the format: YYYY-MM-DD
In this example, I would like to add a "line" for the dates missing between the dates I have information for, with a value of 0 since I have no data for those dates!
The output would look like this:
Can someone help me ? Thanks!!
One approach would be to assemble a time-series dataframe using LocalDate functions for the wanted date range and perform a left-join, as shown below:
import java.time.LocalDate
val startDate: LocalDate = LocalDate.parse("2020-09-30")
val endDate: LocalDate = LocalDate.parse("2020-10-06")
val tsDF = Iterator.iterate(startDate)(_.plusDays(1)).
takeWhile(! _.isAfter(endDate)).
map(java.sql.Date.valueOf(_)).
toSeq.
toDF("date")
val df = Seq(
("2020-10-01", 10),
("2020-10-03", 10),
("2020-10-04", 10),
("2020-10-06", 10)
).toDF("date", "value")
tsDF.
join(df, Seq("date"), "left_outer").
select($"date", coalesce($"value", lit(0)).as("value")).
show
// +----------+-----+
// | date|value|
// +----------+-----+
// |2020-09-30| 0|
// |2020-10-01| 10|
// |2020-10-02| 0|
// |2020-10-03| 10|
// |2020-10-04| 10|
// |2020-10-05| 0|
// |2020-10-06| 10|
// +----------+-----+
can you just give an indication about the size of the data that you are working on
It is not that simple to achieve without putting all the data onto one single partition and trashing the performance. What I would do to avoid that is associate each date to an id, than use spark.range to generate a dataframe all these ids and then join it with the original dataframe. It would go as follows:
import org.apache.spark.sql.Row
// let's create the sample dataframe
val df = Seq("2020-10-01" -> 10, "2020-10-03" -> 10, "2020-10-06" -> 10)
.toDF("Date", "Value")
.withColumn("Date", to_date('Date))
// Then, let's extract the first date and the number of days between the first
// and last dates
val Row(start : Date, diff : Int) = df
.select(min('Date) as "start", max('Date) as "end")
.select('start, datediff('end, 'start) as "diff")
.head
// Finally, we create an id equal to 0 for the first date and diff for the last
// By joining with a dataframe containing all the ids between 0 and diff,
// missing dates will be populated.
df
.withColumn("id", datediff('Date, lit(start)))
.join(spark.range(diff+1), Seq("id"), "right")
.withColumn("start", lit(start))
.select(expr("date_add(start, id)") as "Date", 'Value)
.show
+----------+-----+
| Date|Value|
+----------+-----+
|2020-10-01| 10|
|2020-10-02| null|
|2020-10-03| 10|
|2020-10-04| null|
|2020-10-05| null|
|2020-10-06| 10|
+----------+-----+
Reference to How do I select item with most count in a dataframe and define is as a variable in scala?
Given a table below, how can I select nth src_ip and put it as a variable?
+--------------+------------+
| src_ip|src_ip_count|
+--------------+------------+
| 58.242.83.11| 52|
|58.218.198.160| 33|
|58.218.198.175| 22|
|221.194.47.221| 6|
You can create another column with row number as
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val tempdf = df.withColumn("row_number", monotonically_increasing_id())
tempdf.withColumn("row_number", row_number().over(Window.orderBy("row_number")))
which should give you tempdf as
+--------------+------------+----------+
| src_ip|src_ip_count|row_number|
+--------------+------------+----------+
| 58.242.83.11| 52| 1|
|58.218.198.160| 33| 2|
|58.218.198.175| 22| 3|
|221.194.47.221| 6| 4|
+--------------+------------+----------+
Now you can use filter to filter in the nth row as
.filter($"row_number" === n)
That should be it.
For extracting the ip, lets say your n is 2 as
val n = 2
Then the above process would give you
+--------------+------------+----------+
| src_ip|src_ip_count|row_number|
+--------------+------------+----------+
|58.218.198.160| 33| 2|
+--------*------+------------+----------+
getting the ip address* is explained in the link you provided in the question by doing
.head.get(0)
Safest way is to use zipWithIndex in the dataframe converted into rdd and then convert back to dataframe, so that we have unmistakable row_number column.
val finalDF = df.rdd.zipWithIndex().map(row => (row._1(0).toString, row._1(1).toString, (row._2+1).toInt)).toDF("src_ip", "src_ip_count", "row_number")
Rest of the steps are already explained before.
I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.
How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
a.show()
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
b.show()
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
+------+----------+--------+
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
final.show()
+-------+----------+--------+
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
+-------+----------+--------+
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
//join
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
+------+----------+--------+
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))
I have a DataFrame with the lookup table data, for each and every hour there will a entry in this table. How do i calculate the total number of records till the current hour?
For example my DF data
+----+-----+
|hour|count|
+----+-----+
|0.00| 10|
|1.00| 5|
|2.00| 10|
|3.00| 15|
|4.00| 10|
|5.00| 10|
+----+-----+
If i pass "4.00" as input, it should return the total count till 4 hour.
Expected output is:
Total count
50
Sample code i tried:
val df = Seq(("0.00", "10"),
("1.00", "15")).toDF("hour", "reccount")
df.show
df.printSchema
df.registerTempTable("erv")
//sqlContext.sql("select hour,reccount from erv").show
sqlContext.sql("select sum(reccount) over(partition by hour) as running_total from erv").show
But i am getting the below error.
Exception in thread "main" java.lang.RuntimeException: [1.26] failure:
``union'' expected but `(' found
select sum(reccount) over(partition by hour) as running_total from erv
I also tried the Window functions like below, but Its expecting HiveContext needs to be created, when i try to create HiveContext locally its not creating HiveContext.
window function code:
val wSpec = Window.partitionBy("hour").orderBy("hour").rowsBetween(Long.MinValue, 0)
df.withColumn("cumSum", sum(df("reccount")).over(wSpec)).show()
Not sure why you'd want to use Window Functions if you can simply filter to get the right hours and agg:
val upTo = 4.0
val result = input.filter($"hour" <= upTo).agg(sum($"count") as "Total Count")
result.show()
// +-----------+
// |Total Count|
// +-----------+
// | 50|
// +-----------+