Calculate sequences of constantly increasing dates Spark - scala

I have a dataframe in Spark with name column and dates. And I would like to find all continuous sequences of constantly increasing dates (day after day) for each name and calculate their durations. The output should contain a name, start date (of the dates sequence) and duration of such time period (amount of days)
How can I do this with Spark functions?
A consecutive sequence of dates example:
2019-03-12
2019-03-13
2019-03-14
2019-03-15
I have defined such solution but it calculates the overall amount of days by each name and does not divide it into sequences:
val result = allDataDf
.groupBy($"name")
.agg(count($"date").as("timePeriod"))
.orderBy($"timePeriod".desc)
.head()
Also, I tried with ranks, but counts column has only 1s, for some reason:
val names = Window
.partitionBy($"name")
.orderBy($"date")
val result = allDataDf
.select($"name", $"date", rank over names as "rank")
.groupBy($"name", $"date", $"rank")
.agg(count($"*") as "count")
The output looks like this:
+-----------+----------+----+-----+
|stationName| date|rank|count|
+-----------+----------+----+-----+
| NAME|2019-03-24| 1| 1|
| NAME|2019-03-25| 2| 1|
| NAME|2019-03-27| 3| 1|
| NAME|2019-03-28| 4| 1|
| NAME|2019-01-29| 5| 1|
| NAME|2019-03-30| 6| 1|
| NAME|2019-03-31| 7| 1|
| NAME|2019-04-02| 8| 1|
| NAME|2019-04-05| 9| 1|
| NAME|2019-04-07| 10| 1|
+-----------+----------+----+-----+

Finding consecutive dates is fairly easy in SQL. You could do it with a query like:
WITH s AS (
SELECT
stationName,
date,
date_add(date, -(row_number() over (partition by stationName order by date))) as discriminator
FROM stations
)
SELECT
stationName,
MIN(date) as start,
COUNT(1) AS duration
FROM s GROUP BY stationName, discriminator
Fortunately, we can use SQL in spark. Let's check if it works (I used different dates):
val df = Seq(
("NAME1", "2019-03-22"),
("NAME1", "2019-03-23"),
("NAME1", "2019-03-24"),
("NAME1", "2019-03-25"),
("NAME1", "2019-03-27"),
("NAME1", "2019-03-28"),
("NAME2", "2019-03-27"),
("NAME2", "2019-03-28"),
("NAME2", "2019-03-30"),
("NAME2", "2019-03-31"),
("NAME2", "2019-04-04"),
("NAME2", "2019-04-05"),
("NAME2", "2019-04-06")
).toDF("stationName", "date")
.withColumn("date", date_format(col("date"), "yyyy-MM-dd"))
df.createTempView("stations");
val result = spark.sql(
"""
|WITH s AS (
| SELECT
| stationName,
| date,
| date_add(date, -(row_number() over (partition by stationName order by date)) + 1) as discriminator
| FROM stations
|)
|SELECT
| stationName,
| MIN(date) as start,
| COUNT(1) AS duration
|FROM s GROUP BY stationName, discriminator
""".stripMargin)
result.show()
It seems to output correct dataset:
+-----------+----------+--------+
|stationName| start|duration|
+-----------+----------+--------+
| NAME1|2019-03-22| 4|
| NAME1|2019-03-27| 2|
| NAME2|2019-03-27| 2|
| NAME2|2019-03-30| 2|
| NAME2|2019-04-04| 3|
+-----------+----------+--------+

Related

Create summary of Spark Dataframe

I have a Spark Dataframe which I am trying to summarise in order to find overly long columns:
// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")
I can summarise the lengths of the columns and find overly long ones like this:
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)
If I try and add the existing filename column to the map I get an error:
val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)
I have tried a few variations of the $"filename" syntax. How can I incorporate the filename column into the summary?
columnName
maxLength
filename
col2
4
file3
col3
3
file2
The real dataframes have 300+ columns and millions of rows so I cannot hard-type column names.
#wBob does the following achieve your goal?
group by file name and get the maximum per column:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`
With the output:
+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
| file1| 1| 1| 2| 2|
| file2| 1| 1| 2| 3|
| file3| 1| 1| 4| 2|
+--------+-----+----+----+----+
Use subqueries to get the maximum per column and concatenate the results using union:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName, $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()
With the result:
+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
| rowId| 1| file1|
| rowId| 1| file2|
| rowId| 1| file3|
| col1| 1| file1|
| col1| 1| file2|
| col1| 1| file3|
| col2| 4| file3|
| col3| 3| file2|
+----------+---------+--------+
Note that there are multiple entries for rowId and col1 since the maximum is not unique.
There is probably a more elegant way to write it, but I am struggling to find one at the moment.
Pushed a little further for better result.
df.select(
col("*"),
array( // make array of columns name/value/length
(for{ col_name <- df.columns } yield
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)
).toSeq:_* ).alias("rowInfo")
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo")
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length| col|col_value|
+-----+------+--------+---------+
| 2| 3| col3| fff|
| 3| 4| col2| hhhh|
| 3| 5|filename| file3|
| 1| 5|filename| file1|
| 2| 5|filename| file2|
+-----+------+--------+---------+
It might be enough to sort your table by total text length. This can be achieved quickly and concisely.
df.select(
col("*"),
length( // take the length
concat( //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*
)
)
.as("length")
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
| 3| g|hhhh| ii| file3| 13|
| 2| d| ee| fff| file2| 12|
| 1| a| bb| cc| file1| 11|
+-----+----+----+----+--------+------+
Sorting an array[struct] it will sort on the first field first and second field next. This works as we put the size of the sting up front. If you re-order the fields you'll get different results. You can easily accept more than 1 result if you so desired but I think dsicovering a row is challenging is likely enough.
df.select(
col("*"),
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename| rowMax|
+-----+----+----+----+--------+--------------------+
| 1| a| bb| cc| file1|[5, filename, file1]|
| 2| d| ee| fff| file2|[5, filename, file2]|
| 3| g|hhhh| ii| file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+

Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF
val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))
val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")
This is the sample DF
+---+-----------+-----------+
| id| include| exclude|
+---+-----------+-----------+
| 1|in1,in2,in3| null|
| 2| in4,in5|ex1,ex2,ex3|
| 3| null| ex4,ex5|
| 4| null| null|
+---+-----------+-----------+
which I'm trying to transform into
+---+----+---+
| id|type|col|
+---+----+---+
| 1|incl|in1|
| 1|incl|in2|
| 1|incl|in3|
| 2|incl|in4|
| 2|incl|in5|
| 2|excl|ex1|
| 2|excl|ex2|
| 2|excl|ex3|
| 3|excl|ex4|
| 3|excl|ex5|
+---+----+---+
EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.
I can get it to work with union, as so:
df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)
but I was wondering if this was possible to do without using union.
The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.
This might be a long process.
With cte as ( select id, include+exclude as outputcol from SQL),
Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte)
Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes
Where finalcol is not null

assign unique id to dataframe elements

I have a dataframe recording each person's name and address, denoted as
case class Person(name:String, addr:String)
dataframe looks like this,
+----+-----+
|name| addr|
+----+-----+
| u1|addr1|
| u1|addr2|
| u2|addr1|
+----+-----+
but now I need to assign a Long type unique id to each element in this dataframe, which could be denoted as
case class PersonX(name:String, name_id:Long, addr:String, addr_id:Long)
and dataframe looks like this,
+----+-------+-----+------+
|name|name_id| addr|addr_id|
+----+-------+-----+------+
| u1| 1|addr1| 2|
| u1| 1|addr2| 3|
| u2| 4|addr1| 2|
+----+-------+-----+------+
NOTE that, the elements in both columns (name and addr) share the same id space, which means, name_id should not have duplicates, and addr_id should not either, and furthermore name_ids & addr_ids should not overlap with each other.
How to achieve this?
The easiest way to assign id is to use dense_rank function from spark sql.
To make sure ids don't overlap between names and addresses you can make a trick:
compute ids of names and addresses separately
update ids of addresses by adding maximum id of name
In this way ids of addresses will be after ids of names
val input = spark.sparkContext.parallelize(List(
Person("u1", "addr1"),
Person("u1", "addr2"),
Person("u2", "addr1")
)).toDF("name", "addr")
input.createOrReplaceTempView("people")
val people = spark.sql(
"""select name,
| dense_rank() over(partition by 1 order by name) as name_id,
| addr,
| dense_rank() over(partition by 1 order by addr) as addr_id
| from people
""".stripMargin)
people.show()
//+----+-------+-----+-------+
//|name|name_id| addr|addr_id|
//+----+-------+-----+-------+
//| u1| 1|addr1| 1|
//| u2| 2|addr1| 1|
//| u1| 1|addr2| 2|
//+----+-------+-----+-------+
val name = people.col("name")
val nameId = people.col("name_id")
val addr = people.col("addr")
val addrId = people.col("addr_id")
val maxNameId = people.select(max(nameId)).first().getInt(0)
val shiftedAddrId = (addrId + maxNameId).as("addr_id")
people.select(name, addr, nameId, shiftedAddrId).as[PersonX].show()
//+----+-----+-------+-------+
//|name| addr|name_id|addr_id|
//+----+-----+-------+-------+
//| u1|addr1| 1| 3|
//| u2|addr1| 2| 3|
//| u1|addr2| 1| 4|
//+----+-----+-------+-------+

How to group by gender and join by positions per group?

I have tried numerous approaches to turn the following:
Gender, Age, Value
1, 20, 21
2, 23 22
1, 26, 23
2, 29, 24
into
Male_Age, Male_Value, Female_Age, Female_Value
20 21 23 22
26 23 29 24
What i need to do is group by gender and instead of using an aggregate like (sum, count, avg) I need to create List[age] and List[value]. This should be possible because i am using a Dataset which allows functional operations.
If the number of rows for males and females are not the same, the columns should be filled with nulls.
One approach I tried was to make a new a new dataframe using the columns of other dataframes like so:
df
.select(male.select("sex").where('sex === 1).col("sex"),
female.select("sex").where('sex === 2).col("sex"))
However, this bizarrely produces output like so:
sex, sex,
1, 1
2, 2
1, 1
2, 2
I can't see how that is possible.
I also tried using pivot, but it forces me to aggregate after the group by:
df.withColumn("sex2", df.col("sex"))
.groupBy("sex")
.pivot("sex2")
.agg(
sum('value').as("mean"),
stddev('value).as("std. dev") )
.show()
|sex| 1.0_mean| 1.0_std. dev| 2.0_mean| 2.0_std. dev|
|1.0|0.4926065526| 1.8110632697| | |
|2.0| | |0.951250372|1.75060275400785|
The following code does what I need in Oracle SQL, so it should possible in Spark SQL too I reckon...
drop table mytable
CREATE TABLE mytable
( gender number(10) NOT NULL,
age number(10) NOT NULL,
value number(10) );
insert into mytable values (1,20,21);
insert into mytable values(2,23,22);
insert into mytable values (1,26,23);
insert into mytable values (2,29,24);
insert into mytable values (1,30,25);
select * from mytable;
SELECT A.VALUE AS MALE,
B.VALUE AS FEMALE
FROM
(select value, rownum RN from mytable where gender = 1) A
FULL OUTER JOIN
(select value, rownum RN from mytable where gender = 2) B
ON A.RN = B.RN
The following should give you the result.
val df = Seq(
(1, 20, 21),
(2, 23, 22),
(1, 26, 23),
(2, 29, 24)
).toDF("Gender", "Age", "Value")
scala> df.show
+------+---+-----+
|Gender|Age|Value|
+------+---+-----+
| 1| 20| 21|
| 2| 23| 22|
| 1| 26| 23|
| 2| 29| 24|
+------+---+-----+
// Gender 1 = Male
// Gender 2 = Female
import org.apache.spark.sql.expressions.Window
val byGender = Window.partitionBy("gender").orderBy("gender")
val males = df
.filter("gender = 1")
.select($"age" as "male_age",
$"value" as "male_value",
row_number() over byGender as "RN")
scala> males.show
+--------+----------+---+
|male_age|male_value| RN|
+--------+----------+---+
| 20| 21| 1|
| 26| 23| 2|
+--------+----------+---+
val females = df
.filter("gender = 2")
.select($"age" as "female_age",
$"value" as "female_value",
row_number() over byGender as "RN")
scala> females.show
+----------+------------+---+
|female_age|female_value| RN|
+----------+------------+---+
| 23| 22| 1|
| 29| 24| 2|
+----------+------------+---+
scala> males.join(females, Seq("RN"), "outer").show
+---+--------+----------+----------+------------+
| RN|male_age|male_value|female_age|female_value|
+---+--------+----------+----------+------------+
| 1| 20| 21| 23| 22|
| 2| 26| 23| 29| 24|
+---+--------+----------+----------+------------+
Given a DataFrame called df with columns gender, age, and value, you can do this:
df.groupBy($"gender")
.agg(collect_list($"age"), collect_list($"value")).rdd.map { row =>
val ages: Seq[Int] = row.getSeq(1)
val values: Seq[Int] = row.getSeq(2)
(row.getInt(0), ages.head, ages.last, values.head, values.last)
}.toDF("gender", "male_age", "female_age", "male_value", "female_value")
This uses the collect_list aggregating function in the very helpful Spark functions library to aggregate the values you want. (As you can see, there is also a collect_set as well.)
After that, I don't know of any higher-level DataFrame functions to expand those columnar arrays into individual columns of their own, so I fall back to the lower-level RDD API our ancestors used. I simply expand everything into a Tuple and then turn it back into a DataFrame. The commenters above mention corner cases I have not addressed; using functions like headOption and tailOption might be useful there. But this should be enough to get you moving.

Create a new column based on date checking

I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11