Count Rows where a date is within a date range for each ID - pyspark

We have a dataframe (of multiple million rows) consisting of:
Id
start date
end date
date
For each Row we take the date Variable and want to count how many rows for each id exist, where this date lies between start date and end date.
This value then should be included in a new Column ("sum_of_rows").
Here is the table we expect (with sum_of_rows the to creating variable):
+---+----------+----------+----------+-----------+
| Id| start| end| date|sum_of_rows|
+---+----------+----------+----------+-----------+
| A|2008-01-02|2010-01-01|2009-01-01| 2|
| A|2005-01-02|2012-01-01| null| null|
| A|2013-01-02|2015-01-01|2014-01-01| 1|
| B|2002-01-02|2019-01-01|2003-01-01| 1|
| B|2015-01-02|2017-01-01|2016-01-01| 2|
+---+----------+----------+----------+-----------+
Example:
We look at the first Row. Take the date "2009-01-01" and want to look
in all rows where ID is the ID of the Row (so A here) and count
in how many Rows the date "2009-01-01" is within start and end (True for row 1 and 2 in this example).
Code for the original table:
table = spark.createDataFrame(
[
["A", '2008-01-02', '2010-01-01', '2009-01-01'],
["A", '2005-01-02', '2012-01-01', None],
["A", '2013-01-02', '2015-01-01', '2014-01-01'],
["B", '2002-01-02', '2019-01-01', '2003-01-01'],
["B", '2015-01-02', '2017-01-01', '2016-01-01']
],
("Id", "start", "end", "date")
)

This code works but creates a "product" join which is not recommended with big volumes of data.
table2 = table.select(
F.col("id"),
F.col("start").alias("s"),
F.col("end").alias("e"),
)
table3 = table.join(
table2, on="id"
)
table3 = table3.withColumn(
"one",
F.when(
F.col("date").between(F.col("s"),F.col("e")),
1
).otherwise(0)
)
table3.groupBy(
"Id",
"start",
"end",
"date"
).agg(F.sum("one").alias("sum_of_rows")).show()
+---+----------+----------+----------+-----------+
| Id| start| end| date|sum_of_rows|
+---+----------+----------+----------+-----------+
| B|2002-01-02|2019-01-01|2003-01-01| 1|
| B|2015-01-02|2017-01-01|2016-01-01| 2|
| A|2008-01-02|2010-01-01|2009-01-01| 2|
| A|2013-01-02|2015-01-01|2014-01-01| 1|
| A|2005-01-02|2012-01-01| null| 0|
+---+----------+----------+----------+-----------+

Related

Create summary of Spark Dataframe

I have a Spark Dataframe which I am trying to summarise in order to find overly long columns:
// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")
I can summarise the lengths of the columns and find overly long ones like this:
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)
If I try and add the existing filename column to the map I get an error:
val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)
I have tried a few variations of the $"filename" syntax. How can I incorporate the filename column into the summary?
columnName
maxLength
filename
col2
4
file3
col3
3
file2
The real dataframes have 300+ columns and millions of rows so I cannot hard-type column names.
#wBob does the following achieve your goal?
group by file name and get the maximum per column:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`
With the output:
+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
| file1| 1| 1| 2| 2|
| file2| 1| 1| 2| 3|
| file3| 1| 1| 4| 2|
+--------+-----+----+----+----+
Use subqueries to get the maximum per column and concatenate the results using union:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName, $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()
With the result:
+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
| rowId| 1| file1|
| rowId| 1| file2|
| rowId| 1| file3|
| col1| 1| file1|
| col1| 1| file2|
| col1| 1| file3|
| col2| 4| file3|
| col3| 3| file2|
+----------+---------+--------+
Note that there are multiple entries for rowId and col1 since the maximum is not unique.
There is probably a more elegant way to write it, but I am struggling to find one at the moment.
Pushed a little further for better result.
df.select(
col("*"),
array( // make array of columns name/value/length
(for{ col_name <- df.columns } yield
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)
).toSeq:_* ).alias("rowInfo")
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo")
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length| col|col_value|
+-----+------+--------+---------+
| 2| 3| col3| fff|
| 3| 4| col2| hhhh|
| 3| 5|filename| file3|
| 1| 5|filename| file1|
| 2| 5|filename| file2|
+-----+------+--------+---------+
It might be enough to sort your table by total text length. This can be achieved quickly and concisely.
df.select(
col("*"),
length( // take the length
concat( //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*
)
)
.as("length")
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
| 3| g|hhhh| ii| file3| 13|
| 2| d| ee| fff| file2| 12|
| 1| a| bb| cc| file1| 11|
+-----+----+----+----+--------+------+
Sorting an array[struct] it will sort on the first field first and second field next. This works as we put the size of the sting up front. If you re-order the fields you'll get different results. You can easily accept more than 1 result if you so desired but I think dsicovering a row is challenging is likely enough.
df.select(
col("*"),
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename| rowMax|
+-----+----+----+----+--------+--------------------+
| 1| a| bb| cc| file1|[5, filename, file1]|
| 2| d| ee| fff| file2|[5, filename, file2]|
| 3| g|hhhh| ii| file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+

Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF
val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))
val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")
This is the sample DF
+---+-----------+-----------+
| id| include| exclude|
+---+-----------+-----------+
| 1|in1,in2,in3| null|
| 2| in4,in5|ex1,ex2,ex3|
| 3| null| ex4,ex5|
| 4| null| null|
+---+-----------+-----------+
which I'm trying to transform into
+---+----+---+
| id|type|col|
+---+----+---+
| 1|incl|in1|
| 1|incl|in2|
| 1|incl|in3|
| 2|incl|in4|
| 2|incl|in5|
| 2|excl|ex1|
| 2|excl|ex2|
| 2|excl|ex3|
| 3|excl|ex4|
| 3|excl|ex5|
+---+----+---+
EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.
I can get it to work with union, as so:
df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)
but I was wondering if this was possible to do without using union.
The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.
This might be a long process.
With cte as ( select id, include+exclude as outputcol from SQL),
Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte)
Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes
Where finalcol is not null

Use PySpark Dataframe column in another spark sql query

I have a situation where I'm trying to query a table and use the result (dataframe) from that query as IN clause of another query.
From the first query I have the dataframe below:
+-----------------+
|key |
+-----------------+
| 10000000000004|
| 10000000000003|
| 10000000000008|
| 10000000000009|
| 10000000000007|
| 10000000000006|
| 10000000000010|
| 10000000000002|
+-----------------+
And now I want to run a query like the one below using the values of that dataframe dynamically instead of hard coding the values:
spark.sql("""select country from table1 where key in (10000000000004, 10000000000003, 10000000000008, 10000000000009, 10000000000007, 10000000000006, 10000000000010, 10000000000002)""").show()
I tried the following, however it didn't work:
df = spark.sql("""select key from table0 """)
a = df.select("key").collect()
spark.sql("""select country from table1 where key in ({0})""".format(a)).show()
Can somebody help me?
You should use an (inner) join between two data frames to get the countries you would like. See my example:
# Create a list of countries with Id's
countries = [('Netherlands', 1), ('France', 2), ('Germany', 3), ('Belgium', 4)]
# Create a list of Ids
numbers = [(1,), (2,)]
# Create two data frames
df_countries = spark.createDataFrame(countries, ['CountryName', 'Id'])
df_numbers = spark.createDataFrame(numbers, ['Id'])
The data frames look like the following:
df_countries:
+-----------+---+
|CountryName| Id|
+-----------+---+
|Netherlands| 1|
| France| 2|
| Germany| 3|
| Belgium| 4|
+-----------+---+
df_numbers:
+---+
| Id|
+---+
| 1|
| 2|
+---+
You can join them as follows:
countries.join(numbers, on='Id', how='inner')
Resulting in:
+---+-----------+
| Id|CountryName|
+---+-----------+
| 1|Netherlands|
| 2| France|
+---+-----------+
Hope that clears things up!

Calculate sequences of constantly increasing dates Spark

I have a dataframe in Spark with name column and dates. And I would like to find all continuous sequences of constantly increasing dates (day after day) for each name and calculate their durations. The output should contain a name, start date (of the dates sequence) and duration of such time period (amount of days)
How can I do this with Spark functions?
A consecutive sequence of dates example:
2019-03-12
2019-03-13
2019-03-14
2019-03-15
I have defined such solution but it calculates the overall amount of days by each name and does not divide it into sequences:
val result = allDataDf
.groupBy($"name")
.agg(count($"date").as("timePeriod"))
.orderBy($"timePeriod".desc)
.head()
Also, I tried with ranks, but counts column has only 1s, for some reason:
val names = Window
.partitionBy($"name")
.orderBy($"date")
val result = allDataDf
.select($"name", $"date", rank over names as "rank")
.groupBy($"name", $"date", $"rank")
.agg(count($"*") as "count")
The output looks like this:
+-----------+----------+----+-----+
|stationName| date|rank|count|
+-----------+----------+----+-----+
| NAME|2019-03-24| 1| 1|
| NAME|2019-03-25| 2| 1|
| NAME|2019-03-27| 3| 1|
| NAME|2019-03-28| 4| 1|
| NAME|2019-01-29| 5| 1|
| NAME|2019-03-30| 6| 1|
| NAME|2019-03-31| 7| 1|
| NAME|2019-04-02| 8| 1|
| NAME|2019-04-05| 9| 1|
| NAME|2019-04-07| 10| 1|
+-----------+----------+----+-----+
Finding consecutive dates is fairly easy in SQL. You could do it with a query like:
WITH s AS (
SELECT
stationName,
date,
date_add(date, -(row_number() over (partition by stationName order by date))) as discriminator
FROM stations
)
SELECT
stationName,
MIN(date) as start,
COUNT(1) AS duration
FROM s GROUP BY stationName, discriminator
Fortunately, we can use SQL in spark. Let's check if it works (I used different dates):
val df = Seq(
("NAME1", "2019-03-22"),
("NAME1", "2019-03-23"),
("NAME1", "2019-03-24"),
("NAME1", "2019-03-25"),
("NAME1", "2019-03-27"),
("NAME1", "2019-03-28"),
("NAME2", "2019-03-27"),
("NAME2", "2019-03-28"),
("NAME2", "2019-03-30"),
("NAME2", "2019-03-31"),
("NAME2", "2019-04-04"),
("NAME2", "2019-04-05"),
("NAME2", "2019-04-06")
).toDF("stationName", "date")
.withColumn("date", date_format(col("date"), "yyyy-MM-dd"))
df.createTempView("stations");
val result = spark.sql(
"""
|WITH s AS (
| SELECT
| stationName,
| date,
| date_add(date, -(row_number() over (partition by stationName order by date)) + 1) as discriminator
| FROM stations
|)
|SELECT
| stationName,
| MIN(date) as start,
| COUNT(1) AS duration
|FROM s GROUP BY stationName, discriminator
""".stripMargin)
result.show()
It seems to output correct dataset:
+-----------+----------+--------+
|stationName| start|duration|
+-----------+----------+--------+
| NAME1|2019-03-22| 4|
| NAME1|2019-03-27| 2|
| NAME2|2019-03-27| 2|
| NAME2|2019-03-30| 2|
| NAME2|2019-04-04| 3|
+-----------+----------+--------+

Create a new column based on date checking

I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11