Dropping entries of close timestamps - scala

I would like to drop all records which are duplicate entries but have said a difference in the timestamp could be of any amount of time as an offset but for simplicity will use 2 minutes.
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:21|ABC |DEF |
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I would like my dataframe to have only rows
+-------------------+-----+----+
|Date |ColA |ColB|
+-------------------+-----+----+
|2017-07-04 18:50:26|ABC |DEF |
|2017-07-04 18:50:21|ABC |KLM |
+-------------------+-----+----+
I tried something like this but this does not remove duplicates.
val joinedDfNoDuplicates = joinedDFTransformed.as("df1").join(joinedDFTransformed.as("df2"), col("df1.ColA") === col("df2.ColA") &&
col("df1.ColB") === col("df2.ColB") &&
&& abs(unix_timestamp(col("Date")) - unix_timestamp(col("Date"))) > offset
)
For now, I am just selecting distinct or a group by min here Find minimum for a timestamp through Spark groupBy dataframe on the data based on certain columns but I would like a more robust solution the reason for this is that data outside of that interval may be valid data. Also, the offset could be changed so maybe within 5s or 5 minutes depending on requirements.
Somebody mentioned to me about creating a UDF comparing dates and if all other columns are the same but I am not sure exactly how to do that such that either I would filter out rows or add a flag and then remove those rows any help would be greatly appreciated.
Similiar sql question here Duplicate entries with different timestamp
Thanks!

I would do it like this:
Define a Window to order to dates over a dummy column.
Add a dummy column, and add a constant value to it.
Add a new column containing the date of the previous record.
calculate the difference between the date and the previous date.
Filter your records based on the value of the difference.
The Code can be something like the follow:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("dummy").orderBy("Date") // step 1
df.withColumn("dummy", lit(1)) // this is step 1
.withColumn("previousDate", lag($"Date", 1) over w) // step 2
.withColumn("difference", unix_timestamp($"Date") - unix_timestamp("previousDate")) // step 3
This above solution is valid if you have pairs of records that might be close in time. If you have more than two records, you can compare each record to the first record (not the previous one) in the window, so instead of using lag($"Date",1), you use first($"Date"). In this case the 'difference' column contains the difference in time between the current record and the first record in the window.

Related

Determine if dates are continuous in a list

I have a dataframe containing the id of some person and the date on which he performed a certain action:
+----+----------+
| id| date|
+----+----------+
| 1|2022-09-01|
| 1|2022-10-01|
| 1|2022-11-01|
| 2|2022-07-01|
| 2|2022-10-01|
| 2|2022-11-01|
| 3|2022-09-01|
| 3|2022-10-01|
| 3|2022-11-01|
+----+----------+
I need to determine the fact that this person performed some action over a certain period of time (suppose the last 3 months). In a specific example, person number 2 missed months 08 and 09, respectively, the condition was not met. So I expect to get the following result:
+----+------------------------------------+------+
| id| dates|3month|
+----+------------------------------------+------+
| 1|[2022-09-01, 2022-10-01, 2022-11-01]| true|
| 2|[2022-07-01, 2022-10-01, 2022-11-01]| false|
| 3|[2022-09-01, 2022-10-01, 2022-11-01]| true|
+----+------------------------------------+------+
I understand that I should group by person ID and get an array of dates that correspond to it.
data.groupBy(col("id")).agg(collect_list("date") as "dates").withColumn("3month", ???)
However, I'm at a loss in writing a function that would carry out a check for compliance with the requirement.I have an option using recursion, but it does not suit me due to low performance (there may be more than one thousand dates). I would be very grateful if someone could help me with my problem.
A simple trick is to use a set instead of a list in your aggregation, in order to have distinct values, and then check the size of that set.
Here are some possible solutions:
Solution 1
Assuming you have a list of months of interest on which you want to check, you can perform a preliminary filter on the required months, then aggregate and validate.
import org.apache.spark.sql.{functions => F}
import java.time.{LocalDate, Duration}
val requiredMonths = Seq(
LocalDate.parse("2022-09-01"),
LocalDate.parse("2022-10-01"),
LocalDate.parse("2022-11-01")
);
df
.filter(F.date_trunc("month", $"date").isInCollection(requiredMonths))
.groupBy($"id")
.agg(F.collect_set(F.date_trunc("month", $"date")) as "months")
.withColumn("is_valid", F.size($"months") === requiredMonths.size)
date_trunc is used to truncate the date column to month.
Solution 2
Similar to the previous one, with preliminary filter, but here assuming you have a range of months
import java.time.temporal.ChronoUnit
val firstMonth = LocalDate.parse("2022-09-01");
val lastMonth = LocalDate.parse("2022-11-01");
val requiredNumberOfMonths = ChronoUnit.MONTHS.between(firstMonth, lastMonth) + 1;
df
.withColumn("month", F.date_trunc("month", $"date"))
.filter($"month" >= firstMonth && $"month" <= lastMonth)
.groupBy($"id")
.agg(F.collect_set($"month") as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Solution 3
Both solution 1 and 2 have a problem that causes the complete exclusion from the final result of the ids that have no intersection with the dates of interest.
This is caused by the filter applied before grouping.
Here is a solution based on solution 2 that does not filter and solves this problem.
df
.withColumn("month", F.date_trunc("month", $"date"))
.groupBy($"id")
.agg(F.collect_set(F.when($"month" >= firstMonth && $"month" <= lastMonth, $"month")) as "months")
.withColumn("is_valid", F.size($"months") === requiredNumberOfMonths)
Now the filter is performed using a conditional collect_set.
It is right to consider also solution 1 and 2 because the preliminary filter can have advantages and in some cases that could be the expected result.

Updating dataframes in a dictionary

I have a spark dataframe like the following,
+-------------------+----------+--------------------+----------------+--------------+
| placekey|naics_code| visits_by_day|date_range_start|date_range_end|
+-------------------+----------+--------------------+----------------+--------------+
|zzy-222#627-wby-z9f| 445120|[41,126,72,96,110...| 2018-12-31| 2019-01-07|
|zzw-223#627-s6k-fzz| 722410|[25,22,92,74,98,5...| 2018-12-31| 2019-01-07|
|223-222#627-s8r-8gk| 722410|[70,82,58,80,106,...| 2018-12-31| 2019-01-07|
| ...| ...| ...| ...| ...|
|22j-222#627-vty-5cq| 722511| [11,5,9,5,4,6,5]| 2019-01-28| 2019-02-04|
+-------------------+----------+--------------------+----------------+--------------+
This dataframe has a 9 unique naics_code and my goal is to add few more columns using other columns and partition it by the naics_code to create 9 different csv files. I am trying to partition the dataframe first and then add the columns, because I think this will somehow make the work more efficient, because I can get away with grouping the data by the key (let me know if this is a bad idea). So I created a dictionary of dataframes and I tried to add a new column to all of the partitioned dataframes in a loop,
for x in df_dict.values():
x = x.withColumn('new_col', udf_some_func(x['col1'], x['col2']))
But when I looked at the dataframes, the new column is not there, but when I print x.show() in the for loop, it does show the new column. Why is this the case?
Within each iteration of the loop, x is a local variable. Reassigning x will make it point to the new dataframe you created, but the old dataframe in the dict will remain untouched. You probably mean to do something like
for k, v in df_dict.items():
df[k] = v.withColumn('new_col', udf_some_func(v['col1'], v['col2']))

Average function in pyspark dataframe

I have a dataframe shown below
A user supplies a value, I want to calculate the average of the second number in the tuple from all the rows above that particular value.
example: lets say the value is 10. I want to take all the rows whose value in the "value" column is greater or equal to 10 and calculate the average of those rows. In this case, it'll take up the first two rows and the output will be as shown below
Can someone help me with this please?
Another option: You can filter the data frame first and then calculate the average; Use getItem method to access the value1 field in the struct column:
import pyspark.sql.functions as f
df.filter(df.value >= 10)
.agg(f.avg(df.tuple.getItem('value1')).alias('Avg'),
f.lit(10).alias('value')
).show()
+------+-----+
| Avg|value|
+------+-----+
|2200.0| 10|
+------+-----+

How to associate dates with a count or an integer [duplicate]

This question already has answers here:
Spark SQL window function with complex condition
(2 answers)
Closed 4 years ago.
I have a DataFrame with columns "id" and "date". date is of format yyyy-mm-dd here is an example:
+---------+----------+
| item_id| ds|
+---------+----------+
| 25867869|2018-05-01|
| 17190474|2018-01-02|
| 19870756|2018-01-02|
|172248680|2018-07-29|
| 41148162|2018-03-01|
+---------+----------+
I want to create a new column, in which each date is associated with an integer starting from 1. such that the smallest(earliest) date gets integer 1 , next(2nd earliest date) gets assigned to 2 and so on..
I want my DataFrame to look like this... :
+---------+----------+---------+
| item_id| ds| number|
+---------+----------+---------+
| 25867869|2018-05-01| 3|
| 17190474|2018-01-02| 1|
| 19870756|2018-01-02| 1|
|172248680|2018-07-29| 4|
| 41148162|2018-03-01| 2|
+---------+----------+---------+
Explanation:
2018 jan 02 date comes the earliest hence its number is 1. since there are 2 rows with same date, therefore 1 is located twice. after 2018-01-02 the next date comes as 2018-03-01 hence its number is 2 and so on... How can I create such column ?
This can be achieved by dense_rank in Window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val win = Window.orderBy(to_date(col("ds"),"yyyy-MM-dd").asc)
val df1 = df.withColumn("number", dense_rank() over win)
df1 will have the column number as you required.
Note : to_date(col("ds"),"yyyy-MM-dd") is mandatory, else it will be considered as Strings and does not survive the purpose.
You should make a function to get the oldest query without a number something like:
SELECT * FROM tablename WHERE number IS NULL ORDER BY ds ASC
then make another query where you get the greatest number:
SELECT * FROM tablename ORDER BY number DESC
then if both queries have the same date then update the table with the same number:
UPDATE tablename SET number = 'greatest number from first query' WHERE ds = 'the date from first query'
or if the dates are diferent then the same but add 1 to the number:
UPDATE tablename SET number= 'greatest number from first query' + 1 WHERE ds = 'the date from first query'
To make this work you should first assgin the number 1 to the oldest entry.
You should do this in a loop until the first query (checks if there is any number that is not set) is empty.
The first query suposes that the empty column is all null, if it's another case then you should change the WHERE condition to check when the column is empty.

How to divide the value of current row with the following one?

In Spark-Sql version 1.6, using DataFrames, is there a way to calculate, for a specific column, the fraction of dividing current row and the next one, for every row?
For example, if I have a table with one column, like so
Age
100
50
20
4
I'd like the following output
Franction
2
2.5
5
The last row is dropped because it has no "next row" to be added to.
Right now I am doing it by ranking the table and joining it with itself, where the rank is equals to rank+1.
Is there a better way to do this?
Can this be done with a Window function?
Window function should do only partial tricks. Other partial trick can be done by defining a udf function
def div = udf((age: Double, lag: Double) => lag/age)
First we need to find the lag using Window function and then pass that lag and age in udf function to find the div
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val dataframe = Seq(
("A",100),
("A",50),
("A",20),
("A",4)
).toDF("person", "Age")
val windowSpec = Window.partitionBy("person").orderBy(col("Age").desc)
val newDF = dataframe.withColumn("lag", lag(dataframe("Age"), 1) over(windowSpec))
And finally cal the udf function
newDF.filter(newDF("lag").isNotNull).withColumn("div", div(newDF("Age"), newDF("lag"))).drop("Age", "lag").show
Final output would be
+------+---+
|person|div|
+------+---+
| A|2.0|
| A|2.5|
| A|5.0|
+------+---+
Edited
As #Jacek has suggested a better solution to use .na.drop instead of .filter(newDF("lag").isNotNull) and use / operator , so we don't even need to call the udf function
newDF.na.drop.withColumn("div", newDF("lag")/newDF("Age")).drop("Age", "lag").show