Structured streaming custom deduplication

Structured streaming custom deduplication - scala

I have a streaming data coming in from kafka into dataFrame.
I want to remove duplicates based in Id and keep the latest records based on timestamp.
Sample data is like this :
Id Name count timestamp
1 Vikas 20 2018-09-19T10:10:10
2 Vijay 50 2018-09-19T10:10:20
3 Vilas 30 2018-09-19T10:10:30
4 Vishal 10 2018-09-19T10:10:40
1 Vikas 50 2018-09-19T10:10:50
4 Vishal 40 2018-09-19T10:11:00
1 Vikas 10 2018-09-19T10:11:10
3 Vilas 20 2018-09-19T10:11:20
The output that I am expecting would be :
Id Name count timestamp
1 Vikas 10 2018-09-19T10:11:10
2 Vijay 50 2018-09-19T10:10:20
3 Vilas 20 2018-09-19T10:11:20
4 Vishal 40 2018-09-19T10:11:00
Older duplicates are removed and only the recent records are kept based on the timestamp field.
I am using watermarking for timestamp field.
I have tried using "df.removeDuplicate" but it keeps older records intact and anything new gets discarded.
Current code is as follows :
df = df.withWatermark("timestamp", "1 Day").dropDuplicates("Id", "timestamp")
How can we implement custom dedup method so that we can keep latest record as unique record?
Any help is appreciated.

Sort the timestamp column first before dropping the duplicates.
df.withWatermark("timestamp", "1 Day")
.sort($"timestamp".desc)
.dropDuplicates("Id", "timestamp")

Related

Date difference between pairs of dates per ID

I have data with 2 columns, in the following format:
ID
Date
1
1/1/2020
1
27/7/2020
1
15/3/2021
2
18/1/2020
3
1/1/2020
3
3/8/2020
3
18/9/2021
2
23/8/2020
2
30/2/2021
Now I would like to create a calculation field in Tableau to find per ID the difference between the different dates. For any value e.g. days.
For example for ID 1 the difference of the two dates according to calendar is 208 days. Next the difference of the second to third date for the same ID is 231 days.

A table calc like the following should do if you get the partitioning, addressing and ordering right — such as setting “compute using” to Date.
If first() < 0 then min([Date]) - lookup(min([Date]), -1) end

After performing dropDuplicates() am getting different counts when taking the count

I did dropDuplicates in a dataframe with subsets of Region,store,and id.
The dataframe contains some other columns like latitude, longitude, address, Zip, Year, Month...
When I do count of the derived dataframe am getting a constant value,
But when i take the count of a selected year, say 2018, am getting different counts when running the df.count()
Could anyone please explain why this is happening?
Df.dropDuplicates("region","store","id")
Df.createOrReplaceTempView(Df)
spark.sql("select * from Df").count() is constant
whenever i run
But if i put a where clause inside with Year or Month am getting multiple counts.
Eg:
spark.sql("select * from Df where Year =2018").count()
This statement is giving multiple values on each execution.
Intermediate output
Region store objectnr latitude longitude newid month year uid
Abc 20 4572 46.6383 8.7383 1 4 2018 0
Sgs 21 1425 47.783 6.7282 2 5 2019 1
Efg 26 1277 48.8293 8.2727 3 7 2019 2
Output
Region store objectnr latitude longitude newid month year uid
Abc 20 4572 46.6383 8.7383 1277 4 2018 0
Sgs 21 1425 47.783 6.7282 1425 5 2019 1
Efg 26 1277 48.8293 8.2727 1277 7 2019 2
So here newid gets the value of objecrnr,
When newid is comming same then i need to assign the latest objectnr to newid, considering the year and month

The line
Df.dropDuplicates("region","store","id")
creates a new Dataframe and it is not modifying the existing one. Dataframes are immutable.
To solve your issue you need to save the output of the dropDuplicates statement into a new Dataframe as shown below:
val Df2 = Df.dropDuplicates("region","store","id")
Df2.createOrReplaceTempView(Df2)
spark.sql("select * from Df2").count()
In addition you may get different counts when applying the filter Year=2018 because the Year column ist not part of the three columns you used to drop the duplicates. Apparently you have date in your Dataframe that share the same values in the three column but differ in the Year. Dropping duplicates is not a deterministic process ass it depends on the ordering of your data which vary in every run on your code.

How to union in a way that will group "date" into different columns instead of combining everything in the same column

I'm very new to pyspark. Here is the little situation, I created a dataframe for each file (9 in total, each file represent the counts for each month), then I need to union them all in one big df. The thing is I need it to come out like this with each month is its own separate column.
name_id | 2020_01 | 2020_02 | 2020_03
1 23 43534 3455
2 12 34534 34534
3 2352 32525 23
However, with my current code, it put all months under the same column. I've been searching on the internet for a long time, but could not find anything to solve this (maybe I need groupby, but not sure how to do that). Below is my code. Thanks!
df1=spark.read.format("parquet").load("dbfs:")
df2=spark.read.format("parquet").load("dbfs:")
df3=spark.read.format("parquet").load("dbfs:")
df4=spark.read.format("parquet").load("dbfs:")
df5=spark.read.format("parquet").load("dbfs:")
df6=spark.read.format("parquet").load("dbfs:")
df7=spark.read.format("parquet").load("dbfs:")
df8=spark.read.format("parquet").load("dbfs:")
df9=spark.read.format("parquet").load("dbfs:")
#union all 9 files
union_all=df1.unionAll(df2).unionAll(df3).unionAll(df4).unionAll(df5).unionAll(df6).unionAll(df7).unionAll(df8).unionAll(df9)
Here is the current output
name_id | count | date
1 23 2020_01
2 12 2020_01
1 43534 2020_02
2 34534 2020_02

Laravel: generate a list of month of a mysql date field

I want to generate with Laravel/Eloquent a list of months from a date field in a MySQL table.
How can I group the query and generate a list of the months?
The list is intended for a select field for later filtering entries per month.
id user_id garage_id date time
1 1 23 2015-09-11 02:23:00
5 1 17 2015-09-07 03:36:00
6 1 136 2015-08-02 23:22:00
7 1 83 2015-08-13 14:56:00
8 1 3 2015-07-16 14:58:00
9 1 67 2015-07-09 10:51:00
Thanks
Mirko

You can use groupBy() on the DB query, which essentially gives you a list of unique values.
$unqiue_dates = DB::table('your_table')
->where('some_condition','=','if you like')
->groupBy('created_at') // or whatever date field
->get();
Then, once you've passed this data onto a Laravel blade template you could iterate through them like:
<select>
#foreach ($unique_dates as $date)
<option>{{ date('F', strtotime($date)) }}</option>
#endforeach
<select>
Using 'F' as the format character gives a full textual representation e.g. January.

I found a solution outside of Laravel and Eloquent:
SELECT MONTHNAME(date) as month FROM cp_dbcarsharing GROUP BY MONTH(date)
Now i tweak it for my situation ;)
Thanks

crystal report display count of records

I have crystal generating the following output in my Details section
Cats Group
Number How Old
________________________
12 0-30 days old
32 0-30 days old
34 31-60 days old
Dogs Group
Number How Old
________________________
22 over 61 days old
123 0-30 days old
but i need the above info in a table format
Group 0-30 days old 31-60 days old over 61 days old
______________________________________________________
Cats 2 1 0
Dogs 1 0 1
thanks

You need a Cross-Tab;
Open Cross-tab expert
Drag group into columns
Drag animal type into rows
Drag animalinto value and set count as summarize option

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Structured streaming custom deduplication - scala

Sort the timestamp column first before dropping the duplicates. df.withWatermark("timestamp", "1 Day") .sort($"timestamp".desc) .dropDuplicates("Id", "timestamp")

Related

Date difference between pairs of dates per ID

After performing dropDuplicates() am getting different counts when taking the count

How to union in a way that will group "date" into different columns instead of combining everything in the same column

Laravel: generate a list of month of a mysql date field

crystal report display count of records

Categories

Resources