Finding missing timestamps between groups

Finding missing timestamps between groups - pyspark

I have a dataframe like below
Timestamp SiteID Count
2020-01-02T05:33:05 1044 5949
2020-01-02T05:50:05 1044 177
2020-01-02T06:00:36 1020 587
2020-01-02T06:01:05 1020 367
I need to generate the missing timestamps per minute grouped by SiteID. The count for the generated timestamp can be 0.
Thanks

Here is my try.
from pyspark.sql.functions import *
df.groupBy('SiteID').agg(collect_list(unix_timestamp('Timestamp')).alias('Timestamp')) \
.withColumn('seq', sequence(col('Timestamp')[0], col('Timestamp')[1], lit(60))) \
.withColumn('seq', explode(array_distinct(array_union('seq', 'Timestamp')))) \
.withColumn('Timestamp', from_unixtime('seq')) \
.drop('seq') \
.join(df, ['SiteId', 'Timestamp'], 'left') \
.fillna(0).show(20, False)
+------+-------------------+-----+
|SiteID|Timestamp |Count|
+------+-------------------+-----+
|1020 |2020-01-02 06:00:36|587 |
|1020 |2020-01-02 06:01:05|367 |
|1044 |2020-01-02 05:33:05|5949 |
|1044 |2020-01-02 05:34:05|0 |
|1044 |2020-01-02 05:35:05|0 |
|1044 |2020-01-02 05:36:05|0 |
|1044 |2020-01-02 05:37:05|0 |
|1044 |2020-01-02 05:38:05|0 |
|1044 |2020-01-02 05:39:05|0 |
|1044 |2020-01-02 05:40:05|0 |
|1044 |2020-01-02 05:41:05|0 |
|1044 |2020-01-02 05:42:05|0 |
|1044 |2020-01-02 05:43:05|0 |
|1044 |2020-01-02 05:44:05|0 |
|1044 |2020-01-02 05:45:05|0 |
|1044 |2020-01-02 05:46:05|0 |
|1044 |2020-01-02 05:47:05|0 |
|1044 |2020-01-02 05:48:05|0 |
|1044 |2020-01-02 05:49:05|0 |
|1044 |2020-01-02 05:50:05|177 |
+------+-------------------+-----+

Related

Flatten all map columns recursively in PySpark dataframe

I have a pyspark dataframe with multiple map columns. I want to flatten all map columns recursively. personal and financial are map type columns. Similarly, we might have more map columns.
Input dataframe:
-------------------------------------------------------------------------------------------------------
| id | name | Gender | personal | financial |
-------------------------------------------------------------------------------------------------------
| 1 | A | M | {age:20,city:Dallas,State:Texas} | {salary:10000,bonus:2000,tax:1500}|
| 2 | B | F | {city:Houston,State:Texas,Zipcode:77001} | {salary:12000,tax:1800} |
| 3 | C | M | {age:22,city:San Jose,Zipcode:940088} | {salary:2000,bonus:500} |
-------------------------------------------------------------------------------------------------------
Output dataframe:
--------------------------------------------------------------------------------------------------------------
| id | name | Gender | age | city | state | Zipcode | salary | bonus | tax |
--------------------------------------------------------------------------------------------------------------
| 1 | A | M | 20 | Dallas | Texas | null | 10000 | 2000 | 1500 |
| 2 | B | F | null | Houston | Texas | 77001 | 12000 | null | 1800 |
| 3 | C | M | 22 | San Jose | null | 940088 | 2000 | 500 | null |
--------------------------------------------------------------------------------------------------------------

use map_concat to merge the map fields and then explode them. exploding a map column creates 2 new columns - key and value. pivot the key column with value as values to get your desired output.
data_sdf. \
withColumn('personal_financial', func.map_concat('personal', 'financial')). \
selectExpr(*[c for c in data_sdf.columns if c not in ['personal', 'financial']],
'explode(personal_financial)'
). \
groupBy([c for c in data_sdf.columns if c not in ['personal', 'financial']]). \
pivot('key'). \
agg(func.first('value')). \
show(truncate=False)
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |id |name|gender|State|Zipcode|age |bonus|city |salary|tax |
# +---+----+------+-----+-------+----+-----+--------+------+----+
# |1 |A |M |Texas|null |20 |2000 |Dallas |10000 |1500|
# |2 |B |F |Texas|77001 |null|null |Houston |12000 |1800|
# |3 |C |M |null |940088 |22 |500 |San Jose|2000 |null|
# +---+----+------+-----+-------+----+-----+--------+------+----+

pyspark dataframe check if string contains substring

i need help to implement below Python logic into Pyspark dataframe.
Python:
df1['isRT'] = df1['main_string'].str.lower().str.contains('|'.join(df2['sub_string'].str.lower()))
df1.show()
+--------+---------------------------+
|id | main_string |
+--------+---------------------------+
| 1 | i am a boy |
| 2 | i am from london |
| 3 | big data hadoop |
| 4 | always be happy |
| 5 | software and hardware |
+--------+---------------------------+
df2.show()
+--------+---------------------------+
|id | sub_string |
+--------+---------------------------+
| 1 | happy |
| 2 | xxxx |
| 3 | i am a boy |
| 4 | yyyy |
| 5 | from london |
+--------+---------------------------+
Final Output:
df1.show()
+--------+---------------------------+--------+
|id | main_string | isRT |
+--------+---------------------------+--------+
| 1 | i am a boy | True |
| 2 | i am from london | True |
| 3 | big data hadoop | False |
| 4 | always be happy | True |
| 5 | software and hardware | False |
+--------+---------------------------+--------+

First construct the substring list substr_list, and then use the rlike function to generate the isRT column.
df3 = df2.select(F.expr('collect_list(lower(sub_string))').alias('substr'))
substr_list = '|'.join(df3.first()[0])
df = df1.withColumn('isRT', F.expr(f'lower(main_string) rlike "{substr_list}"'))
df.show(truncate=False)

For your two dataframes,
df1 = spark.createDataFrame(['i am a boy', 'i am from london', 'big data hadoop', 'always be happy', 'software and hardware'], 'string').toDF('main_string')
df1.show(truncate=False)
df2 = spark.createDataFrame(['happy', 'xxxx', 'i am a boy', 'yyyy', 'from london'], 'string').toDF('sub_string')
df2.show(truncate=False)
+---------------------+
|main_string |
+---------------------+
|i am a boy |
|i am from london |
|big data hadoop |
|always be happy |
|software and hardware|
+---------------------+
+-----------+
|sub_string |
+-----------+
|happy |
|xxxx |
|i am a boy |
|yyyy |
|from london|
+-----------+
you can get the following result with the simple join expression.
from pyspark.sql import functions as f
df1.join(df2, f.col('main_string').contains(f.col('sub_string')), 'left') \
.withColumn('isRT', f.expr('if(sub_string is null, False, True)')) \
.drop('sub_string') \
.show()
+--------------------+-----+
| main_string| isRT|
+--------------------+-----+
| i am a boy| true|
| i am from london| true|
| big data hadoop|false|
| always be happy| true|
|software and hard...|false|
+--------------------+-----+

How to remove rows from pyspark dataframe using pattern matching?

I have a pyspark dataframe read from a CSV file that has a value column which contains hexadecimal values.
| date | part | feature | value" |
|----------|-------|---------|--------------|
| 20190503 | par1 | feat2 | 0x0 |
| 20190503 | par1 | feat3 | 0x01 |
| 20190501 | par2 | feat4 | 0x0f32 |
| 20190501 | par5 | feat9 | 0x00 |
| 20190506 | par8 | feat2 | 0x00f45 |
| 20190507 | par1 | feat6 | 0x0e62300000 |
| 20190501 | par11 | feat3 | 0x000000000 |
| 20190501 | par21 | feat5 | 0x03efff |
| 20190501 | par3 | feat9 | 0x000 |
| 20190501 | par6 | feat5 | 0x000000 |
| 20190506 | par5 | feat8 | 0x034edc45 |
| 20190506 | par8 | feat1 | 0x00000 |
| 20190508 | par3 | feat6 | 0x00000000 |
| 20190503 | par4 | feat3 | 0x0c0deffe21 |
| 20190503 | par6 | feat4 | 0x0000000000 |
| 20190501 | par3 | feat6 | 0x0123fe |
| 20190501 | par7 | feat4 | 0x00000d0 |
The requirement is to remove rows that contain values similar to 0x0, 0x00, 0x000, etc. which evaluate to decimal 0(zero) in the value column. The number of 0's after '0x' varies across the dataframe. Removing through pattern matching is the way I tried, but I wasn't successful.
myFile = sc.textFile("file.txt")
header = myFile.first()
fields = [StructField(field_name, StringType(), True) for field_name in header.split(',')]
myFile_header = myFile.filter(lambda l: "date" in l)
myFile_NoHeader = myFile.subtract(myFile_header)
myFile_df = myFile_NoHeader.map(lambda line: line.split(",")).toDF(schema)
## this is the pattern match I tried
result = myFile_df.withColumn('Test', regexp_extract(col('value'), '(0x)(0\1*\1*)',2 ))
result.show()
The other approach I used was using udf:
def convert_value(x):
return int(x,16)
Using this udf in pyspark give me
ValueError: invalid literal for int() with base 16: value

I don't really understand your regular expression, but when you want to match all strings containing 0x0 (+any number of zeros), then you can use ^0x0+$. Filtering with regular expression can be achieved with rlike and the tilde negates the match.
l = [('20190503', 'par1', 'feat2', '0x0'),
('20190503', 'par1', 'feat3', '0x01'),
('20190501', 'par2', 'feat4', '0x0f32'),
('20190501', 'par5', 'feat9', '0x00'),
('20190506', 'par8', 'feat2', '0x00f45'),
('20190507', 'par1', 'feat6', '0x0e62300000'),
('20190501', 'par11', 'feat3', '0x000000000'),
('20190501', 'par21', 'feat5', '0x03efff'),
('20190501', 'par3', 'feat9', '0x000'),
('20190501', 'par6', 'feat5', '0x000000'),
('20190506', 'par5', 'feat8', '0x034edc45'),
('20190506', 'par8', 'feat1', '0x00000'),
('20190508', 'par3', 'feat6', '0x00000000'),
('20190503', 'par4', 'feat3', '0x0c0deffe21'),
('20190503', 'par6', 'feat4', '0x0000000000'),
('20190501', 'par3', 'feat6', '0x0123fe'),
('20190501', 'par7', 'feat4', '0x00000d0')]
columns = ['date', 'part', 'feature', 'value']
df=spark.createDataFrame(l, columns)
expr = "^0x0+$"
df.filter(~ df["value"].rlike(expr)).show()
Output:
+--------+-----+-------+------------+
| date| part|feature| value|
+--------+-----+-------+------------+
|20190503| par1| feat3| 0x01|
|20190501| par2| feat4| 0x0f32|
|20190506| par8| feat2| 0x00f45|
|20190507| par1| feat6|0x0e62300000|
|20190501|par21| feat5| 0x03efff|
|20190506| par5| feat8| 0x034edc45|
|20190503| par4| feat3|0x0c0deffe21|
|20190501| par3| feat6| 0x0123fe|
|20190501| par7| feat4| 0x00000d0|
+--------+-----+-------+------------+

Get data based on latest date

Based on the dataset below, I'm trying to get the lastest cost based on the latest report date.
For example: When the report date=forecast date (column headers) then pick the values as on that report date which can be achived by this formula
IF [Report Date]=[Forecast Date] THEN [Forecasted Cost] END
but I also want to get the subsequent values as of the lastest report date i.e. 2/15/2019. How do I achieve this?
DESIRED OUTPUT
+------------+-----------+-----------+------------+------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| | 8/15/2018 | 9/15/2018 | 10/15/2018 | 11/15/2018 | 12/15/2018 | 1/15/2019 | 2/15/2019 | 3/15/2019 | 4/15/2019 | 5/15/2019 | 6/15/2019 | 7/15/2019 | 8/15/2019 |
+------------+-----------+-----------+------------+------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Final Cost | 646.00 | 646.00 | 620.00 | 620.00 | 550.00 | 445.00 | 361.00 | 332.50 | 315.40 | 296.40 | 290.70 | 285.00 | 279.30 |
+------------+-----------+-----------+------------+------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
DATASET
+------+-------------+-----------+-----------+------------+------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| Item | Report Date | 8/15/2018 | 9/15/2018 | 10/15/2018 | 11/15/2018 | 12/15/2018 | 1/15/2019 | 2/15/2019 | 3/15/2019 | 4/15/2019 | 5/15/2019 | 6/15/2019 | 7/15/2019 | 8/15/2019 |
+------+-------------+-----------+-----------+------------+------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
| 4124 | 8/15/2018 | 646.00 | 646.00 | 658.00 | 658.00 | 658.00 | 658.00 | 658.00 | | | | | | |
| 4124 | 9/15/2018 | | 646 | 626 | 626 | 626 | 622 | 622 | 622 | | | | | |
| 4124 | 10/15/2018 | | | 620 | 620 | 620 | 585 | 585 | 585 | 555 | | | | |
| 4124 | 11/15/2018 | | | | 620 | 620 | 610 | 595 | 554.5 | 543.38 | 535.35 | | | |
| 4124 | 12/15/2018 | | | | | 550 | 535 | 505 | 490 | 490 | 490 | 490 | | |
| 4124 | 1/15/2019 | | | | | | 445 | 430 | 420 | 410 | 400 | 390 | 384 | |
| 4124 | 2/15/2019 | | | | | | | 361 | 332.5 | 315.4 | 296.4 | 290.7 | 285 | 279.3 |
+------+-------------+-----------+-----------+------------+------------+------------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+

First of all, you need to transpose your dataset, i.e. to have 4 columns "Item", "Report Date", "Forecast Date" and "Forecast Cost". Then you create a filter "forecast date >= report date" and show values by forecast date.
Now you will have multiple values for each forecast date. if you only want to get the latest value, then you can use table calculation window_min(date diff).

Date shows not continuous

I have lots of data contain daily price, I need to calculate everyday`s change and earn with percentage. I create a view can satisfy most situation, but one day not exist in it.
[Original data]
2012-02-17 | KCN | 533700 | 7.40
2012-02-20 | KCN | 288000 | 7.55
2012-02-21 | KCN | 643800 | 7.48
2012-02-23 | KCN | 5614600 | 6.88
2012-02-24 | KCN | 1809800 | 6.92
2012-02-27 | KCN | 795900 | 6.74
Here I show my Code:
create or replace view a as
select date, Code, Price
from b;
select a."Date", a.Code, a.price as previousprice, b.price as newprice
from a, b
where a.date = (select min(date) from a where date > b.date)
and a.Code = b.Code;
By using these two select, I can pick up most data ignore the weekends and holidays but one day is missing here, then I will show the result of my code:
[Real result from my Code]
2012-02-16 | KCN | 662100 | 7.46 | 7.22 | -0.24 | -24.00
2012-02-17 | KCN | 533700 | 7.22 | 7.40 | 0.18 | 18.00
2012-02-20 | KCN | 288000 | 7.40 | 7.55 | 0.15 | 15.00
2012-02-21 | KCN | 643800 | 7.55 | 7.48 | -0.07 | -7.00
2012-02-24 | KCN | 1809800 | 6.88 | 6.92 | 0.04 | 4.00
2012-02-27 | KCN | 795900 | 6.92 | 6.74 | -0.18 | -18.00
2012-02-28 | KCN | 1101000 | 6.74 | 6.52 | -0.22 | -22.00
2012-02-29 | KCN | 912500 | 6.52 | 6.69 | 0.17 | 17.00
I just don't know where is '2012-02-23'? Is there any logical mistake in my Code?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Finding missing timestamps between groups - pyspark

I have a dataframe like below Timestamp SiteID Count 2020-01-02T05:33:05 1044 5949 2020-01-02T05:50:05 1044 177 2020-01-02T06:00:36 1020 587 2020-01-02T06:01:05 1020 367 I need to generate the missing timestamps per minute grouped by SiteID. The count for the generated timestamp can be 0. Thanks

Related

Flatten all map columns recursively in PySpark dataframe

pyspark dataframe check if string contains substring

How to remove rows from pyspark dataframe using pattern matching?

Get data based on latest date

Date shows not continuous

Categories

Resources