I have a working Pyspark Windowing function (Spark 2.0) that takes the last 30 days (86400*30) seconds and counts the number of times each action in column 'a' happens per ID. The dataset that I am applying this function to has multiple records for every day between '2018-01-01' and '2018-04-01'. Because this is a 30 day look back, I don't want to apply this function to data that doesn't have a full 30 days to look back on. For convenience, I want to start my counts on Feb 1st. I can' filter out January, because it is needed for Februrary's counts. I know I can just throw a filter on the new dataframe and filter out the data before for February, but is there a way to do it without that extra step? It'd be nice to not have to preform the calculations which could save time.
Here's the code:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowsess = Window.partitionBy("id",'a').orderBy('ts').rangeBetween(-86400*30, Window.currentRow)
df4 = df3.withColumn("2h4_ct",F.count(df.a).over(windowsess))
Mockup of current dataset. I didn't want to convert the col ts, by hand so I wrote in a substitute for it.
id,a,timestamp,ts
1,soccer,2018-01-01 10:41:00, <unix_timestamp>
1,soccer,2018-01-13 10:40:00, <unix_timestamp>
1,soccer,2018-01-23 10:39:00, <unix_timestamp>
1,soccer,2018-02-01 10:38:00, <unix_timestamp>
1,soccer,2018-02-03 10:37:00, <unix_timestamp>
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>
With my made up sample data. I want to return the following rows
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
instead I get this:
1,soccer,2018-01-01 10:41:00, <unix_timestamp>,1
1,soccer,2018-01-13 10:40:00, <unix_timestamp>,2
1,soccer,2018-01-23 10:39:00, <unix_timestamp>,3
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
What if you use :
df4 = df3.groupby(['id', 'a', 'timestamp']).count()
Related
I am loading the delta tables into S3 delta lake. the table schema is product_code,date,quantity,crt_dt.
i am getting 6 months of Forecast data, for example if this month is May 2022, i will get May, June, July, Aug, Sept, Oct quantities data. What is the issue i am facing here is the data is getting duplicated every month. i want only a single row in the delta table based on the recent crt_dt as shown in below screenshot. Can anyone help me with the solution i should implement?
The data is partitioned by crt_dt.
Thanks!
If you want to get the recent crt_dt normally this code will do the trick
w3 = Window.partitionBy("product_cat").orderBy(col("crt_dt").desc())
df.withColumn("row",row_number().over(w3)) \
.filter(col("row") == 1).drop("row") \
.show()
for more details check this https://sparkbyexamples.com/pyspark/pyspark-select-first-row-of-each-group/
You have a dataset that you'd like to filter and then write out to a Delta table.
Another poster told you how to filter the data to meet your requirements. Here's how to filter the data and then write it out.
filtered_df = df.withColumn("row",row_number().over(w3)) \
.filter(col("row") == 1).drop("row") \
.show()
filtered_df.write.format("delta").mode("append").save("path/to/delta_lake")
You can also do this with SQL if you aren't using the Python API.
In DataBricks notebook using pyspark I need to create/add a new timestamp column based on an existing date column while adding hours to it based on an existing hours-bin integer column - this is to support the creation of an event-driven time-series feature set, which requires in this case that the timestamp be limited to date and hour (no minutes, seconds, etc...). I have tried using string-based expr(), date_add(), various formatted-string and cast() combinations but I get a maddening slew of errors related to column access, parsing issues and the like. What is the simplest way to accomplish this?
In my opinion, unix_timestamp is the simplest method:
dfResult = dfSource.withColumn("yourNewTimestampColName",
(unix_timestamp(col("yourExistingDateCol")) +
(col("yourExistingHoursCol")*3600)).cast("timestamp"))
Where yourNewTimestampColName represents the name of the timestamp column that you want to add, yourExistingDateCol represents a date column that must be present with this name within the dfSource dataframe and yourExistingHoursCol represents an integer-based hour column that must also be present with this name within the dfSource dataframe.
The unix_timestamp() method adds to the date in seconds, so to add hours multiply yourExistingHoursCol by 3,600 seconds, to add minutes multiply by 60, to add days multiply 3,600*24, etc....
Executing display(dfResult) should show structure/content of the dfSource dataframe with a new column named yourNewTimestampColName containing the date/hour combination requested.
timestampbefore timestamp after #pissall's code I have a Timestamp column with 0.5Hz frequency, that results in millions of rows. I am willing to reduce this data size by having a timestamp in an hourly manner. i.e 24 observations for a particular day.
I already reduced the data size by filtering the data by year, month and day. but as it is still very big i want to reduce it now to hourly basis.
I am working on Databricks and using PySpark for the same.
i used following command to reduce my data size from years to a Day.
df = df.filter(df.Timestamp.between('2019-09-03 00:00:00','2019-09-04 00:00:00'))
I would appreciate your help.
Thanks
Java.util.gregori...
You can replace the minutes and seconds part of your datetime using a UDF. Might not be the best solution, but here you go:
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
date_replace_udf = F.udf(lambda date: date.replace(minute=0, second=0, microsecond=0),TimestampType())
df = df.withColumn("Timestamp", date_replace_udf(F.col("Timestamp")))
Another reference: How to truncate the time on a DateTime object in Python?
I have a table
t:`date xasc ([]date:100?2018.01.01+til 100;price:100?til 100;acc:100?`a`b)
and would like to have a new column in t which contains the counts of entries in t where date is in the daterange of the previous 14 days and the account is the same as in acc. For example, if there is a row
date price acc prevdate prevdate1W countprev14
2018.01.10 37 a 2018.01.09 2018.01.03 ?
then countprev14 should contain the number of observations between 2018.01.03 and 2018.01.09 where acc=a
The way I am currently doing it can probably be improved:
f:{[dates;ac;t]count select from t where date>=(dates 0),date<=(dates 1),acc=ac}[;;t]
(f')[(exec date-7 from t),'(exec date-1 from t);exec acc from t]
Thanks for the help
Another method is using a window join (wj1):
https://code.kx.com/q/ref/joins/#wj-wj1-window-join
dates:exec date from t;
d:(dates-7;dates-1);
wj1[d;`acc`date;t;(`acc`date xasc t;(count;`i))]
I think you're looking for something like this:
update count14:{c-0^(c:sums 1&x)y bin y-14}[i;date] by acc from t
this uses sums to get the running counts, bin to find the running count from 14 days prior, and then indexes back into the list of running counts to get the counts from that date.
The difference between the counts then and now are those from the latest 14 days.
Note the lambda here allows us to store the result from the sums easily and avoid unnecessary recomputation.
I need to filter my query with different time intervals like that:
...
where
date >= '2011-07-01' and date <='2011-09-30'
and date >='2012-07-01' and date >='2012-09-30'
I suppose such code is not good, because these dates conflicts with each other. But how to filter only these two intervals, skipping everything else? Is it even possible? Because if I query like this, I don't get any results.I tried to use BETWEEN, but it does same thing.
I bypassed this by extracting quarters from years and calculating only third quarter. But then other quarters sum is showed as zero and I can't ignore these rows that have sum column with zero value. I tried to filter where price > 0 (column where sum goes), but it says that column do not exist. So I put it whole FROM under '('')' brackets to make it calculate sum before where clause, but it still does give me error that such column do not exist.
Also if you need to see query I have now, I can post it, just tell me if it is needed.
I want to do this, because I need to compare third quarter of two different years (maybe I should use another approach).
You're not going to get any results because you can't have a date that's both within 7/1/2011 through 9/30/11 and after 7/1/2012 and after 9/30/12.
You can have a date that is either between 7/1/20122 and 9/30/2011 or between 7/1/2012 and 9/30/2012.
SELECT col1 FROM table1
WHERE date BETWEEN '7/1/2011' AND '9/30/2011' OR date BETWEEN '7/1/2012' AND '9/30/2012';