Pyspark Dataframes creation in a loop

Pyspark Dataframes creation in a loop - pyspark

I am trying to create create dynamic dataframes in a loop.
startdate=statdate+1 day
But my loop runs only once and it could not run 2nd iteration as the dataframe is getting failed next time
Please help me to get this loop runs with each iteration i should have one dataframe created with date range
ex:
While satartdate<enddate
df= SpDF.filter((SpDF_Main['date'] >= lit(start_date)) & (SpDF_Main['FILTER_DATE'] <= lit(enddate))).drop(col('FILTER_DATE'))
----
------
startdate=statdate+1 day

Related

DataBricks 10.2 pyspark 3.2.0; How Do I Add a New Timestamp Column Based on Another Date and Integer (Hours) Column?

In DataBricks notebook using pyspark I need to create/add a new timestamp column based on an existing date column while adding hours to it based on an existing hours-bin integer column - this is to support the creation of an event-driven time-series feature set, which requires in this case that the timestamp be limited to date and hour (no minutes, seconds, etc...). I have tried using string-based expr(), date_add(), various formatted-string and cast() combinations but I get a maddening slew of errors related to column access, parsing issues and the like. What is the simplest way to accomplish this?

In my opinion, unix_timestamp is the simplest method:
dfResult = dfSource.withColumn("yourNewTimestampColName",
(unix_timestamp(col("yourExistingDateCol")) +
(col("yourExistingHoursCol")*3600)).cast("timestamp"))
Where yourNewTimestampColName represents the name of the timestamp column that you want to add, yourExistingDateCol represents a date column that must be present with this name within the dfSource dataframe and yourExistingHoursCol represents an integer-based hour column that must also be present with this name within the dfSource dataframe.
The unix_timestamp() method adds to the date in seconds, so to add hours multiply yourExistingHoursCol by 3,600 seconds, to add minutes multiply by 60, to add days multiply 3,600*24, etc....
Executing display(dfResult) should show structure/content of the dfSource dataframe with a new column named yourNewTimestampColName containing the date/hour combination requested.

Azure Data Factory: How to set variable after for each successfully run?

I Am working one Azure copy data pipeline where I want to do validation of data.
: Input records should be matched with Output records.
. pipeline is copy, incremental data. So for this I have following set variable .
1:start time=>set current UTC time for calculating pipeline start time.
(Utcnow())
2: End time=>for set pipeline end time after run.
(Utcnow())
3:Total Time=>then I subtract start time from end time for calculating the exact time.
(#string(div(sub(ticks(Utcnow()),ticks(variables('Start'))),600000000))
)
4:Then I check total inserted records with Kusto table.(ago (50m) is pipeline calculated Total time )
Kusto query:
5:Output_LK=> sample |where ingestion_time()>ago(50m)|summarize rowcount=count()|project rowcount | take 1
6:Get Request:
Sample1
|summarize rowcount=count()
|project rowcount
Pipeline Query: #if(equals(activity('Output_LK').output.firstrow.rowcount,activity('Get Request').output.firstrow.rowcount),concat('Total Number Of input RECORDS is ',activity('Get Request').output.firstrow.rowcount,'Total Number of output records is ',activity('Output_LK').output.firstrow.rowcount),concat('Total Number Of input RECORDS is ',activity('Get Request').output.firstrow.rowcount,'Total Number of output records is ',activity('Output_LK').output.firstrow.rowcount))
I get the result but This 3rd step executed for for loop completion.
so My result is not exact as expected.

The ago() function will use a different time range each time you execute the query according to the execution time, what can result with different results.
You can query by the start time and duration to get the exact time range.
sample
|where ingestion_time() between(startTime ..50m)
|summarize rowcount=count()
|project rowcount
| take 1

Pyspark windows on last 30 days on subset of data

I have a working Pyspark Windowing function (Spark 2.0) that takes the last 30 days (86400*30) seconds and counts the number of times each action in column 'a' happens per ID. The dataset that I am applying this function to has multiple records for every day between '2018-01-01' and '2018-04-01'. Because this is a 30 day look back, I don't want to apply this function to data that doesn't have a full 30 days to look back on. For convenience, I want to start my counts on Feb 1st. I can' filter out January, because it is needed for Februrary's counts. I know I can just throw a filter on the new dataframe and filter out the data before for February, but is there a way to do it without that extra step? It'd be nice to not have to preform the calculations which could save time.
Here's the code:
from pyspark.sql import Window
from pyspark.sql import functions as F
windowsess = Window.partitionBy("id",'a').orderBy('ts').rangeBetween(-86400*30, Window.currentRow)
df4 = df3.withColumn("2h4_ct",F.count(df.a).over(windowsess))
Mockup of current dataset. I didn't want to convert the col ts, by hand so I wrote in a substitute for it.
id,a,timestamp,ts
1,soccer,2018-01-01 10:41:00, <unix_timestamp>
1,soccer,2018-01-13 10:40:00, <unix_timestamp>
1,soccer,2018-01-23 10:39:00, <unix_timestamp>
1,soccer,2018-02-01 10:38:00, <unix_timestamp>
1,soccer,2018-02-03 10:37:00, <unix_timestamp>
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>
With my made up sample data. I want to return the following rows
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1
instead I get this:
1,soccer,2018-01-01 10:41:00, <unix_timestamp>,1
1,soccer,2018-01-13 10:40:00, <unix_timestamp>,2
1,soccer,2018-01-23 10:39:00, <unix_timestamp>,3
1,soccer,2018-02-01 10:38:00, <unix_timestamp>,4
1,soccer,2018-02-03 10:37:00, <unix_timestamp>,5
1,leagueoflegends,2018-02-04 10:36:00, <unix_timestamp>,1

What if you use :
df4 = df3.groupby(['id', 'a', 'timestamp']).count()

Spark Streaming Guarantee Specific Start Window Time

I'm using Spark Streaming to read data from Kinesis using the Structured Streaming framework, my connection is as follows
val kinesis = spark
.readStream
.format("kinesis")
.option("streams", streamName)
.option("endpointUrl", endpointUrl)
.option("initialPositionInStream", "earliest")
.option("format", "json")
.schema(<my-schema>)
.load
The data comes from several IoT devices which have a unique id, I need to aggregate the data by this id and by a tumbling window over the timestamp field, as follows:
val aggregateData = kinesis
.groupBy($"uid", window($"timestamp", "15 minute", "15 minute"))
.agg(...)
The problem I'm encountering is that I need to guarantee that every window starts at round times (such as 00:00:00, 00:15:00 and so on), also I need a guarantee that only rows containing full 15-minute long windows are going to be output to my sink, what I'm currently doing is
val query = aggregateData
.writeStream
.foreach(postgreSQLWriter)
.outputMode("update")
.start()
.awaitTermination()
Where ths postgreSQLWriter is a StreamWriter I created for inserting each row into a PostgreSQL SGBD. How can I force my windows to be exactly 15-minute long and the start time to be round 15-minute timestamp values for each device unique id?

question1:
to start at specific times to start, there is one more parameters spark grouping function takes which is "offset".
By specifying that it will start after the specified time from an hour
Example:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute"))
so the above syntax will group by column1 and create windows of 22 minute duration with sliding window size of 1 minute and offset as 15 minute
for example it starts from:
window1: 8:15(8:00 add 15 minute offset) to 8:37 (8:15 add 22 minutes)
window2: 8:16(previous window start + 1 minute) to 8:38 ( 22 minute size again)
question2:
to push only those windows having full 15 minute size, create a count column which counts the number of events having in that window. once it reaches 15, push it to wherever you want using filter command
calculating count:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute")).agg(count*$"Column2").as("count"))
writestream filter containing count 15 only:
aggregateddata.filter($"count"===15).writeStream.format(....).outputMode("complete").start()

Create a Spark Dataframe on Time

quick question that I did not find on google.
What is the best way to create a Spark Dataframe with time stamps
lets say I have startpoint endpoint and a 15 min interval. What would be the best way to resolve this on spark ?

try SparkSQL inside udf like this DF.withColumn("regTime",current_timestamp) ,it's can replace a column ,
you also can convert a String to date or timestamp use DF.withColumn("regTime", to_date(spark_temp_user_day("dateTime")))
regTime is a column of DF

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark Dataframes creation in a loop - pyspark

Related

DataBricks 10.2 pyspark 3.2.0; How Do I Add a New Timestamp Column Based on Another Date and Integer (Hours) Column?

Azure Data Factory: How to set variable after for each successfully run?

Pyspark windows on last 30 days on subset of data

Spark Streaming Guarantee Specific Start Window Time

Create a Spark Dataframe on Time

Categories

Resources