PySpark: How to generate a dataframe composed of datetime range? - pyspark

I would like to create a pyspark dataframe composed of a list of datetimes with a specific frequency.
Currently I'm using this approach, which seems quite cumbersome and I'm pretty sure there are better ways
# Define date range
START_DATE = dt.datetime(2019,8,15,20,30,0)
END_DATE = dt.datetime(2019,8,16,15,43,0)
# Generate date range with pandas
timerange = pd.date_range(start=START_DATE, end=END_DATE, freq='15min')
# Convert to timestamp
timestamps = [int(x) for x in timerange.values.astype(np.int64) // 10 ** 9]
# Create pyspark dataframe from the above timestamps
(spark.createDataFrame(dates, IntegerType())
.withColumn('value_date', sf.from_unixtime('value'))
.drop('value')
.withColumnRenamed('value_date', 'date').show())
which otputs
+-------------------+
| date|
+-------------------+
|2019-08-15 20:30:00|
|2019-08-15 20:45:00|
|2019-08-15 21:00:00|
|2019-08-15 21:15:00|
|2019-08-15 21:30:00|
|2019-08-15 21:45:00|
|2019-08-15 22:00:00|
|2019-08-15 22:15:00|
|2019-08-15 22:30:00|
|2019-08-15 22:45:00|
|2019-08-15 23:00:00|
|2019-08-15 23:15:00|
|2019-08-15 23:30:00|
|2019-08-15 23:45:00|
|2019-08-16 00:00:00|
|2019-08-16 00:15:00|
|2019-08-16 00:30:00|
|2019-08-16 00:45:00|
|2019-08-16 01:00:00|
|2019-08-16 01:15:00|
+-------------------+
Can you suggest a smarter way to achieve this?
Thanks
Edit:
This seems to work
(spark.sql('SELECT sequence({start_date}, {end_date}, 60*15) as timestamp_seq'.format(
start_date=int(START_DATE.timestamp()), end_date=int(END_DATE.timestamp())
)).withColumn('timestamp', sf.explode('timestamp_seq'))
.select(sf.col('timestamp').cast('timestamp').alias('datetime'))).show()
but I'm unable to make it work without the conversion to timestamp.

Here's a solution working on spark 2.4.3 and python 3.6.8
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Python version 3.6.8 (default, Dec 30 2018 18:50:55)
SparkSession available as 'spark'.
>>> from pyspark.sql import functions as F
>>> def generate_dates(spark,range_list,interval=60*60*24,dt_col="date_time_ref"): # TODO: attention to sparkSession
... """
... Create a Spark DataFrame with a single column named dt_col and a range of date within a specified interval (start and stop included).
... With hourly data, dates end at 23 of stop day
...
... :param spark: SparkSession or sqlContext depending on environment (server vs local)
... :param range_list: array of strings formatted as "2018-01-20" or "2018-01-20 00:00:00"
... :param interval: number of seconds (frequency), output from get_freq()
... :param dt_col: string with date column name. Date column must be TimestampType
...
... :returns: df from range
... """
... start,stop = range_list
... temp_df = spark.createDataFrame([(start, stop)], ("start", "stop"))
... temp_df = temp_df.select([F.col(c).cast("timestamp") for c in ("start", "stop")])
... temp_df = temp_df.withColumn("stop",F.date_add("stop",1).cast("timestamp"))
... temp_df = temp_df.select([F.col(c).cast("long") for c in ("start", "stop")])
... start, stop = temp_df.first()
... return spark.range(start,stop,interval).select(F.col("id").cast("timestamp").alias(dt_col))
...
>>> date_range = ["2018-01-20 00:00:00","2018-01-23 00:00:00"]
>>> generate_dates(spark,date_range)
DataFrame[date_time_ref: timestamp]
>>> generate_dates(spark,date_range).show()
+-------------------+
| date_time_ref|
+-------------------+
|2018-01-20 00:00:00|
|2018-01-21 00:00:00|
|2018-01-22 00:00:00|
|2018-01-23 00:00:00|
+-------------------+
Sincerely I think your first approach (pd.date_range -> spark.createDataFrame()) is the best approach, since it lets pandas consider eveything related to DST. Simply don't convert in python timestamp objects to int but convert them to str and then cast column from StringType to TimestampType.

Related

reading partitioned parquet record in pyspark

I have a parquet file partitioned by a date field (YYYY-MM-DD).
How to read the (current date-1 day) records from the file efficiently in Pyspark - please suggest.
PS: I would not like to read the entire file and then filter the records as the data volume is huge.
There are multiple ways to go about this:
Suppose this is the input data and you write out the dataframe partitioned on "date" column:
data = [(datetime.date(2022, 6, 12), "Hello"), (datetime.date(2022, 6, 19), "World")]
schema = StructType([StructField("date", DateType()),StructField("message", StringType())])
df = spark.createDataFrame(data, schema=schema)
df.write.mode('overwrite').partitionBy('date').parquet('./test')
You can read the parquet files associated to a given date with this syntax:
spark.read.parquet('./test/date=2022-06-19').show()
# The catch is that the date column is gonna be omitted from your dataframe
+-------+
|message|
+-------+
| World|
+-------+
# You could try adding the date column with lit syntax.
(spark.read.parquet('./test/date=2022-06-19')
.withColumn('date', f.lit('2022-06-19').cast(DateType()))
.show()
)
# Output
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
The more efficient solution is using the delta tables:
df.write.mode('overwrite').partitionBy('date').format('delta').save('/test')
spark.read.format('delta').load('./test').where(f.col('date') == '2022-06-19').show()
The spark engine uses the _delta_log to optimize your query and only reads the parquet files that are applicable to your query. Also, the output will have all the columns:
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
you can read it by passing date variable while reading.
This is dynamic code, you nor need to hardcode date, just append it with path
>>> df.show()
+-----+-----------------+-----------+----------+
|Sr_No| User_Id|Transaction| dt|
+-----+-----------------+-----------+----------+
| 1|paytm 111002203#p| 100D|2022-06-29|
| 2|paytm 111002203#p| 50C|2022-06-27|
| 3|paytm 111002203#p| 20C|2022-06-26|
| 4|paytm 111002203#p| 10C|2022-06-25|
| 5| null| 1C|2022-06-24|
+-----+-----------------+-----------+----------+
>>> df.write.partitionBy("dt").mode("append").parquet("/dir1/dir2/sample.parquet")
>>> from datetime import date
>>> from datetime import timedelta
>>> today = date.today()
#Here i am taking two days back date, for one day back you can do (days=1)
>>> yesterday = today - timedelta(days = 2)
>>> two_days_back=yesterday.strftime('%Y-%m-%d')
>>> path="/di1/dir2/sample.parquet/dt="+two_days_back
>>> spark.read.parquet(path).show()
+-----+-----------------+-----------+
|Sr_No| User_Id|Transaction|
+-----+-----------------+-----------+
| 2|paytm 111002203#p| 50C|
+-----+-----------------+-----------+

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

Spark 2.0 Timestamp Difference in Milliseconds using Scala

I am using Spark 2.0 and looking for a way to achieve the following in Scala:
Need the time-stamp difference in milliseconds between two Data-frame column values.
Value_1 = 06/13/2017 16:44:20.044
Value_2 = 06/13/2017 16:44:21.067
Data-types for both is timestamp.
Note:Applying the function unix_timestamp(Column s) on both values and subtracting works but not upto the milliseconds value which is the requirement.
Final query would look like this:
Select **timestamp_diff**(Value_2,Value_1) from table1
this should return the following output:
1023 milliseconds
where timestamp_diff is the function that would calculate the difference in milliseconds.
One way would be to use Unix epoch time, the number of milliseconds since 1 January 1970. Below is an example using an UDF, it takes two timestamps and returns the difference between them in milliseconds.
val timestamp_diff = udf((startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
})
val df = // dataframe with two timestamp columns (col1 and col2)
.withColumn("diff", timestamp_diff(col("col2"), col("col1")))
Alternatively, you can register the function to use with SQL commands:
val timestamp_diff = (startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
}
spark.sqlContext.udf.register("timestamp_diff", timestamp_diff)
df.createOrReplaceTempView("table1")
val df2 = spark.sqlContext.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
The same for PySpark:
import datetime
def timestamp_diff(time1: datetime.datetime, time2: datetime.datetime):
return int((time1-time2).total_seconds()*1000)
int and *1000 are only to output milliseconds
Example usage:
spark.udf.register("timestamp_diff", timestamp_diff)
df.registerTempTable("table1")
df2 = spark.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
It's not an optimal solution since UDFs are usually slow, so you might run into performance issues.
Bit late to the party, but hope it's still useful.
import org.apache.spark.sql.Column
def getUnixTimestamp(col: Column): Column = (col.cast("double") * 1000).cast("long")
df.withColumn("diff", getUnixTimestamp(col("col2")) - getUnixTimestamp(col("col1")))
Of course you can define a separate method for the difference:
def timestampDiff(col1: Column, col2: Column): Column = getUnixTimestamp(col2) - getUnixTimestamp(col1)
df.withColumn("diff", timestampDiff(col("col1"), col("col2")))
To make life easier one can define an overloaded method for Strings with a default diff name:
def timestampDiff(col1: String, col2: String): Column = timestampDiff(col(col1), col(col2)).as("diff")
Now in action:
scala> df.show(false)
+-----------------------+-----------------------+
|min_time |max_time |
+-----------------------+-----------------------+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|
+-----------------------+-----------------------+
scala> df.withColumn("diff", timestampDiff("min_time", "max_time")).show(false)
+-----------------------+-----------------------+-----+
|min_time |max_time |diff |
+-----------------------+-----------------------+-----+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|2441 |
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|142 |
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|65363|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|209 |
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|219 |
+-----------------------+-----------------------+-----+
scala> df.select(timestampDiff("min_time", "max_time")).show(false)
+-----+
|diff |
+-----+
|2441 |
|142 |
|65363|
|209 |
|219 |
+-----+

Is Spark only applying my UDF on records being shown?

I have a feeling Spark is being cleverer than me and reordering (or at least in comparison to the written code) what is running on executors etc.
Suppose I have a very simple spark query in scala as follows.
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
val rawData = sqlContext.sql("FROM mytable SELECT *")
I then create a new column using some functionality in a UDF, this function is not lightweight (or at least some of the time) and relies on multiple columns in the data. Roughly speaking my UDF looks similar to this although the processing is only an example.
def method1(s1:String, s2:String):String = {
List(s1, s2).mkString(" ")
}
val method1UDF = udf(method1 _)
val dataWithCol = rawData
.withColumn("newcol", method1UDF($"c1",$"c2"))
dataWithCol.show(100)
My question actually revolves around the last statement, or at least I think it does.
If my dataset has 1 billion records in it, is Spark actually only applying my withColumn to 100 records, or is it applying it to all 1 million records and then just returning the first 100?
In Hive I presume the equivalent would be:
SELECT t.c1, t.c2, CONCAT_WS(" ",t.c1,t.c2) as newCol from (
SELECT c1,c2 as newCol FROM mytable limit 100
) t
Even though in code it looks like i've written the equivalent of the following query
SELECT * from (
SELECT c1,c2, CONCAT_WS(" ",c1,c2) as newCol FROM mytable
) t limit 100
I suspect it is doing the former since adding a filter on the new column drastically slows down the operation. If I change the last line to:
dataWithCol.filter($"newCol" === "H i").show(100)
This is now having to apply the function to a lot more data (presumably the entire dataset) before it does the limit of 100, similar to the following Hive query:
SELECT * from (
SELECT c1,c2, CONCAT_WS(" ",c1,c2) as newCol FROM mytable
) t where t.newCol == "H i" limit 100
Am I along the right lines with what Spark is doing in the background? Is it optimising my query by only applying the processing on records which will end up being viewed?
If you are not sure you can alway make an experiment:
Spark context available as 'sc' (master = local[*], app id = local-1490732267478).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_121)
Type in expressions to have them evaluated.
Type :help for more information.
scala> :paste
// Entering paste mode (ctrl-D to finish)
val rawData = spark.range(0, 1000000000, 1, 1000)
.toDF("id")
.select(
$"id".cast("string").alias("s1"),
$"id".cast("string").alias("s2"))
val counter = sc.longAccumulator("counter")
def f = udf((s1: String, s2: String) => {
counter.add(1)
s"$s1 $s2"
})
rawData.select(f($"s1", $"s2")).show(10)
// Exiting paste mode, now interpreting.
+-----------+
|UDF(s1, s2)|
+-----------+
| 0 0|
| 1 1|
| 2 2|
| 3 3|
| 4 4|
| 5 5|
| 6 6|
| 7 7|
| 8 8|
| 9 9|
+-----------+
only showing top 10 rows
rawData: org.apache.spark.sql.DataFrame = [s1: string, s2: string]
counter: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(counter), value: 12)
f: org.apache.spark.sql.expressions.UserDefinedFunction
scala> counter.value
res1: Long = 12
As you can see Spark limits the number of records to be processed but it is not exactly precise. You should also remember that these results are version and query dependent.
For example earlier Spark version where fairly limited when applying optimizations to UDF calls. Also upstream wide transformation may affect this behavior and result in processing more (or even all) records.
Spark applies something known as "lazy execution". This means it only evaluates actions as and when necessary. So, it is actually doing something between the two statements you wrote. The execution planner is clever enough to figure out what needs to be done, and what doesn't. To see more detail browse to localhost:4040 (increment port by 1 for every context you're running).

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.