How to check that a value is the unix timestamp in Scala? - scala

In the DataFrame df I have a column datetime that contains timestamp values. The problem is that in some rows these are unix timestamps, while in other rows these are yyyyMMddHHmm format.
How can I verify that each given value is unix timestamp and if it's not to convert it into timestamp?
df.withColumn("timestamp", unix_timestamp(col("datetime")))
I assume that when...otherwise should be used, but how to check that a value is the unix timestamp?

You can use when/otherwise along with the date parsing methods. Here is some example code. I differentiated using just the length of the string, but you could also check the result of parsing them.
from pyspark.sql.functions import *
data = [
('201001021011',),
('201101021011',),
('1539721852',),
('1539721853',)
]
df = sc.parallelize(data).toDF(['date'])
df2 = df.withColumn('date',
when(length('date') != 12, from_unixtime('date', 'yyyyMMddHHmm')) \
.otherwise(col('date'))
)
df3 = df2.withColumn('date', to_timestamp('date', 'yyyyMMddHHmm'))
df3.show()
Outputs this:
+-------------------+
| date|
+-------------------+
|2010-01-02 10:11:00|
|2011-01-02 10:11:00|
|2018-10-16 16:30:00|
|2018-10-16 16:30:00|
+-------------------+

If column datetime consists of only Unix-timestamp strings or "yyyyMMddHHmm"-formatted strings, you can differentiate the two string formats based on their length, since the former has 10 digits or less whereas the latter is a fixed 12:
val df = Seq(
(1, "1538384400"),
(2, "1538481600"),
(3, "201809281800"),
(4, "1538548200"),
(5, "201809291530")
).toDF("id", "datetime")
df.withColumn("timestamp",
when(length($"datetime") === 12, unix_timestamp($"datetime", "yyyyMMddHHmm")).
otherwise($"datetime")
)
// +---+------------+----------+
// | id| datetime| timestamp|
// +---+------------+----------+
// | 1| 1538384400|1538384400|
// | 2| 1538481600|1538481600|
// | 3|201809281800|1538182800|
// | 4| 1538548200|1538548200|
// | 5|201809291530|1538260200|
// +---+------------+----------+
In case there are other string formats in column datetime, you can narrow down the conditions for Unix timestamp to a range corresponding to the range of date-time in your dataset. For example, Unix timestamp should be a 10-digit number post 2001-09-09 (and for the next 250+ years) and would start with 10 to 15 up till now:
df.withColumn("timestamp",
when(length($"datetime") === 12, unix_timestamp($"datetime", "yyyyMMddHHmm")).
otherwise(when(regexp_extract($"datetime", "^(1[0-5]\\d{8})$", 1) === $"datetime", $"datetime").
otherwise(null) // Or, additional conditions for other cases
))

Related

reading partitioned parquet record in pyspark

I have a parquet file partitioned by a date field (YYYY-MM-DD).
How to read the (current date-1 day) records from the file efficiently in Pyspark - please suggest.
PS: I would not like to read the entire file and then filter the records as the data volume is huge.
There are multiple ways to go about this:
Suppose this is the input data and you write out the dataframe partitioned on "date" column:
data = [(datetime.date(2022, 6, 12), "Hello"), (datetime.date(2022, 6, 19), "World")]
schema = StructType([StructField("date", DateType()),StructField("message", StringType())])
df = spark.createDataFrame(data, schema=schema)
df.write.mode('overwrite').partitionBy('date').parquet('./test')
You can read the parquet files associated to a given date with this syntax:
spark.read.parquet('./test/date=2022-06-19').show()
# The catch is that the date column is gonna be omitted from your dataframe
+-------+
|message|
+-------+
| World|
+-------+
# You could try adding the date column with lit syntax.
(spark.read.parquet('./test/date=2022-06-19')
.withColumn('date', f.lit('2022-06-19').cast(DateType()))
.show()
)
# Output
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
The more efficient solution is using the delta tables:
df.write.mode('overwrite').partitionBy('date').format('delta').save('/test')
spark.read.format('delta').load('./test').where(f.col('date') == '2022-06-19').show()
The spark engine uses the _delta_log to optimize your query and only reads the parquet files that are applicable to your query. Also, the output will have all the columns:
+-------+----------+
|message| date|
+-------+----------+
| World|2022-06-19|
+-------+----------+
you can read it by passing date variable while reading.
This is dynamic code, you nor need to hardcode date, just append it with path
>>> df.show()
+-----+-----------------+-----------+----------+
|Sr_No| User_Id|Transaction| dt|
+-----+-----------------+-----------+----------+
| 1|paytm 111002203#p| 100D|2022-06-29|
| 2|paytm 111002203#p| 50C|2022-06-27|
| 3|paytm 111002203#p| 20C|2022-06-26|
| 4|paytm 111002203#p| 10C|2022-06-25|
| 5| null| 1C|2022-06-24|
+-----+-----------------+-----------+----------+
>>> df.write.partitionBy("dt").mode("append").parquet("/dir1/dir2/sample.parquet")
>>> from datetime import date
>>> from datetime import timedelta
>>> today = date.today()
#Here i am taking two days back date, for one day back you can do (days=1)
>>> yesterday = today - timedelta(days = 2)
>>> two_days_back=yesterday.strftime('%Y-%m-%d')
>>> path="/di1/dir2/sample.parquet/dt="+two_days_back
>>> spark.read.parquet(path).show()
+-----+-----------------+-----------+
|Sr_No| User_Id|Transaction|
+-----+-----------------+-----------+
| 2|paytm 111002203#p| 50C|
+-----+-----------------+-----------+

Spark dataframe filter a timestamp by just the date part

How can I filter a spark dataframe that has a column of type timestamp but filter out by just the date part. I tried below, but it only matches if time is 00:00:00.
Basically I want the filter to match all rows with date 2020-01-01 (3 rows)
import java.sql.Timestamp
val df = Seq(
(1, Timestamp.valueOf("2020-01-01 23:00:01")),
(2, Timestamp.valueOf("2020-01-01 00:00:00")),
(3, Timestamp.valueOf("2020-01-01 12:54:00")),
(4, Timestamp.valueOf("2019-12-15 09:54:00")),
(5, Timestamp.valueOf("2019-12-09 10:12:43"))
).toDF("someCol","someTimeStamp")
df.filter(df("someTimeStamp") === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 2|2020-01-01 00:00:00| // ONLY MATCHED with time 00:00
+-------+-------------------+
Use the to_date function to extract the date from the timestamp:
scala> df.filter(to_date(df("someTimeStamp")) === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 1|2020-01-01 23:00:01|
| 2|2020-01-01 00:00:00|
| 3|2020-01-01 12:54:00|
+-------+-------------------+

Timestamp comparison in spark-scala dataframe

I have a field in spark dataframe of type string, and it's value is in format 2019-07-08 00:00. I have to perform a condition on the field like
df.filter(myfield > 2019-07-08 00:00)
Standard comparison operators for String should work, given your date format is in British military form:
val df = Seq(
(1, "2019-07-06 16:00"),
(2, "2019-07-08 09:00"),
(3, "2019-07-11 06:30")
).toDF("id", "date")
df.filter(col("date") > "2019-07-08 00:00").show
// +---+----------------+
// | id| date|
// +---+----------------+
// | 2|2019-07-08 09:00|
// | 3|2019-07-11 06:30|
// +---+----------------+

How to calculate difference between date column and current date?

I am trying to calculate the Date Diff between a column field and current date of the system.
Here is my sample code where I have hard coded the my column field with 20170126.
val currentDate = java.time.LocalDate.now
var datediff = spark.sqlContext.sql("""Select datediff(to_date('$currentDate'),to_date(DATE_FORMAT(CAST(unix_timestamp( cast('20170126' as String), 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'))) AS GAP
""")
datediff.show()
Output is like:
+----+
| GAP|
+----+
|null|
+----+
I need to calculate actual Gap Between the two dates but getting NULL.
You have not defined the type and format of "column field" so I assume it's a string in the (not-very-pleasant) format YYYYMMdd.
val records = Seq((0, "20170126")).toDF("id", "date")
scala> records.show
+---+--------+
| id| date|
+---+--------+
| 0|20170126|
+---+--------+
scala> records
.withColumn("year", substring($"date", 0, 4))
.withColumn("month", substring($"date", 5, 2))
.withColumn("day", substring($"date", 7, 2))
.withColumn("d", concat_ws("-", $"year", $"month", $"day"))
.select($"id", $"d" cast "date")
.withColumn("datediff", datediff(current_date(), $"d"))
.show
+---+----------+--------+
| id| d|datediff|
+---+----------+--------+
| 0|2017-01-26| 83|
+---+----------+--------+
PROTIP: Read up on functions object.
Caveats
cast
Please note that I could not convince Spark SQL to cast the column "date" to DateType given the rules in DateTimeUtils.stringToDate:
yyyy,
yyyy-[m]m
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d *
yyyy-[m]m-[d]dT*
date_format
I could not convince date_format to work either so I parsed "date" column myself using substring and concat_ws functions.

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.