Filling missing dates in spark dataframe column - scala

I've a spark data frame with columns - "date" of type timestamp and "quantity" of type long. For each date, I've some value for quantity. The dates are sorted in increasing order. But there are some dates which are missing.
For eg -
Current df -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
14-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
20-09-2016 | 2
As you can see, the df has some missing dates like 12-09-2016, 13-09-2016 etc. I want to put 0 in the quantity field for those missing dates such that resultant df should look like -
Date | Quantity
10-09-2016 | 1
11-09-2016 | 2
12-09-2016 | 0
13-09-2016 | 0
14-09-2016 | 0
15-09-2016 | 0
16-09-2016 | 1
17-09-2016 | 0
18-09-2016 | 0
19-09-2016 | 0
20-09-2016 | 2
Any help/suggestion regarding this will be appreciated. Thanks in advance.
Note that I am coding in scala.

I have written this answer in a bit verbose way for easy understanding of the code. It can be optimized.
Needed imports
import java.time.format.DateTimeFormatter
import java.time.{LocalDate, LocalDateTime}
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{LongType, TimestampType}
UDFs for String to Valid date format
val date_transform = udf((date: String) => {
val dtFormatter = DateTimeFormatter.ofPattern("d-M-y")
val dt = LocalDate.parse(date, dtFormatter)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
Below UDF code taken from Iterate over dates range
def fill_dates = udf((start: String, excludedDiff: Int) => {
val dtFormatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
val fromDt = LocalDateTime.parse(start, dtFormatter)
(1 to (excludedDiff - 1)).map(day => {
val dt = fromDt.plusDays(day)
"%4d-%2d-%2d".format(dt.getYear, dt.getMonthValue, dt.getDayOfMonth)
.replaceAll(" ", "0")
})
})
Setting up sample dataframe (df)
val df = Seq(
("10-09-2016", 1),
("11-09-2016", 2),
("14-09-2016", 0),
("16-09-2016", 1),
("17-09-2016", 0),
("20-09-2016", 2)).toDF("date", "quantity")
.withColumn("date", date_transform($"date").cast(TimestampType))
.withColumn("quantity", $"quantity".cast(LongType))
df.printSchema()
root
|-- date: timestamp (nullable = true)
|-- quantity: long (nullable = false)
df.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-14 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+
Create a temporary dataframe(tempDf) to union with df:
val w = Window.orderBy($"date")
val tempDf = df.withColumn("diff", datediff(lead($"date", 1).over(w), $"date"))
.filter($"diff" > 1) // Pick date diff more than one day to generate our date
.withColumn("next_dates", fill_dates($"date", $"diff"))
.withColumn("quantity", lit("0"))
.withColumn("date", explode($"next_dates"))
.withColumn("date", $"date".cast(TimestampType))
tempDf.show(false)
+-------------------+--------+----+------------------------+
|date |quantity|diff|next_dates |
+-------------------+--------+----+------------------------+
|2016-09-12 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-13 00:00:00|0 |3 |[2016-09-12, 2016-09-13]|
|2016-09-15 00:00:00|0 |2 |[2016-09-15] |
|2016-09-18 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
|2016-09-19 00:00:00|0 |3 |[2016-09-18, 2016-09-19]|
+-------------------+--------+----+------------------------+
Now union two dataframes
val result = df.union(tempDf.select("date", "quantity"))
.orderBy("date")
result.show()
+-------------------+--------+
| date|quantity|
+-------------------+--------+
|2016-09-10 00:00:00| 1|
|2016-09-11 00:00:00| 2|
|2016-09-12 00:00:00| 0|
|2016-09-13 00:00:00| 0|
|2016-09-14 00:00:00| 0|
|2016-09-15 00:00:00| 0|
|2016-09-16 00:00:00| 1|
|2016-09-17 00:00:00| 0|
|2016-09-18 00:00:00| 0|
|2016-09-19 00:00:00| 0|
|2016-09-20 00:00:00| 2|
+-------------------+--------+

Based on the #mrsrinivas excellent answer, here is the PySpark version.
Needed imports
from typing import List
import datetime
from pyspark.sql import DataFrame, Window
from pyspark.sql.functions import col, lit, udf, datediff, lead, explode
from pyspark.sql.types import DateType, ArrayType
UDF to create the range of next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [start_date + datetime.timedelta(days=days) for days in range(1, diff)]
Function the create the DateFrame filling the dates (support "grouping" columns):
def _get_fill_dates_df(df: DataFrame, date_column: str, group_columns: List[str], fill_column: str) -> DataFrame:
get_next_dates_udf = udf(_get_next_dates, ArrayType(DateType()))
window = Window.orderBy(*group_columns, date_column)
return df.withColumn("_diff", datediff(lead(date_column, 1).over(window), date_column)) \
.filter(col("_diff") > 1).withColumn("_next_dates", get_next_dates_udf(date_column, "_diff")) \
.withColumn(fill_column, lit("0")).withColumn(date_column, explode("_next_dates")) \
.drop("_diff", "_next_dates")
The usage of the function:
fill_df = _get_fill_dates_df(df, "Date", [], "Quantity")
df = df.union(fill_df)
It assumes that the date column is already in date type.

Here is a slight modification, to use this function with months and enter measure columns (columns that should be set to zero) instead of group columns:
from typing import List
import datetime
from dateutil import relativedelta
import math
import pyspark.sql.functions as f
from pyspark.sql import DataFrame, Window
from pyspark.sql.types import DateType, ArrayType
def fill_time_gaps_date_diff_based(df: pyspark.sql.dataframe.DataFrame, measure_columns: list, date_column: str):
group_columns = [col for col in df.columns if col not in [date_column]+measure_columns]
# save measure sums for qc
qc = df.agg({col: 'sum' for col in measure_columns}).collect()
# convert month to date
convert_int_to_date = f.udf(lambda mth: datetime.datetime(year=math.floor(mth/100), month=mth%100, day=1), DateType())
df = df.withColumn(date_column, convert_int_to_date(date_column))
# sort values
df = df.orderBy(group_columns)
# get_fill_dates_df (instead of months_between also use date_diff for days)
window = Window.orderBy(*group_columns, date_column)
# calculate diff column
fill_df = df.withColumn(
"_diff",
f.months_between(f.lead(date_column, 1).over(window), date_column).cast(IntegerType())
).filter(
f.col("_diff") > 1
)
# generate next dates
def _get_next_dates(start_date: datetime.date, diff: int) -> List[datetime.date]:
return [
start_date + relativedelta.relativedelta(months=months)
for months in range(1, diff)
]
get_next_dates_udf = f.udf(_get_next_dates, ArrayType(DateType()))
fill_df = fill_df.withColumn(
"_next_dates",
get_next_dates_udf(date_column, "_diff")
)
# set measure columns to 0
for col in measure_columns:
fill_df = fill_df.withColumn(col, f.lit(0))
# explode next_dates column
fill_df = fill_df.withColumn(date_column, f.explode('_next_dates'))
# drop unneccessary columns
fill_df = fill_df.drop(
"_diff",
"_next_dates"
)
# union df with fill_df
df = df.union(fill_df)
# qc: should be removed for productive runs
if qc != df.agg({col: 'sum' for col in measure_columns}).collect():
raise ValueError('Sums before and after run do not fit.')
return df
Please note, that I assume that the month is given as Integer in the form YYYYMM. This could easily be adjusted by modifying the "convert month to date" part.

Related

adding time interval to a column in dataframe spark

Below is my dataframe.
import spark.implicits._
val lastRunDtDF = sc.parallelize(Seq(
(1, 2,"2019-07-18 13:34:24")
)).toDF("id", "cnt","run_date")
lastRunDtDF.show
+---+---+-------------------+
| id|cnt| run_date|
+---+---+-------------------+
| 1| 2|2019-07-18 13:34:24|
+---+---+-------------------+
I want to create a new dataframe with a new column as new_run_date by adding 2 minutes to the existing run_date column. sample Output like below.
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+
I am trying something like below
lastRunDtDF.withColumn("new_run_date",lastRunDtDF("run_date")+"INTERVAL 2 MINUTE")
Looks like its not the right way. Thanks in advance for any help.
Try wrapping INTERVAL 2 MINUTE in expr function.
import org.apache.spark.sql.functions.expr
lastRunDtDF.withColumn("new_run_date",lastRunDtDF("run_date") + expr("INTERVAL 2 MINUTE"))
.show()
Result:
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+
(or)
By using from_unixtime,unix_timestamp functions:
import org.apache.spark.sql.functions._
lastRunDtDF.selectExpr("*","from_unixtime(unix_timestamp(run_date) + 2*60,
'yyyy-MM-dd HH:mm:ss') as new_run_date")
.show()
Result:
+---+---+-------------------+-------------------+
| id|cnt| run_date| new_run_date|
+---+---+-------------------+-------------------+
| 1| 2|2019-07-18 13:34:24|2019-07-18 13:36:24|
+---+---+-------------------+-------------------+

Add a New column in pyspark Dataframe (alternative of .apply in pandas DF)

I have a pyspark.sql.DataFrame.dataframe df
id col1
1 abc
2 bcd
3 lal
4 bac
i want to add one more column flag in the df such that if id is odd no, flag should be 'odd' , if even 'even'
final output should be
id col1 flag
1 abc odd
2 bcd even
3 lal odd
4 bac even
I tried:
def myfunc(num):
if num % 2 == 0:
flag = 'EVEN'
else:
flag = 'ODD'
return flag
df['new_col'] = df['id'].map(lambda x: myfunc(x))
df['new_col'] = df['id'].apply(lambda x: myfunc(x))
It Gave me error : TypeError: 'Column' object is not callable
How do is use .apply ( as i use in pandas dataframe) in pyspark
pyspark doesn't provide apply, the alternative is to use withColumn function. Use withColumn to perform this operation.
from pyspark.sql import functions as F
df = sqlContext.createDataFrame([
[1,"abc"],
[2,"bcd"],
[3,"lal"],
[4,"bac"]
],
["id","col1"]
)
df.show()
+---+----+
| id|col1|
+---+----+
| 1| abc|
| 2| bcd|
| 3| lal|
| 4| bac|
+---+----+
df.withColumn(
"flag",
F.when(F.col("id")%2 == 0, F.lit("Even")).otherwise(
F.lit("odd"))
).show()
+---+----+----+
| id|col1|flag|
+---+----+----+
| 1| abc| odd|
| 2| bcd|Even|
| 3| lal| odd|
| 4| bac|Even|
+---+----+----+

How can I rank items in a RDD to build a streak?

I have an RDD containing data like this: (downloadId: String, date: LocalDate, downloadCount: Int). The date and download-id are unique and the download-count is for the date.
What I've been trying to accomplish is to get the number of consecutive days (going backwards from the current date) that a download-id was in the top 100 of all download-ids. So if a given download was in the top-100 today, yesterday and the day before, then it's streak would be 3.
In SQL, I guess this could be solved using window functions. I've seen similar questions like this. How to add a running count to rows in a 'streak' of consecutive days
(I'm rather new to Spark but wasn't sure to how to map-reduce an RDD to even begin solving a problem like this.)
Some more information, the dates are the last 30 days and there are approximately unique 4M download-ids per day.
I suggest you work with DataFrames, as they are much easier to use than RDDs. Leo's answer is shorter, but I couldn't find where it was filtering for the top 100 downloads, so I decide to post my answer as well. It does not depend on window functions, but it is bound on the number of days in the past you want to streak by. Since you said you only use the last 30 days' data, that should not be a problem.
As a first Step, I wrote some code to generate a DF similar to what you described. You don't need to run this first block (if you do, reduce the number of rows unless you have a cluster to try it on, it's heavy on memory). You can see how to transform the RDD (theData) into a DF (baseData). You should define a schema for it, like I did.
import java.time.LocalDate
import scala.util.Random
val maxId = 10000
val numRows = 15000000
val lastDate = LocalDate.of(2017, 12, 31)
// Generates the data. As a convenience for working with Dataframes, I converted the dates to epoch days.
val theData = sc.parallelize(1.to(numRows).map{
_ => {
val id = Random.nextInt(maxId)
val nDownloads = Random.nextInt((id / 1000 + 1))
Row(id, lastDate.minusDays(Random.nextInt(30)).toEpochDay, nDownloads)
}
})
//Working with Dataframes is much simples, so I'll generate a DF named baseData from the RDD
val schema = StructType(
StructField("downloadId", IntegerType, false) ::
StructField("date", LongType, false) ::
StructField("downloadCount", IntegerType, false) :: Nil)
val baseData = sparkSession.sqlContext.createDataFrame(theData, schema)
.groupBy($"downloadId", $"date")
.agg(sum($"downloadCount").as("downloadCount"))
.cache()
Now you have the data you want in a DF called baseData. The next step is to restrict it to the top 100 for each day - you should discard the data you don't before doing any additional heavy transformations.
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Row}
def filterOnlyTopN(data: DataFrame, n: Int = 100): DataFrame = {
// For each day in the data, let's find the cutoff # of downloads to make it into the top N
val getTopNCutoff = udf((downloads: Seq[Long]) => {
val reverseSortedDownloads = downloads.sortBy{- _ }
if (reverseSortedDownloads.length >= n)
reverseSortedDownloads.drop(n - 1).head
else
reverseSortedDownloads.last
})
val topNLimitsByDate = data.groupBy($"date").agg(collect_set($"downloadCount").as("downloads"))
.select($"date", getTopNCutoff($"downloads").as("cutoff"))
// And then, let's throw away the records below the top 100
data.join(topNLimitsByDate, Seq("date"))
.filter($"downloadCount" >= $"cutoff")
.drop("cutoff", "downloadCount")
}
val relevantData = filterOnlyTopN(baseData)
Now that you have the relevantData DF with only the data you need, you can calculate the streak for them. I have left the ids with no streaks as streak 0, you can filter those out by using streaks.filter($"streak" > lit(0)).
def getStreak(df: DataFrame, fromDate: Long): DataFrame = {
val calcStreak = udf((dateList: Seq[Long]) => {
if (!dateList.contains(fromDate))
0
else {
val relevantDates = dateList.sortBy{- _ } // Order the dates descending
.dropWhile(_ != fromDate) // And drop everything until we find the starting day we are interested in
if (relevantDates.length == 1) // If there's only one day left, it's a one day streak
1
else // Otherwise, let's count the streak length (this works if no dates are left, too - but not with only 1 day)
relevantDates.sliding(2) // Take days by pairs
.takeWhile{twoDays => twoDays(1) == twoDays(0) - 1} // While the pair is of consecutive days
.length+1 // And the streak will be the number of consecutive pairs + 1 (the initial day of the streak)
}
})
df.groupBy($"downloadId").agg(collect_list($"date").as("dates")).select($"downloadId", calcStreak($"dates").as("streak"))
}
val streaks = getStreak(relevantData, lastDate.toEpochDay)
streaks.show()
+------------+--------+
| downloadId | streak |
+------------+--------+
| 8086 | 0 |
| 9852 | 0 |
| 7253 | 0 |
| 9376 | 0 |
| 7833 | 0 |
| 9465 | 1 |
| 7880 | 0 |
| 9900 | 1 |
| 7993 | 0 |
| 9427 | 1 |
| 8389 | 1 |
| 8638 | 1 |
| 8592 | 1 |
| 6397 | 0 |
| 7754 | 1 |
| 7982 | 0 |
| 7554 | 0 |
| 6357 | 1 |
| 7340 | 0 |
| 6336 | 0 |
+------------+--------+
And there you have the streaks DF with the data you need.
Using a similar approach in the listed PostgreSQL link, you can apply Window function in Spark as well. Spark's DataFrame API doesn't have encoders for java.time.LocalDate, so you'll need to convert it to, say, java.sql.Date.
Here're the steps: First, transfrom the RDD to a DataFrame with supported date format; next, create a UDF to compute the baseDate which requires a date and a per-id chronological row-number (generated using Window function) as parameters. Another Window function is applied to calculate per-id-baseDate row-number, which is the wanted streak value:
import java.time.LocalDate
val rdd = sc.parallelize(Seq(
(1, LocalDate.parse("2017-12-13"), 2),
(1, LocalDate.parse("2017-12-16"), 1),
(1, LocalDate.parse("2017-12-17"), 1),
(1, LocalDate.parse("2017-12-18"), 2),
(1, LocalDate.parse("2017-12-20"), 1),
(1, LocalDate.parse("2017-12-21"), 3),
(2, LocalDate.parse("2017-12-15"), 2),
(2, LocalDate.parse("2017-12-16"), 1),
(2, LocalDate.parse("2017-12-19"), 1),
(2, LocalDate.parse("2017-12-20"), 1),
(2, LocalDate.parse("2017-12-21"), 2),
(2, LocalDate.parse("2017-12-23"), 1)
))
val df = rdd.map{ case (id, date, count) => (id, java.sql.Date.valueOf(date), count) }.
toDF("downloadId", "date", "downloadCount")
def baseDate = udf( (d: java.sql.Date, n: Long) =>
new java.sql.Date(new java.util.Date(d.getTime).getTime - n * 24 * 60 * 60 * 1000)
)
import org.apache.spark.sql.expressions.Window
val dfStreak = df.withColumn("rowNum", row_number.over(
Window.partitionBy($"downloadId").orderBy($"date")
)
).withColumn(
"baseDate", baseDate($"date", $"rowNum")
).select(
$"downloadId", $"date", $"downloadCount", row_number.over(
Window.partitionBy($"downloadId", $"baseDate").orderBy($"date")
).as("streak")
).orderBy($"downloadId", $"date")
dfStreak.show
+----------+----------+-------------+------+
|downloadId| date|downloadCount|streak|
+----------+----------+-------------+------+
| 1|2017-12-13| 2| 1|
| 1|2017-12-16| 1| 1|
| 1|2017-12-17| 1| 2|
| 1|2017-12-18| 2| 3|
| 1|2017-12-20| 1| 1|
| 1|2017-12-21| 3| 2|
| 2|2017-12-15| 2| 1|
| 2|2017-12-16| 1| 2|
| 2|2017-12-19| 1| 1|
| 2|2017-12-20| 1| 2|
| 2|2017-12-21| 2| 3|
| 2|2017-12-23| 1| 1|
+----------+----------+-------------+------+

How to convert UNIX timestamp to the given timezone in Spark?

I use Spark 1.6.2
I have epochs like this:
+--------------+-------------------+-------------------+
|unix_timestamp|UTC |Europe/Helsinki |
+--------------+-------------------+-------------------+
|1491771599 |2017-04-09 20:59:59|2017-04-09 23:59:59|
|1491771600 |2017-04-09 21:00:00|2017-04-10 00:00:00|
|1491771601 |2017-04-09 21:00:01|2017-04-10 00:00:01|
+--------------+-------------------+-------------------+
The default timezone is the following on the Spark machines:
#timezone = DefaultTz: Europe/Prague, SparkUtilTz: Europe/Prague
the output of
logger.info("#timezone = DefaultTz: {}, SparkUtilTz: {}", TimeZone.getDefault.getID, org.apache.spark.sql.catalyst.util.DateTimeUtils.defaultTimeZone.getID)
I want to count the timestamps grouped by date and hour in the given timezone (now it is Europe/Helsinki +3hours).
What I expect:
+----------+---------+-----+
|date |hour |count|
+----------+---------+-----+
|2017-04-09|23 |1 |
|2017-04-10|0 |2 |
+----------+---------+-----+
Code (using from_utc_timestamp):
def getCountsPerTime(sqlContext: SQLContext, inputDF: DataFrame, timeZone: String, aggr: String): DataFrame = {
import sqlContext.implicits._
val onlyTime = inputDF.select(
from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), timeZone).alias("time")
)
val visitsPerTime =
if (aggr.equalsIgnoreCase("hourly")) {
onlyTime.groupBy(
date_format($"time", "yyyy-MM-dd").alias("date"),
date_format($"time", "H").cast(DataTypes.IntegerType).alias("hour"),
).count()
} else if (aggr.equalsIgnoreCase("daily")) {
onlyTime.groupBy(
date_format($"time", "yyyy-MM-dd").alias("date")
).count()
}
visitsPerTime.show(false)
visitsPerTime
}
What I get:
+----------+---------+-----+
|date |hour |count|
+----------+---------+-----+
|2017-04-09|22 |1 |
|2017-04-09|23 |2 |
+----------+---------+-----+
Trying to wrap it with to_utc_timestamp:
def getCountsPerTime(sqlContext: SQLContext, inputDF: DataFrame, timeZone: String, aggr: String): DataFrame = {
import sqlContext.implicits._
val onlyTime = inputDF.select(
to_utc_timestamp(from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), timeZone), DateTimeUtils.defaultTimeZone.getID).alias("time")
)
val visitsPerTime = ... //same as above
visitsPerTime.show(false)
visitsPerTime
}
What I get:
+----------+---------+-----+
|tradedate |tradehour|count|
+----------+---------+-----+
|2017-04-09|20 |1 |
|2017-04-09|21 |2 |
+----------+---------+-----+
How to get the expected result?
Your codes are not working for me so I couldn't replicate the last two outputs you got.
But I am going to provide you some hints on how you can achieve the output you expected
I am assuming you already have dataframe as
+--------------+---------------------+---------------------+
|unix_timestamp|UTC |Europe/Helsinki |
+--------------+---------------------+---------------------+
|1491750899 |2017-04-09 20:59:59.0|2017-04-09 23:59:59.0|
|1491750900 |2017-04-09 21:00:00.0|2017-04-10 00:00:00.0|
|1491750901 |2017-04-09 21:00:01.0|2017-04-10 00:00:01.0|
+--------------+---------------------+---------------------+
I got this dataframe by using following code
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val inputDF = Seq(
"2017-04-09 20:59:59",
"2017-04-09 21:00:00",
"2017-04-09 21:00:01"
).toDF("unix_timestamp")
val onlyTime = inputDF.select(
unix_timestamp($"unix_timestamp").alias("unix_timestamp"),
from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), "UTC").alias("UTC"),
from_utc_timestamp($"unix_timestamp".cast(DataTypes.TimestampType), "Europe/Helsinki").alias("Europe/Helsinki")
)
onlyTime.show(false)
Once you have above dataframe, getting the output dataframe that you desire would require you to split the date, groupby and count as below
onlyTime.select(split($"Europe/Helsinki", " ")(0).as("date"), split(split($"Europe/Helsinki", " ")(1).as("time"), ":")(0).as("hour"))
.groupBy("date", "hour").agg(count("date").as("count"))
.show(false)
The resulting dataframe is
+----------+----+-----+
|date |hour|count|
+----------+----+-----+
|2017-04-09|23 |1 |
|2017-04-10|00 |2 |
+----------+----+-----+
Setting "spark.sql.session.timeZone" before the action seems to be reliable. Using this setting we can be sure that the timestamps that we use afterwards- does actually represent the time in the specified time zone. Without it (if we use from_unixtime or timestamp_seconds) we can't be sure which time zone is represented. Both those functions represent the current system time zone. And if afterwards we used to_utc_timestamp or from_utc_timestamp, we would only get a shift from the current system time zone. UTC does not necessarily come into play with the latter functions. This is why explicitly setting a time zone can be reliable. One thing to keep in mind is that the action(s) must be performed before spark.conf.unset("spark.sql.session.timeZone").
Scala
Input df:
import spark.implicits._
import org.apache.spark.sql.functions._
val inputDF = Seq(1491771599L,1491771600L,1491771601L).toDF("unix_timestamp")
inputDF.show()
// +--------------+
// |unix_timestamp|
// +--------------+
// | 1491771599|
// | 1491771600|
// | 1491771601|
// +--------------+
Result:
spark.conf.set("spark.sql.session.timeZone", "Europe/Helsinki")
val ts = from_unixtime($"unix_timestamp")
val DF = inputDF.groupBy(to_date(ts).alias("date"), hour(ts).alias("hour")).count()
DF.show()
// +----------+----+-----+
// | date|hour|count|
// +----------+----+-----+
// |2017-04-09| 23| 1|
// |2017-04-10| 0| 2|
// +----------+----+-----+
spark.conf.unset("spark.sql.session.timeZone")
PySpark
Input df:
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1491771599,),(1491771600,),(1491771601,)], ['unix_timestamp'])
df.show()
# +--------------+
# |unix_timestamp|
# +--------------+
# | 1491771599|
# | 1491771600|
# | 1491771601|
# +--------------+
Result:
spark.conf.set("spark.sql.session.timeZone", "Europe/Helsinki")
ts = F.from_unixtime('unix_timestamp')
df_agg = df.groupBy(F.to_date(ts).alias('date'), F.hour(ts).alias('hour')).count()
df_agg.show()
# +----------+----+-----+
# | date|hour|count|
# +----------+----+-----+
# |2017-04-09| 23| 1|
# |2017-04-10| 0| 2|
# +----------+----+-----+
spark.conf.unset("spark.sql.session.timeZone")

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.