PySpark Milliseconds of TimeStamp

PySpark Milliseconds of TimeStamp - pyspark

I am trying to get the difference between two timestamp columns but the milliseconds is gone.
How to correct this?
from pyspark.sql.functions import unix_timestamp
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
(1, '2018-07-25 17:15:06.39','2018-07-25 17:15:06.377'),
(2,'2018-07-25 11:12:49.317','2018-07-25 11:12:48.883')
]
df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts']).withColumn('diff',F.unix_timestamp('max_ts', format=timeFmt) - F.unix_timestamp('min_ts', format=timeFmt))
df.show(truncate = False)

That's the intended behavior for unix_timestamp - it clearly states in the source code docstring it only returns seconds, so the milliseconds component is dropped when doing the calculation.
If you want to have that calculation, you can use the substring function to concat the numbers and then do the difference. See the example below. Please note that this assumes fully formed data, for example the milliseconds are fulfilled entirely (all 3 digits):
import pyspark.sql.functions as F
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
(1, '2018-07-25 17:15:06.390', '2018-07-25 17:15:06.377'), # note the '390'
(2, '2018-07-25 11:12:49.317', '2018-07-25 11:12:48.883')
]
df = spark.createDataFrame(data, ['ID', 'max_ts', 'min_ts'])\
.withColumn('max_milli', F.unix_timestamp('max_ts', format=timeFmt) + F.substring('max_ts', -3, 3).cast('float')/1000)\
.withColumn('min_milli', F.unix_timestamp('min_ts', format=timeFmt) + F.substring('min_ts', -3, 3).cast('float')/1000)\
.withColumn('diff', (F.col('max_milli') - F.col('min_milli')).cast('float') * 1000)
df.show(truncate=False)
+---+-----------------------+-----------------------+----------------+----------------+---------+
|ID |max_ts |min_ts |max_milli |min_milli |diff |
+---+-----------------------+-----------------------+----------------+----------------+---------+
|1 |2018-07-25 17:15:06.390|2018-07-25 17:15:06.377|1.53255330639E9 |1.532553306377E9|13.000011|
|2 |2018-07-25 11:12:49.317|2018-07-25 11:12:48.883|1.532531569317E9|1.532531568883E9|434.0 |
+---+-----------------------+-----------------------+----------------+----------------+---------+

Assuming you already have a dataframe with columns of timestamp type:
from datetime import datetime
data = [
(1, datetime(2018, 7, 25, 17, 15, 6, 390000), datetime(2018, 7, 25, 17, 15, 6, 377000)),
(2, datetime(2018, 7, 25, 11, 12, 49, 317000), datetime(2018, 7, 25, 11, 12, 48, 883000))
]
df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts'])
df.printSchema()
# root
# |-- ID: long (nullable = true)
# |-- max_ts: timestamp (nullable = true)
# |-- min_ts: timestamp (nullable = true)
You can get the time in seconds by casting the timestamp-type column to a double type, or in milliseconds by multiplying that result by 1000 (and optionally casting to long if you want an integer).
For example
df.select(
F.col('max_ts').cast('double').alias('time_in_seconds'),
(F.col('max_ts').cast('double') * 1000).cast('long').alias('time_in_milliseconds'),
).toPandas()
# time_in_seconds time_in_milliseconds
# 0 1532538906.390 1532538906390
# 1 1532517169.317 1532517169317
Finally, if you want the difference between the two times in milliseconds, you could do:
df.select(
((F.col('max_ts').cast('double') - F.col('min_ts').cast('double')) * 1000).cast('long').alias('diff_in_milliseconds'),
).toPandas()
# diff_in_milliseconds
# 0 13
# 1 434
I'm doing this on PySpark 2.4.2. There is no need to use string concatenation whatsoever.

The answer from Tanjin doesn't work when the values are of type timestamp and the milliseconds are round numbers (like 390, 500). Python would cut the 0 at the end and the timestamp from the example would look like this 2018-07-25 17:15:06.39.
The problem is the hardcoded value in F.substring('max_ts', -3, 3). If the 0 at the end is missing then the substring goes wild.
To convert tmpColumn of type timestamp column to tmpLongColumn of type long I used this snippet:
timeFmt = "yyyy-MM-dd HH:mm:ss.SSS"
df = df \
.withColumn('tmpLongColumn', F.substring_index('tmpColumn', '.', -1).cast('float')) \
.withColumn('tmpLongColumn', F.when(F.col('tmpLongColumn') < 100, F.col('tmpLongColumn')*10).otherwise(F.col('tmpLongColumn')).cast('long')) \
.withColumn('tmpLongColumn', (F.unix_timestamp('tmpColumn', format=timeFmt)*1000 + F.col('tmpLongColumn'))) \
The first transformation extracts the substring containing the milliseconds. Next, if the value is less then 100 multiply it by 10. Finally, convert the timestamp and add milliseconds.

Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
if S in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions
Please refer pyspark to_timestamp does not include milliseconds

Unlike #kaichi I did not find that trailing zeros were truncated by the substring_index command, so multiplying milliseconds by ten is not necessary and can give you the wrong answer, for example if the milliseconds is originally 099 this would become 990. Furthermore, you may want to also add handling of timestamps that have zero milliseconds. To handle both of these situations, I have modified #kaichi's answer to give the following as the difference between two timestamps in milliseconds:
df = (
df
.withColumn('tmpLongColumn', f.substring_index(tmpColumn, '.', -1).cast('long'))
.withColumn(
'tmpLongColumn',
f.when(f.col('tmpLongColumn').isNull(), 0.0)
.otherwise(f.col('tmpLongColumn')))
.withColumn(
tmpColumn,
(f.unix_timestamp(tmpColumn, format=timeFmt)*1000 + f.col('tmpLongColumn')))
.drop('tmpLongColumn'))

When you cannot guarantee the exact format of the sub-seconds (length? trailing zeros?), I propose the following little algorithm, which should work for all lengths and formats:
Algorithm
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
current_col = "time"
df = df.withColumn("subsecond_string", F.substring_index(current_col, '.', -1))
df = df.withColumn("subsecond_length", F.length(F.col("subsecond_string")))
df = df.withColumn("divisor", F.pow(10,"subsecond_length"))
df = df.withColumn("subseconds", F.col("subsecond_string").cast("int") / F.col("divisor") )
# Putting it all together
df = df.withColumn("timestamp_subsec", F.unix_timestamp(current_col, format=timeFmt) + F.col("subseconds"))
Based on the length of the subsecond-string, an appropriate divisor is calculated (10 to the power of the length of the substring).
Dropping the superfluous columns afterwards should not be a problem.
Demonstration
My exemplary result looks like this:
+----------------------+----------------+----------------+-------+----------+----------------+
|time |subsecond_string|subsecond_length|divisor|subseconds|timestamp_subsec|
+----------------------+----------------+----------------+-------+----------+----------------+
|2019-04-02 14:34:16.02|02 |2 |100.0 |0.02 |1.55420845602E9 |
|2019-04-02 14:34:16.03|03 |2 |100.0 |0.03 |1.55420845603E9 |
|2019-04-02 14:34:16.04|04 |2 |100.0 |0.04 |1.55420845604E9 |
|2019-04-02 14:34:16.05|05 |2 |100.0 |0.05 |1.55420845605E9 |
|2019-04-02 14:34:16.06|06 |2 |100.0 |0.06 |1.55420845606E9 |
|2019-04-02 14:34:16.07|07 |2 |100.0 |0.07 |1.55420845607E9 |
|2019-04-02 14:34:16.08|08 |2 |100.0 |0.08 |1.55420845608E9 |
|2019-04-02 14:34:16.09|09 |2 |100.0 |0.09 |1.55420845609E9 |
|2019-04-02 14:34:16.1 |1 |1 |10.0 |0.1 |1.5542084561E9 |
|2019-04-02 14:34:16.11|11 |2 |100.0 |0.11 |1.55420845611E9 |
|2019-04-02 14:34:16.12|12 |2 |100.0 |0.12 |1.55420845612E9 |
|2019-04-02 14:34:16.13|13 |2 |100.0 |0.13 |1.55420845613E9 |
|2019-04-02 14:34:16.14|14 |2 |100.0 |0.14 |1.55420845614E9 |
|2019-04-02 14:34:16.15|15 |2 |100.0 |0.15 |1.55420845615E9 |
|2019-04-02 14:34:16.16|16 |2 |100.0 |0.16 |1.55420845616E9 |
|2019-04-02 14:34:16.17|17 |2 |100.0 |0.17 |1.55420845617E9 |
|2019-04-02 14:34:16.18|18 |2 |100.0 |0.18 |1.55420845618E9 |
|2019-04-02 14:34:16.19|19 |2 |100.0 |0.19 |1.55420845619E9 |
|2019-04-02 14:34:16.2 |2 |1 |10.0 |0.2 |1.5542084562E9 |
|2019-04-02 14:34:16.21|21 |2 |100.0 |0.21 |1.55420845621E9 |
+----------------------+----------------+----------------+-------+----------+----------------+

Related

Merge multiple spark rows inside dataframe by ID into one row based on update_time

we need to merge multiple rows based on ID into a single record using Pyspark. If there are multiple updates to the column, then we have to select the one with the last update made to it.
Please note, NULL would mean there was no update made to the column in that instance.
So, basically we have to create a single row with the consolidated updates made to the records.
So,for example, if this is the dataframe ...
Looking for similar answer, but in Pyspark .. Merge rows in a spark scala Dataframe
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update1 | <*no-update*> | 1634228709 |
| 123 | <*no-update*> | 80 | 1634228724 |
| 123 | update2 | <*no-update*> | 1634229000 |
expected output is -
------------------------------------------------------------
| id | column1 | column2 | updated_at |
------------------------------------------------------------
| 123 | update2 | 80 | 1634229000 |

Let's say that our input dataframe is:
+---+-------+----+----------+
|id |col1 |col2|updated_at|
+---+-------+----+----------+
|123|null |null|1634228709|
|123|null |80 |1634228724|
|123|update2|90 |1634229000|
|12 |update1|null|1634221233|
|12 |null |80 |1634228333|
|12 |update2|null|1634221220|
+---+-------+----+----------+
What we want is to covert updated_at to TimestampType then order by id and updated_at in desc order:
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
that gives us:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|12 |null |80 |2021-10-14 18:18:53|
|12 |update1|null|2021-10-14 16:20:33|
|12 |update2|null|2021-10-14 16:20:20|
|123|update2|90 |2021-10-14 18:30:00|
|123|null |80 |2021-10-14 18:25:24|
|123|null |null|2021-10-14 18:25:09|
+---+-------+----+-------------------+
Now get first non None value in each column or return None and group by id:
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)
And the result is:
+---+-------+----+-------------------+
|id |col1 |col2|updated_at |
+---+-------+----+-------------------+
|123|update2|90 |2021-10-14 18:30:00|
|12 |update1|80 |2021-10-14 18:18:53|
+---+-------+----+-------------------+
Here's the full example code:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.types import TimestampType
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(123, None, None, 1634228709),
(123, None, 80, 1634228724),
(123, "update2", 90, 1634229000),
(12, "update1", None, 1634221233),
(12, None, 80, 1634228333),
(12, "update2", None, 1634221220),
]
columns = ["id", "col1", "col2", "updated_at"]
df = spark.createDataFrame(data, columns)
df = df.withColumn("updated_at", F.col("updated_at").cast(TimestampType())).orderBy(
F.col("id"), F.col("updated_at").desc()
)
exp = [F.first(x, ignorenulls=True).alias(x) for x in df.columns[1:]]
df = df.groupBy(F.col("id")).agg(*exp)

Pyspark: filter last 3 days of data based on regex

I have a dataframe with dates and would like to filter for the last 3 days (not based on current time but the latest time available in the dataset)
+---+----------------------------------------------------------------------------------+----------+
|id |partition |date |
+---+----------------------------------------------------------------------------------+----------+
|1 |/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl|2019-12-01|
|2 |/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-30|
|3 |/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-29|
|4 |/raw/gsec/qradar/flows/dt=2019-11-28/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-28|
|5 |/raw/gsec/qradar/flows/dt=2019-11-27/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-27|
+---+----------------------------------------------------------------------------------+----------+
Should return
+---+----------------------------------------------------------------------------------+----------+
|id |partition |date |
+---+----------------------------------------------------------------------------------+----------+
|1 |/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl|2019-12-01|
|2 |/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-30|
|3 |/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl|2019-11-29|
+---+----------------------------------------------------------------------------------+----------+
EDIT: I have taken #Lamanus answer to extract the dates from the partition string
df = sqlContext.createDataFrame([
(1, '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl'),
(2, '/raw/gsec/qradar/flows/dt=2019-11-30/hour=00/1585218406613_flows_20191201_00.jsonl'),
(3, '/raw/gsec/qradar/flows/dt=2019-11-29/hour=00/1585218406613_flows_20191201_00.jsonl'),
(4, '/raw/gsec/qradar/flows/dt=2019-11-28/hour=00/1585218406613_flows_20191201_00.jsonl'),
(5, '/raw/gsec/qradar/flows/dt=2019-11-27/hour=00/1585218406613_flows_20191201_00.jsonl')
], ['id','partition'])
df.withColumn('date', F.regexp_extract('partition', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0)) \
.show(10, False)

For your original purpose, I don't think you need the date-specific folders. Because the folder structure is already partitioned by dt, take them all and do the filter.
df = spark.createDataFrame([('1', '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl')]).toDF('id', 'value')
from pyspark.sql.functions import *
dates = df.withColumn('date', regexp_extract('value', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0)) \
.withColumn('date', explode(sequence(to_date('date'), date_sub('date', 2)))) \
.select('date').rdd.map(lambda x: str(x[0])).collect()
path = df.withColumn('value', split('value', '/dt')[0]) \
.select('value').rdd.map(lambda x: str(x[0])).collect()
newDF = spark.read.json(path).filter(col(dt).isin(dates))
Here is my try.
df = spark.createDataFrame([('1', '/raw/gsec/qradar/flows/dt=2019-12-01/hour=00/1585218406613_flows_20191201_00.jsonl')]).toDF('id', 'value')
from pyspark.sql.functions import *
df.withColumn('date', regexp_extract('value', '[0-9]{4}-[0-9]{2}-[0-9]{2}', 0)) \
.withColumn('date', explode(sequence(to_date('date'), date_sub('date', 2)))) \
.withColumn('value', concat(lit('.*/'), col('date'), lit('/.*'))).show(10, False)
+---+----------------+----------+
|id |value |date |
+---+----------------+----------+
|1 |.*/2019-12-01/.*|2019-12-01|
|1 |.*/2019-11-30/.*|2019-11-30|
|1 |.*/2019-11-29/.*|2019-11-29|
+---+----------------+----------+

How to split the spark dataframe into 2 using ratio given in terms of months and the unix epoch column?

I wanted to split the spark dataframe into 2 using ratio given in terms of months and the unix epoch column-
sample dataframe is as below-
unixepoch
---------
1539754800
1539754800
1539931200
1539927600
1539927600
1539931200
1539931200
1539931200
1539927600
1540014000
1540014000
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
1540190400
strategy of splitting-
if total months of data given is say 30 months and splittingRatio is say 0.6
then expected dataframe 1 should have: 30 * 0.6 = 18 months of data
and expected dataframe 1 should have: 30 * 0.4 = 12 months of data
EDIT-1
most of the answers are given by considering splitting ratio for number of records i.e. if total records count = 100 and split ratio = 0.6
then split1DF~=60 records and split2DF~=40 records.
To be more clear, this is not i am looking for. Here splitting ratio is given for month which can be calculated by the given epoch unix timestamp column from the above sample dataframe.
Suppose above epoch column is some distibution of 30 months then I want first 18 months epoch in the dataframe 1 and last 12 months epoch rows in the second dataframe. you can consider this as split the dataframe for timeseries data in spark.
EDIT-2
if the data is given for July, 2018 to May, 2019=10 months data, then split1(0.6=first 6 months)= (July, 2018, Jan,2019 ) and split2(0.4=last 4 months)= (Feb,2019, May, 2019 ). randomized picking shouldn't be there.

Use row_number & filter to split data into two DataFrame.
scala> val totalMonths = 10
totalMonths: Int = 10
scala> val splitRatio = 0.6
splitRatio: Double = 0.6
scala> val condition = (totalMonths * splitRatio).floor + 1
condition: Double = 7.0
scala> epochDF.show(false)
+----------+-----+
|dt |month|
+----------+-----+
|1530383400|7 |
|1533061800|8 |
|1535740200|9 |
|1538332200|10 |
|1541010600|11 |
|1543602600|12 |
|1546281000|1 |
|1548959400|2 |
|1551378600|3 |
|1554057000|4 |
|1556649000|5 |
+----------+-----+
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> epochDF.orderBy($"dt".asc).withColumn("id",row_number().over(Window.orderBy($"dt".asc))).filter($"id" <= condition).show(false)
+----------+-----+---+
|dt |month|id |
+----------+-----+---+
|2018-07-01|7 |1 |
|2018-08-01|8 |2 |
|2018-09-01|9 |3 |
|2018-10-01|10 |4 |
|2018-11-01|11 |5 |
|2018-12-01|12 |6 |
|2019-01-01|1 |7 |
+----------+-----+---+
scala> epochDF.orderBy($"dt".asc).withColumn("id",row_number().over(Window.orderBy($"dt".asc))).filter($"id" > condition).show(false)
+----------+-----+---+
|dt |month|id |
+----------+-----+---+
|2019-02-01|2 |8 |
|2019-03-01|3 |9 |
|2019-04-01|4 |10 |
|2019-05-01|5 |11 |
+----------+-----+---+

I have divided data based on months and then days if the data is given for 1 month.
I prefer this method since this answer is not dependent on the windowing function. Other answer given here uses Window without partitionBy which degrades the performance seriously as data shuffles to one executor.
1. splitting method given a train ratio in terms of months
val EPOCH = "epoch"
def splitTrainTest(inputDF: DataFrame,
trainRatio: Double): (DataFrame, DataFrame) = {
require(trainRatio >= 0 && trainRatio <= 0.9, s"trainRatio must between 0 and 0.9, found : $trainRatio")
def extractDateCols(tuples: (String, Column)*): DataFrame = {
tuples.foldLeft(inputDF) {
case (df, (dateColPrefix, dateColumn)) =>
df
.withColumn(s"${dateColPrefix}_month", month(from_unixtime(dateColumn))) // month
.withColumn(s"${dateColPrefix}_dayofmonth", dayofmonth(from_unixtime(dateColumn))) // dayofmonth
.withColumn(s"${dateColPrefix}_year", year(from_unixtime(dateColumn))) // year
}
}
val extractDF = extractDateCols((EPOCH, inputDF(EPOCH)))
// derive min/max(yyyy-MM)
val yearCol = s"${EPOCH}_year"
val monthCol = s"${EPOCH}_month"
val dayCol = s"${EPOCH}_dayofmonth"
val SPLIT = "split"
require(trainRatio >= 0 && trainRatio <= 0.9, s"trainRatio must between 0 and 0.9, found : $trainRatio")
// derive min/max(yyyy-MM)
// val yearCol = PLANNED_START_YEAR
// val monthCol = PLANNED_START_MONTH
val dateCol = to_date(date_format(
concat_ws("-", Seq(yearCol, monthCol).map(col): _*), "yyyy-MM-01"))
val minMaxDF = extractDF.agg(max(dateCol).as("max_date"), min(dateCol).as("min_date"))
val min_max_date = minMaxDF.head()
import java.sql.{Date => SqlDate}
val minDate = min_max_date.getAs[SqlDate]("min_date")
val maxDate = min_max_date.getAs[SqlDate]("max_date")
println(s"Min Date Found: $minDate")
println(s"Max Date Found: $maxDate")
// Get the total months for which the data exist
val totalMonths = (maxDate.toLocalDate.getYear - minDate.toLocalDate.getYear) * 12 +
maxDate.toLocalDate.getMonthValue - minDate.toLocalDate.getMonthValue
println(s"Total Months of data found for is $totalMonths months")
// difference starts with 0
val splitDF = extractDF.withColumn(SPLIT, round(months_between(dateCol, to_date(lit(minDate)))).cast(DataTypes.IntegerType))
val (trainDF, testDF) = totalMonths match {
// data is provided for more than a month
case tm if tm > 0 =>
val trainMonths = Math.round(totalMonths * trainRatio)
println(s"Data considered for training is < $trainMonths months")
println(s"Data considered for testing is >= $trainMonths months")
(splitDF.filter(col(SPLIT) < trainMonths), splitDF.filter(col(SPLIT) >= trainMonths))
// data is provided for a month, split based on the total records in terms of days
case tm if tm == 0 =>
// val dayCol = PLANNED_START_DAYOFMONTH
val splitDF1 = splitDF.withColumn(SPLIT,
datediff(date_format(
concat_ws("-", Seq(yearCol, monthCol, dayCol).map(col): _*), "yyyy-MM-dd"), lit(minDate))
)
// Get the total days for which the data exist
val todalDays = splitDF1.select(max(SPLIT).as("total_days")).head.getAs[Int]("total_days")
if (todalDays <= 1) {
throw new RuntimeException(s"Insufficient data provided for training, Data found for $todalDays days but " +
s"$todalDays > 1 required")
}
println(s"Total Days of data found is $todalDays days")
val trainDays = Math.round(todalDays * trainRatio)
(splitDF1.filter(col(SPLIT) < trainDays), splitDF1.filter(col(SPLIT) >= trainDays))
// data should be there
case default => throw new RuntimeException(s"Insufficient data provided for training, Data found for $totalMonths " +
s"months but $totalMonths >= 1 required")
}
(trainDF.cache(), testDF.cache())
}
2. Test using the data from multiple months across years
// call methods
val implicits = sqlContext.sparkSession.implicits
import implicits._
val monthData = sc.parallelize(Seq(
1539754800,
1539754800,
1539931200,
1539927600,
1539927600,
1539931200,
1539931200,
1539931200,
1539927600,
1540449600,
1540449600,
1540536000,
1540536000,
1540536000,
1540424400,
1540424400,
1540618800,
1540618800,
1545979320,
1546062120,
1545892920,
1545892920,
1545892920,
1545201720,
1545892920,
1545892920
)).toDF(EPOCH)
val (split1, split2) = splitTrainTest(monthData, 0.6)
split1.show(false)
split2.show(false)
/**
* Min Date Found: 2018-10-01
* Max Date Found: 2018-12-01
* Total Months of data found for is 2 months
* Data considered for training is < 1 months
* Data considered for testing is >= 1 months
* +----------+-----------+----------------+----------+-----+
* |epoch |epoch_month|epoch_dayofmonth|epoch_year|split|
* +----------+-----------+----------------+----------+-----+
* |1539754800|10 |17 |2018 |0 |
* |1539754800|10 |17 |2018 |0 |
* |1539931200|10 |19 |2018 |0 |
* |1539927600|10 |19 |2018 |0 |
* |1539927600|10 |19 |2018 |0 |
* |1539931200|10 |19 |2018 |0 |
* |1539931200|10 |19 |2018 |0 |
* |1539931200|10 |19 |2018 |0 |
* |1539927600|10 |19 |2018 |0 |
* |1540449600|10 |25 |2018 |0 |
* |1540449600|10 |25 |2018 |0 |
* |1540536000|10 |26 |2018 |0 |
* |1540536000|10 |26 |2018 |0 |
* |1540536000|10 |26 |2018 |0 |
* |1540424400|10 |25 |2018 |0 |
* |1540424400|10 |25 |2018 |0 |
* |1540618800|10 |27 |2018 |0 |
* |1540618800|10 |27 |2018 |0 |
* +----------+-----------+----------------+----------+-----+
*
* +----------+-----------+----------------+----------+-----+
* |epoch |epoch_month|epoch_dayofmonth|epoch_year|split|
* +----------+-----------+----------------+----------+-----+
* |1545979320|12 |28 |2018 |2 |
* |1546062120|12 |29 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* |1545201720|12 |19 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* |1545892920|12 |27 |2018 |2 |
* +----------+-----------+----------------+----------+-----+
*/
3. Test using one month of data from a year
val oneMonthData = sc.parallelize(Seq(
1589514575, // Friday, May 15, 2020 3:49:35 AM
1589600975, // Saturday, May 16, 2020 3:49:35 AM
1589946575, // Wednesday, May 20, 2020 3:49:35 AM
1590378575, // Monday, May 25, 2020 3:49:35 AM
1590464975, // Tuesday, May 26, 2020 3:49:35 AM
1590470135 // Tuesday, May 26, 2020 5:15:35 AM
)).toDF(EPOCH)
val (split3, split4) = splitTrainTest(oneMonthData, 0.6)
split3.show(false)
split4.show(false)
/**
* Min Date Found: 2020-05-01
* Max Date Found: 2020-05-01
* Total Months of data found for is 0 months
* Total Days of data found is 25 days
* +----------+-----------+----------------+----------+-----+
* |epoch |epoch_month|epoch_dayofmonth|epoch_year|split|
* +----------+-----------+----------------+----------+-----+
* |1589514575|5 |15 |2020 |14 |
* +----------+-----------+----------------+----------+-----+
*
* +----------+-----------+----------------+----------+-----+
* |epoch |epoch_month|epoch_dayofmonth|epoch_year|split|
* +----------+-----------+----------------+----------+-----+
* |1589600975|5 |16 |2020 |15 |
* |1589946575|5 |20 |2020 |19 |
* |1590378575|5 |25 |2020 |24 |
* |1590464975|5 |26 |2020 |25 |
* |1590470135|5 |26 |2020 |25 |
* +----------+-----------+----------------+----------+-----+
*/

Use Window to count lines with if condition in scala 2

I already post a question similar but someone gave me a trick to avoid using the "if condition".
Here I am in a similar position and I do not find any trick to avoid it....
I have a dataframe.
var df = sc.parallelize(Array(
(1, "2017-06-29 10:53:53.0","2017-06-25 14:60:53.0","boulanger.fr"),
(2, "2017-07-05 10:48:57.0","2017-09-05 08:60:53.0","patissier.fr"),
(3, "2017-06-28 10:31:42.0","2017-02-28 20:31:42.0","boulanger.fr"),
(4, "2017-08-21 17:31:12.0","2017-10-21 10:29:12.0","patissier.fr"),
(5, "2017-07-28 11:22:42.0","2017-05-28 11:22:42.0","boulanger.fr"),
(6, "2017-08-23 17:03:43.0","2017-07-23 09:03:43.0","patissier.fr"),
(7, "2017-08-24 16:08:07.0","2017-08-22 16:08:07.0","boulanger.fr"),
(8, "2017-08-31 17:20:43.0","2017-05-22 17:05:43.0","patissier.fr"),
(9, "2017-09-04 14:35:38.0","2017-07-04 07:30:25.0","boulanger.fr"),
(10, "2017-09-07 15:10:34.0","2017-07-29 12:10:34.0","patissier.fr"))).toDF("id", "date1","date2", "mail")
df = df.withColumn("date1", (unix_timestamp($"date1", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
df = df.withColumn("date2", (unix_timestamp($"date2", "yyyy-MM-dd HH:mm:ss").cast("timestamp")))
df = df.orderBy("date1", "date2")
It looks like:
+---+---------------------+---------------------+------------+
|id |date1 |date2 |mail |
+---+---------------------+---------------------+------------+
|3 |2017-06-28 10:31:42.0|2017-02-28 20:31:42.0|boulanger.fr|
|1 |2017-06-29 10:53:53.0|2017-06-25 15:00:53.0|boulanger.fr|
|2 |2017-07-05 10:48:57.0|2017-09-05 09:00:53.0|patissier.fr|
|5 |2017-07-28 11:22:42.0|2017-05-28 11:22:42.0|boulanger.fr|
|4 |2017-08-21 17:31:12.0|2017-10-21 10:29:12.0|patissier.fr|
|6 |2017-08-23 17:03:43.0|2017-07-23 09:03:43.0|patissier.fr|
|7 |2017-08-24 16:08:07.0|2017-08-22 16:08:07.0|boulanger.fr|
|8 |2017-08-31 17:20:43.0|2017-05-22 17:05:43.0|patissier.fr|
|9 |2017-09-04 14:35:38.0|2017-07-04 07:30:25.0|boulanger.fr|
|10 |2017-09-07 15:10:34.0|2017-07-29 12:10:34.0|patissier.fr|
+---+---------------------+---------------------+------------+
For each id I want to count among all other line the number of lines with:
a date1 in [my_current_date1-60 day, my_current_date1-1 day]
a date2 < my_current_date1
the same mail than my current_mail
If I look at the line 5 I want to return the number of line with:
date1 in [2017-05-29 11:22:42.0, 2017-07-27 11:22:42.0]
date2 < 2017-07-28 11:22:42.0
mail = boulanger.fr
--> The result would be 2 (corresponding to id 1 and id 3)
So I would like to do something like:
val w = Window.partitionBy("mail").orderBy(col("date1").cast("long")).rangeBetween(-60*24*60*60,-1*24*60*60)
var df= df.withColumn("all_previous", count("mail") over w)
But this will respond to condition 1 and condition 3 but not the second one... i have to add something to includ this second condition comparing date2 to my_date1...

Using a generalized Window spec with last(date1) being the current date1 per Window partition and a sum over 0's and 1's as conditional count, here's how I would incorporate your condition #2 into the counting criteria:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
def days(n: Long): Long = n * 24 * 60 * 60
val w = Window.partitionBy("mail").orderBy($"date1".cast("long"))
val w1 = w.rangeBetween(days(-60), days(0))
val w2 = w.rangeBetween(days(-60), days(-1))
df.withColumn("all_previous", sum(
when($"date2".cast("long") < last($"date1").over(w1).cast("long"), 1).
otherwise(0)
).over(w2)
).na.fill(0).
show
// +---+-------------------+-------------------+------------+------------+
// | id| date1| date2| mail|all_previous|
// +---+-------------------+-------------------+------------+------------+
// | 3|2017-06-28 10:31:42|2017-02-28 20:31:42|boulanger.fr| 0|
// | 1|2017-06-29 10:53:53|2017-06-25 15:00:53|boulanger.fr| 1|
// | 5|2017-07-28 11:22:42|2017-05-28 11:22:42|boulanger.fr| 2|
// | 7|2017-08-24 16:08:07|2017-08-22 16:08:07|boulanger.fr| 3|
// | 9|2017-09-04 14:35:38|2017-07-04 07:30:25|boulanger.fr| 2|
// | 2|2017-07-05 10:48:57|2017-09-05 09:00:53|patissier.fr| 0|
// | 4|2017-08-21 17:31:12|2017-10-21 10:29:12|patissier.fr| 0|
// | 6|2017-08-23 17:03:43|2017-07-23 09:03:43|patissier.fr| 0|
// | 8|2017-08-31 17:20:43|2017-05-22 17:05:43|patissier.fr| 1|
// | 10|2017-09-07 15:10:34|2017-07-29 12:10:34|patissier.fr| 2|
// +---+-------------------+-------------------+------------+------------+
[UPDATE]
This solution is incorrect, even though the result appears to be correct with the sample dataset. In particular, last($"date1").over(w1) did not work the way intended. The answer is being kept to hopefully serve as a lead for a working solution.

How to create pairs of nodes in Spark?

I have the following DataFrame in Spark and Scala:
group nodeId date
1 1 2016-10-12T12:10:00.000Z
1 2 2016-10-12T12:00:00.000Z
1 3 2016-10-12T12:05:00.000Z
2 1 2016-10-12T12:30:00.000Z
2 2 2016-10-12T12:35:00.000Z
I need to group records by group, sort them in ascending order by date and make pairs of sequential nodeId. Also, date should be converted to Unix epoch.
This can be better explained with the expected output:
group nodeId_1 nodeId_2 date
1 2 3 2016-10-12T12:00:00.000Z
1 3 1 2016-10-12T12:05:00.000Z
2 1 2 2016-10-12T12:30:00.000Z
This is what I did so far:
df
.groupBy("group")
.agg($"nodeId",$"date")
.orderBy(asc("date"))
But I don't know how to create pairs of nodeId.

You can be benefited by using Window function with lead inbuilt function to create the pairs and to_utc_timestamp inbuilt function to convert the date to epoch date. Finally you have to filter the unpaired rows as you don't require them in the output.
Following is the program of above explanation. I have used comments for clarity
import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("group").orderBy("date") //defining window function grouping by group and ordering by date
import org.apache.spark.sql.functions._
df.withColumn("date", to_utc_timestamp(col("date"), "Asia/Kathmandu")) //converting the date to epoch datetime you can choose other timezone as required
.withColumn("nodeId_2", lead("nodeId", 1).over(windowSpec)) //using window for creating pairs
.filter(col("nodeId_2").isNotNull) //filtering out the unpaired rows
.select(col("group"), col("nodeId").as("nodeId_1"), col("nodeId_2"), col("date")) //selecting as required final dataframe
.show(false)
You should get the final dataframe as required
+-----+--------+--------+-------------------+
|group|nodeId_1|nodeId_2|date |
+-----+--------+--------+-------------------+
|1 |2 |3 |2016-10-12 12:00:00|
|1 |3 |1 |2016-10-12 12:05:00|
|2 |1 |2 |2016-10-12 12:30:00|
+-----+--------+--------+-------------------+
I hope the answer is helpful
Note to get the correct epoch date I have used Asia/Kathmandu as timezone.

If I understand your requirement correctly, you can use a self-join on group and a < inequality condition on nodeId:
val df = Seq(
(1, 1, "2016-10-12T12:10:00.000Z"),
(1, 2, "2016-10-12T12:00:00.000Z"),
(1, 3, "2016-10-12T12:05:00.000Z"),
(2, 1, "2016-10-12T12:30:00.000Z"),
(2, 2, "2016-10-12T12:35:00.000Z")
).toDF("group", "nodeId", "date")
df.as("df1").join(
df.as("df2"),
$"df1.group" === $"df2.group" && $"df1.nodeId" < $"df2.nodeId"
).select(
$"df1.group", $"df1.nodeId", $"df2.nodeId",
when($"df1.date" < $"df2.date", $"df1.date").otherwise($"df2.date").as("date")
)
// +-----+------+------+------------------------+
// |group|nodeId|nodeId|date |
// +-----+------+------+------------------------+
// |1 |1 |3 |2016-10-12T12:05:00.000Z|
// |1 |1 |2 |2016-10-12T12:00:00.000Z|
// |1 |2 |3 |2016-10-12T12:00:00.000Z|
// |2 |1 |2 |2016-10-12T12:30:00.000Z|
// +-----+------+------+------------------------+