Data processing for time series data-spark - scala

Given the below sample data,
t- timeseries datetime sample,
lat-latitude,
long-longitude
t lat long
0 27 28
5 27 28
10 27 28
15 29 49
20 29 49
25 27 28
30 27 28
I want to get the output similar to this, I want to process the time series data in such a way that grouping the pair of lat long I am able to get the distinct time series interval for the pair.
I am doing the processing in spark
Lat-long interval
(27,28) (0,10)
(29,49) (15,20)
(27,28) (25,30)

I wouldn't have suggested you this solution if your data were huge but since you commented
I am processing the day wise data which is stored in cassandara ,size of 5-6k rows of records/second
following solution proposal should be fine
Looking at your given dataframe, schema should be as
root
|-- t: integer (nullable = false)
|-- lat: integer (nullable = false)
|-- long: integer (nullable = false)
And your expected output suggests that you would need an additional column for grouping the dataframe which would require you to collect the data on one executor
val collectedRDD = df.collect()
var varianceCount, lattitude, longitude = 0
val groupedData = new ArrayBuffer[(Int, Int, Int, Int)]()
for(rdd <- collectedRDD) {
val t = rdd.getAs[Int]("t")
val lat = rdd.getAs[Int]("lat")
val long = rdd.getAs[Int]("long")
if (lat != lattitude || long != longitude) {
varianceCount = varianceCount + 1
lattitude = lat
longitude = long
groupedData.append((t, lat, long, varianceCount))
}
else {
groupedData.append((t, lat, long, varianceCount))
}
}
Then you convert the ArrayBuffer to dataframe and use groupBy and aggregation as
val finalDF = groupedData
.toDF("t", "lat", "long", "grouped")
.groupBy(struct("lat", "long").as("lat-long"), col("grouped"))
.agg(struct(min("t"), max("t")).as("interval"))
.drop("grouped")
finalDF should be
+--------+--------+
|lat-long|interval|
+--------+--------+
|[29,49] |[15,20] |
|[27,28] |[0,10] |
|[27,28] |[25,30] |
+--------+--------+
I hope the answer is helpful

Related

Convert value string to Date, Scala Spark

I am getting a value from a DF using max aggregation, so I get a string and I want to convert it to Date.
What I am doing is this:
var date = spark.read.parquet("data/users").select("Date").agg(max(col("Date"))).first.get(0).toString
df2 = table_read.filter("Date=" + lastDate)
In this way I get a variable of string type and now I want to convert it to Date type. I have been searching to do this in another answers but all I saw is to do it with DataFrames and using to_date. How can I do in this case?
EDIT:
Schema:
root
|-- Date: date (nullable = false)
|-- op: string (nullable = true)
|-- value: string (nullable = true)
Output of spark.read.parquet("data/users").select("Date").agg(max(col("Date"))).show:
+-----------+
|max(Date) |
+-----------+
|2019-11-10 |
+-----------+
Error:
Exception message: cannot resolve '(`Date` = ((2021 - 12) - 14))' due to data type mismatch: differing types in '(`Date` = ((2021 - 12) - 14))' (date and int).; line 1 pos 0;
'Filter (Date#5488 = ((2021 - 12) - 14))
You can use .getDate, e.g.
var date = spark.read.parquet("data/users").select("Date").agg(max(col("Date"))).first.getDate(0)
To use it in a filter, you can do
df2 = table_read.filter(col("Date") === lastDate)
// or df2 = table_read.filter("date='" + date + "'")

Split dataframe by column values Scala

I need to split a dataframe into multiple dataframes by the timestamp column. So I would provide a number of hours that this dataframe should contain and will get a set of dataframes with a specified number of hours in each of those.
The signature of the method would look like this:
def splitDataframes(df: DataFrame, hoursNumber: Int): Seq[DataFrame]
How can I achieve that?
The schema of the dataframe looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
Some of the input df fields:
event_ts, user_id
2020-12-13 08:22:00, 1
2020-12-13 08:51:00, 2
2020-12-13 09:28:00, 1
2020-12-13 10:53:00, 3
2020-12-13 11:05:00, 1
2020-12-13 12:19:00, 2
Some of the output df fields with hoursNumber=2:
df1 event_ts, user_id
2020-12-13 08:22:00, 1
2020-12-13 08:51:00, 2
2020-12-13 09:28:00, 1
df2 2020-12-13 10:46:00, 3
2020-12-13 11:05:00, 1
df3 2020-12-13 12:48:00, 2
Convert the timestamp to unix timestamp, and then work out the id for each row using the time difference from the earliest timestamp.
EDIT: the solution is even simpler if you count starting time from 00:00:00.
import org.apache.spark.sql.DataFrame
def splitDataframes(df: DataFrame, hoursNumber: Int): Seq[DataFrame] = {
val df2 = df.withColumn(
"event_unix_ts",
unix_timestamp($"event_ts")
).withColumn(
"grouping",
floor($"event_unix_ts" / (3600 * hoursNumber))
).drop("event_unix_ts")
val df_array = df2.select("grouping").distinct().collect().map(
x => df2.filter($"grouping" === x(0)).drop("grouping")).toSeq
return df_array
}

PySpark - Get the size of each list in group by

I have a massive pyspark dataframe. I need to group by Person and then collect their Budget items into a list, to perform a further calculation.
As an example,
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
Group By:
import pyspark.sql.functions as F
df_grouped = df.groupby('person').agg(F.collect_list("Budget").alias("data"))
Schema:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: string (containsNull = true)
However, I am getting a memory error when I try to apply a UDF on each person. How can I get the size (in megabytes or gigbabytes) of each list (data) for each person?
I have done the following, but I am getting nulls
import sys
size_list_udf = F.udf(lambda data: sys.getsizeof(data)/1000, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show()
Output:
+------+--------------------+----+
|person| data|size|
+------+--------------------+----+
| Sue|[Household, House...|null|
| Bob|[Food, Food, Hous...|null|
+------+--------------------+----+
You just have one minor issue with your code. sys.getsizeof() returns the size of an object in bytes as an integer. You're dividing this by the integer value 1000 to get kilobytes. In python 2, this returns an integer. However you defined your udf to return a DoubleType(). The simple fix is to divide by 1000.0.
import sys
size_list_udf = f.udf(lambda data: sys.getsizeof(data)/1000.0, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show(truncate=False)
#+------+-----------------------+-----+
#|person|data |size |
#+------+-----------------------+-----+
#|Sue |[Household, Household] |0.112|
#|Bob |[Food, Food, Household]|0.12 |
#+------+-----------------------+-----+
I have found that in cases where a udf is returning null, the culprit is very frequently a type mismatch.

spark OneHotEncoder - how to exclude user-defined category?

Consider the following spark dataframe:
df.printSchema()
|-- predictor: double (nullable = true)
|-- label: double (nullable = true)
|-- date: string (nullable = true)
df.show(6)
predictor label date
4.23 6.33 20160510
4.77 7.18 20160510
4.09 5.94 20160511
4.23 6.33 20160511
4.77 7.18 20160512
4.09 5.94 20160512
Essentially, my dataframe consists of data with daily frequency. I need to map the column of dates to a column of binary vectors. This is simple to implement using StringIndexer & OneHotEncoder:
val dateIndexer = new StringIndexer()
.setInputCol("date")
.setOutputCol("dateIndex")
.fit(df)
val indexed = dateIndexer.transform(df)
val encoder = new OneHotEncoder()
.setInputCol("dateIndex")
.setOutputCol("date_codeVec")
val encoded = encoder.transform(indexed)
My problem is that OneHotEncoder drops the last category by default. However, I need to drop the category which relates to the first date in my dataframe (20160510 in the above example) because I need to compute a time trend relative to the first date.
How can I achieve this for the above example (note that I have more than 100 dates in my dataframe)?
You can try setting setDropLast to false:
val encoder = new OneHotEncoder()
.setInputCol("dateIndex")
.setOutputCol("date_codeVec")
.setDropLast(false)
val encoded = encoder.transform(indexed)
and dropping level choice manually, using VectorSlicer:
import org.apache.spark.ml.feature.VectorSlicer
val slicer = new VectorSlicer()
.setInputCol("date_codeVec")
.setOutputCol("data_codeVec_selected")
.setNames(dateIndexer.labels.diff(Seq(dateIndexer.labels.min)))
slicer.transform(encoded)
+---------+-----+--------+---------+-------------+---------------------+
|predictor|label| date|dateIndex| date_codeVec|data_codeVec_selected|
+---------+-----+--------+---------+-------------+---------------------+
| 4.23| 6.33|20160510| 0.0|(3,[0],[1.0])| (2,[],[])|
| 4.77| 7.18|20160510| 0.0|(3,[0],[1.0])| (2,[],[])|
| 4.09| 5.94|20160511| 2.0|(3,[2],[1.0])| (2,[1],[1.0])|
| 4.23| 6.33|20160511| 2.0|(3,[2],[1.0])| (2,[1],[1.0])|
| 4.77| 7.18|20160512| 1.0|(3,[1],[1.0])| (2,[0],[1.0])|
| 4.09| 5.94|20160512| 1.0|(3,[1],[1.0])| (2,[0],[1.0])|
+---------+-----+--------+---------+-------------+---------------------+

Classify data using Apache Spark

I have the following ds:
|-- created_at: timestamp (nullable = true)
|-- channel_source_id: integer (nullable = true)
|-- movie_id: integer (nullable = true)
I would like to classify the movie_id based on some conditions:
Number of times have been played;
#Count ocurrences by ID
SELECT COUNT(createad_at) FROM logs GROUP BY movie_id
What is the range (created_at) that movie has been played;
#Returns distinct movies_id
SELECT DISTINCT(movie_id) FROM logs
#For each movie_id, I would like to retrieve the hour that has been played
#When i have the result, I could apply an filter from df to extract the intervals
SELECT created_at FROM logs WHERE movie_id = ?
Number of differents channel_source_id that have played the movie;
#Count number of channels that have played
SELECT COUNT(DISTINCT(channel_source_id)) FROM logs where movie_id = ? GROUP BY movie_id
I've written a simple table to help me on classification
Played 1 to 5 times, range between 00:00:00 - 03:59:59, 1 to 3 different channels >> Movie Type A
Played 6 to 10 times, range between 04:00:00 - 07:59:59, 4 to 5 different channels >> Movie Type B
etc
I'm using Spark to import file but i'm lost how can I perform the classification. Anyone could me give me a hand on where should I start?
def run() = {
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.options(Map("header" -> "true", "inferSchema" -> "true"))
.load("/home/plc/Desktop/movies.csv")
df.printSchema()
}