Classify data using Apache Spark

Classify data using Apache Spark - scala

I have the following ds:
|-- created_at: timestamp (nullable = true)
|-- channel_source_id: integer (nullable = true)
|-- movie_id: integer (nullable = true)
I would like to classify the movie_id based on some conditions:
Number of times have been played;
#Count ocurrences by ID
SELECT COUNT(createad_at) FROM logs GROUP BY movie_id
What is the range (created_at) that movie has been played;
#Returns distinct movies_id
SELECT DISTINCT(movie_id) FROM logs
#For each movie_id, I would like to retrieve the hour that has been played
#When i have the result, I could apply an filter from df to extract the intervals
SELECT created_at FROM logs WHERE movie_id = ?
Number of differents channel_source_id that have played the movie;
#Count number of channels that have played
SELECT COUNT(DISTINCT(channel_source_id)) FROM logs where movie_id = ? GROUP BY movie_id
I've written a simple table to help me on classification
Played 1 to 5 times, range between 00:00:00 - 03:59:59, 1 to 3 different channels >> Movie Type A
Played 6 to 10 times, range between 04:00:00 - 07:59:59, 4 to 5 different channels >> Movie Type B
etc
I'm using Spark to import file but i'm lost how can I perform the classification. Anyone could me give me a hand on where should I start?
def run() = {
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.options(Map("header" -> "true", "inferSchema" -> "true"))
.load("/home/plc/Desktop/movies.csv")
df.printSchema()
}

Related

Split dataframe by column values Scala

I need to split a dataframe into multiple dataframes by the timestamp column. So I would provide a number of hours that this dataframe should contain and will get a set of dataframes with a specified number of hours in each of those.
The signature of the method would look like this:
def splitDataframes(df: DataFrame, hoursNumber: Int): Seq[DataFrame]
How can I achieve that?
The schema of the dataframe looks like this:
root
|-- date_time: integer (nullable = true)
|-- user_id: long (nullable = true)
|-- order_id: string (nullable = true)
|-- description: string (nullable = true)
|-- event_date: date (nullable = true)
|-- event_ts: timestamp (nullable = true)
|-- event_hour: long (nullable = true)
Some of the input df fields:
event_ts, user_id
2020-12-13 08:22:00, 1
2020-12-13 08:51:00, 2
2020-12-13 09:28:00, 1
2020-12-13 10:53:00, 3
2020-12-13 11:05:00, 1
2020-12-13 12:19:00, 2
Some of the output df fields with hoursNumber=2:
df1 event_ts, user_id
2020-12-13 08:22:00, 1
2020-12-13 08:51:00, 2
2020-12-13 09:28:00, 1
df2 2020-12-13 10:46:00, 3
2020-12-13 11:05:00, 1
df3 2020-12-13 12:48:00, 2

Convert the timestamp to unix timestamp, and then work out the id for each row using the time difference from the earliest timestamp.
EDIT: the solution is even simpler if you count starting time from 00:00:00.
import org.apache.spark.sql.DataFrame
def splitDataframes(df: DataFrame, hoursNumber: Int): Seq[DataFrame] = {
val df2 = df.withColumn(
"event_unix_ts",
unix_timestamp($"event_ts")
).withColumn(
"grouping",
floor($"event_unix_ts" / (3600 * hoursNumber))
).drop("event_unix_ts")
val df_array = df2.select("grouping").distinct().collect().map(
x => df2.filter($"grouping" === x(0)).drop("grouping")).toSeq
return df_array
}

How to get data of second data frame for all values of particular columns values matched in first dataframe?

Have two dataframe as below
first_df
|-- company_id: string (nullable = true)
|-- max_dd: date (nullable = true)
|-- min_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
second_df
|-- company_id: string (nullable = true)
|-- max_dd: date (nullable = true)
|-- mean: double (nullable = true)
|-- count: long (nullable = false)
I have some companies data in second_df . I need to get data from second_df for those company ids which are listed in first_df.
what kind of spark apis useful here for me ?
How can i do it ?
Thank you.
Question extension :
If there is no stored records then first_df would be empty. Hence first_df("mean") & first_df("count") would be null resulting "acc_new_mean" is null. In that case I need to set "new_mean" as second_df("mean") , how to do it ?
I tried like this but it is not working
Any clue how to handle here .withColumn("new_mean", ... ) ???
val acc_new_mean = (second_df("mean") + first_df("mean")) / (second_df("count") + first_df("count"))
val acc_new_count = second_df("count") + first_df("count")
val new_df = second_df.join(first_df.withColumnRenamed("company_id", "right_company_id").as("a"),
( $"a.right_company_id" === second_df("company_id") && ( second_df("min_dd") > $"a.max_dd" ) )
, "leftOuter")
.withColumn("new_mean", if(acc_new_mean == null) lit(second_df("mean")) else acc_new_mean )

APPROACH 1 :
If you are finding difficult to join 2 dataframes using the dataframe's join API, you could use sql if you are comfortable in sql. For that you can register your 2 dataframes as tables in the spark memory and the write sql on top of that.
second_df.registerTempTable("table_second_df")
first_df.registerTempTable("table_first_df")
val new_df = spark.sql("select distinct s.* from table_second_df s join table_first_df f on s.company_id=f.company_id")
new_df.show()
As you requested, I have added the logic.
Consider your first_df looks like below :
+----------+----------+----------+----+-----+
|company_id| max_dd| min_dd|mean|count|
+----------+----------+----------+----+-----+
| A|2019-04-05|2019-04-01| 10| 100|
| A|2019-04-06|2019-04-02| 20| 200|
| B|2019-04-08|2019-04-01| 30| 300|
| B|2019-04-09|2019-04-02| 40| 400|
+----------+----------+----------+----+-----+
Consider your second_df looks like below :
+----------+----------+----+-----+
|company_id| max_dd|mean|count|
+----------+----------+----+-----+
| A|2019-04-03| 10| 100|
| A|2019-04-02| 20| 200|
+----------+----------+----+-----+
Since company id A is there in the second table, I have taken the latest max_dd record from second_df. For company id B, it is not in second_df I took the latest max_dd record from first_df.
Please find the code below.
first_df.registerTempTable("table_first_df")
second_df.registerTempTable("table_second_df")
val new_df = spark.sql("select company_id,max_dd,min_dd,mean,count from (select distinct s.company_id,s.max_dd,null as min_dd,s.mean,s.count,row_number() over (partition by s.company_id order by s.max_dd desc) rno from table_second_df s join table_first_df f on s.company_id=f.company_id) where rno=1 union select company_id,max_dd,min_dd,mean,count from (select distinct f.*,row_number() over (partition by f.company_id order by f.max_dd desc) rno from table_first_df f left join table_second_df s on s.company_id=f.company_id where s.company_id is null) where rno=1")
new_df.show()
Below is the result :
APPROACH 2 :
Instead of creating a temp table as I mentioned in Approach 1, you can use the join of dataframe's API. This is the same logic in Approach 1 but here I am using dataframe's API to accomplish this. Please don't forget to import org.apache.spark.sql.expressions.Window as I have used Window.patitionBy in the below code.
val new_df = second_df.as('s).join(first_df.as('f),$"s.company_id" === $"f.company_id","inner").drop($"min_dd").withColumn("min_dd",lit("")).select($"s.company_id", $"s.max_dd",$"min_dd", $"s.mean", $"s.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"s.company_id").orderBy($"s.max_dd".desc))).filter($"Rno" === 1).drop($"Rno").union(first_df.as('f).join(second_df.as('s),$"s.company_id" === $"f.company_id","left_anti").select($"f.company_id", $"f.max_dd",$"f.min_dd", $"f.mean", $"f.count").dropDuplicates.withColumn("Rno", row_number().over(Window.partitionBy($"f.company_id").orderBy($"f.max_dd".desc))).filter($"Rno" === 1).drop($"Rno"))
new_df.show()
Below is the result :
Please let me know if you have any questions.

val acc_new_mean = //new mean literaal
val acc_new_count = //new count literaal
val resultDf = computed_df.join(accumulated_results_df.as("a"),
( $"company_id" === computed_df("company_id") )
, "leftOuter")
.withColumn("new_mean", when( acc_new_mean.isNull,lit(computed_df("mean")) ).otherwise(acc_new_mean) )
.withColumn("new_count", when( acc_new_count.isNull,lit(computed_df("count")) ).otherwise(acc_new_count) )
.select(
computed_df("company_id"),
computed_df("max_dd"),
col("new_mean").as("mean"),
col("new_count").as("count")
)

Timestamp formats and time zones in Spark (scala API)

******* UPDATE ********
As suggested in the comments I eliminated the irrelevant part of the code:
My requirements:
Unify number of milliseconds to 3
Transform string to timestamp and keep the value in UTC
Create dataframe:
val df = Seq("2018-09-02T05:05:03.456Z","2018-09-02T04:08:32.1Z","2018-09-02T05:05:45.65Z").toDF("Timestamp")
Here the reults using the spark shell:
************ END UPDATE *********************************
I am having a nice headache trying to deal with time zones and timestamp formats in Spark using scala.
This is a simplification of my script to explain my problem:
import org.apache.spark.sql.functions._
val jsonRDD = sc.wholeTextFiles("file:///data/home2/phernandez/vpp/Test_Message.json")
val jsonDF = spark.read.json(jsonRDD.map(f => f._2))
This is the resulting schema:
root
|-- MeasuredValues: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- MeasuredValue: double (nullable = true)
| | |-- Status: long (nullable = true)
| | |-- Timestamp: string (nullable = true)
Then I just select the Timestamp field as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select($"Values.Timestamp").show(5,false)
First thing I want to fix is the number of milliseconds of every timestamp and unify it to three.
I applied the date_format as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(5,false)
Milliseconds format was fixed but timestamp is converted from UTC to local time.
To tackle this issue, I applied the to_utc_timestamp together with my local time zone.
jsonDF.select(explode($"MeasuredValues").as("Values")).select(to_utc_timestamp(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),"Europe/Berlin").as("Timestamp")).show(5,false)
Even worst, UTC value is not returned, and the milliseconds format is lost.
Any Ideas how to deal with this? I will appreciated it 😊
BR. Paul

The cause of the problem is the time format string used for conversion:
yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
As you may see, Z is inside single quotes, which means that it is not interpreted as the zone offset marker, but only as a character like T in the middle.
So, the format string should be changed to
yyyy-MM-dd'T'HH:mm:ss.SSSX
where X is the Java standard date time formatter pattern (Z being the offset value for 0).
Now, the source data can be converted to UTC timestamps:
val srcDF = Seq(
("2018-04-10T13:30:34.45Z"),
("2018-04-10T13:45:55.4Z"),
("2018-04-10T14:00:00.234Z"),
("2018-04-10T14:15:04.34Z"),
("2018-04-10T14:30:23.45Z")
).toDF("Timestamp")
val convertedDF = srcDF.select(to_utc_timestamp(date_format($"Timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSX"), "Europe/Berlin").as("converted"))
convertedDF.printSchema()
convertedDF.show(false)
/**
root
|-- converted: timestamp (nullable = true)
+-----------------------+
|converted |
+-----------------------+
|2018-04-10 13:30:34.45 |
|2018-04-10 13:45:55.4 |
|2018-04-10 14:00:00.234|
|2018-04-10 14:15:04.34 |
|2018-04-10 14:30:23.45 |
+-----------------------+
*/
If you need to convert the timestamps back to strings and normalize the values to have 3 trailing zeros, there should be another date_format call, similar to what you have already applied in the question.

PySpark - Get the size of each list in group by

I have a massive pyspark dataframe. I need to group by Person and then collect their Budget items into a list, to perform a further calculation.
As an example,
a = [('Bob', 562,"Food", "12 May 2018"), ('Bob',880,"Food","01 June 2018"), ('Bob',380,'Household'," 16 June 2018"), ('Sue',85,'Household'," 16 July 2018"), ('Sue',963,'Household'," 16 Sept 2018")]
df = spark.createDataFrame(a, ["Person", "Amount","Budget", "Date"])
Group By:
import pyspark.sql.functions as F
df_grouped = df.groupby('person').agg(F.collect_list("Budget").alias("data"))
Schema:
root
|-- person: string (nullable = true)
|-- data: array (nullable = true)
| |-- element: string (containsNull = true)
However, I am getting a memory error when I try to apply a UDF on each person. How can I get the size (in megabytes or gigbabytes) of each list (data) for each person?
I have done the following, but I am getting nulls
import sys
size_list_udf = F.udf(lambda data: sys.getsizeof(data)/1000, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show()
Output:
+------+--------------------+----+
|person| data|size|
+------+--------------------+----+
| Sue|[Household, House...|null|
| Bob|[Food, Food, Hous...|null|
+------+--------------------+----+

You just have one minor issue with your code. sys.getsizeof() returns the size of an object in bytes as an integer. You're dividing this by the integer value 1000 to get kilobytes. In python 2, this returns an integer. However you defined your udf to return a DoubleType(). The simple fix is to divide by 1000.0.
import sys
size_list_udf = f.udf(lambda data: sys.getsizeof(data)/1000.0, DoubleType())
df_grouped = df_grouped.withColumn("size",size_list_udf("data") )
df_grouped.show(truncate=False)
#+------+-----------------------+-----+
#|person|data |size |
#+------+-----------------------+-----+
#|Sue |[Household, Household] |0.112|
#|Bob |[Food, Food, Household]|0.12 |
#+------+-----------------------+-----+
I have found that in cases where a udf is returning null, the culprit is very frequently a type mismatch.

How to partition Spark RDD when importing Postgres using JDBC?

I am importing a Postgres database into Spark. I know that I can partition on import, but that requires that I have a numeric column (I don't want to use the value column because it's all over the place and doesn't maintain order):
df = spark.read.format('jdbc').options(url=url, dbtable='tableName', properties=properties).load()
df.printSchema()
root
|-- id: string (nullable = false)
|-- timestamp: timestamp (nullable = false)
|-- key: string (nullable = false)
|-- value: double (nullable = false)
Instead, I am converting the dataframe into an rdd (of enumerated tuples) and trying to partition that instead:
rdd = df.rdd.flatMap(lambda x: enumerate(x)).partitionBy(20)
Note that I used 20 because I have 5 workers with one core each in my cluster, and 5*4=20.
Unfortunately, the following command still takes forever to execute:
result = rdd.first()
Therefore I am wondering if my logic above makes sense? Am I doing anything wrong? From the web GUI, it looks like the workers are not being used:

Since you already know you can partition by a numeric column this is probably what you should do. Here is the trick. First lets find a minimum and maximum epoch:
url = ...
properties = ...
min_max_query = """(
SELECT
CAST(min(extract(epoch FROM timestamp)) AS bigint),
CAST(max(extract(epoch FROM timestamp)) AS bigint)
FROM tablename
) tmp"""
min_epoch, max_epoch = spark.read.jdbc(
url=url, table=min_max_query, properties=properties
).first()
and use it to query the table:
numPartitions = ...
query = """(
SELECT *, CAST(extract(epoch FROM timestamp) AS bigint) AS epoch
FROM tablename) AS tmp"""
spark.read.jdbc(
url=url, table=query,
lowerBound=min_epoch, upperBound=max_epoch + 1,
column="epoch", numPartitions=numPartitions, properties=properties
).drop("epoch")
Since this splits data into ranges of the same size it is relatively sensitive to data skew so you should use it with caution.
You could also provide a list of disjoint predicates as a predicates argument.
predicates= [
"id BETWEEN 'a' AND 'c'",
"id BETWEEN 'd' AND 'g'",
... # Continue to get full coverage an desired number of predicates
]
spark.read.jdbc(
url=url, table="tablename", properties=properties,
predicates=predicates
)
The latter approach is much more flexible and can address certain issues with non-uniform data distribution but requires more knowledge about the data.
Using partitionBy fetches data first and then performs full shuffle to get desired number of partitions so it is relativistically expensive.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Classify data using Apache Spark - scala

Related

Split dataframe by column values Scala

How to get data of second data frame for all values of particular columns values matched in first dataframe?

Timestamp formats and time zones in Spark (scala API)

PySpark - Get the size of each list in group by

How to partition Spark RDD when importing Postgres using JDBC?

Categories

Resources