How to split columns with inconsistent data in spark - scala

I was trying to join two dataframes and create a new column with the values of the attribute dynamically (or at least trying to do that).
I have to split the columns from formulaTable and create additional columns and then join it with the attribute table.
However, I am not able to split the columns dynamically properly.
I have two questions which i have kept in bold in following steps.
So Currently in my formulaTable the data is like this.
val attributeFormulaDF = Seq("A0004*A0003","A0003*A0005").toDF("AttributeFormula")
So data is like
+----------------+
|AttributeFormula|
+----------------+
|A0004*A0003 |
|A0003*A0005 |
+----------------+
Attribute data is like this.
val attrValTransposedDF = Seq(
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY1_VALUE", "801"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY2_VALUE", "802"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY3_VALUE", "803"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY4_VALUE", "804"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY5_VALUE", "805"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY6_VALUE", "736"),
("2007", 201801, "DEL", "A0003", "NA", "ATTRIB_DAY7_VALUE", "1007"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY1_VALUE", "901"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY2_VALUE", "902"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY3_VALUE", "903"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY4_VALUE", "904"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY5_VALUE", "905"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY6_VALUE", "936"),
("2007", 201801, "DEL", "A0004", "NA", "ATTRIB_DAY7_VALUE", "9007"))
.toDF("Store_Number", "Attribute_Week_Number", "Department_Code", "Attribute_Code", "Attribute_General_Value", "Day", "Value")
.select("Attribute_Code", "Day", "Value")
So data is like
+--------------+-----------------+-----+
|Attribute_Code|Day |Value|
+--------------+-----------------+-----+
|A0003 |ATTRIB_DAY1_VALUE|801 |
|A0003 |ATTRIB_DAY2_VALUE|802 |
|A0003 |ATTRIB_DAY3_VALUE|803 |
|A0003 |ATTRIB_DAY4_VALUE|804 |
|A0003 |ATTRIB_DAY5_VALUE|805 |
|A0003 |ATTRIB_DAY6_VALUE|736 |
|A0003 |ATTRIB_DAY7_VALUE|1007 |
|A0004 |ATTRIB_DAY1_VALUE|901 |
|A0004 |ATTRIB_DAY2_VALUE|902 |
|A0004 |ATTRIB_DAY3_VALUE|903 |
|A0004 |ATTRIB_DAY4_VALUE|904 |
|A0004 |ATTRIB_DAY5_VALUE|905 |
|A0004 |ATTRIB_DAY6_VALUE|936 |
|A0004 |ATTRIB_DAY7_VALUE|9007 |
+--------------+-----------------+-----+
Now i am splitting it based on *
val firstDF = attributeFormulaDF.select("AttributeFormula")
val rowVal = firstDF.first.mkString.split("\\*").length
val columnSeq = (0 until rowVal).map(i => col("temp").getItem(i).as(s"col$i"))
val newDFWithSplitColumn = firstDF.withColumn("temp", split(col("AttributeFormula"), "\\*"))
.select(col("*") +: columnSeq :_*).drop("temp")
I have referred to this stackOverFlow post (Split 1 column into 3 columns in spark scala)
So the split columns is like
+----------------+-----+-----+
|AttributeFormula|col0 |col1 |
+----------------+-----+-----+
|A0004*A0003 |A0004|A0003|
|A0003*A0005 |A0003|A0005|
+----------------+-----+-----+
Question 1: if my AttributeFormula can have any number of attributes list(which is just a string) how will i split it dynamically.
eg:
+-----------------+
|AttributeFormula |
+-----------------+
|A0004 |
|A0004*A0003 |
|A0003*A0005 |
|A0003*A0004 |
|A0003*A0004*A0005|
+-----------------+
So I kind of need like this
+---------------- +-----+-----+------+
|AttributeFormula |col0 |col1 | col2 |
+---------------- +-----+-----+------+
|A0004 |A0004|null | null |
|A0003*A0005 |A0003|A0005| null |
|A0003*A0004 |A0003|A0004| null |
|A0003*A0004*A0005|A0003|A0004| A0005|
+----------------+-----+-----+
Again I am joining the attributeFormula with attribute values to get the formula values column .
val joinColumnCondition = newDFWithSplitColumn.columns
.withFilter(_.startsWith("col"))
.map(col(_) === attrValTransposedDF("Attribute_Code"))
//using zipWithIndex to make the value columns separate and to avoid ambiguous error while joining
val dataFrameList = joinColumnCondition.zipWithIndex.map {
i =>
newDFWithSplitColumn.join(attrValTransposedDF, i._1)
.withColumnRenamed("Value", s"Value${i._2}")
.drop("Attribute_Code")
}
val combinedDataFrame = dataFrameList.reduce(_.join(_, Seq("Day","AttributeFormula"),"LEFT"))
val toBeConcatColumn = combinedDataFrame.columns.filter(_.startsWith("Value"))
combinedDataFrame
.withColumn("AttributeFormulaValues", concat_ws("*", toBeConcatColumn.map(c => col(c)): _*))
.select("Day","AttributeFormula","AttributeFormulaValues")
So my final output looks like this.
+-----------------+----------------+----------------------+
|Day |AttributeFormula|AttributeFormulaValues|
+-----------------+----------------+----------------------+
|ATTRIB_DAY7_VALUE|A0004*A0003 |9007*1007 |
|ATTRIB_DAY6_VALUE|A0004*A0003 |936*736 |
|ATTRIB_DAY5_VALUE|A0004*A0003 |905*805 |
|ATTRIB_DAY4_VALUE|A0004*A0003 |904*804 |
|ATTRIB_DAY3_VALUE|A0004*A0003 |903*803 |
|ATTRIB_DAY2_VALUE|A0004*A0003 |902*802 |
|ATTRIB_DAY1_VALUE|A0004*A0003 |901*801 |
|ATTRIB_DAY7_VALUE|A0003 |1007 |
|ATTRIB_DAY6_VALUE|A0003 |736 |
|ATTRIB_DAY5_VALUE|A0003 |805 |
|ATTRIB_DAY4_VALUE|A0003 |804 |
|ATTRIB_DAY3_VALUE|A0003 |803 |
|ATTRIB_DAY2_VALUE|A0003 |802 |
|ATTRIB_DAY1_VALUE|A0003 |801 |
+-----------------+----------------+----------------------+
This code is working fine if i have only fixed attributeFormula(ie. relates to question 1)
Question 2: how will I avoid using Dataframe list and use the reduce function?

For Question 1, here it is a possible solution:
Given that you have a dataframe with formulas:
val attributeFormulaDF = Seq("A0004*A0003","A0003*A0005", "A0003*A0004*A0005").toDF("formula")
you can split it and form an array
val splitFormula = attributeFormulaDF.select(col("formula"), split(col("formula"), "\\*").as("split"))
After that you select the maximum array size
val maxSize = splitFormula.select(max(size(col("split")))).first().getInt(0)
Now the interesting part is that based on the max size you can start generating columns and associated it to the previous array
val enhancedFormula = (0 until(maxSize)).foldLeft(splitFormula)( (df, i) => {
df.withColumn(s"col_${i}", expr(s"split[${i}]"))
})
Here is the output
+-----------------+--------------------+-----+-----+-----+
| formula| split|col_0|col_1|col_2|
+-----------------+--------------------+-----+-----+-----+
| A0004*A0003| [A0004, A0003]|A0004|A0003| null|
| A0003*A0005| [A0003, A0005]|A0003|A0005| null|
|A0003*A0004*A0005|[A0003, A0004, A0...|A0003|A0004|A0005|
+-----------------+--------------------+-----+-----+-----+
I think this can easily be used for the question 2

Related

Pyspark group by collect list, to_json and pivot

Summary: Combining multiple rows to columns for a user
Input DF:
Id
group
A1
A2
B1
B2
1
Alpha
1
2
null
null
1
AlphaNew
6
8
null
null
2
Alpha
7
4
null
null
2
Beta
null
null
3
9
Note: The group values are dynamic
Expected Output DF:
Id
Alpha_A1
Alpha_A2
AlphaNew_A1
AlphaNew_A2
Beta_B1
Beta_B2
1
1
2
6
8
null
null
2
7
4
null
null
3
9
Attempted Solution:
I thought of making a json of non-null columns for each row, then a group by and concat_list of maps. Then I can explode the json to get the expected output.
But I am stuck at the stage of a nested json. Here is my code
vcols = df.columns[2:]
df\
.withColumn('json', F.to_json(F.struct(*vcols)))\
.groupby('id')\
.agg(
F.to_json(
F.collect_list(
F.create_map('group', 'json')
)
)
).alias('json')
Id
json
1
[{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}]
2
[{Alpha: {A1:7, A2:4}}, {Beta: {B1:3, B2:9}}]
What I am trying to get:
Id
json
1
[{Alpha_A1:1, Alpha_A2:2, AlphaNew_A1:6, AlphaNew_A2:8}]
2
[{Alpha_A1:7, Alpha_A2:4, Beta_B1:3, Beta_B2:9}]
I'd appreciate any help. I'm also trying to avoid UDFs as my true dataframe's shape is quite big
There's definitely a better way to do this but I continued your to json experiment.
Using UDFs:
After you get something like [{Alpha: {A1:1, A2:2}}, {AlphaNew: {A1:6, A2:8}}] you could create a UDF to flatten the dict. But since it's a JSON string you'll have to parse it to dict and then back again to JSON.
After that you would like to explode and pivot the table but that's not possible with JSON strings, so you have to use F.from_json with defined schema. That will give you MapType which you can explode and pivot.
Here's an example:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from collections import MutableMapping
import json
from pyspark.sql.types import (
ArrayType,
IntegerType,
MapType,
StringType,
)
def flatten_dict(d, parent_key="", sep="_"):
items = []
for k, v in d.items():
new_key = parent_key + sep + k if parent_key else k
if isinstance(v, MutableMapping):
items.extend(flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def flatten_groups(data):
result = []
for item in json.loads(data):
result.append(flatten_dict(item))
return json.dumps(result)
if __name__ == "__main__":
spark = SparkSession.builder.master("local").appName("Test").getOrCreate()
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["Id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
vcols = df.columns[2:]
df = (
df.withColumn("json", F.struct(*vcols))
.groupby("id")
.agg(F.to_json(F.collect_list(F.create_map("group", "json"))).alias("json"))
)
# Flatten groups
flatten_groups_udf = F.udf(lambda x: flatten_groups(x))
schema = ArrayType(MapType(StringType(), IntegerType()))
df = df.withColumn("json", F.from_json(flatten_groups_udf(F.col("json")), schema))
# Explode and pivot
df = df.select(F.col("id"), F.explode(F.col("json")).alias("json"))
df = (
df.select("id", F.explode("json"))
.groupby("id")
.pivot("key")
.agg(F.first("value"))
)
At the end dataframe looks like:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
Without UDFs:
vcols = df.columns[2:]
df = (
df.withColumn("json", F.to_json(F.struct(*vcols)))
.groupby("id")
.agg(
F.collect_list(
F.create_map(
"group", F.from_json("json", MapType(StringType(), IntegerType()))
)
).alias("json")
)
)
df = df.withColumn("json", F.explode(F.col("json")).alias("json"))
df = df.select("id", F.explode(F.col("json")).alias("root", "value"))
df = df.select("id", "root", F.explode(F.col("value")).alias("sub", "value"))
df = df.select(
"id", F.concat(F.col("root"), F.lit("_"), F.col("sub")).alias("name"), "value"
)
df = df.groupBy(F.col("id")).pivot("name").agg(F.first("value"))
Result:
+---+-----------+-----------+--------+--------+-------+-------+
|id |AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
|1 |6 |8 |1 |2 |null |null |
|2 |null |null |7 |4 |3 |9 |
+---+-----------+-----------+--------+--------+-------+-------+
I found a slightly better way than the json approach:
Stack the input dataframe value columns A1, A2,B1, B2,.. as rows
So the structure would look like id, group, sub, value where sub has the column name like A1, A2, B1, B2 and the value column has the value associated
Filter out the rows that have value as null
And, now we are able to pivot by the group. Since the null value rows are removed, we wont have the initial issue of the pivot making extra columns
import pyspark.sql.functions as F
data = [
(1, "Alpha", 1, 2, None, None),
(1, "AlphaNew", 6, 8, None, None),
(2, "Alpha", 7, 4, None, None),
(2, "Beta", None, None, 3, 9),
]
columns = ["id", "group", "A1", "A2", "B1", "B2"]
df = spark.createDataFrame(data, columns)
# Value columns that need to be stacked
vcols = df.columns[2:]
expr_str = ', '.join([f"'{i}', {i}" for i in vcols])
expr_str = f"stack({len(vcols)}, {expr_str}) as (sub, value)"
df = df\
.selectExpr("id", "group", expr_str)\
.filter(F.col("value").isNotNull())\
.select("id", F.concat("group", F.lit("_"), "sub").alias("group"), "value")\
.groupBy("id")\
.pivot("group")\
.agg(F.first("value"))
df.show()
Result:
+---+-----------+-----------+--------+--------+-------+-------+
| id|AlphaNew_A1|AlphaNew_A2|Alpha_A1|Alpha_A2|Beta_B1|Beta_B2|
+---+-----------+-----------+--------+--------+-------+-------+
| 1| 6| 8| 1| 2| null| null|
| 2| null| null| 7| 4| 3| 9|
+---+-----------+-----------+--------+--------+-------+-------+

Spark: Row filter based on Column value

I have millions of rows as dataframe like this:
val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+
Now I want to divide this data into three separate dataFrame like this:
Only ACTIVE ids (like id2), say activeDF
Only INACTIVE ids (like id3), say inactiveDF
Having both ACTIVE and INACTIVE as status, say bothDF
How can I calculate activeDF and inactiveDF?
I know that bothDF can be calculated like
df.select("id").distinct.except(activeDF).except(inactiveDF)
, but this will involve shuffling (as 'distinct' operation required same). Is there any better way to calculate bothDF
Versions:
Spark : 2.2.1
Scala : 2.11
The most elegant solution is to pivot on status
val counts = df
.groupBy("id")
.pivot("status", Seq("ACTIVE", "INACTIVE"))
.count
or equivalent direct agg
val counts = df
.groupBy("id")
.agg(
count(when($"status" === "ACTIVE", true)) as "ACTIVE",
count(when($"status" === "INACTIVE", true)) as "INACTIVE"
)
followed by a simple CASE ... WHEN:
val result = counts.withColumn(
"status",
when($"ACTIVE" === 0, "INACTIVE")
.when($"inactive" === 0, "ACTIVE")
.otherwise("BOTH")
)
result.show
+---+------+--------+--------+
| id|ACTIVE|INACTIVE| status|
+---+------+--------+--------+
|id3| 0| 2|INACTIVE|
|id1| 1| 2| BOTH|
|id2| 1| 0| ACTIVE|
+---+------+--------+--------+
Later you can separate the result with filters or dump to disk with source that supports partitionBy (How to split a dataframe into dataframes with same column values?).
just another way - groupBy, collect as set and then if the size of the set is 1, it is either active or inactive only, else both
scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE |
|id5|ACTIVE |
|id6|INACTIVE|
|id7|ACTIVE |
|id7|INACTIVE|
+---+--------+
scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]
scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE] |
|id5|[ACTIVE] |
|id6|[INACTIVE] |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE] |
|id4|[ACTIVE] |
+---+------------------+
scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus |status |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH |
|id3|[INACTIVE] |INACTIVE|
|id5|[ACTIVE] |ACTIVE |
|id6|[INACTIVE] |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH |
|id2|[ACTIVE] |ACTIVE |
|id4|[ACTIVE] |ACTIVE |
+---+------------------+--------+

Get the number of null per row in PySpark dataframe

This is probably a duplicate, but somehow I have been searching for a long time already:
I want to get the number of nulls per Row in a Spark dataframe. I.e.
col1 col2 col3
null 1 a
1 2 b
2 3 null
Should in the end be:
col1 col2 col3 number_of_null
null 1 a 1
1 2 b 0
2 3 null 1
In a general fashion, I want to get the number of times a certain string or number appears in a spark dataframe row.
I.e.
col1 col2 col3 number_of_ABC
ABC 1 a 1
1 2 b 0
2 ABC ABC 2
I am using Pyspark 2.3.0 and prefer a solution that does not involve SQL syntax. For some reason, I seem not to be able to google this. :/
EDIT: Assume that I have so many columns that I can't list them all.
EDIT2: I explicitely dont want to have a pandas solution.
EDIT3: The solution explained with sums or means does not work as it throws errors:
(data type mismatch: differing types in '((`log_time` IS NULL) + 0)' (boolean and int))
...
isnull(log_time#10) + 0) + isnull(log#11))
In Scala:
val df = List(
("ABC", "1", "a"),
("1", "2", "b"),
("2", "ABC", "ABC")
).toDF("col1", "col2", "col3")
val expected = "ABC"
val complexColumn: Column = df.schema.fieldNames.map(c => when(col(c) === lit(expected), 1).otherwise(0)).reduce((a, b) => a + b)
df.withColumn("countABC", complexColumn).show(false)
Output:
+----+----+----+--------+
|col1|col2|col3|countABC|
+----+----+----+--------+
|ABC |1 |a |1 |
|1 |2 |b |0 |
|2 |ABC |ABC |2 |
+----+----+----+--------+
As stated in pasha701's answer, I resort to map and reduce. Note that I am working on Spark 1.6.x and Python 2.7
Taking your DataFrame as df (and as is)
dfvals = [
(None, "1", "a"),
("1", "2", "b"),
("2", None, None)
]
df = sqlc.createDataFrame(dfvals, ['col1', 'col2', 'col3'])
new_df = df.withColumn('null_cnt', reduce(lambda x, y: x + y,
map(lambda x: func.when(func.isnull(func.col(x)) == 'true', 1).otherwise(0),
df.schema.names)))
Check if the value is Null and assign 1 or 0. Add the result to get the count.
new_df.show()
+----+----+----+--------+
|col1|col2|col3|null_cnt|
+----+----+----+--------+
|null| 1| a| 1|
| 1| 2| b| 0|
| 2|null|null| 2|
+----+----+----+--------+

How to join multiple columns from one DataFrame with another DataFrame

I have two DataFrames recommendations and movies. Columns rec1-rec3 in recommendations represent movie id from movies dataframe.
val recommendations: DataFrame = List(
(0, 1, 2, 3),
(1, 2, 3, 4),
(2, 1, 3, 4)).toDF("id", "rec1", "rec2", "rec3")
val movies = List(
(1, "the Lord of the Rings"),
(2, "Star Wars"),
(3, "Star Trek"),
(4, "Pulp Fiction")).toDF("id", "name")
What I want:
+---+------------------------+------------+------------+
| id| rec1| rec2| rec3|
+---+------------------------+------------+------------+
| 0| the Lord of the Rings| Star Wars| Star Trek|
| 1| Star Wars| Star Trek|Pulp Fiction|
| 2| the Lord of the Rings| Star Trek| Star Trek|
+---+------------------------+------------+------------+
We can also use the functions stack() and pivot() to arrive at your expected output, joining the two dataframes only once.
// First rename 'id' column to 'ids' avoid duplicate names further downstream
val moviesRenamed = movies.withColumnRenamed("id", "ids")
recommendations.select($"id", expr("stack(3, 'rec1', rec1, 'rec2', rec2, 'rec3', rec3) as (rec, movie_id)"))
.where("rec is not null")
.join(moviesRenamed, col("movie_id") === moviesRenamed.col("ids"))
.groupBy("id")
.pivot("rec")
.agg(first("name"))
.show()
+---+--------------------+---------+------------+
| id| rec1| rec2| rec3|
+---+--------------------+---------+------------+
| 0|the Lord of the R...|Star Wars| Star Trek|
| 1| Star Wars|Star Trek|Pulp Fiction|
| 2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+
I figured it out. You should create aliases for your columns just like in SQL.
val joined = recommendation
.join(movies.select(col("id").as("id1"), 'name.as("n1")), 'id1 === recommendation.col("rec1"))
.join(movies.select(col("id").as("id2"), 'name.as("n2")), 'id2 === recommendation.col("rec2"))
.join(movies.select(col("id").as("id3"), 'name.as("n3")), 'id3 === recommendation.col("rec3"))
.select('id, 'n1, 'n2, 'n3)
joined.show()
Query will result in
+---+--------------------+---------+------------+
| id| n1| n2| n3|
+---+--------------------+---------+------------+
| 0|the Lord of the R...|Star Wars| Star Trek|
| 1| Star Wars|Star Trek|Pulp Fiction|
| 2|the Lord of the R...|Star Trek|Pulp Fiction|
+---+--------------------+---------+------------+

Spark: create a sessionId based on timestamp

I like to do the following transformation. Given a data frame that records whether a user is logged. My aim is to create a sessionId for each record based on the timestamp and a pre-defined value TIMEOUT = 20.
A session period is defined as : [first record --> first record + Timeout]
For instance, the original DataFrame would look like the following:
scala> val df = sc.parallelize(List(
("user1",0),
("user1",3),
("user1",15),
("user1",22),
("user1",28),
("user1",41),
("user1",45),
("user1",85),
("user1",90)
)).toDF("user_id","timestamp")
df: org.apache.spark.sql.DataFrame = [user_id: string, timestamp: int]
+-------+---------+
|user_id|timestamp|
+-------+---------+
|user1 |0 |
|user1 |3 |
|user1 |15 |
|user1 |22 |
|user1 |28 |
|user1 |41 |
|user1 |45 |
|user1 |85 |
|user1 |90 |
+-------+---------+
The goal is:
+-------+---------+----------+
|user_id|timestamp|session_id|
+-------+---------+----------+
|user1 |0 | 0 |-> first record (session 0: period [0->20])
|user1 |3 | 0 |
|user1 |15 | 0 |
|user1 |22 | 1 |-> 22 not in [0->20]->new session(period 22->42)
|user1 |28 | 1 |
|user1 |41 | 1 |
|user1 |45 | 2 |-> 45 not in [22->42]->newsession(period 45->65)
|user1 |85 | 3 |
|user1 |90 | 3 |
+-------+---------+----------+
Are there any elegant solution to solve this problem, preferably in Scala.
Thanks in advance!
This may not be an elegant solution but this worked for given data format.
sc.parallelize(List(
("user1", 0),
("user1", 3),
("user1", 15),
("user1", 22),
("user1", 28),
("user1", 41),
("user1", 45),
("user1", 85),
("user1", 90))).toDF("user_id", "timestamp").map { x =>
val userId = x.getAs[String]("user_id")
val timestamp = x.getAs[Int]("timestamp")
val session = timestamp / 20
(userId, timestamp, session)
}.toDF("user_id", "timestamp", "session").show()
Result
You can change timestamp / 20 according to your need.
Please see my code.
Two issues here:
1,I think the performance is not well.
2,I use the "userid" to join, if this doesn't meet your requirement. You add a new column with a same value to timeSetFrame and newSessionSec.
var newSession = ss.sparkContext.parallelize(List(
("user1", 0), ("user1", 3), ("user1", 15), ("user1", 22),
("user1", 28), ("user1", 41), ("user1", 45), ("user1", 85),
("user1", 90))).zipWithIndex().toDF("tmp", "index")
val getUser_id = udf( ( s : Row) => {
s.getString(0)
})
val gettimestamp = udf( ( s : Row) => {
s.getInt(1)
})
val newSessionSec = newSession.withColumn( "user_id", getUser_id($"tmp"))
.withColumn( "timestamp", gettimestamp($"tmp")).drop( "tmp") //.show()
val timeSet : Array[Int] = newSessionSec.select("timestamp").collect().map( s=>s.getInt(0))
val timeSetFrame = ss.sparkContext.parallelize( Seq(( "user1",timeSet))).toDF( "user_id", "tset")
val newSessionThird = newSessionSec.join( timeSetFrame, Seq("user_id"), "outer") // .show
val getSessionID = udf( ( ts: Int, aa: Seq[Int]) => {
var result = 0
var begin = 0
val loop = new Breaks
loop.breakable {
for (time <- aa) {
if (time > (begin + 20)) {
begin = time
result += 1
}
if (time == ts) {
loop.break;
}
}
}
result
})
newSessionThird.withColumn( "sessionID", getSessionID( $"timestamp", $"tset")).drop("tset", "index").show()