Get value from first lead row that has a different value - pyspark

I have a list of ids, a sequence number of messages (seq) and a value (e.g. timestamps). Multiple rows can have the same sequence number. There are some other columns with different values in every row, but I excluded them as they are not important.
Within all messages from a deviceId (=partitionBy), I need to sort by sequence_number (=orderBy) and add the 'ts'-value of the next message with a different sequence_number to all messages of the current sequence_number.
I got so far as to retrieve the value of the next row if that row has a different sequence number. But since the "next row with a different sequence number" could potentially be x rows far away, I would have to add specific .when(condition, ...) blocks for x rows ahead.
I was wondering if there was a better solution which works no matter how "far away" the next row with a different sequence number is. I tried a .otherwise(lead(col("next_value"), 1), but since I am just building the column, it doesn't work.
My Code & reproducible example:
data = [
(1, 1, "A"),
(2, 1, "G"),
(2, 2, "F"),
(3, 1, "A"),
(4, 1, "A"),
(4, 2, "B"),
(4, 3, "C"),
(4, 3, "C"),
(4, 3, "C"),
(4, 4, "D")
]
df = spark.createDataFrame(data=data, schema=["id", "seq", "ts"])
df.printSchema()
df.show(10, False)
window = Window \
.orderBy("id", "seq") \
.partitionBy("id")
# I could potentially do this 100x if the next lead-value is 100 rows away, but I wonder if there isn't a better solution.
is_different_seq1 = lead(col("seq"), 1).over(window) != col("seq")
is_different_seq2 = lead(col("seq"), 2).over(window) != col("seq")
df = df.withColumn("lead_value",
when(is_different_seq1,
lead(col("ts"), 1).over(window)
)
.when(is_different_seq2,
lead(col("ts"), 2).over(window)
)
)
df.printSchema()
df.show(10, False)
Ideal output in column "next_value" for id=4:
id
seq
ts
next_value
4
1
A
B
4
2
B
C
4
3
C
D
4
3
C
D
4
3
C
D
4
4
D
Null

I haven't tried the more complicated case, so this might still need more adjustment but I think you can combine with last function.
With just the lead function, it results in like this.
id
seq
ts
lead_value
4
1
A
B
4
2
B
C
4
3
C
C
4
3
C
C
4
3
C
D
4
4
D
Null
You want to overwrite the lead_value of 3rd and 4th rows to be "D" which is the last value of the lead_value in the same id&seq group.
lead_window = (Window
.partitionBy("deviceId")
.orderBy("seq"))
last_window = (Window
.partitionBy('deviceId', 'seq')
.rowsBetween(0, Window.unboundedFollowing))
df = df.withColumn('next_value', F.last(
F.lead(F.col('ts')).over(lead_window)
).over(last_window))
Result.
id
seq
ts
next_value
4
1
A
B
4
2
B
C
4
3
C
D
4
3
C
D
4
3
C
D
4
4
D
Null

I found a solution (horribly slow however), so if someone comes up with a better solution, please add your answer!
I get one row per "message" with a distinct, execute the lead(1) there, and join it back to the dataframe to the rest of the columns.
df_filtered = df.select("id", "seq", "ts").distinct()
df_filtered = df_filtered.withColumn("lead_value", lead(col("ts"), 1).over(window))
df = df.join(df_filtered, on=["id", "seq", "ts"])

Related

Create item- item Interaction Matrix in Pyspark

I have a dataset containing two columns, user_id and item_id. The DataFrame looks like this:
index user_id item_id
0 user1 A
1 user1 B
2 user2 A
3 user3 B
4 user4 C
I'm looking for a way to transform this table into an item-item interaction matrix where we have distinct intersection of common users between items:
A B C
A 2 1 0
B 1 2 0
C 0 0 1
And another item-item interaction matrix where we have distinct union of users between items:
A B C
A 2 3 3
B 3 2 3
C 3 3 1
Step 0. Define the dataframe
import pyspark.sql.functions as F
data = [(0, "user1", "A"),
(1, "user1", "B"),
(2, "user2", "A"),
(3, "user3", "B"),
(4, "user4", "C")]
df = spark.createDataFrame(data, schema=["index", "user_id", "item_id"])
Step 1. Collect user data for each item in df_collect
df_collect = (df
.select("user_id", "item_id")
.groupBy("item_id")
.agg(F.collect_set("user_id").alias("users")))
Step 2. Cross join df_collect with itself to get all item-item combinations
df_crossjoin = (df_collect
.join(df_collect
.withColumnRenamed("item_id", "item_y")
.withColumnRenamed("users", "users_y")))
Step 2. Find user union and intersection and the count
df_ui = (df_crossjoin
.withColumn("users_union",
F.size((F.array_union("users", "users_y"))))
.withColumn("users_intersect",
F.size(F.array_intersect("users", "users_y"))))
Step 3. Pivot to get item-item matrix
df_matrix_union = (df_ui
.groupBy("item_id")
.pivot("item_y")
.agg(F.first("users_union"))
.orderBy("item_id"))
df_matrix_intrsct = (df_ui
.groupBy("item_id")
.pivot("item_y")
.agg(F.first("users_intersect"))
.orderBy("item_id"))

Spark Dataframe from all combinations of Array column

Assume I have a Spark DataFrame d1 with two columns, elements_1 and elements_2, that contain sets of integers of size k, and value_1, value_2 that contain a integer value. For example, with k = 3:
d1 =
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3) | (3, 4, 5) |
| (2, 1, 3) | (1, 0, 2) |
| (4, 3, 1) | (3, 5, 6) |
+-------------------------+
I need to create a new column combinations made that contains, for each pair of sets elements_1 and elements_2, a list of the sets from all possible combinations of their elements. These sets must have the following properties:
Their size must be k+1
They must contain either the set in elements_1 or the set in elements_2
For example, from (1, 2, 3) and (3, 4, 5) we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]. The list does not contain (1, 2, 5) because it is not of length 3+1, and it does not contain (1, 2, 4, 5) because it contains neither of the original sets.
You need to create a custom user-defined function to perform the transformation, create a spark-compatible UserDefinedFunction from it, then apply using withColumn. So really, there are two questions here: (1) how to do the set transformation you described, and (2) how to create a new column in a DataFrame using a user-defined function.
Here's a first shot at the set logic, let me know if it does what you're looking for:
def combo[A](a: Set[A], b: Set[A]): Set[Set[A]] =
a.diff(b).map(b+_) ++ b.diff(a).map(a+_)
Now create the UDF wrapper. Note that under the hood these sets are all represented by WrappedArrays, so we need to handle this. There's probably a more elegant way to deal with this by defining some implicit conversions, but this should work:
import scala.collection.mutable.WrappedArray
val comboWrap: (WrappedArray[Int],WrappedArray[Int])=>Array[Array[Int]] =
(x,y) => combo(x.toSet,y.toSet).map(_.toArray).toArray
val comboUDF = udf(comboWrap)
Finally, apply it to the DataFrame by creating a new column:
val data = Seq((Set(1,2,3),Set(3,4,5))).toDF("elements_1","elements_2")
val result = data.withColumn("result",
comboUDF(col("elements_1"),col("elements_2")))
result.show

how to filter few rows in a table using Scala

Using Scala:
I have a emp table as below
id, name, dept, address
1, a, 10, hyd
2, b, 10, blr
3, a, 5, chn
4, d, 2, hyd
5, a, 3, blr
6, b, 2, hyd
Code:
val inputFile = sc.textFile("hdfs:/user/edu/emp.txt");
val inputRdd = inputFile.map(iLine => (iLine.split(",")(0),
iLine.split(",")(1),
iLine.split(",")(3)
));
// filtering only few columns Now i want to pull hyd addressed employees complete data
Problem: I don't want to print all emp details, I want to print only few emp details who all are from hyd.
I have loaded this emp dataset into Rdd
I was split this Rdd with ','
now I want to print only hyd addressed emp.
I think the below solution will help to solve your problem.
val fileName = "/path/stact_test.txt"
val strRdd = sc.textFile(fileName).map { line =>
val data = line.split(",")
(data(0), data(1), data(3))
}.filter(rec=>rec._3.toLowerCase.trim.equals("hyd"))
after splitting the data, filter the location using the 3rd item from the tuple RDD.
Output:
(1, a, hyd)
(4, d, hyd)
(6, b, hyd)
You may try to use dataframe
val viewsDF=spark.read.text("hdfs:/user/edu/emp.txt")
val splitedViewsDF = viewsDF.withColumn("id", split($"value",",").getItem(0))
.withColumn("name", split($"value", ",").getItem(1))
.withColumn("address", split($"value", ",").getItem(3))
.drop($"value")
.filter(df("address").equals("hyd") )

iterate through a Dataframe and perform functions on row/column

I have two DataFrames as below:
val df1 = Seq((1, 3), (2, 4), (1, 5)).toDF("col1", "col2")
example:
1 30
2 40
1 50
and
val df2 = Seq((1, 2), (3, 5)).toDF("key1", "key2")
example:
1 2
3 5
What I want to do is to loop through df2, take key2, and see if df2.key2=df1.col1, if so, I will add another row to df1 to create a new DataFrame. In this example for df2 row1 (1,2), since 2 matches row2 col1 in df1, I want to add another row (1,4) to df1.
Given the input above, the expected output is
1 30
2 40
1 50
1 40 //added this new row as result, as df2.row1.key2 matches df1.row2.col1
//for df2(1,2), as it matches df1 (2,4)using that join condition, it brings in 4
I understand that we could check if df1.col("col1")===df2.col("key2"), but I don't know how to iterate through df2 to perform that on each row.

How to join and reduce two datasets with arrays?

I need an idea for how to join two datasets with millions of arrays. Each dataset will have Longs numbered 1-10,000,000. But with different groupings in each one ex. [1,2] [3, 4] and [1], [2, 3], [4] output should be [1,2,3,4]
I need some way to join these sets efficiently.
I have tried an approach where I explode and group by multiple times, finally sorting and distincting the arrays. This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
Any ideas on how to use another approach like a reducer or aggregation to solve this problem more efficiently.
The following is a scala code example. However, I would need an approach that works in java as well.
val rdd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11]}"""))
val rdd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}"""))
val srdd1 = spark.read.json(rdd1)
val srdd2 = spark.read.json(rdd2)
Dataset 1:
+---------+
|groupings|
+---------+
|[1, 2, 3]|
|[4, 5, 6]|
|[7, 8, 9]|
| [10]|
| [11]|
+---------+
Dataset 2:
+---------+
|groupings|
+---------+
| [1]|
|[2, 3, 4]|
| [7, 8]|
| [9]|
| [10, 11]|
+---------+
Output should be
+------------------+
| groupings|
+------------------+
|[1, 2, 3, 4, 5, 6]|
| [7, 8, 9]|
| [10, 11]|
+------------------+
Update:
This was my original code, which I had problems running, #AyanGuha had me thinking that perhaps it would be simpler to just use a series of joins instead, I am testing that now and will post a solution if it works out.
srdd1.union(srdd2).withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.select(callUDF("sortLongArray", col("groupings")).alias("groupings"))
.distinct()
What this code showed was that after 3 iterations the data coalesced, ideally then 3 joins would do the same.
Update 2:
Looks like I have a new working version, still seems inefficient but I think this will be handled better by spark.
val ardd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11,12]}""", """{"groupings":[13,14]}"""))
val ardd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}""", """{"groupings":[12,13]}""", """{"groupings":[14, 15]}"""))
var srdd1 = spark.read.json(ardd1)
var srdd2 = spark.read.json(ardd2)
val addUDF = udf((x: Seq[Long], y: Seq[Long]) => if(y == null) x else (x ++ y).distinct.sorted)
val encompassUDF = udf((x: Seq[Long], y: Seq[Long]) => if(x.size == y.size) false else (x diff y).size == 0)
val arrayContainsAndDiffUDF = udf((x: Seq[Long], y: Seq[Long]) => (x.intersect(y).size > 0) && (y diff x).size > 0)
var rdd1 = srdd1
var rdd2 = srdd2.withColumnRenamed("groupings", "groupings2")
for (i <- 1 to 3){
rdd1 = rdd1.join(rdd2, arrayContainsAndDiffUDF(col("groupings"), col("groupings2")), "left")
.select(addUDF(col("groupings"), col("groupings2")).alias("groupings"))
.distinct
.alias("rdd1")
rdd2 = rdd1.select(col("groupings").alias("groupings2")).alias("rdd2")
}
rdd1.join(rdd2, encompassUDF(col("groupings"), col("groupings2")), "leftanti")
.show(10, false)
Outputs:
+------------------------+
|groupings |
+------------------------+
|[10, 11, 12, 13, 14, 15]|
|[1, 2, 3, 4, 5, 6] |
|[7, 8, 9] |
+------------------------+
I will try this at scale and see what I get.
This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
I don't think you have other options than explodeing the arrays, join followed by distinct. Spark is fairly good at such computations and tries doing them as much using internal binary rows as possible. The datasets are compressed and often comparisons are done at byte level (outside JVM)
That's just a matter of enough memory to hold all the elements which may not that big deal.
I'd recommend giving your solution a try and check out the physical plan and the stats. It could in the end turn out to be the only available solution.
Here is an alternate solution using ARRAY data type that is supported as part of HiveQL. This will at least make your coding simple [i.e. building out the logic]. Code below assumes that the raw data is in a text file.
Step 1. Create table
create table array_table1(array_col1 array<int>) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED
BY'\n' STORED AS text;
Step 2: Load data into both tables
LOAD DATA INPATH('/path/to/file') OVERWRITE INTO TABLE array_table1
Step 3: Apply sql functions to get results
select distinct(explode(array_col1)) from array_table1 union
select distinct(explode(array_col2)) from array_table2
I am not clear on what is the final output you are looking for, from the example. Is it just union of all distinct numbers - or are they supposed to have a grouping? But anyway, with the tables created you can use a combination of distinct, explode(), left anti join and union - to get the expected results.
You may want to optimize this code to filter the final data set again for duplicates.
Hope that helps!
OK I finally figured it out.
First of all with my array joins I was doing something very wrong, which I overlooked initially.
When joining two arrays with a equivalency. EX. does [1,2,3] equal [1,2,3]? the arrays are hashed. I was doing an intersection match using a UDF. Given x in [1,2,3] is any x in [1, 2, 3, 4, 5]. This cannot be hashed and therefore requires a plan which will check every row with every row.
So to do this you have to explode both arrays first, then join them.
You can then apply other criteria. For example I saved time by only joining arrays which were not equal and when summed were less then the other.
Example with a self join:
rdd2 = rdd2.withColumn("single", explode(col("grouping"))) // Explode the grouping
temp = rdd2.withColumnRenamed("grouping", "grouping2").alias("temp") // Alias for self join
rdd2 = rdd2.join(temp, rdd2.col("single").equalTo(temp.col("single")) // Compare singles which will be hashed
.and(col("grouping").notEqual(col("grouping2"))) // Apply further conditions
.and(callUDF("lessThanArray", col("grouping"), col("grouping2"))) // Make it so only [1,2,3] [4,5,6] is joined and not the duplicate [4,5,6] [1,2,3], for efficiency
, "left") // Left so that the efficiency criteria do not drop rows
I then grouped by the grouping which was joined against aggregated the groupings from the self join.
rdd2.groupBy("grouping")
.agg(callUDF("list_agg",col('grouping2')).alias('grouping2')) // List agg is a UserDefinedAggregateFunction which aggregates lists into a distinct list
.select(callUDF("addArray", col("grouping"), col("grouping2")).alias("grouping")) // AddArray is a udf which concats and distincts 2 arrays
grouping grouping2
[1,2,3] [3,4,5]
[1,2,3] [2,6,7]
[1,2,3] [2,8,9]
becomes just
[1,2,3] [2,3,4,5,6,7,8,9]
after addArray
[1,2,3,4,5,6,7,8,9]
I then iterated that code 3 times, which seems to make everything coalesce and threw in a distinct for good measure.
Notice from the original question I had two datasets, for my specific problem I discovered some assumptions about the first and second set. The first set I could assume had no duplicates as it was a master list, the second set had duplicates hence I only needed to apply the above code to the second set, then join it with the first. I would assume if both sets had duplicates they could be unioned together first.