I have a dataset containing two columns, user_id and item_id. The DataFrame looks like this:
index user_id item_id
0 user1 A
1 user1 B
2 user2 A
3 user3 B
4 user4 C
I'm looking for a way to transform this table into an item-item interaction matrix where we have distinct intersection of common users between items:
A B C
A 2 1 0
B 1 2 0
C 0 0 1
And another item-item interaction matrix where we have distinct union of users between items:
A B C
A 2 3 3
B 3 2 3
C 3 3 1
Step 0. Define the dataframe
import pyspark.sql.functions as F
data = [(0, "user1", "A"),
(1, "user1", "B"),
(2, "user2", "A"),
(3, "user3", "B"),
(4, "user4", "C")]
df = spark.createDataFrame(data, schema=["index", "user_id", "item_id"])
Step 1. Collect user data for each item in df_collect
df_collect = (df
.select("user_id", "item_id")
.groupBy("item_id")
.agg(F.collect_set("user_id").alias("users")))
Step 2. Cross join df_collect with itself to get all item-item combinations
df_crossjoin = (df_collect
.join(df_collect
.withColumnRenamed("item_id", "item_y")
.withColumnRenamed("users", "users_y")))
Step 2. Find user union and intersection and the count
df_ui = (df_crossjoin
.withColumn("users_union",
F.size((F.array_union("users", "users_y"))))
.withColumn("users_intersect",
F.size(F.array_intersect("users", "users_y"))))
Step 3. Pivot to get item-item matrix
df_matrix_union = (df_ui
.groupBy("item_id")
.pivot("item_y")
.agg(F.first("users_union"))
.orderBy("item_id"))
df_matrix_intrsct = (df_ui
.groupBy("item_id")
.pivot("item_y")
.agg(F.first("users_intersect"))
.orderBy("item_id"))
Related
I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns.
For Example: My df is like this
col1
col2
col3
Cat1
XYZ
A
Cat1
XYZ
C
Cat1
ABC
B
Cat2
ABC
A
Cat2
XYZ
B
Cat2
MNO
A
I want something like this
VarNAME
Category
Count
A
B
C
col1
Cat1
3
1
1
1
col1
Cat2
3
2
0
1
col2
XYZ
3
1
1
1
col2
ABC
2
1
1
0
col2
MNO
1
1
0
0
col3
A
3
3
0
0
col3
B
2
0
2
0
Col3
C
1
0
0
1
So, Basically, I want cross-tabulation for all individual columns with col3 and the total count.
I can do it in Python using a loop but the loop is somewhat different in pyspark.
Here are my 2 cents.
Created a sample dataframe
df = spark.createDataFrame(
[("Cat1","XYZ","A"),
("Cat1","XYZ","C"),
("Cat1","ABC","B"),
("Cat2","ABC","A"),
("Cat2","XYZ","B"),
("Cat2","MNO","A")
],schema = ['col1','col2','col3'])
Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it.
Then performed union for all these dataframes
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df_union = \
df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
Printing the data frame based on the column order
df_union.select('VarName','Category','count','A','B','C').show()
Please check the sample output for the reference:
I have a list of ids, a sequence number of messages (seq) and a value (e.g. timestamps). Multiple rows can have the same sequence number. There are some other columns with different values in every row, but I excluded them as they are not important.
Within all messages from a deviceId (=partitionBy), I need to sort by sequence_number (=orderBy) and add the 'ts'-value of the next message with a different sequence_number to all messages of the current sequence_number.
I got so far as to retrieve the value of the next row if that row has a different sequence number. But since the "next row with a different sequence number" could potentially be x rows far away, I would have to add specific .when(condition, ...) blocks for x rows ahead.
I was wondering if there was a better solution which works no matter how "far away" the next row with a different sequence number is. I tried a .otherwise(lead(col("next_value"), 1), but since I am just building the column, it doesn't work.
My Code & reproducible example:
data = [
(1, 1, "A"),
(2, 1, "G"),
(2, 2, "F"),
(3, 1, "A"),
(4, 1, "A"),
(4, 2, "B"),
(4, 3, "C"),
(4, 3, "C"),
(4, 3, "C"),
(4, 4, "D")
]
df = spark.createDataFrame(data=data, schema=["id", "seq", "ts"])
df.printSchema()
df.show(10, False)
window = Window \
.orderBy("id", "seq") \
.partitionBy("id")
# I could potentially do this 100x if the next lead-value is 100 rows away, but I wonder if there isn't a better solution.
is_different_seq1 = lead(col("seq"), 1).over(window) != col("seq")
is_different_seq2 = lead(col("seq"), 2).over(window) != col("seq")
df = df.withColumn("lead_value",
when(is_different_seq1,
lead(col("ts"), 1).over(window)
)
.when(is_different_seq2,
lead(col("ts"), 2).over(window)
)
)
df.printSchema()
df.show(10, False)
Ideal output in column "next_value" for id=4:
id
seq
ts
next_value
4
1
A
B
4
2
B
C
4
3
C
D
4
3
C
D
4
3
C
D
4
4
D
Null
I haven't tried the more complicated case, so this might still need more adjustment but I think you can combine with last function.
With just the lead function, it results in like this.
id
seq
ts
lead_value
4
1
A
B
4
2
B
C
4
3
C
C
4
3
C
C
4
3
C
D
4
4
D
Null
You want to overwrite the lead_value of 3rd and 4th rows to be "D" which is the last value of the lead_value in the same id&seq group.
lead_window = (Window
.partitionBy("deviceId")
.orderBy("seq"))
last_window = (Window
.partitionBy('deviceId', 'seq')
.rowsBetween(0, Window.unboundedFollowing))
df = df.withColumn('next_value', F.last(
F.lead(F.col('ts')).over(lead_window)
).over(last_window))
Result.
id
seq
ts
next_value
4
1
A
B
4
2
B
C
4
3
C
D
4
3
C
D
4
3
C
D
4
4
D
Null
I found a solution (horribly slow however), so if someone comes up with a better solution, please add your answer!
I get one row per "message" with a distinct, execute the lead(1) there, and join it back to the dataframe to the rest of the columns.
df_filtered = df.select("id", "seq", "ts").distinct()
df_filtered = df_filtered.withColumn("lead_value", lead(col("ts"), 1).over(window))
df = df.join(df_filtered, on=["id", "seq", "ts"])
I have two DataFrames as below:
val df1 = Seq((1, 3), (2, 4), (1, 5)).toDF("col1", "col2")
example:
1 30
2 40
1 50
and
val df2 = Seq((1, 2), (3, 5)).toDF("key1", "key2")
example:
1 2
3 5
What I want to do is to loop through df2, take key2, and see if df2.key2=df1.col1, if so, I will add another row to df1 to create a new DataFrame. In this example for df2 row1 (1,2), since 2 matches row2 col1 in df1, I want to add another row (1,4) to df1.
Given the input above, the expected output is
1 30
2 40
1 50
1 40 //added this new row as result, as df2.row1.key2 matches df1.row2.col1
//for df2(1,2), as it matches df1 (2,4)using that join condition, it brings in 4
I understand that we could check if df1.col("col1")===df2.col("key2"), but I don't know how to iterate through df2 to perform that on each row.
I have a spark dataframe(input_dataframe_1), data in this dataframe looks like as below:
id value
1 Ab
2 Ai
3 aB
I have another spark dataframe(input_dataframe_2), data in this dataframe looks like as below:
name value
x ab
y iA
z aB
I want to join both dataframe and join condition should be case insensitive, below is the join condition I am using:
output = input_dataframe_1.join(input_dataframe_2,['value'])
How can I make join condition case insensitive?
from pyspark.sql.functions import lower
#sample data
input_dataframe_1 = sc.parallelize([(1, 'Ab'), (2, 'Ai'), (3, 'aB')]).toDF(["id", "value"])
input_dataframe_2 = sc.parallelize([('x', 'ab'), ('y', 'iA'), ('z', 'aB')]).toDF(["name", "value"])
output = input_dataframe_1.\
join(input_dataframe_2, lower(input_dataframe_1.value)==lower(input_dataframe_2.value)).\
drop(input_dataframe_2.value)
output.show()
Expecting you are doing an inner join, find below solution:
Create input dataframe 1
val inputDF1 = spark.createDataFrame(Seq(("1","Ab"),("2","Ai"),("3","aB"))).withColumnRenamed("_1","id").withColumnRenamed("_2","value")
Create input dataframe 2
val inputDF2 = spark.createDataFrame(Seq(("x","ab"),("y","iA"),("z","aB"))).withColumnRenamed("_1","id").withColumnRenamed("_2","value")
Joining both dataframes on lower(value) column
inputDF1.join(inputDF2,lower(inputDF1.col("value"))===lower(inputDF2.col("value"))).show
id
value
id
value
1
Ab
z
aB
1
Ab
x
ab
3
aB
z
aB
3
aB
x
ab
I have 4 RDDs with the same key but different columns. And I want to attached them. I thought in perform a fullOuterJoin because, even if the ids are not matched I want the complete register.
Maybe this is easier with dataframes (taking into account not to lose registers)? But until now I have the following code:
var final_agg = rdd_agg_1.fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_2).fullOuterJoin(rdd_agg_4).map {
case (id, // TODO this case to format the resulting rdd)
}
If I have this rdds:
rdd1 rdd2 rdd3 rdd4
id field1 id field2 id field3 id field4
1 2 1 2 1 2 2 3
2 5 5 1
So the resulting rdd would have the following form:
rdd
id field1 field2 field3 field4
1 2 2 2 3
2 5 - - -
5 - 1 - -
EDIT:
This is the RDD that I want to format in the case:
org.apache.spark.rdd.RDD[(String, (Option[(Option[(Option[Int], Option[Int])], Option[Int])], Option[String]))]