I have two DataFrames as below:
val df1 = Seq((1, 3), (2, 4), (1, 5)).toDF("col1", "col2")
example:
1 30
2 40
1 50
and
val df2 = Seq((1, 2), (3, 5)).toDF("key1", "key2")
example:
1 2
3 5
What I want to do is to loop through df2, take key2, and see if df2.key2=df1.col1, if so, I will add another row to df1 to create a new DataFrame. In this example for df2 row1 (1,2), since 2 matches row2 col1 in df1, I want to add another row (1,4) to df1.
Given the input above, the expected output is
1 30
2 40
1 50
1 40 //added this new row as result, as df2.row1.key2 matches df1.row2.col1
//for df2(1,2), as it matches df1 (2,4)using that join condition, it brings in 4
I understand that we could check if df1.col("col1")===df2.col("key2"), but I don't know how to iterate through df2 to perform that on each row.
Related
I have a list of ids, a sequence number of messages (seq) and a value (e.g. timestamps). Multiple rows can have the same sequence number. There are some other columns with different values in every row, but I excluded them as they are not important.
Within all messages from a deviceId (=partitionBy), I need to sort by sequence_number (=orderBy) and add the 'ts'-value of the next message with a different sequence_number to all messages of the current sequence_number.
I got so far as to retrieve the value of the next row if that row has a different sequence number. But since the "next row with a different sequence number" could potentially be x rows far away, I would have to add specific .when(condition, ...) blocks for x rows ahead.
I was wondering if there was a better solution which works no matter how "far away" the next row with a different sequence number is. I tried a .otherwise(lead(col("next_value"), 1), but since I am just building the column, it doesn't work.
My Code & reproducible example:
data = [
(1, 1, "A"),
(2, 1, "G"),
(2, 2, "F"),
(3, 1, "A"),
(4, 1, "A"),
(4, 2, "B"),
(4, 3, "C"),
(4, 3, "C"),
(4, 3, "C"),
(4, 4, "D")
]
df = spark.createDataFrame(data=data, schema=["id", "seq", "ts"])
df.printSchema()
df.show(10, False)
window = Window \
.orderBy("id", "seq") \
.partitionBy("id")
# I could potentially do this 100x if the next lead-value is 100 rows away, but I wonder if there isn't a better solution.
is_different_seq1 = lead(col("seq"), 1).over(window) != col("seq")
is_different_seq2 = lead(col("seq"), 2).over(window) != col("seq")
df = df.withColumn("lead_value",
when(is_different_seq1,
lead(col("ts"), 1).over(window)
)
.when(is_different_seq2,
lead(col("ts"), 2).over(window)
)
)
df.printSchema()
df.show(10, False)
Ideal output in column "next_value" for id=4:
id
seq
ts
next_value
4
1
A
B
4
2
B
C
4
3
C
D
4
3
C
D
4
3
C
D
4
4
D
Null
I haven't tried the more complicated case, so this might still need more adjustment but I think you can combine with last function.
With just the lead function, it results in like this.
id
seq
ts
lead_value
4
1
A
B
4
2
B
C
4
3
C
C
4
3
C
C
4
3
C
D
4
4
D
Null
You want to overwrite the lead_value of 3rd and 4th rows to be "D" which is the last value of the lead_value in the same id&seq group.
lead_window = (Window
.partitionBy("deviceId")
.orderBy("seq"))
last_window = (Window
.partitionBy('deviceId', 'seq')
.rowsBetween(0, Window.unboundedFollowing))
df = df.withColumn('next_value', F.last(
F.lead(F.col('ts')).over(lead_window)
).over(last_window))
Result.
id
seq
ts
next_value
4
1
A
B
4
2
B
C
4
3
C
D
4
3
C
D
4
3
C
D
4
4
D
Null
I found a solution (horribly slow however), so if someone comes up with a better solution, please add your answer!
I get one row per "message" with a distinct, execute the lead(1) there, and join it back to the dataframe to the rest of the columns.
df_filtered = df.select("id", "seq", "ts").distinct()
df_filtered = df_filtered.withColumn("lead_value", lead(col("ts"), 1).over(window))
df = df.join(df_filtered, on=["id", "seq", "ts"])
I want to make a conceptual check of my code. The goal is to calculate minimum value of the field minTimestamp and maximum value of the field maxTimestamp in the DataFrame df, and delete all other values.
For example:
df
src dst minTimestamp maxTimestamp
1 3 1530809948 1530969948
1 3 1540711155 1530809945
1 3 1520005712 1530809940
2 3 1520005712 1530809940
The answer should be the following one:
result:
src dst minTimestamp maxTimestamp
1 3 1520005712 1530969948
2 3 1520005712 1530809940
This is my code:
val cw_min = Window.partitionBy($"src", $"dst").orderBy($"minTimestamp".asc)
val cw_max = Window.partitionBy($"src", $"dst").orderBy($"maxTimestamp".desc)
val result = df
.withColumn("rn", row_number.over(cw_min)).where($"rn" === 1).drop("rn")
.withColumn("rn", row_number.over(cw_max)).where($"rn" === 1).drop("rn")
Is it possible to use Window function sequentially as I did in my code sample?
The problem is that I always get the same values of minTimestamp and maxTimestamp.
You can use DataFrame groupBy to aggregate the min and max:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 3, 1530809948L, 1530969948L),
(1, 3, 1540711155L, 1530809945L),
(1, 3, 1520005712L, 1530809940L),
(2, 3, 1520005712L, 1530809940L)
).toDF("src", "dst", "minTimestamp", "maxTimestamp")
df.groupBy("src", "dst").agg(
min($"minTimestamp").as("minTimestamp"), max($"maxTimestamp").as("maxTimestamp")
).
show
// +---+---+------------+------------+
// |src|dst|minTimestamp|maxTimestamp|
// +---+---+------------+------------+
// | 2| 3| 1520005712| 1530809940|
// | 1| 3| 1520005712| 1530969948|
// +---+---+------------+------------+
Why not do use spark SQL and do
val spark: SparkSession = ???
df.createOrReplaceTempView("myDf")
val df2 = spark.sql("""
select
src,
dst,
min(minTimestamp) as minTimestamp,
max(maxTimestamp) as maxTimestamp
from myDf group by src, dst""")
You can also use the API to do the same:
val df2 = df
.groupBy("src", "dst")
.agg(min("minTimestamp"), max("maxTimestamp"))
There are two json and first json has more column and always it is super set.
val df1 = spark.read.json(sqoopJson)
val df2 = spark.read.json(kafkaJson)
Except Operation :
I like to apply except operation on both df1 and df2, But df1 has 10 column and df2 has only 8 columns.
In case manually if i drop 2 column from df1 then except will work. But I have 50+ tables/json and need to do EXCEPT for all 50 set of tables/json.
Question :
How to select only columns available in DF2 ( 8) columns from DF1 and create new df3? So df3 will have data from df1 with limited column and it will match with df2 columns.
For the Question: How to select only columns available in DF2 ( 8) columns from DF1 and create new df3?
//Get the 8 column names from df2
val columns = df2.schema.fieldNames.map(col(_))
//select only the columns from df2
val df3 = df1.select(columns :_*)
Hope this helps!
I have two dataframes as below. I'm trying to find the intersection of two dataframes based on either of the two columns, not only both of them.
So In this case, I want to return dataframe C, which has df A row 1 (as A row1 col1= row one col1 in B), df A row 2(A row 2 Col 2=row 1 Col2 in B) and df A row 4(as Col1 row 2 in B = Col 1 row 4 in A), and row 5 in A. But if I do a intersect of A and B, it will only return row 5 in A, as that's a match of both columns. How do I do this? Many thanks.Let me know if I'm not explaining the question very well.
A:
Col1 Col2
1 2
2 3
3 7
5 4
1 3
B:
Col1 Col2
1 3
5 1
C:
1 2
2 3
5 4
1 3
With the following data:
val df1 = sc.parallelize(Seq(1->2, 2->3, 3->7, 5->4, 1->3)).toDF("col1", "col2")
val df2 = sc.parallelize(Seq(1->3, 5->1)).toDF("col1", "col2")
Then you can join your datasets with a or condition:
val cols = df1.columns
df1.join(df2, cols.map(c => df1(c) === df2(c)).reduce(_ || _) )
.select(cols.map(df1(_)) :_*)
.distinct
.show
+----+----+
|col1|col2|
+----+----+
| 2| 3|
| 1| 2|
| 1| 3|
| 5| 4|
+----+----+
The join condition is generic and would work for any number of columns. The code maps each column to an equality between that column in df1 and the same one in df2 cols.map(c => df1(c) === df2(c)). The the reduce takes the logical or of all these equalities, which is what you want.
The select is there because otherwise the columns of both dataframes would be kept. Here I simply keep the ones from df1. I also added a distinct in case several lines of df2 would match a line of df1 or vice versa. Indeed, you may get a cartesian product.
Note that this method does not need any collection to the driver so it will work regardless of the size of the datasets. Yet, if df2 is small enough to be collected to the driver and braodcasted, you would get faster results with a method like this:
// to each column name, we map the set of values in df2.
val valueMap = df2.rdd
.flatMap(row => cols.map(name => name -> row.getAs[Any](name)))
.distinct
.groupByKey
.mapValues(_.toSet)
.collectAsMap
//we create a udf that looks up in valueMap
val filter = udf((name : String, value : Any) =>
valueMap(name).contains(value))
//Finally we apply the filter.
df1.where( cols.map(c => filter(lit(c), df1(c))).reduce(_||_))
.show
With this method, no shuffling of df1 and no cartesian product. If df2 is small, this is definitely the way to go.
You should perform two join operations individually on each of the join columns, and then perform a union of the two resulting Dataframes:
val dfA = List((1,2),(2,3),(3,7),(5,4),(1,3)).toDF("Col1", "Col2")
val dfB = List((1,3),(5,1)).toDF("Col1", "Col2")
val res1 = dfA.join(dfB, dfA.col("Col1")===dfB.col("Col1"))
val res2 = dfA.join(dfB, dfA.col("Col2")===dfB.col("Col2"))
val res = res1.union(res2)
Is it possible to divide DF in two parts using single filter operation.For example
let say df has below records
UID Col
1 a
2 b
3 c
if I do
df1 = df.filter(UID <=> 2)
can I save filtered and non-filtered records in different RDD in single operation
?
df1 can have records where uid = 2
df2 can have records with uid 1 and 3
If you're interested only in saving data you can add an indicator column to the DataFrame:
val df = Seq((1, "a"), (2, "b"), (3, "c")).toDF("uid", "col")
val dfWithInd = df.withColumn("ind", $"uid" <=> 2)
and use it as a partition column for the DataFrameWriter with one of the supported formats (as for 1.6 it is Parquet, text, and JSON):
dfWithInd.write.partitionBy("ind").parquet(...)
It will create two separate directories (ind=false, ind=true) on write.
In general though, it is not possible to yield multiple RDDs or DataFrames from a single transformation. See How to split a RDD into two or more RDDs?