how to join dataframes with some similar values and multiple keys / scala

how to join dataframes with some similar values and multiple keys / scala - scala

I have problems to get following table. The first two tables are my source tables which i would like to join. the third table is how i would like to have it.
I tried it with an outer join and used the keys "ID" and "date" but the result is not the same like in this example. The problem is, that some def_ values in each table have the same date and i would like to get them in the same row.
I used following join:
val df_result = df_1.join(df_2, Seq("ID", "date"), "outer")
df
+----+-----+-----------+
|ID |def_a| date |
+----+-----+-----------+
| 01| 1| 2019-01-31|
| 02| 1| 2019-12-31|
| 03| 1| 2019-11-30|
| 01| 1| 2019-10-31|
df
+----+-----+-----+-----------+
|ID |def_b|def_c|date |
+----+-----+-----+-----------+
| 01| 1| 0| 2017-01-31|
| 02| 1| 1| 2019-12-31|
| 03| 1| 1| 2018-11-30|
| 03| 0| 1| 2019-11-30|
| 01| 1| 1| 2018-09-30|
| 02| 1| 1| 2018-08-31|
| 01| 1| 1| 2018-07-31|
result
+----+-----+-----+-----+-----------+
|ID |def_a|def_b|deb_c|date |
+----+-----+-----+-----+-----------+
| 01| 1| 0| 0| 2019-01-31|
| 02| 1| 1| 1| 2019-12-31|
| 03| 1| 0| 1| 2019-11-30|
| 01| 1| 0| 0| 2019-10-31|
| 01| 0| 1| 0| 2017-01-31|
| 03| 0| 1| 1| 2018-11-30|
| 01| 0| 1| 1| 2018-09-30|
| 02| 0| 1| 1| 2018-08-31|
| 01| 0| 1| 1| 2018-07-31|
I would be grateful for any help.

Hope the following code would be helpful —
df_result
.groupBy("ID", "date")
.agg(
max("a"),
max("b"),
max("c")
)

Related

Get % of rows that have a unique value by id

I have a pyspark dataframe that looks like this
import pandas as pd
spark.createDataFrame(
pd.DataFrame({'ch_id': [1,1,1,1,1,
2,2,2,2],
'e_id': [0,0,1,2,2,
0,0,1,1],
'seg': ['h','s','s','a','s',
'h','s','s','h']})
).show()
+-----+----+---+
|ch_id|e_id|seg|
+-----+----+---+
| 1| 0| h|
| 1| 0| s|
| 1| 1| s|
| 1| 2| a|
| 1| 2| s|
| 2| 0| h|
| 2| 0| s|
| 2| 1| s|
| 2| 1| h|
+-----+----+---+
I would like for every c_id to get:
the % of e_id for which there is one unique value of s
The output would like like this:
+----+-------+
|c_id|%_major|
+----+-------+
| 1| 66.6|
| 2| 0.0|
+----+-------+
How could I achieve that in pyspark ?

How to count change in row values in pyspark

Logic to count the change in the row values of a given column
Input
df22 = spark.createDataFrame(
[(1, 1.0), (1,22.0), (1,22.0), (1,21.0), (1,20.0), (2, 3.0), (2,3.0),
(2, 5.0), (2, 10.0), (2,3.0), (3,11.0), (4, 11.0), (4,15.0), (1,22.0)],
("id", "v"))
+---+----+
| id| v|
+---+----+
| 1| 1.0|
| 1|22.0|
| 1|22.0|
| 1|21.0|
| 1|20.0|
| 2| 3.0|
| 2| 3.0|
| 2| 5.0|
| 2|10.0|
| 2| 3.0|
| 3|11.0|
| 4|11.0|
| 4|15.0|
+---+----+
Expect output
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|22.0| 1|
| 1|22.0| 1|
| 1|21.0| 2|
| 1|20.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 2| 3.0| 3|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+
Any help on this will be greatly appreciated
Thanks in advance
Ramabadran

Before adding answer, I would like to ask you ,"what you have tried ??". Please try something from your end and then seek for support in this platform. Also your question is not clear. You have not provided if you are looking for a delta capture count per 'id' or as a whole. Just giving an expected output is not going to make the question clear.
And now comes to your question , if I understood it correctly from the sample input and output,you need delta capture count per 'id'. So one way to achieve it as below
#Capture the incremented count using lag() and sum() over below mentioned window
import pyspark.sql.functions as F
from pyspark.sql.window import Window
winSpec=Window.partitionBy('id').orderBy('v') # Your Window for capturing the incremented count
df22.\
withColumn('prev',F.coalesce(F.lag('v').over(winSpec),F.col('v'))).\
withColumn('c',F.sum(F.expr("case when v-prev<>0 then 1 else 0 end")).over(winSpec)).\
drop('prev').\
orderBy('id','v').\
show()
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|20.0| 1|
| 1|21.0| 2|
| 1|22.0| 3|
| 1|22.0| 3|
| 1|22.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+

How to union 2 dataframe without creating additional rows?

I have 2 dataframes and I wanted to do .filter($"item" === "a") while keeping the "S/N" in number numbers.
I tried the following but it ended up with additional rows when I use union. Is there a way to union 2 dataframes without creating additional rows?
var DF1 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
var DF2 = Seq(
("1","a",2),
("2","a",3),
("3","b",3),
("4","b",4),
("5","a",2)).
toDF("S/N","item", "value")
DF2 = DF2.filter($"item"==="a")
DF3=DF1.withColumn("item",lit(0)).withColumn("value",lit(0))
DF1.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| b| 3|
| 4| b| 4|
| 5| a| 2|
+---+----+-----+
DF2.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
+---+----+-----+
DF3.show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+
DF2.union(someDF3).show()
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 5| a| 2|
| 1| 0| 0|
| 2| 0| 0|
| 3| 0| 0|
| 4| 0| 0|
| 5| 0| 0|
+---+----+-----+

Left outer join your S/Ns with filtered dataframe, then use coalesce to get rid of nulls:
val DF3 = DF1.select("S/N")
val DF4 = (DF3.join(DF2, Seq("S/N"), joinType="leftouter")
.withColumn("item", coalesce($"item", lit(0)))
.withColumn("value", coalesce($"value", lit(0))))
DF4.show
+---+----+-----+
|S/N|item|value|
+---+----+-----+
| 1| a| 2|
| 2| a| 3|
| 3| 0| 0|
| 4| 0| 0|
| 5| a| 2|
+---+----+-----+

Row count broken up by a focal value

I have the following DataFrame in Spark using Scala:
val df = List(
("random", 0),
("words", 1),
("in", 1),
("a", 1),
("column", 1),
("are", 0),
("what", 0),
("have", 1),
("been", 1),
("placed", 0),
("here", 1),
("now", 1)
).toDF(Seq("words", "numbers"): _*)
df.show()
+------+-------+
| words|numbers|
+------+-------+
|random| 0|
| words| 1|
| in| 1|
| a| 1|
|column| 1|
| are| 0|
| what| 0|
| have| 1|
| been| 1|
|placed| 0|
| here| 1|
| now| 1|
+------+-------+
I'd like to add a column that contains the count of rows which is started over at every 0 in the numbers column. It would look like this:
+------+-------+-----+
| words|numbers|count|
+------+-------+-----+
|random| 0| 5|
| words| 1| 5|
| in| 1| 5|
| a| 1| 5|
|column| 1| 5|
| are| 0| 1|
| what| 0| 3|
| have| 1| 3|
| been| 1| 3|
|placed| 0| 3|
| here| 1| 3|
| now| 1| 3|
+------+-------+-----+

Here is a method using selectExpr with SQL window functions sum and count; sum of 1-numbers generates the group id which increases by 1 when a zero is encountered, then count the number of rows by this group id:
This might be inefficient since you don't have any partition column.
df.selectExpr(
"words", "numbers",
"count(*) over(partition by sum(1-numbers) over (order by monotonically_increasing_id())) as count"
).show
+------+-------+-----+
| words|numbers|count|
+------+-------+-----+
|random| 0| 5|
| words| 1| 5|
| in| 1| 5|
| a| 1| 5|
|column| 1| 5|
| are| 0| 1|
| what| 0| 3|
| have| 1| 3|
| been| 1| 3|
|placed| 0| 3|
| here| 1| 3|
| now| 1| 3|
+------+-------+-----+

how to filter out a null value from spark dataframe

I created a dataframe in spark with the following schema:
root
|-- user_id: long (nullable = false)
|-- event_id: long (nullable = false)
|-- invited: integer (nullable = false)
|-- day_diff: long (nullable = true)
|-- interested: integer (nullable = false)
|-- event_owner: long (nullable = false)
|-- friend_id: long (nullable = false)
And the data is shown below:
+----------+----------+-------+--------+----------+-----------+---------+
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
| 4236494| 110357109| 0| -1| 0| 937597069| null|
| 78065188| 498404626| 0| 0| 0| 2904922087| null|
| 282487230|2520855981| 0| 28| 0| 3749735525| null|
| 335269852|1641491432| 0| 2| 0| 1490350911| null|
| 437050836|1238456614| 0| 2| 0| 991277599| null|
| 447244169|2095085551| 0| -1| 0| 1579858878| null|
| 516353916|1076364848| 0| 3| 1| 3597645735| null|
| 528218683|1151525474| 0| 1| 0| 3433080956| null|
| 531967718|3632072502| 0| 1| 0| 3863085861| null|
| 627948360|2823119321| 0| 0| 0| 4092665803| null|
| 811791433|3513954032| 0| 2| 0| 415464198| null|
| 830686203| 99027353| 0| 0| 0| 3549822604| null|
|1008893291|1115453150| 0| 2| 0| 2245155244| null|
|1239364869|2824096896| 0| 2| 1| 2579294650| null|
|1287950172|1076364848| 0| 0| 0| 3597645735| null|
|1345896548|2658555390| 0| 1| 0| 2025118823| null|
|1354205322|2564682277| 0| 3| 0| 2563033185| null|
|1408344828|1255629030| 0| -1| 1| 804901063| null|
|1452633375|1334001859| 0| 4| 0| 1488588320| null|
|1625052108|3297535757| 0| 3| 0| 1972598895| null|
+----------+----------+-------+--------+----------+-----------+---------+
I want to filter out the rows have null values in the field of "friend_id".
scala> val aaa = test.filter("friend_id is null")
scala> aaa.count
I got :res52: Long = 0 which is obvious not right. What is the right way to get it?
One more question, I want to replace the values in the friend_id field. I want to replace null with 0 and 1 for any other value except null. The code I can figure out is:
val aaa = train_friend_join.select($"user_id", $"event_id", $"invited", $"day_diff", $"interested", $"event_owner", ($"friend_id" != null)?1:0)
This code also doesn't work. Can anyone tell me how can I fix it? Thanks

Let's say you have this data setup (so that results are reproducible):
// declaring data types
case class Company(cName: String, cId: String, details: String)
case class Employee(name: String, id: String, email: String, company: Company)
// setting up example data
val e1 = Employee("n1", null, "n1#c1.com", Company("c1", "1", "d1"))
val e2 = Employee("n2", "2", "n2#c1.com", Company("c1", "1", "d1"))
val e3 = Employee("n3", "3", "n3#c1.com", Company("c1", "1", "d1"))
val e4 = Employee("n4", "4", "n4#c2.com", Company("c2", "2", "d2"))
val e5 = Employee("n5", null, "n5#c2.com", Company("c2", "2", "d2"))
val e6 = Employee("n6", "6", "n6#c2.com", Company("c2", "2", "d2"))
val e7 = Employee("n7", "7", "n7#c3.com", Company("c3", "3", "d3"))
val e8 = Employee("n8", "8", "n8#c3.com", Company("c3", "3", "d3"))
val employees = Seq(e1, e2, e3, e4, e5, e6, e7, e8)
val df = sc.parallelize(employees).toDF
Data is:
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|n2#c1.com|[c1,1,d1]|
| n3| 3|n3#c1.com|[c1,1,d1]|
| n4| 4|n4#c2.com|[c2,2,d2]|
| n5|null|n5#c2.com|[c2,2,d2]|
| n6| 6|n6#c2.com|[c2,2,d2]|
| n7| 7|n7#c3.com|[c3,3,d3]|
| n8| 8|n8#c3.com|[c3,3,d3]|
+----+----+---------+---------+
Now to filter employees with null ids, you will do --
df.filter("id is null").show
which will correctly show you following:
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n5|null|n5#c2.com|[c2,2,d2]|
+----+----+---------+---------+
Coming to the second part of your question, you can replace the null ids with 0 and other values with 1 with this --
df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show
This results in:
+----+---+---------+---------+
|name| id| email| company|
+----+---+---------+---------+
| n1| 0|n1#c1.com|[c1,1,d1]|
| n2| 1|n2#c1.com|[c1,1,d1]|
| n3| 1|n3#c1.com|[c1,1,d1]|
| n4| 1|n4#c2.com|[c2,2,d2]|
| n5| 0|n5#c2.com|[c2,2,d2]|
| n6| 1|n6#c2.com|[c2,2,d2]|
| n7| 1|n7#c3.com|[c3,3,d3]|
| n8| 1|n8#c3.com|[c3,3,d3]|
+----+---+---------+---------+

Or like df.filter($"friend_id".isNotNull)

df.where(df.col("friend_id").isNull)

There are two ways to do it: creating filter condition 1) Manually 2) Dynamically.
Sample DataFrame:
val df = spark.createDataFrame(Seq(
(0, "a1", "b1", "c1", "d1"),
(1, "a2", "b2", "c2", "d2"),
(2, "a3", "b3", null, "d3"),
(3, "a4", null, "c4", "d4"),
(4, null, "b5", "c5", "d5")
)).toDF("id", "col1", "col2", "col3", "col4")
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
| 2| a3| b3|null| d3|
| 3| a4|null| c4| d4|
| 4|null| b5| c5| d5|
+---+----+----+----+----+
1) Creating filter condition manually i.e. using DataFrame where or filter function
df.filter(col("col1").isNotNull && col("col2").isNotNull).show
or
df.where("col1 is not null and col2 is not null").show
Result:
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
| 2| a3| b3|null| d3|
+---+----+----+----+----+
2) Creating filter condition dynamically: This is useful when we don't want any column to have null value and there are large number of columns, which is mostly the case.
To create the filter condition manually in these cases will waste a lot of time. In below code we are including all columns dynamically using map and reduce function on DataFrame columns:
val filterCond = df.columns.map(x=>col(x).isNotNull).reduce(_ && _)
How filterCond looks:
filterCond: org.apache.spark.sql.Column = (((((id IS NOT NULL) AND (col1 IS NOT NULL)) AND (col2 IS NOT NULL)) AND (col3 IS NOT NULL)) AND (col4 IS NOT NULL))
Filtering:
val filteredDf = df.filter(filterCond)
Result:
+---+----+----+----+----+
| id|col1|col2|col3|col4|
+---+----+----+----+----+
| 0| a1| b1| c1| d1|
| 1| a2| b2| c2| d2|
+---+----+----+----+----+

A good solution for me was to drop the rows with any null values:
Dataset<Row> filtered = df.filter(row => !row.anyNull);
In case one is interested in the other case, just call row.anyNull.
(Spark 2.1.0 using Java API)

The following lines work well:
test.filter("friend_id is not null")

From the hint from Michael Kopaniov, below works
df.where(df("id").isNotNull).show

Here is a solution for spark in Java. To select data rows containing nulls. When you have Dataset data, you do:
Dataset<Row> containingNulls = data.where(data.col("COLUMN_NAME").isNull())
To filter out data without nulls you do:
Dataset<Row> withoutNulls = data.where(data.col("COLUMN_NAME").isNotNull())
Often dataframes contain columns of type String where instead of nulls we have empty strings like "". To filter out such data as well we do:
Dataset<Row> withoutNullsAndEmpty = data.where(data.col("COLUMN_NAME").isNotNull().and(data.col("COLUMN_NAME").notEqual("")))

for the first question, it is correct you are filtering out nulls and hence count is zero.
for the second replacing: use like below:
val options = Map("path" -> "...\\ex.csv", "header" -> "true")
val dfNull = spark.sqlContext.load("com.databricks.spark.csv", options)
scala> dfNull.show
+----------+----------+-------+--------+----------+-----------+---------+
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
| 4236494| 110357109| 0| -1| 0| 937597069| null|
| 78065188| 498404626| 0| 0| 0| 2904922087| null|
| 282487230|2520855981| 0| 28| 0| 3749735525| null|
| 335269852|1641491432| 0| 2| 0| 1490350911| null|
| 437050836|1238456614| 0| 2| 0| 991277599| null|
| 447244169|2095085551| 0| -1| 0| 1579858878| a|
| 516353916|1076364848| 0| 3| 1| 3597645735| b|
| 528218683|1151525474| 0| 1| 0| 3433080956| c|
| 531967718|3632072502| 0| 1| 0| 3863085861| null|
| 627948360|2823119321| 0| 0| 0| 4092665803| null|
| 811791433|3513954032| 0| 2| 0| 415464198| null|
| 830686203| 99027353| 0| 0| 0| 3549822604| null|
|1008893291|1115453150| 0| 2| 0| 2245155244| null|
|1239364869|2824096896| 0| 2| 1| 2579294650| d|
|1287950172|1076364848| 0| 0| 0| 3597645735| null|
|1345896548|2658555390| 0| 1| 0| 2025118823| null|
|1354205322|2564682277| 0| 3| 0| 2563033185| null|
|1408344828|1255629030| 0| -1| 1| 804901063| null|
|1452633375|1334001859| 0| 4| 0| 1488588320| null|
|1625052108|3297535757| 0| 3| 0| 1972598895| null|
+----------+----------+-------+--------+----------+-----------+---------+
dfNull.withColumn("friend_idTmp", when($"friend_id".isNull, "1").otherwise("0")).drop($"friend_id").withColumnRenamed("friend_idTmp", "friend_id").show
+----------+----------+-------+--------+----------+-----------+---------+
| user_id| event_id|invited|day_diff|interested|event_owner|friend_id|
+----------+----------+-------+--------+----------+-----------+---------+
| 4236494| 110357109| 0| -1| 0| 937597069| 1|
| 78065188| 498404626| 0| 0| 0| 2904922087| 1|
| 282487230|2520855981| 0| 28| 0| 3749735525| 1|
| 335269852|1641491432| 0| 2| 0| 1490350911| 1|
| 437050836|1238456614| 0| 2| 0| 991277599| 1|
| 447244169|2095085551| 0| -1| 0| 1579858878| 0|
| 516353916|1076364848| 0| 3| 1| 3597645735| 0|
| 528218683|1151525474| 0| 1| 0| 3433080956| 0|
| 531967718|3632072502| 0| 1| 0| 3863085861| 1|
| 627948360|2823119321| 0| 0| 0| 4092665803| 1|
| 811791433|3513954032| 0| 2| 0| 415464198| 1|
| 830686203| 99027353| 0| 0| 0| 3549822604| 1|
|1008893291|1115453150| 0| 2| 0| 2245155244| 1|
|1239364869|2824096896| 0| 2| 1| 2579294650| 0|
|1287950172|1076364848| 0| 0| 0| 3597645735| 1|
|1345896548|2658555390| 0| 1| 0| 2025118823| 1|
|1354205322|2564682277| 0| 3| 0| 2563033185| 1|
|1408344828|1255629030| 0| -1| 1| 804901063| 1|
|1452633375|1334001859| 0| 4| 0| 1488588320| 1|
|1625052108|3297535757| 0| 3| 0| 1972598895| 1|
+----------+----------+-------+--------+----------+-----------+---------+

val df = Seq(
("1001", "1007"),
("1002", null),
("1003", "1005"),
(null, "1006")
).toDF("user_id", "friend_id")
Data is:
+-------+---------+
|user_id|friend_id|
+-------+---------+
| 1001| 1007|
| 1002| null|
| 1003| 1005|
| null| 1006|
+-------+---------+
Drop rows containing any null or NaN values in the specified columns of the Seq:
df.na.drop(Seq("friend_id"))
.show()
Output:
+-------+---------+
|user_id|friend_id|
+-------+---------+
| 1001| 1007|
| 1003| 1005|
| null| 1006|
+-------+---------+
If do not specify columns, drop row as long as any column of a row contains null or NaN values:
df.na.drop()
.show()
Output:
+-------+---------+
|user_id|friend_id|
+-------+---------+
| 1001| 1007|
| 1003| 1005|
+-------+---------+

Another easy way to filter out null values from multiple columns in spark dataframe. Please pay attention there is AND between columns.
df.filter(" COALESCE(col1, col2, col3, col4, col5, col6) IS NOT NULL")
If you need to filter out rows that contain any null (OR connected) please use
df.na.drop()

I use the following code to solve my question. It works. But as we all know, I work around a country's mile to solve it. So, is there a short cut for that? Thanks
def filter_null(field : Any) : Int = field match {
case null => 0
case _ => 1
}
val test = train_event_join.join(
user_friends_pair,
train_event_join("user_id") === user_friends_pair("user_id") &&
train_event_join("event_owner") === user_friends_pair("friend_id"),
"left"
).select(
train_event_join("user_id"),
train_event_join("event_id"),
train_event_join("invited"),
train_event_join("day_diff"),
train_event_join("interested"),
train_event_join("event_owner"),
user_friends_pair("friend_id")
).rdd.map{
line => (
line(0).toString.toLong,
line(1).toString.toLong,
line(2).toString.toLong,
line(3).toString.toLong,
line(4).toString.toLong,
line(5).toString.toLong,
filter_null(line(6))
)
}.toDF("user_id", "event_id", "invited", "day_diff", "interested", "event_owner", "creator_is_friend")