PySpark dataframe how to use flatmap - pyspark

I am writing a PySpark program that is comparing two tables, let's say Table1 and Table2
Both tables have identical structure, but may contain different data
Let's say, Table 1 has below cols
key1, key2, col1, col2, col3
The sample data in table 1 is as follows
"a", 1, "x1", "y1", "z1"
"a", 2, "x2", "y2", "z2"
"a", 3, "x3", "y3", "z3"
Similarly Table 2 has below cols
key1, key2, col1, col2, col3
The sample data in table 1 is as follows
"a", 1, "x1", "y1", "z1"
"a", 2, "x21", "y21", "z2"
"a", 3, "x3", "y3", "z31"
The program creates a data frame (let's say df1) that contains below columns
Key1, Key2, a.Col1, a.Col2, a.Col3, b.Col1, b.Col2, b.Col3, column_names
example data:
"a", 2, "x2", "y2", "z2", "x21", "y21", "z2", "col1,col2"
"a", 3, "x3", "y3", "z3", "x3", "y3", "z31", "col3"
The column "column_names" contains columns that have different values between table1 and table2
Using this data frame, I need to create another data frame that contains below structure
key1, key2, field_in_difference, src_value, tgt_value
"a", 2, "col1", "x2", "x21"
"a", 2, "col2", "y2", "y21"
"a", 3, "col3", "z3", "z31"
I am thinking that I need use flatMap in PySpark
Can I use flatmap for one of the column in the dataframe, so that multiple rows are created in the resulting dataframe ? but remaining columns get copied in the new row ?
I tried to use following, but does not seem to be correct syntax
df2 = df1.withColumn("newcolumn", func.concat_ws(",", flatMap(lambda x: x.split(','))))
But I get an error NameErrorL name flatMap is not defined
not sure how do I specify that the flatmap needs to be done on the column "column_names" , while keeping the remaining cols as they are..
I think the approach is to create one row per the column in difference as a first step
Then in the second step, create another df that will transform as expected output
I really appreciate the help

flatMap works on RDD, not DataFrame.
I don't quite understand how you want to use flatMap on df1, but I think working directly from Table 1 and Table 2 might be easier. Let's say Table 1 is df_src and Table 2 is df_tgt.
df_src.show()
+----+----+----+----+----+
|key1|key2|col1|col2|col3|
+----+----+----+----+----+
| a| 1| x1| y1| z1|
| a| 2| x2| y2| z2|
| a| 3| x3| y3| z3|
+----+----+----+----+----+
df_tgt.show()
+----+----+----+----+----+
|key1|key2|col1|col2|col3|
+----+----+----+----+----+
| a| 1| x1| y1| z1|
| a| 2| x21| y21| z2|
| a| 3| x3| y3| z31|
+----+----+----+----+----+
You can un-pivot both dataframes using stack function, join them, and filter it.
from pyspark.sql.functions import col
# unpivot col1, col2 and col3 of both dataframes. rename key columns as well
df_src = df_src.selectExpr("key1 key1_s", "key2 key2_s", "stack(3, 'col1', col1, 'col2', col2, 'col3', col3) (field_s, src_value)")
df_tgt = df_tgt.selectExpr("key1 key1_t", "key2 key2_t", "stack(3, 'col1', col1, 'col2', col2, 'col3', col3) (field_t, tgt_value)")
# join the dataframes on keys and field, then filter where field values are different
df_res = (df_src
.join(df_tgt,
[col('key1_s') == col('key1_t'), col('key2_s') == col('key2_t'), col('field_s') == col('field_t')],
'inner')
.filter(col('src_value') != col('tgt_value'))
.selectExpr('key1_s key1', 'key2_s key2', 'field_s field_in_difference', 'src_value', 'tgt_value')
)
df_res.show()
+----+----+-------------------+---------+---------+
|key1|key2|field_in_difference|src_value|tgt_value|
+----+----+-------------------+---------+---------+
| a| 2| col1| x2| x21|
| a| 2| col2| y2| y21|
| a| 3| col3| z3| z31|
+----+----+-------------------+---------+---------+

Related

Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF
val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))
val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")
This is the sample DF
+---+-----------+-----------+
| id| include| exclude|
+---+-----------+-----------+
| 1|in1,in2,in3| null|
| 2| in4,in5|ex1,ex2,ex3|
| 3| null| ex4,ex5|
| 4| null| null|
+---+-----------+-----------+
which I'm trying to transform into
+---+----+---+
| id|type|col|
+---+----+---+
| 1|incl|in1|
| 1|incl|in2|
| 1|incl|in3|
| 2|incl|in4|
| 2|incl|in5|
| 2|excl|ex1|
| 2|excl|ex2|
| 2|excl|ex3|
| 3|excl|ex4|
| 3|excl|ex5|
+---+----+---+
EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.
I can get it to work with union, as so:
df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)
but I was wondering if this was possible to do without using union.
The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.
This might be a long process.
With cte as ( select id, include+exclude as outputcol from SQL),
Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte)
Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes
Where finalcol is not null

How to union DataFrames and add only missing rows?

I have a dataframe df1, which contains below data:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule1
2 B X rule1
I have another dataframe df2, which contains below data:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule2
2 B X rule2
3 C y rule2
rule_name values in both dataframes is always fixed
I want a new unionized dataframe df3. It should have all customers from dataframe df1 and all other customers from dataframe df2, which are not present in df1. So final df3 should look like:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule1
2 B X rule1
3 C y rule2
Can anyone please help me out to achieve this outcome. Any help will be appreciated.
Given the following datasets:
val df1 = Seq(
(1, "A", "1", "rule1"),
(2, "B", "X", "rule1")
).toDF("customer_id", "product", "Val_id", "rule_name")
val df2 = Seq(
(1, "A", "1", "rule2"),
(2, "B", "X", "rule2"),
(3, "C", "y", "rule2")
).toDF("customer_id", "product", "Val_id", "rule_name")
And the requirement:
It should have all customers from dataframe df1 and all other customers from dataframe df2, which are not present in df1.
My first solution could be as follows:
val missingCustomers = df2.
join(df1, Seq("customer_id"), "leftanti").
select($"customer_id", df2("product"), df2("Val_id"), df2("rule_name"))
val all = df1.union(missingCustomers)
scala> all.show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
| 1| A| 1| rule1|
| 2| B| X| rule1|
| 3| C| y| rule2|
+-----------+-------+------+---------+
Another (and perhaps slower) solution could be as follows:
// find missing ids, i.e. ids in df2 that are not in df1
// BE EXTRA CAREFUL: "Downloading" all missing ids to the driver
val missingIds = df2.
select("customer_id").
except(df1.select("customer_id")).
as[Int].
collect
// filter ids in df2 that match missing ids
val missingRows = df2.filter($"customer_id" isin (missingIds: _*))
scala> df1.union(missingRows).show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
| 1| A| 1| rule1|
| 2| B| X| rule1|
| 3| C| y| rule2|
+-----------+-------+------+---------+

How to group by gender and join by positions per group?

I have tried numerous approaches to turn the following:
Gender, Age, Value
1, 20, 21
2, 23 22
1, 26, 23
2, 29, 24
into
Male_Age, Male_Value, Female_Age, Female_Value
20 21 23 22
26 23 29 24
What i need to do is group by gender and instead of using an aggregate like (sum, count, avg) I need to create List[age] and List[value]. This should be possible because i am using a Dataset which allows functional operations.
If the number of rows for males and females are not the same, the columns should be filled with nulls.
One approach I tried was to make a new a new dataframe using the columns of other dataframes like so:
df
.select(male.select("sex").where('sex === 1).col("sex"),
female.select("sex").where('sex === 2).col("sex"))
However, this bizarrely produces output like so:
sex, sex,
1, 1
2, 2
1, 1
2, 2
I can't see how that is possible.
I also tried using pivot, but it forces me to aggregate after the group by:
df.withColumn("sex2", df.col("sex"))
.groupBy("sex")
.pivot("sex2")
.agg(
sum('value').as("mean"),
stddev('value).as("std. dev") )
.show()
|sex| 1.0_mean| 1.0_std. dev| 2.0_mean| 2.0_std. dev|
|1.0|0.4926065526| 1.8110632697| | |
|2.0| | |0.951250372|1.75060275400785|
The following code does what I need in Oracle SQL, so it should possible in Spark SQL too I reckon...
drop table mytable
CREATE TABLE mytable
( gender number(10) NOT NULL,
age number(10) NOT NULL,
value number(10) );
insert into mytable values (1,20,21);
insert into mytable values(2,23,22);
insert into mytable values (1,26,23);
insert into mytable values (2,29,24);
insert into mytable values (1,30,25);
select * from mytable;
SELECT A.VALUE AS MALE,
B.VALUE AS FEMALE
FROM
(select value, rownum RN from mytable where gender = 1) A
FULL OUTER JOIN
(select value, rownum RN from mytable where gender = 2) B
ON A.RN = B.RN
The following should give you the result.
val df = Seq(
(1, 20, 21),
(2, 23, 22),
(1, 26, 23),
(2, 29, 24)
).toDF("Gender", "Age", "Value")
scala> df.show
+------+---+-----+
|Gender|Age|Value|
+------+---+-----+
| 1| 20| 21|
| 2| 23| 22|
| 1| 26| 23|
| 2| 29| 24|
+------+---+-----+
// Gender 1 = Male
// Gender 2 = Female
import org.apache.spark.sql.expressions.Window
val byGender = Window.partitionBy("gender").orderBy("gender")
val males = df
.filter("gender = 1")
.select($"age" as "male_age",
$"value" as "male_value",
row_number() over byGender as "RN")
scala> males.show
+--------+----------+---+
|male_age|male_value| RN|
+--------+----------+---+
| 20| 21| 1|
| 26| 23| 2|
+--------+----------+---+
val females = df
.filter("gender = 2")
.select($"age" as "female_age",
$"value" as "female_value",
row_number() over byGender as "RN")
scala> females.show
+----------+------------+---+
|female_age|female_value| RN|
+----------+------------+---+
| 23| 22| 1|
| 29| 24| 2|
+----------+------------+---+
scala> males.join(females, Seq("RN"), "outer").show
+---+--------+----------+----------+------------+
| RN|male_age|male_value|female_age|female_value|
+---+--------+----------+----------+------------+
| 1| 20| 21| 23| 22|
| 2| 26| 23| 29| 24|
+---+--------+----------+----------+------------+
Given a DataFrame called df with columns gender, age, and value, you can do this:
df.groupBy($"gender")
.agg(collect_list($"age"), collect_list($"value")).rdd.map { row =>
val ages: Seq[Int] = row.getSeq(1)
val values: Seq[Int] = row.getSeq(2)
(row.getInt(0), ages.head, ages.last, values.head, values.last)
}.toDF("gender", "male_age", "female_age", "male_value", "female_value")
This uses the collect_list aggregating function in the very helpful Spark functions library to aggregate the values you want. (As you can see, there is also a collect_set as well.)
After that, I don't know of any higher-level DataFrame functions to expand those columnar arrays into individual columns of their own, so I fall back to the lower-level RDD API our ancestors used. I simply expand everything into a Tuple and then turn it back into a DataFrame. The commenters above mention corner cases I have not addressed; using functions like headOption and tailOption might be useful there. But this should be enough to get you moving.

How to filter a dataframe by multiple columns?

I have a problem as below:
I have two dataframes
Dataframe DF1:
ID, Name, age
1 name1 18
2 name2 20
DataFrame DF2:
ID, Name, age
1 name1 18
3 name3 19
I am attempting to filter DF2 to exclude records contained in DF1 by ID and Name so that I can get new DF2 like
ID, Name, age
3 name3 19
and then union these two dataframes to get final result:
ID, Name, age
1 name1 18
2 name2 20
3 name3 19
To do this in T-SQL, I can write a statement like
INSERT INTO DF1
SELECT ID, Name, age FROM DF2 WHERE NOT EXISTS
(SELECT 1 FROM DF1 WHERE DF1.ID = DF2.ID AND DF1.Name = DF2.Name)
But I find that "insert" is not supported in dataframe in sparkSQL.
So my questions are:
How can I filter a dataframe based on multiple column?
How can I union two dataframes together?
I am appreciate for any solution.
UNION followed by DISTINCT
Assuming that the records are unique the simplest way to achieve what you want is to take UNION and follow it by DISTINCT:
val df1 = Seq((1, "name1", 18), (2, "name2", 20)).toDF("ID", "Name", "age")
val df2 = Seq((1, "name1", 18), (3, "name3", 19)).toDF("ID", "Name", "age")
df1.unionAll(df2).distinct.show
// +---+-----+---+
// | ID| Name|age|
// +---+-----+---+
// | 1|name1| 18|
// | 2|name2| 20|
// | 3|name3| 19|
// +---+-----+---+
Characteristics:
has to access df1 only once
shuffles both df1 and df2 independent of the size
EXCEPT followed by UNION
Another approach is to use EXCEPT followed by UNION:
df1.unionAll(df2.except(df1)).show // df2.distinct.except to drop duplicates
// +---+-----+---+
// | ID| Name|age|
// +---+-----+---+
// | 1|name1| 18|
// | 2|name2| 20|
// | 3|name3| 19|
// +---+-----+---+
Properties:
has to access df1 twice
shuffles both frames independent of the size (?)
can be used with three frames (df3.unionAll(df2.except(df1)))
LEFT OUTER JOIN followed by SELECT with filter followed by UNION
Finally if you want only partial match LEFT OUTER JOIN with filter followed by UNION should do the trick:
df2.as("df2")
.join(
df1.select("id", "name").as("df1"),
// join on id and name
$"df1.id" === $"df2.id" && $"df1.name" === $"df2.name",
"leftouter")
// This could be replaced by .na.drop(...)
.where($"df1.id".isNull && $"df1.Name".isNull)
.select($"df2.id", $"df2.name", $"df2.age")
.unionAll(df1)
.show
// ---+-----+---+
// | ID| Name|Age|
// +---+-----+---+
// | 3|name3| 19|
// | 1|name1| 18|
// | 2|name2| 20|
// +---+-----+---+
Properties:
has to access df1 twice
if one the data frames is small enough to broadcasted it may not require shuflle
can be used with three data frames

How to avoid duplicate columns after join?

I have two dataframes with the following columns:
df1.columns
// Array(ts, id, X1, X2)
and
df2.columns
// Array(ts, id, Y1, Y2)
After I do
val df_combined = df1.join(df2, Seq(ts,id))
I end up with the following columns: Array(ts, id, X1, X2, ts, id, Y1, Y2). I could expect that the common columns would be dropped. Is there something that additional that needs to be done?
The simple answer (from the Databricks FAQ on this matter) is to perform the join where the joined columns are expressed as an array of strings (or one string) instead of a predicate.
Below is an example adapted from the Databricks FAQ but with two join columns in order to answer the original poster's question.
Here is the left dataframe:
val llist = Seq(("bob", "b", "2015-01-13", 4), ("alice", "a", "2015-04-23",10))
val left = llist.toDF("firstname","lastname","date","duration")
left.show()
/*
+---------+--------+----------+--------+
|firstname|lastname| date|duration|
+---------+--------+----------+--------+
| bob| b|2015-01-13| 4|
| alice| a|2015-04-23| 10|
+---------+--------+----------+--------+
*/
Here is the right dataframe:
val right = Seq(("alice", "a", 100),("bob", "b", 23)).toDF("firstname","lastname","upload")
right.show()
/*
+---------+--------+------+
|firstname|lastname|upload|
+---------+--------+------+
| alice| a| 100|
| bob| b| 23|
+---------+--------+------+
*/
Here is an incorrect solution, where the join columns are defined as the predicate left("firstname")===right("firstname") && left("lastname")===right("lastname").
The incorrect result is that the firstname and lastname columns are duplicated in the joined data frame:
left.join(right, left("firstname")===right("firstname") &&
left("lastname")===right("lastname")).show
/*
+---------+--------+----------+--------+---------+--------+------+
|firstname|lastname| date|duration|firstname|lastname|upload|
+---------+--------+----------+--------+---------+--------+------+
| bob| b|2015-01-13| 4| bob| b| 23|
| alice| a|2015-04-23| 10| alice| a| 100|
+---------+--------+----------+--------+---------+--------+------+
*/
The correct solution is to define the join columns as an array of strings Seq("firstname", "lastname"). The output data frame does not have duplicated columns:
left.join(right, Seq("firstname", "lastname")).show
/*
+---------+--------+----------+--------+------+
|firstname|lastname| date|duration|upload|
+---------+--------+----------+--------+------+
| bob| b|2015-01-13| 4| 23|
| alice| a|2015-04-23| 10| 100|
+---------+--------+----------+--------+------+
*/
This is an expected behavior. DataFrame.join method is equivalent to SQL join like this
SELECT * FROM a JOIN b ON joinExprs
If you want to ignore duplicate columns just drop them or select columns of interest afterwards. If you want to disambiguate you can use access these using parent DataFrames:
val a: DataFrame = ???
val b: DataFrame = ???
val joinExprs: Column = ???
a.join(b, joinExprs).select(a("id"), b("foo"))
// drop equivalent
a.alias("a").join(b.alias("b"), joinExprs).drop(b("id")).drop(a("foo"))
or use aliases:
// As for now aliases don't work with drop
a.alias("a").join(b.alias("b"), joinExprs).select($"a.id", $"b.foo")
For equi-joins there exist a special shortcut syntax which takes either a sequence of strings:
val usingColumns: Seq[String] = ???
a.join(b, usingColumns)
or as single string
val usingColumn: String = ???
a.join(b, usingColumn)
which keep only one copy of columns used in a join condition.
I have been stuck with this for a while, and only recently I came up with a solution what is quite easy.
Say a is
scala> val a = Seq(("a", 1), ("b", 2)).toDF("key", "vala")
a: org.apache.spark.sql.DataFrame = [key: string, vala: int]
scala> a.show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
and
scala> val b = Seq(("a", 1)).toDF("key", "valb")
b: org.apache.spark.sql.DataFrame = [key: string, valb: int]
scala> b.show
+---+----+
|key|valb|
+---+----+
| a| 1|
+---+----+
and I can do this to select only the value in dataframe a:
scala> a.join(b, a("key") === b("key"), "left").select(a.columns.map(a(_)) : _*).show
+---+----+
|key|vala|
+---+----+
| a| 1|
| b| 2|
+---+----+
You can simply use this
df1.join(df2, Seq("ts","id"),"TYPE-OF-JOIN")
Here TYPE-OF-JOIN can be
left
right
inner
fullouter
For example, I have two dataframes like this:
// df1
word count1
w1 10
w2 15
w3 20
// df2
word count2
w1 100
w2 150
w5 200
If you do fullouter join then the result looks like this
df1.join(df2, Seq("word"),"fullouter").show()
word count1 count2
w1 10 100
w2 15 150
w3 20 null
w5 null 200
try this,
val df_combined = df1.join(df2, df1("ts") === df2("ts") && df1("id") === df2("id")).drop(df2("ts")).drop(df2("id"))
This is a normal behavior from SQL, what I am doing for this:
Drop or Rename source columns
Do the join
Drop renamed column if any
Here I am replacing "fullname" column:
Some code in Java:
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data/year=%d/month=%d/day=%d", year, month, day))
.drop("fullname")
.registerTempTable("data_original");
this
.sqlContext
.read()
.parquet(String.format("hdfs:///user/blablacar/data_v2/year=%d/month=%d/day=%d", year, month, day))
.registerTempTable("data_v2");
this
.sqlContext
.sql(etlQuery)
.repartition(1)
.write()
.mode(SaveMode.Overwrite)
.parquet(outputPath);
Where the query is:
SELECT
d.*,
concat_ws('_', product_name, product_module, name) AS fullname
FROM
{table_source} d
LEFT OUTER JOIN
{table_updates} u ON u.id = d.id
This is something you can do only with Spark I believe (drop column from list), very very helpful!
Inner Join is default join in spark, Below is simple syntax for it.
leftDF.join(rightDF,"Common Col Nam")
For Other join you can follow the below syntax
leftDF.join(rightDF,Seq("Common Columns comma seperated","join type")
If columns Name are not common then
leftDF.join(rightDF,leftDF.col("x")===rightDF.col("y),"join type")
Best practice is to make column name different in both the DF before joining them and drop accordingly.
df1.columns =[id, age, income]
df2.column=[id, age_group]
df1.join(df2, on=df1.id== df2.id,how='inner').write.saveAsTable('table_name')
will return an error while error for duplicate columns
Try this instead try this:
df2_id_renamed = df2.withColumnRenamed('id','id_2')
df1.join(df2_id_renamed, on=df1.id== df2_id_renamed.id_2,how='inner').drop('id_2')
If anyone is using spark-SQL and wants to achieve the same thing then you can use USING clause in join query.
val spark = SparkSession.builder().master("local[*]").getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
import spark.implicits._
val df1 = List((1, 4, 3), (5, 2, 4), (7, 4, 5)).toDF("c1", "c2", "C3")
val df2 = List((1, 4, 3), (5, 2, 4), (7, 4, 10)).toDF("c1", "c2", "C4")
df1.createOrReplaceTempView("table1")
df2.createOrReplaceTempView("table2")
spark.sql("select * from table1 inner join table2 using (c1, c2)").show(false)
/*
+---+---+---+---+
|c1 |c2 |C3 |C4 |
+---+---+---+---+
|1 |4 |3 |3 |
|5 |2 |4 |4 |
|7 |4 |5 |10 |
+---+---+---+---+
*/
After I've joined multiple tables together, I run them through a simple function to rename columns in the DF if it encounters duplicates. Alternatively, you could drop these duplicate columns too.
Where Names is a table with columns ['Id', 'Name', 'DateId', 'Description'] and Dates is a table with columns ['Id', 'Date', 'Description'], the columns Id and Description will be duplicated after being joined.
Names = sparkSession.sql("SELECT * FROM Names")
Dates = sparkSession.sql("SELECT * FROM Dates")
NamesAndDates = Names.join(Dates, Names.DateId == Dates.Id, "inner")
NamesAndDates = deDupeDfCols(NamesAndDates, '_')
NamesAndDates.saveAsTable("...", format="parquet", mode="overwrite", path="...")
Where deDupeDfCols is defined as:
def deDupeDfCols(df, separator=''):
newcols = []
for col in df.columns:
if col not in newcols:
newcols.append(col)
else:
for i in range(2, 1000):
if (col + separator + str(i)) not in newcols:
newcols.append(col + separator + str(i))
break
return df.toDF(*newcols)
The resulting data frame will contain columns ['Id', 'Name', 'DateId', 'Description', 'Id2', 'Date', 'Description2'].
Apologies this answer is in Python - I'm not familiar with Scala, but this was the question that came up when I Googled this problem and I'm sure Scala code isn't too different.