Spark Dataframe Combine 2 Columns into Single Column, with Additional Identifying Column - scala

I'm trying to split and then combine 2 DataFrame columns into 1, with another column identifying which column it originated from. Here is the code to generate the sample DF
val data = Seq(("1", "in1,in2,in3", null), ("2","in4,in5","ex1,ex2,ex3"), ("3", null, "ex4,ex5"), ("4", null, null))
val df = spark.sparkContext.parallelize(data).toDF("id", "include", "exclude")
This is the sample DF
+---+-----------+-----------+
| id| include| exclude|
+---+-----------+-----------+
| 1|in1,in2,in3| null|
| 2| in4,in5|ex1,ex2,ex3|
| 3| null| ex4,ex5|
| 4| null| null|
+---+-----------+-----------+
which I'm trying to transform into
+---+----+---+
| id|type|col|
+---+----+---+
| 1|incl|in1|
| 1|incl|in2|
| 1|incl|in3|
| 2|incl|in4|
| 2|incl|in5|
| 2|excl|ex1|
| 2|excl|ex2|
| 2|excl|ex3|
| 3|excl|ex4|
| 3|excl|ex5|
+---+----+---+
EDIT: Should mention that the data inside each of the cells in the example DF is just for visualization, and doesn't need to have the form in1,ex1, etc.
I can get it to work with union, as so:
df.select($"id", lit("incl").as("type"), explode(split(col("include"), ",")))
.union(
df.select($"id", lit("excl").as("type"), explode(split(col("exclude"), ",")))
)
but I was wondering if this was possible to do without using union.

The approach that I am thinking off is, better club both the include and exclude columns and then apply explode function. Then fetch only the column which doesn't have nulls. Finally a case statement.
This might be a long process.
With cte as ( select id, include+exclude as outputcol from SQL),
Ctes as (select id,explode(split(col("outputcol"), ",")) as finalcol from cte)
Select id, case when finalcol like 'in%' then 'incl' else 'excl' end as type, finalcol from Ctes
Where finalcol is not null

Related

Ambiguous Column in DataFrame Join - Unable to Alias or Call

Getting into databricks from a SQL background and working with some dataframe samples for joining for basic transformations, and I am having issues isolating the correct dataframe.column for other transformations after the join.
For DF1, I have 3 columns: user_id, user_ts, email. For DF2, I have two columns: email, converted.
Below is how I have the logic for the join. This works and returns 5 columns; however, there are two email columns in the schema
df3 = (df1
.join(df2, df1.email == df2.email, "outer")
)
I am trying to do some basic transformation on the df2 email as part the dataframe string, but I receive the error:
"Cannot resolve column name "df2.email" among (user_id, user_ts, email, email, converted)"
df3 = (df1
.join(df2, df1.email == df2.email, "outer")
.na.fill(False,["df2.email"])
)
If I remove the df2 from the fill(), I get the error that the columns are ambiguous.
How can I define which column I want to do a transformation on if it has the same column name as a second column. In SQL, I just use a table alias predicate for the column, but this doesn't seem to be how pyspark is bested used.
Suggestions?
If you want to avoid both key columns in the join result and get combined result then you can pass list of key columns as an argument to join() method.
If you want to retain same key columns from both dataframes then you have to rename one of the column name before doing transformation, otherwise spark will throw ambiguous column error.
df1 = spark.createDataFrame([(1, 'abc#gmail.com'),(2,'def#gmail.com')],["id1", "email"])
df2 = spark.createDataFrame([(1, 'abc#gmail.com'),(2,'ghi#gmail.com')],["id2", "email"])
df1.join(df2,['email'], 'outer').show()
'''
+-------------+----+----+
| email| id1| id2|
+-------------+----+----+
|def#gmail.com| 2|null|
|ghi#gmail.com|null| 2|
|abc#gmail.com| 1| 1|
+-------------+----+----+'''
df1.join(df2,df1['email'] == df2['email'], 'outer').show()
'''
+----+-------------+----+-------------+
| id1| email| id2| email|
+----+-------------+----+-------------+
| 2|def#gmail.com|null| null|
|null| null| 2|ghi#gmail.com|
| 1|abc#gmail.com| 1|abc#gmail.com|
+----+-------------+----+-------------+'''
df1.join(df2,df1['email'] == df2['email'], 'outer') \
.select('id1', 'id2', df1['email'], df2['email'].alias('email2')) \
.na.fill('False','email2').show()
'''
+----+----+-------------+-------------+
| id1| id2| email| email2|
+----+----+-------------+-------------+
| 2|null|def#gmail.com| False|
|null| 2| null|ghi#gmail.com|
| 1| 1|abc#gmail.com|abc#gmail.com|
+----+----+-------------+-------------+ '''

Comparing two Identically structured Dataframes in Spark

val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000)).toDF("id","name","city","credit_score","credit_limit")
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000)).toDF("id","name","city","creditscore","credit_limit")
So the above two dataframes has the same table structure and I want to find out the id's for which the values have changed in the other dataframe(changedDF). I tried with the except() function in spark but its giving me two rows. Id is the common column between these two dataframes.
changedDF.except(originalDF).show
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 4|Joshua|cochin| 612| 85000|
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
Whereas I only want the common ids for which there has been any changes.Like this ->
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
Is there any way to find out the only the common ids for which the data have changed.
Can anybody tell me any approach I can follow to achieve this.
You can do the inner join of the dataframes, that will give you the result with common ids.
originalDF.alias("a").join(changedDF.alias("b"), col("a.id") === col("b.id"), "inner")
.select("a.*")
.except(changedDF)
.show
Then, your expected result will be out:
+---+-----+-----+------------+------------+
| id| name| city|credit_score|credit_limit|
+---+-----+-----+------------+------------+
| 2|sunil|noida| 600| 80000|
+---+-----+-----+------------+------------+

Use PySpark Dataframe column in another spark sql query

I have a situation where I'm trying to query a table and use the result (dataframe) from that query as IN clause of another query.
From the first query I have the dataframe below:
+-----------------+
|key |
+-----------------+
| 10000000000004|
| 10000000000003|
| 10000000000008|
| 10000000000009|
| 10000000000007|
| 10000000000006|
| 10000000000010|
| 10000000000002|
+-----------------+
And now I want to run a query like the one below using the values of that dataframe dynamically instead of hard coding the values:
spark.sql("""select country from table1 where key in (10000000000004, 10000000000003, 10000000000008, 10000000000009, 10000000000007, 10000000000006, 10000000000010, 10000000000002)""").show()
I tried the following, however it didn't work:
df = spark.sql("""select key from table0 """)
a = df.select("key").collect()
spark.sql("""select country from table1 where key in ({0})""".format(a)).show()
Can somebody help me?
You should use an (inner) join between two data frames to get the countries you would like. See my example:
# Create a list of countries with Id's
countries = [('Netherlands', 1), ('France', 2), ('Germany', 3), ('Belgium', 4)]
# Create a list of Ids
numbers = [(1,), (2,)]
# Create two data frames
df_countries = spark.createDataFrame(countries, ['CountryName', 'Id'])
df_numbers = spark.createDataFrame(numbers, ['Id'])
The data frames look like the following:
df_countries:
+-----------+---+
|CountryName| Id|
+-----------+---+
|Netherlands| 1|
| France| 2|
| Germany| 3|
| Belgium| 4|
+-----------+---+
df_numbers:
+---+
| Id|
+---+
| 1|
| 2|
+---+
You can join them as follows:
countries.join(numbers, on='Id', how='inner')
Resulting in:
+---+-----------+
| Id|CountryName|
+---+-----------+
| 1|Netherlands|
| 2| France|
+---+-----------+
Hope that clears things up!

Spark SQL Dataframe API -build filter condition dynamically

I have two Spark dataframe's, df1 and df2:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| ramesh| 1212| 29|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
+------+-----+---+-----+
| eName| eNo|age| city|
+------+-----+---+-----+
|aarush|12121| 15|malmo|
|ramesh| 1212| 29|malmo|
+------+-----+---+-----+
I need to get the non matching records from df1, based on a number of columns which is specified in another file.
For example, the column look up file is something like below:
df1col,df2col
name,eName
empNo, eNo
Expected output is:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
|shankar|12121| 28|
| suresh| 1111| 30|
| aarush| 0707| 15|
+-------+-----+---+
The idea is how to build a where condition dynamically for the above scenario, because the lookup file is configurable, so it might have 1 to n fields.
You can use the except dataframe method. I'm assuming that the columns to use are in two lists for simplicity. It's necessary that the order of both lists are correct, the columns on the same location in the list will be compared (regardless of column name). After except, use join to get the missing columns from the first dataframe.
val df1 = Seq(("shankar","12121",28),("ramesh","1212",29),("suresh","1111",30),("aarush","0707",15))
.toDF("name", "empNo", "age")
val df2 = Seq(("aarush", "12121", 15, "malmo"),("ramesh", "1212", 29, "malmo"))
.toDF("eName", "eNo", "age", "city")
val df1Cols = List("name", "empNo")
val df2Cols = List("eName", "eNo")
val tempDf = df1.select(df1Cols.head, df1Cols.tail: _*)
.except(df2.select(df2Cols.head, df2Cols.tail: _*))
val df = df1.join(broadcast(tempDf), df1Cols)
The resulting dataframe will look as wanted:
+-------+-----+---+
| name|empNo|age|
+-------+-----+---+
| aarush| 0707| 15|
| suresh| 1111| 30|
|shankar|12121| 28|
+-------+-----+---+
If you're doing this from a SQL query I would remap the column names in the SQL query itself with something like Changing a SQL column title via query. You could do a simple text replace in the query to normalize them to the df1 or df2 column names.
Once you have that you can diff using something like
How to obtain the difference between two DataFrames?
If you need more columns that wouldn't be used in the diff (e.g. age) you can reselect the data again based on your diff results. This may not be the optimal way of doing it but it would probably work.

How to merge duplicate rows using expressions in Spark Dataframes

How can I merge 2 data frames by removing duplicates by comparing columns.
I have two dataframes with same column names
a.show()
+-----+----------+--------+
| name| date|duration|
+-----+----------+--------+
| bob|2015-01-13| 4|
|alice|2015-04-23| 10|
+-----+----------+--------+
b.show()
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-12| 3|
|alice2|2015-04-13| 10|
+------+----------+--------+
What I am trying to do is merging of 2 dataframes to display only unique rows by applying two conditions
1.For same name duration will be sum of durations.
2.For same name,the final date will be latest date.
Final output will be
final.show()
+-------+----------+--------+
| name | date|duration|
+----- +----------+--------+
| bob |2015-01-13| 7|
|alice |2015-04-23| 10|
|alice2 |2015-04-13| 10|
+-------+----------+--------+
I followed the following method.
//Take union of 2 dataframe
val df =a.unionAll(b)
//group and take sum
val grouped =df.groupBy("name").agg($"name",sum("duration"))
//join
val j=df.join(grouped,"name").drop("duration").withColumnRenamed("sum(duration)", "duration")
and I got
+------+----------+--------+
| name| date|duration|
+------+----------+--------+
| bob|2015-01-13| 7|
| alice|2015-04-23| 10|
| bob|2015-01-12| 7|
|alice2|2015-04-23| 10|
+------+----------+--------+
How can I now remove duplicates by comparing dates.
Will it be possible by running sql queries after registering it as table.
I am a beginner in SparkSQL and I feel like my way of approaching this problem is weird. Is there any better way to do this kind of data processing.
you can do max(date) in groupBy(). No need to do join the grouped with df.
// In 1.3.x, in order for the grouping column "name" to show up,
val grouped = df.groupBy("name").agg($"name",sum("duration"), max("date"))
// In 1.4+, grouping column "name" is included automatically.
val grouped = df.groupBy("name").agg(sum("duration"), max("date"))