Getting Difference of two dataframe on Apache Spark (Scala) - scala

I have two dataframes which are large csv files which I am reading into dataframes in Spark (Scala)
First Dataframe is something like
key| col1 | col2 |
-------------------
1 | blue | house |
2 | red | earth |
3 | green| earth |
4 | cyan | home |
Second dataframe is something like
key| col1 | col2 | col3
-------------------
1 | blue | house | xyz
2 | cyan | earth | xy
3 | green| mars | xy
I want to get differences like this for common keys & common columns (keys are like primary key) in a different dataframe
key| col1 | col2 |
------------------------------------
1 | blue | house |
2 | red --> cyan | earth |
3 | green | home--> mars |
Below is my approach so far:
//read the files into dataframe
val src_df = read_df(file1)
val tgt_df = read_df(file2)
//truncate dataframe to only contain common keys
val common_src = spark.sql(
"""
select *
from src_df src
where src.key IN(
select tgt.key
from tgt_df tgt
"""
val tgt_common = spark.sql(
"""
select *
from tgt_df tgt
where tgt.key IN(
select src.key
from src_df src
"""
//merge both the dataframes
val joined_df = src_common.join(tgt_common, src_common(key) === tgt_common(key), "inner")
I was unsuccessfully trying to do something like this
joined_df
.groupby(key)
.apply(some_function(?))
I have tried looking in existing solutions posted online . But I couldn't get the desired result.
PS: Also hoping the solution would be able to scale for large data
Thanks

Try the following:
spark.sql(
"""
select
s.id,
if(s.col1 = t.col1, s.col1, s.col1 || ' --> ' || t.col1) as col1,
if(s.col2 = t.col2, s.col2, s.col2 || ' --> ' || t.col2) as col2
from src_df s
inner join tgt_df t on s.id = t.id
""").show

Related

Execute spark sql query on dataframe column values

I'm trying to get the size of each table in my database.
I listed first all my tables in a dataframe using this command :
df = spark.sql("show tables in db")
And this is my current dataframe :
+---------+
| tabs |
+---------+
|db.tab1 |
|db.tab2 |
|db.tab3 |
|db.tab4 |
|db.tab5 |
+---------+
Then, for each table I want to get some informations such as count and last modification date.
To explain more, what I want to do is something like this (it's not working) :
df1 = df.withColumn("count", spark.sql('select count(*) from {0}'.format(df.tabs)))
This is the desired result :
+---------+------+
| tabs | count|
+---------+------+
|db.tab1 | 122 |
|db.tab2 | 156 |
|db.tab3 | 235 |
|db.tab4 | 11 |
|db.tab5 | 98 |
+---------+------+
You can try like below.
get count for each table and union them.
tables = df.collect()
count_dfs = [
spark.sql(f'select {table} as tabs, count(*) as count from {table}')
for table in tables
]
result_df = reduce(lambda union_dfs, count_df: union_dfs.union(count_df))
result_df.show()

how to get the minor of three column values in postgresql

The common function to get the minor value of a column is min(column), but what I want to do is to get the minor value of a row, based on the values of 3 columns. For example, using the following base table:
+------+------+------+
| col1 | col2 | col3 |
+------+------+------+
| 2 | 1 | 3 |
| 10 | 0 | 1 |
| 13 | 12 | 2 |
+------+------+------+
I want to query it as:
+-----------+
| min_value |
+-----------+
| 1 |
| 0 |
| 2 |
+-----------+
I found a solution as follows, but for SQL, not Postgresql. So I am not getting it to work in postgresql:
select
(
select min(minCol)
from (values (t.col1), (t.col2), (t.col3)) as minCol(minCol)
) as minCol
from t
I could write something using case statement but I would like to write a query like the above for postgresql. Is this possible?
You can use least() (and greatest() for the maximum)
select least(col1, col2, col3) as min_value
from the_table

Spark: Iterating Through Dataframe with Operation

I have a dataframe and I want to iterate through every row of the dataframe. There are some columns in the dataframe that have leading characters of three quotations that indicate that they are accidentally chopped off, and need to all be part of one column. Therefore, I need to loop through all the rows in the dataframe, and if the column has the leading characters then it needs to join it's proper column.
The following works for a single line and gives the correct result:
val t = df.first.toSeq.toArray.toBuffer
while(t(5).toString.startsWith("\"\"\"")){
t(4) = t(4).toString.concat(t(5).toString)
t.remove(5)
}
However, when I try to go through the whole dataframe it errors out:
df.foreach(z =>
val t = z.toSeq.toArray.toBuffer
while(t(5).toString.startsWith("\"\"\"")){
t(4) = t(4).toString.concat(t(5).toString)
t.remove(5)
}
)
This errors out with this error message: <console>:2: error: illegal start of simple expression.
How do I correct this to make it work correctly? Why is this not correct?
Thanks!
Edit - Example Data (there are other columns in front):
+---+--------+----------+----------+---------+
|id | col4 | col5 | col6 | col7 |
+---+--------+----------+----------+---------+
| 1 | {blah} | service | null | null |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service | null |
+---+--------+----------+----------+---------+

Make row data into column headers while grouping

I'm trying to group up on a multiple rows of similar data and convert differentiated row data into columns on Amazon Redshift. Easier to explain with an example ->
Starting Table
+-------------------------------------------+
|**Col1** | **Col2** | **Col3** | **Col 4** |
| x | y | A | 123 |
| x | y | B | 456 |
+-------------------------------------------+
End result desired
+-------------------------------------------+
|**Col1** | **Col2** | **A** | **B** |
| x | y | 123 | 456 |
+-------------------------------------------+
Essentially grouping by Column 1 and 2, and the entries in Column 3 become the new column headers and the entries in Column 4 become the entries for the new columns.
Any help super appreciated!
There is no native functionality, but you could do something like:
SELECT
COL1,
COL2,
MAX(CASE WHEN COL3='A' THEN COL4 END) AS A,
MAX(CASE WHEN COL3='B' THEN COL4 END) AS B
FROM table
GROUP BY COL1, COL2
You effectively need to hard-code the column names. It's not possible to automatically define columns based on the data.
This is standard SQL - nothing specific to Amazon Redshift.

Spark/ Scala- Select Columns Conditionally From Dataframe

I have two hive tables A and B and their respective data frames df_a and df_b
A
+----+----- +-----------+
| id | name | mobile1 |
+----+----- +-----------+
| 1 | Matt | 123456798 |
+----+----- +-----------+
| 2 | John | 123456798 |
+----+----- +-----------+
| 3 | Lena | |
+----+----- +-----------+
B
+----+----- +-----------+
| id | name | mobile2 |
+----+----- +-----------+
| 3 | Lena | 123456798 |
+----+----- +-----------+
And want to perform an operation similar to
select A.name, nvl(nvl(A.mobile1, B.mobile2), 0) from A left outer join B on A.id = B.id
So far I've come up with
df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(?)
I can't figure out how to conditionally select either mobile1 or mobile2 or 0 like I did in the Hive query.
Could someone please help me with this? I'm using Spark 1.5.
Use coalesce:
import org.apache.spark.sql.functions._
df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer").select(
coalesce(df_a("mobile1"), df_b("mobile2"), lit(0))
)
If will use mobile1 if it's present, if not - then mobile2, if mobile2 is not present then 0
You can use spark sql's nanvl function.
After applying it should be similar to:
df_a.join(df_b, df_a("id") <=> df_b("id"), "left_outer")
.select(df_a("name"), nanvl(nanvl(df_a("mobile1"), df_b("mobile2")), 0))