Drop function not working after left outer join in pyspark - pyspark

My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:
a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)
b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)
c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]
The drop function is not removing the columns.
But if I try to do:
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)
Then priority column for a_df gets dropped.
Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.
I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?
Thanks in advance.

Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.
But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
From the databricks:
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.
When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.
df = left.join(right, ["priority"])

Related

Combine columns from two sources

I have two sources resulting from some transformation in data flow:
I have tried using join, it replicates the data no matter join I select it outputs similar stuff:
I have tried union as well but union either creates null in columns (if done by name) or rows (if done by position)
Shouldnt the join just concat the columns together because the IDs are same in both table.
This is how the desired ouput should look:
I want concat the version column to the first source so that it looks like this:
ID name value version
111 file1 0.1 3
111 file2 0.82 15
111 file3 2.2 2
Both of your source files have only one matching column (ID) and it is not unique.
When you join both sources on the ID column, each row of source1 joins with all the matching rows of source2.
Here, your row1 (111) of source1 joins with all 3 matching rows (111) of source2, hence it results in 9 rows with different version values for each row in source1.
To get only 3 rows as your expected results, you need a unique matching row in each source.
Add window transformation for both sources and get the rowNumber() based on the ID column.
Source1->window1:
Window1 data preview:
Source2->window2:
Window2 data preview:
Add join transformation to join data from window transformations on ID and rank columns.
Join data preview:
Add select transformation to remove the unwanted columns.
Select data preview:
That is expected with a join. For example, when you join tables in SQL, you also supply the target projection as part of the select statement. What you need to do here is add a Select transformation after your Join transformation. In there, you will reduce the projection to just the columns that would like to retain. You'll be able to choose which side (left or right) you would like to keep for the ID column.

How to optimize broadcast join in spark Scala?

I am a new developper at Spark Scala and I want to improve my code by using a broadcast join.
As I understand, a broadcast join can optimise the code if we have a large DataFrame with a small one. It's exactly the case for me. I have a first DF (tab1 in my example) that contains more 3 billions data that I have to join with a second one with only 900 data.
Here is my sql request :
SELECT tab1.id1, regexp_extract(tab2.emp_name, ".*?(\\d+)\\)$", 1) AS city,
topo_2g3g.emp_id AS emp_id, tab1.emp_type
FROM table1 tab1
INNER JOIN table2 tab2
ON (tab1.emp_type = tab2.emp_type AND tab1.start = tab2.code)
And here is my attempt to use a broadcast join :
val tab1 = df1.filter(""" id > 100 """).as("table1")
val tab2 = df2.filter(""" id > 100 """).as("table2")
val result = tab1.join(
broadcast(tab2)
, col("tab1.emp_type") === col("tab2.emp_type") && col("tab1.start") === col("tab2.code")
, "inner")
The problem is that this way is not optimized at all. I mean it contains ALL the columns for the two table, while I don't need all those columns. I just need 3 of them and the last one (with a regex on it), which is not optimal at all. It's like, we generate a very big table first and then we reduce it to a small table. While in SQL, we got directly the small table.
So, after this step :
I have to use withColumn to generate the new column (with the regex)
Apply a filter method to select the 3 colmuns that I. While i got them IMMEDIATELY in sql (with no filter I mean).
Can you help me please to optimize my code and my request ?
Thanks in advance
you select the columns you want before doing the join
df1.select("col1", "col2").filter(""" id > 100 """).as("table1")

Most efficient way to select and process data from a dataframe

I would like to load and process data from a dataframe in Spark using Scala.
The raw SQL Statement looks like this:
INSERT INTO TABLE_1
(
key_attribute,
attribute_1,
attribute_2
)
SELECT
MIN(TABLE_2.key_attribute),
CURRENT_TIMESTAMP as attribute_1,
'Some_String' as attribute_2
FROM TABLE_2
LEFT OUTER JOIN TABLE_1
ON TABLE_2.key_attribute = TABLE_1.key_attribute
WHERE
TABLE_1.key_attribute IS NULL
AND TABLE_2.key_attribute IS NOT NULL
GROUP BY
attribute_1,
attribute_2,
TABLE_2.key_attribute
What I've done so far:
I created a DataFrame from the Select Statement and joined it with the TABLE_2 DataFrame.
val table_1 = spark.sql("Select key_attribute, current_timestamp() as attribute_1, 'Some_String' as attribute_2").toDF();
table_2.join(table_1, Seq("key_attribute"), "left_outer");
Not really much progress because I face to many difficulties:
How do I handle the SELECT with processing data efficiently? Keep everything in seperate DataFrames?
How do I insert the WHERE/GROUP BY clause with attributes from several sources?
Is there any other/better way except Spark SQL?
Few steps in handling are -
First create the dataframe with your raw data
Then save it as temp table.
You can use filter() or "where condition in sparksql" and get the
resultant dataframe
Then as you used - you can make use of jons with datframes. You can
think of dafaframes as a representation of table.
Regarding efficiency, since the processing will be done in parallel, its being taken care. If you want anything more regarding efficiency, please mention it.

SQL with table as becomes ambiguous

Perhaps I'm approaching this all wrong, in which case feel free to point out a better way to solve the overall question, which "How do I use an intermediate table for future queries?"
Let's say I've got tables foo and bar, which join on some baz_id, and I want to use combine this into an intermediate table to be fed into upcoming queries. I know of the WITH .. AS (...) statement, but am running into problems as such:
WITH foobar AS (
SELECT *
FROM foo
INNER JOIN bar ON bar.baz_id = foo.baz_id
)
SELECT
baz_id
-- some other things as well
FROM
foobar
The issue is that (Postgres 9.4) tells me baz_id is ambiguous. I understand this happens because SELECT * includes all the columns in both tables, so baz_id shows up twice; but I'm not sure how to get around it. I was hoping to avoid copying the column names out individually, like
SELECT
foo.var1, foo.var2, foo.var3, ...
bar.other1, bar.other2, bar.other3, ...
FROM foo INNER JOIN bar ...
because there are hundreds of columns in these tables.
Is there some way around this I'm missing, or some altogether different way to approach the question at hand?
WITH foobar AS (
SELECT *
FROM foo
INNER JOIN bar USING(baz_id)
)
SELECT
baz_id
-- some other things as well
FROM
foobar
It leaves only one instance of the baz_id column in the select list.
From the documentation:
The USING clause is a shorthand that allows you to take advantage of the specific situation where both sides of the join use the same name for the joining column(s). It takes a comma-separated list of the shared column names and forms a join condition that includes an equality comparison for each one. For example, joining T1 and T2 with USING (a, b) produces the join condition ON T1.a = T2.a AND T1.b = T2.b.
Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.

differing column names in self-outer-joins

When writing a self-join in tSQL I can avoid duplicate column names thus:
SELECT FirstEvent.Title AS FirstTitle, SecondEvent.Title AS FirstTitle
FROM ContiguatedEvents AS FirstEvent
LEFT OUTER JOIN ContiguatedEvents AS SecondEvent
ON FirstEvent.logID = SecondEvent.logID
Suppose I want to select all the columns from the self-join, for example into a view. How do I then differentiate the column names without writing each one out in the join statement. I.e. is there anything I can write like this (ish)
SELECT FirstEvent.* AS ???, SecondEvent.* AS ???
FROM ContiguatedEvents AS FirstEvent
LEFT OUTER JOIN ContiguatedEvents AS SecondEvent
ON FirstEvent.logID = SecondEvent.logID
There's no way to automatically introduce aliases for multiple columns, you just have to do it by hand.
One handy hint for quickly getting all of the column names into your query (in management studio) is to drag the Columns folder from the Object Explorer into a query window. It gives you all of the column names.