Pyspark union of two dataframes - pyspark

I want to do the union of two pyspark dataframe. They have same columns but sequence of columns are different
I tried this
joined_df = A_df.unionAll(B_DF)
But result is based on column sequence and intermixing the results. IS there a way to do do the union based on columns name and not based on the order of columns. Thanks in advance

Just reorder columns in B so that it has the same column order as in A before union:
A_df.unionAll(B_df.select(*A_df.columns))

Related

Combine columns from two sources

I have two sources resulting from some transformation in data flow:
I have tried using join, it replicates the data no matter join I select it outputs similar stuff:
I have tried union as well but union either creates null in columns (if done by name) or rows (if done by position)
Shouldnt the join just concat the columns together because the IDs are same in both table.
This is how the desired ouput should look:
I want concat the version column to the first source so that it looks like this:
ID name value version
111 file1 0.1 3
111 file2 0.82 15
111 file3 2.2 2
Both of your source files have only one matching column (ID) and it is not unique.
When you join both sources on the ID column, each row of source1 joins with all the matching rows of source2.
Here, your row1 (111) of source1 joins with all 3 matching rows (111) of source2, hence it results in 9 rows with different version values for each row in source1.
To get only 3 rows as your expected results, you need a unique matching row in each source.
Add window transformation for both sources and get the rowNumber() based on the ID column.
Source1->window1:
Window1 data preview:
Source2->window2:
Window2 data preview:
Add join transformation to join data from window transformations on ID and rank columns.
Join data preview:
Add select transformation to remove the unwanted columns.
Select data preview:
That is expected with a join. For example, when you join tables in SQL, you also supply the target projection as part of the select statement. What you need to do here is add a Select transformation after your Join transformation. In there, you will reduce the projection to just the columns that would like to retain. You'll be able to choose which side (left or right) you would like to keep for the ID column.

Spark SQL: Generate a row id column with auto increments in CONSECUTIVE integer

I have a databricks notebook written in Scala. And I have this dataframe generated like this:
val df = spark.sql("SELECT ColumnName FROM TableName")
I want to add another column RowID that will automatically populate the rows with integers. I don't want to use the row_number() function. I need CONSECUTIVE integers starting from 1. Is there any other way?
I checked this answer but it does not help me to generate consecutive integers. And monotonically_increasing_id is not working for me. Is this function valid for databricks? Do we need to import some modules?
Thanks!

How to optimize broadcast join in spark Scala?

I am a new developper at Spark Scala and I want to improve my code by using a broadcast join.
As I understand, a broadcast join can optimise the code if we have a large DataFrame with a small one. It's exactly the case for me. I have a first DF (tab1 in my example) that contains more 3 billions data that I have to join with a second one with only 900 data.
Here is my sql request :
SELECT tab1.id1, regexp_extract(tab2.emp_name, ".*?(\\d+)\\)$", 1) AS city,
topo_2g3g.emp_id AS emp_id, tab1.emp_type
FROM table1 tab1
INNER JOIN table2 tab2
ON (tab1.emp_type = tab2.emp_type AND tab1.start = tab2.code)
And here is my attempt to use a broadcast join :
val tab1 = df1.filter(""" id > 100 """).as("table1")
val tab2 = df2.filter(""" id > 100 """).as("table2")
val result = tab1.join(
broadcast(tab2)
, col("tab1.emp_type") === col("tab2.emp_type") && col("tab1.start") === col("tab2.code")
, "inner")
The problem is that this way is not optimized at all. I mean it contains ALL the columns for the two table, while I don't need all those columns. I just need 3 of them and the last one (with a regex on it), which is not optimal at all. It's like, we generate a very big table first and then we reduce it to a small table. While in SQL, we got directly the small table.
So, after this step :
I have to use withColumn to generate the new column (with the regex)
Apply a filter method to select the 3 colmuns that I. While i got them IMMEDIATELY in sql (with no filter I mean).
Can you help me please to optimize my code and my request ?
Thanks in advance
you select the columns you want before doing the join
df1.select("col1", "col2").filter(""" id > 100 """).as("table1")

Drop function not working after left outer join in pyspark

My pyspark version is 2.1.1. I am trying to join two dataframes (left outer) having two columns id and priority. I am creating my dataframes like this:
a = "select 123 as id, 1 as priority"
a_df = spark.sql(a)
b = "select 123 as id, 1 as priority union select 112 as uid, 1 as priority"
b_df = spark.sql(b)
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(b_df.priority)
c_df schema is coming as DataFrame[uid: int, priority: int, uid: int, priority: int]
The drop function is not removing the columns.
But if I try to do:
c_df = a_df.join(b_df, (a_df.id==b_df.id), 'left').drop(a_df.priority)
Then priority column for a_df gets dropped.
Not sure if there is a version change issue or something else, but it feels very weird that drop function will behave like this.
I know the workaround can be to remove the unwanted columns first, and then do the join. But still not sure why drop function is not working?
Thanks in advance.
Duplicate column names with joins in pyspark lead to unpredictable behavior, and I've read to disambiguate the names before joining. From stackoverflow, Spark Dataframe distinguish columns with duplicated name and Pyspark Join and then column select is showing unexpected output . I'm sorry to say I can't find why pyspark doesn't work as you describe.
But the databricks documentation addresses this problem: https://docs.databricks.com/spark/latest/faq/join-two-dataframes-duplicated-column.html
From the databricks:
If you perform a join in Spark and don’t specify your join correctly you’ll end up with duplicate column names. This makes it harder to select those columns. This topic and notebook demonstrate how perform a join so that you don’t have duplicated columns.
When you join, instead you can try either using an alias (thats typically what I use), or you can join the columns as an list type or str.
df = left.join(right, ["priority"])

How to groupBy using multiple columns in scala collections

records.groupBy(_.column1)
What if I want to include more columns like if I want to group by column1, column2 and column3?
Any hints?
Try
records.groupBy(record => (record.column1, record.column2, record.column3))
This will group by a tuple composed of those 3 columns.