PySpark: How to concatenate two dataframes without duplicates rows? - pyspark

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
A B
0 1 2
1 3 1
Dataframe B:
A B
0 5 6
1 3 1
I wish to merge them such that the final DataFrame is of the following shape:
Final Dataframe:
A B
0 1 2
1 3 1
2 5 6
How can I do this?

pyspark.sql.DataFrame.union and pyspark.sql.DataFrame.unionAll seem to yield the same result with duplicates.
Instead, you can get the desired output by using direct SQL:
dfA.createTempView('dataframea')
dfB.createTempView('dataframeb')
aunionb = spark.sql('select * from dataframea union select * from dataframeb')
Using SQL produces the expected/correct result.

In order to remove any duplicate rows, just use union() followed by a distinct().
Mentioned in the documentation
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
"union(other)
Return a new DataFrame containing union of rows in this frame and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct."

You have just to drop duplicates after union.
df = dfA.union(dfB).dropDuplicates()

Related

Does spark supports the below cascaded query?

I have one requirement to run some queries against some tables in the postgresql database to populate a dataframe. Tables are as following.
table 1 has the below data.
QueryID, WhereClauseID, Enabled
1 1 true
2 2 true
3 3 true
...
table 2 has the below data.
WhereClauseID, WhereClauseString
1 a>b
2 a>c
3 a>b && a<c
...
table 3 has the below data.
a, b, c, value
30, 20, 30, 100
20, 10, 40, 200
...
I want to query in the following way. For table 1, I want to pick up the rows when Enabled is true. Based on the WhereClauseID in each row, I want to pick up the rows in table 2. Based on the WhereClause condition picked up from table 2, I want to run the query using Where Clause to query table 3 to get the Value. Finally, I want to get all records in table 3 meeting the WhereClauses enabled in table 1.
I know I can go through table 1 row by row, and use the parameterized string to build sql query to query table 3. But I think the efficiency is very low to query row by row, especially if table 1 is big. Are there some better way to organize the query to improve the efficiency? Thanks a lot!
Depending on you usecase, but for pyspark databases, you'd might be able to solve it using the .when statement in pyspark.
Here is a suggestion.
import pyspark.sql.functions as F
tbl1 = spark.table("table1")
tbl3 = spark.table("table3")
tbl3 = (
tbl3
.withColumn("WhereClauseID",
## You can do some fancy parsing of your tbl2
## here if you want this to be evaluated programatically from your table2.
(
F.when( F.col("a") > F.col("b"), 1)
.when( F.col("a") > F.col("b"), 2)
.otherwise(-1)
)
)
)
tbl1_with_tbl_3 = tbl1.join(tbl3, "WhereClauseID", "left")

select only those columns from table have not null values in q kdb

I have a table:
q)t:([] a:1 2 3; b:```; c:`a`b`c)
a b c
-----
1 a
2 b
3 c
From this table I want to select only the columns who have not null values, in this case column b should be omitted from output.(something similar to dropna method in pandas).
expected output
a c
---
1 a
2 b
3 c
I tried many things like
select from t where not null cols
but of no use.
Here is a simple solution that does just what you want:
q)where[all null t]_t
a c
---
1 a
2 b
3 c
[all null t] gives a dictionary that checks if the column values are all null or not.
q)all null t
a| 0
b| 1
c| 0
Where returns the keys of the dictionary where it is true
q)where[all null t]
,`b
Finally you use _ to drop the columns from table t
Hopefully this helps
A modification of Sander's solution which handles string columns (or any nested columns):
q)t:([] a:1 2 3; b:```; c:`a`b`c;d:" ";e:("";"";"");f:(();();());g:(1 1;2 2;3 3))
q)t
a b c d e f g
----------------
1 a "" 1 1
2 b "" 2 2
3 c "" 3 3
q)where[{$[type x;all null x;all 0=count each x]}each flip t]_t
a c g
-------
1 a 1 1
2 b 2 2
3 c 3 3
The nature of kdb is column based, meaning that where clauses function on the rows of a given column.
To make a QSQL query produce your desired behaviour, you would need to first examine all your columns and determine which are all null, and then feed that into a functional statement. Which would be horribly inefficient.
Given that you need to fully examine all the columns data regardless (to check if all the values are null) the following will achieve that
q)#[flip;;enlist] k!d k:key[d] where not all each null each value d:flip t
a c
---
1 a
2 b
3 c
Here I'm transforming the table into a dictionary, and extracting its values to determine if any columns consist only of nulls (all each null each). I'm then applying that boolean list to the keys of the dictionary (i.e., the column names) through a where statement. We can then reindex into the original dictionary with those keys and create a subset dictionary of non-null columns and convert that back into a table.
I've generalized the final transformation back into a table by habit with an error catch to ensure that the dictionary will be converted into a table even if only a single row is valid (preventing a 'rank error)

pyspark group by sum

I have a pyspark dataframe with 4 columns.
id/ number / value / x
I want to groupby columns id, number, and then add a new columns with the sum of value per id and number. I want to keep colunms x without doing nothing on it.
df= df.select("id","number","value","x")
.groupBy( 'id', 'number').withColumn("sum_of_value",df.value.sum())
At the end I want a data frame with 5 columns : id/ number / value / x /sum_of_value)
Does anyone can help ?
The result you are trying to achieve doesn't make sense. Your output dataframe will only have columns that were grouped by or aggregated (summed in this case). x and value would have multiple values when you group by id and number.
You can have a 3-column output (id, number and sum(value)) like this:
df_summed = df.groupBy(['id', 'number'])['value'].sum()
Lets say your DataFrame df has 3 Columns Initially.
df1 = df.groupBy("id","number").count()
Now df1 will contain 2 columns id, number and count.
Now you can join df1 and df based on columns "id" and "number" and select whatever columns you would like to select.
Hope it helps.
Regards,
Neeraj

Appending multiple samples of a column into dataframe in spark

I have n (length) values in a spark column. I want to create a spark dataframe of k columns (where k is number of samples) and m rows (where m is sample size). I tried using withColumn, it is not working. Join by creating unique id will be very inefficient for me.
e.g. Spark column has following values :
102
320
11
101
2455
124
I want to create 2 samples of fraction 0.5 as columns in data frame.
So sampled data frame will be something like
sample1,sample2
320,101
124,2455
2455,11
Let df has a column UNIQUE_ID_D, I need k samples from this column. Here is the sample code for k = 2
var df1 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_1")
var df2 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_2")
df1.withColumn("NEW_UNIQUE_ID", df2.col("ID_2")).show
This wont work since withColumn can not access df2 column.
There is only way to join df1 and df2 by adding sequence column(join column) in both df's.
It is very inefficient for my use case since if I want to take 100 samples, I need to join 100 times in a loop for a single column. I need to perform this operation for all columns in original df.
How could I achieve this?

Remove from dataframe unique rows

I have such problem: i need to remove rows that have in the column A unique values from dataframe
In example below of the DF1 row 0 and 3 should be removed
A B C
0 5 100 5
1 1 200 5
2 1 150 4
3 3 500 5
The one solution that I thought till now it is:
groupby(A)
count rows in each group
filter out counts > 1
save result into DF2
DF1.intersect(DF2)
any other ideas? solution for RDD also can help, but better for DataFrame
Thanks!
A more condensed syntax (but following same approach):
df=sqlContext.createDataFrame([[5,100,5],[1,200,5],[1,150,4],[3,500,5]],['A','B','C'])
df.registerTempTable('df') # Making SQL queries possible
df_t=sqlContext.sql('select A,count(B) from df group by A having count(B)=1') # step 1 to 4 in 1 statement
df2=df.join(df_t,df.A==df_t.A,'leftsemi') # only keep records that have a matching key
Some people refer to the 'leftsemi' as 'left keep'. It keeps records of dataframe 1 if the key also exists in df_t.