Remove from dataframe unique rows - pyspark

I have such problem: i need to remove rows that have in the column A unique values from dataframe
In example below of the DF1 row 0 and 3 should be removed
A B C
0 5 100 5
1 1 200 5
2 1 150 4
3 3 500 5
The one solution that I thought till now it is:
groupby(A)
count rows in each group
filter out counts > 1
save result into DF2
DF1.intersect(DF2)
any other ideas? solution for RDD also can help, but better for DataFrame
Thanks!

A more condensed syntax (but following same approach):
df=sqlContext.createDataFrame([[5,100,5],[1,200,5],[1,150,4],[3,500,5]],['A','B','C'])
df.registerTempTable('df') # Making SQL queries possible
df_t=sqlContext.sql('select A,count(B) from df group by A having count(B)=1') # step 1 to 4 in 1 statement
df2=df.join(df_t,df.A==df_t.A,'leftsemi') # only keep records that have a matching key
Some people refer to the 'leftsemi' as 'left keep'. It keeps records of dataframe 1 if the key also exists in df_t.

Related

Crystal Reports - Where clause in formula to calculate sum of a field from second table

I am trying to get a result in my report, which I beleive, requires a where clause and did not work for me with the select expert section.
I have 2 tables. Lets call them table 1 and table 2.
Table 1 contains unique records.
Table 2 contains multiple records for the same uniqueKey as table 1.
there are 3 fields in table 2 that play a roll for each uniqueKey from table 1.
QTY_ORD
QTY_SHIPPED
ITEM_CANCEL
Lets assume for item # 1 from table 1, there are 5 records in table 2. Each record has a values for the 3 above mentioned fields. I need to display the SUM of all the records where ITEM_CANCEL = 0 of QTY_SHIPPED - QTY_ORD.
It could be that 3 of the records have ITEM_CANCEL = 1 (We can ignore these records), but for the other 2 reocrds where ITEM_CANCEL = 0, I need the SUM of QTY_SHIPPED - SUM of QTY_ORD.
the current code I have is as follows"
if {current_order1.ITEM_CANCEL} = 0 then
sum({current_order1.QTY_ORD})-sum({current_order1.QTY_SHIPPED}) else
0
but this result gives me the sum of ALL the records, including the ones where ITEM_CANCEL = 1.
If I use ITEM_CANCEL = 0 in the select expert, then it removes ALL the results that have no value in table 2. I even tried the code without using the SUM function, but this provided the result of only 1 of the records in table 2 where ITEM_CANCEL = 0, and not the total difference of the 2 records in table 2 that I require.
Any suggestions on this?
Start with a detail-level formuls (no SUM):
if {current_order1.ITEM_CANCEL} = 0 then {current_order1.QTY_ORD} - {current_order1.QTY_SHIPPED} ELSE 0
Then, SUM that formula at whatever Group or Report levels you require.

Remove duplicate rows, regardless of new information -PySpark

Say I have a dataframe like so:
ID Media
1 imgix.com/20830dk
2 imgix.com/202398pwe
3 imgix.com/lvw0923dk
4 imgix.com/082kldcm
4 imgix.com/lks032m
4 imgix.com/903248
I'd like to end up with:
ID Media
1 imgix.com/20830dk
2 imgix.com/202398pwe
3 imgix.com/lvw0923dk
4 imgix.com/082kldcm
Even though that causes me to lose 2 links for ID = 4, I don't care. Is there a simple way to do this in python/pyspark?
Group by on col('ID')
Use collect_list with agg to aggregate the list
Call getItem(0) to extract first element from the aggregated list
df.groupBy('ID').agg(collect_list('Media').getItem(0).alias('Media')).show()
Anton and pault are correct:
df.drop_duplicates(subset=['ID'])
does indeed work

PySpark: How to concatenate two dataframes without duplicates rows?

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
A B
0 1 2
1 3 1
Dataframe B:
A B
0 5 6
1 3 1
I wish to merge them such that the final DataFrame is of the following shape:
Final Dataframe:
A B
0 1 2
1 3 1
2 5 6
How can I do this?
pyspark.sql.DataFrame.union and pyspark.sql.DataFrame.unionAll seem to yield the same result with duplicates.
Instead, you can get the desired output by using direct SQL:
dfA.createTempView('dataframea')
dfB.createTempView('dataframeb')
aunionb = spark.sql('select * from dataframea union select * from dataframeb')
Using SQL produces the expected/correct result.
In order to remove any duplicate rows, just use union() followed by a distinct().
Mentioned in the documentation
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
"union(other)
Return a new DataFrame containing union of rows in this frame and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct."
You have just to drop duplicates after union.
df = dfA.union(dfB).dropDuplicates()

Limiting number of rows in contentResolvser.query to get any group of rows

How can I pick any group of rows if I know the first and last ID?
Suppose the table has IDs from 1 to 10 and I want contentResolver.query to return the rows 4 to 8, how can I do that?
I searched for that question, but all I found are solutions to return the first n consecutive rows:
Answer1 Answer2
using LIMIT and OFFSET clauses
solution
SELECT * FROM TABLE LIMIT 5 OFFSET 3; reference

Divide records into groups - quick solution

I need to divide with UPDATE command rows (selected from subselect) in PostgreSQL table into groups, these groups will be identified with integer value in one of its columns. These groups should be with the same size. Source table contains billions of records.
For example I need to divide 213 selected rows into groups, every group should contains 50 records. The result will be:
1 - 50. row => 1
51 - 100. row => 2
101 - 150. row => 3
151 - 200. row => 4
200 - 213. row => 5
There is no problem to do it with some loop (or use PostgreSQL window functions), but I need to do it very efficiently and quickly. I can't use sequence in id because there should be gaps in these ids.
I have an idea to use random integer number generator and set it as default value for a row. But this is not useable when I need to adjust group size.
The query below should display 213 rows with a group-number from 0-4. Just add 1 if you want 1-5
SELECT i, (row_number() OVER () - 1) / 50 AS grp
FROM generate_series(1001,1213) i
ORDER BY i;
create temporary sequence s minvalue 0 start with 0;
select *, nextval('s') / 50 grp
from t;
drop sequence s;
I think it has the potential to be faster than the row_number version #Richard. But the difference could be not relevant depending on the specifics.