Remove duplicates in spark with 90 percent column match

Remove duplicates in spark with 90 percent column match - pyspark

Compare two rows in a dataframe in Spark and to remove the row if 90 percent of the columns matches(if there are 10 columns and if 9 matches). How to do this?
Name Country City Married Salary
Tony India Delhi Yes 30000
Carol USA Chicago Yes 35000
Shuaib France Paris No 25000
Dimitris Spain Madrid No 28000
Richard Italy Milan Yes 32000
Adam Portugal Lisbon Yes 36000
Tony India Delhi Yes 22000 <--
Carol USA Chicago Yes 21000 <--
Shuaib France Paris No 20000 <--
Have to remove the marked rows since 90 percent that 4 out of 5 column values are matching with already existing rows.How to do this in Pyspark Dataframe.TIA

Related

Get records based on column max value - in PySpark

I have cars table with data
country
car
price
Germany
Mercedes
30000
Germany
BMW
20000
Germany
Opel
15000
Japan
Honda
20000
Japan
Toyota
15000
I need get country, car and price from table, with highest price for each country
country
car
price
Germany
Mercedes
30000
Japan
Honda
20000
I saw similar question but solution there is in SQL, i want DSL format of that for PySpark dataframes (link in case for that: Get records based on column max value)

You need row_number and filter to achieve your result like below
df = spark.createDataFrame(
[
("Germany","Mercedes", 30000),
("Germany","BMW", 20000),
("Germany","Opel", 15000),
("Japan","Honda",20000),
("Japan","Toyota",15000)],
("country","car", "price"))
from pyspark.sql.window import *
from pyspark.sql.functions import row_number, desc
df1 = df.withColumn("row_num", row_number().over(Window.partitionBy("country").orderBy(desc("price"))))
df2 = df1.filter(df1.row_num == 1).drop('row_num')

How can i compare 2 tables in postgresql?

i have a table named hotel with 2 columns named : hotel_name , hotel_price
hotel_name | hotel_price
hotel1 | 5
hotel2 | 20
hotel3 | 100
hotel4 | 50
and another table named city that contains the column : city_name , average_prices
city_name | average_prices
paris | 20
london | 30
rome | 75
madrid | 100
I want to find which hotel has a price that's more expensive than average prices in the cities.For example i want in the end to find something like this:
hotel_name | city_name
hotel3 | paris --hotel3 is more expensive than the average price in paris
hotel3 | london --hotel3 is more expensive than the average price in london etc.
hotel3 | rome
hotel4 | paris
hotel4 | london
(I found the hotels that are more expensive than the average prices of the cities)
Any help would be valuable thank you .

A simple join is all that is needed. Typically tables are joined on a defined relationship (PK/FK) but there is nothing requiring that. See fiddle.
select h.hotel_name, c.city_name
from hotels h
join cities c
on h.hotel_price > c.average_prices;
However, while you can get the desired results, it's pretty meaningless. You cannot tell whether a particular hotel is even in a given city.

How to query with "IN" in Q (kdb)?

Let's assume that I have a table in KBD named "Automotive" with following data:
Manufacturer Country Sales Id
Mercedes United States 002
Mercedes Canada 002
Mercedes Germany 003
Mercedes Switzerland 003
Mercedes Japan 004
BMW United States 002
BMW Canada 002
BMW Germany 003
BMW Switzerland 003
BMW Japan 004
How would I structure a query in Q such that I can fetch the records matching United States and Canada without using an OR clause?
In SQL, it would look something like:
SELECT Manufacturer, Country from Automotive WHERE Country IN ('United States', 'Canada')
Thanks in advance for helping this Q beginner!

It's basically the same in kdb. The way you write you query depends on the data type. See below an example where manufacturer is a symbol, and country is a string.
q)tbl:([]manufacturer:`Merc`Merc`BMW`BMW`BMW;country:("United States";"Canada";"United States";"Germany";"Japan");ID:til 5)
q)
q)tbl
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
BMW "United States" 2
BMW "Germany" 3
BMW "Japan" 4
q)meta tbl
c | t f a
------------| -----
manufacturer| s
country | C
ID | j
q)select from tbl where manufacturer in `Merc`Ford
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
q)
q)select from tbl where country in ("United States";"Canada")
manufacturer country ID
-------------------------------
Merc "United States" 0
Merc "Canada" 1
BMW "United States" 2
Check out how to use Q-sql here: https://code.kx.com/q4m3/9_Queries_q-sql/

How to get runtime value of column and used in the next row

I have below dataframe, based on the visited date I need to create a new column allowed- If the customar visted within a week from last allowd week visit I have to mark allowed as NO (4th row 2020-01-09-2020-01-10 <7 ) and if it is more than 1 week allowed yes (3rd row 2020-01-09-2020-01-01 >7 )
Input DF
Customar visited_date
John 2020-01-01
John 2020-01-05
John 2020-01-09
John 2020-01-10
John 2020-01-17
output DF
Customar visited_date allowed
John 2020-01-01 Yes
John 2020-01-05 No
John 2020-01-09 Yes
John 2020-01-10 No
John 2020-01-17 Yes
I dont know how to calulate the colum value in runtime and used that in subsequent column calculation.

Select from table removing similar rows - PostgreSQL

There is a table with document revisions and authors. Looks like this:
doc_id rev_id rev_date editor title,content so on....
123 1 2016-01-01 03:20 Bill ......
123 2 2016-01-01 03:40 Bill
123 3 2016-01-01 03:50 Bill
123 4 2016-01-01 04:10 Bill
123 5 2016-01-01 08:40 Alice
123 6 2016-01-01 08:41 Alice
123 7 2016-01-01 09:00 Bill
123 8 2016-01-01 10:40 Cate
942 9 2016-01-01 11:10 Alice
942 10 2016-01-01 11:15 Bill
942 15 2016-01-01 11:17 Bill
I need to find out moments when document was transferred to another editor - only first rows of every edition series.
Like so:
doc_id rev_id rev_date editor title,content so on....
123 1 2016-01-01 03:20 Bill ......
123 5 2016-01-01 08:40 Alice
123 7 2016-01-01 09:00 Bill
123 8 2016-01-01 10:40 Cate
942 9 2016-01-01 11:10 Alice
942 10 2016-01-01 11:15 Bill
If I use DISTINCT ON (doc_id, editor) it resorts a table and I see only one per doc and editor, that is incorrect.
Of course I can dump all and filter with shell tools like awk | sort | uniq. But it is not good for big tables.
Window functions like FIRST_ROW do not give much, because I cannot partition by doc_id, editor not to mess all them.
How to do better?
Thank you.

You can use lag() to get the previous value, and then a simple comparison:
select t.*
from (select t.*,
lag(editor) over (partition by doc_id order by rev_date) as prev_editor
from t
) t
where prev_editor is null or prev_editor <> editor;

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Remove duplicates in spark with 90 percent column match - pyspark

Related

Get records based on column max value - in PySpark

How can i compare 2 tables in postgresql?

How to query with "IN" in Q (kdb)?

How to get runtime value of column and used in the next row

Select from table removing similar rows - PostgreSQL

Categories

Resources