Merge two hive table(Different column size)- pyspark - pyspark

I have one hive table with schema
Name ,Contact,Address,Subject
Name Contact Address Subject
abc 1111 Mumbai maths
egf 2222 nashik science
pqr 3333 delhi history
And other table with schema **Name ,Contact**
Name Contact
xyz 4444
mno 2222
Expected Output
Name Contact Address Subject
abc 1111 Mumbai maths
pqr 3333 delhi history
xyz 4444 null null
mno 2222 nashik science
I have tried join operation but not able get correct output

Use full join:
select coalesce(t2.name,t1.name) as name,
coalesce(t2.contact, t1.contact) as contact,
t1.address, t1.subject
from table1 t1
full join table2 t2
on t1.contact=t2.contact

Related

PostgreSQL: Count Number of Occurrences in Columns

BACKGROUND
I have three large tables (employee_info, driver_info, school_info) that I have joined together on common attributes using a series of LEFT OUTER JOIN operations. After each join, the resulting number of records increased slightly, indicating that there are duplicate IDs in the data. To try and find all of the duplicates in the IDs, I dumped the ID columns into a temp table like so:
Original Dump of ID Columns
first_name
last_name
employee_id
driver_id
school_id
Mickey
Mouse
1234
abcd
wxyz
Donald
Duck
2423
heca
qwer
Mary
Poppins
1111
acbe
aaaa
Wiley
Cayote
1234
strf
aaaa
Daffy
Duck
1256
acbe
pqrs
Bugs
Bunny
9999
strf
yxwv
Pink
Panther
2222
zzzz
zzaa
Michael
Archangel
0000
rstu
aaaa
In this overly simplified example, you will see that IDs 1234 (employee_id), strf (driver_id), and aaaa (school_id) are each duplicated at least once. I would like to add a count column for each of the ID columns, and populate them with the count for each ID used, like so:
ID Columns with Counts
first_name
last_name
employee_id
employee_id_count
driver_id
driver_id_count
school_id
school_id_count
Mickey
Mouse
1234
2
abcd
1
wxyz
1
Donald
Duck
2423
1
heca
1
qwer
1
Mary
Poppins
1111
1
acbe
1
aaaa
3
Wiley
Cayote
1234
2
strf
2
aaaa
3
Daffy
Duck
1256
1
acbe
1
pqrs
1
Bugs
Bunny
9999
1
strf
2
yxwv
1
Pink
Panther
2222
1
zzzz
1
zzaa
1
Michael
Archangel
0000
1
rstu
1
aaaa
3
You can see that IDs 1234 and strf each have 2 in the count, and aaaa has 3. After generating this table, my goal is to pull out all records where any of the counts are greater than 1, like so:
All Records with One or More Duplicate IDs
first_name
last_name
employee_id
employee_id_count
driver_id
driver_id_count
school_id
school_id_count
Mickey
Mouse
1234
2
abcd
1
wxyz
1
Mary
Poppins
1111
1
acbe
1
aaaa
3
Wiley
Cayote
1234
2
strf
2
aaaa
3
Bugs
Bunny
9999
1
strf
2
yxwv
1
Michael
Archangel
0000
1
rstu
1
aaaa
3
Real World Perspective
In my real-world work, the JOIN'd table contains 100 columns, 15 different ID fields and over 30,000 records, and the final table came out to be 28 more than the original. This may seem like a small amount, but each of the 28 represent a broken link that we must fix.
Is there a simple way to get the counts populated like in the second table above? I have been wrestling with this for hours already, and have not been able to make this work. I tried some aggregate functions, but they cannot be used in table UPDATE operations.
The COUNT function, when used as an analytic function, can do what you want here, e.g.
WITH cte AS (
SELECT *,
COUNT(employee_id) OVER (PARTITION BY employee_id) employee_id_count,
COUNT(driver_id) OVER (PARTITION BY driver_id) driver_id_count,
COUNT(school_id) OVER (PARTITION BY school_id) school_id_count
FROM yourTable
)
SELECT *
FROM cte
WHERE
employee_id_count > 1
driver_id_count > 1
school_id_count > 1;

Problem Displaying Multiple Items From Same Column In One Row

I have three tables, DailyFieldRecord, AB953,and Lookup. The DailyFieldRecord table contains DailyFieldRecordID.The AB953 table contains DailyFieldRecordID,GroupID,LookupID, and PersonID. The Lookup table contains GroupID, Description, and LookupID. I'm trying to display the persons ethnicity, age, and gender in the same row based on each DailyFieldRecordID and PersonID. The problem I'm having is that the descriptions of ethnicity, age, and gender are in the same column in the lookup table. I've tried different ways, but am only able to get the correct information for one person. Any input would be helpful.
DailyFieldRecord: AB953:
DailyFieldRecordID DailyFieldRecordID: LookupID: GroupID: PersonID:
1111 1111 1260 300 1
1111 1262 200 1
1111 1264 310 1
1111 1258 300 2
1111 1261 200 2
1111 1265 310 2
Lookup:
GroupID: Description: LookupID:
300 white 1260
300 latin 1258
200 17 1262
200 18 1261
310 male 1264
310 female 1265
Select ab.DailyFieldRecordID, lkp.Description as
Ethinicity,lkp2.Description as Age, lkp3.Description as Gender,
ab.PersonID
FROM DailyFieldRecord dfr
LEFT JOIN AB953 ab ON ab.DailyFieldRecordID=dfr.DailyFieldRecordID and
ab.GroupID=300 and ab.PersonID=1
LEFT JOIN AB953 ab2 ON ab2.DailyFieldRecordID=dfr.DailyFieldRecordID and
ab2.GroupID=200 and ab2.PersonID=1
LEFT JOIN AB953 ab3 ON ab3.DailyFieldRecordID=dfr.DailyFieldRecordID and
ab3.GroupID=310 and ab3.PersonID=1
LEFT JOIN Lookup lkp ON lkp.LookupID=ab.ItemID
LEFT JOIN Lookup lkp2 ON lkp2.LookupID=ab2.ItemID
LEFT JOIN Lookup lkp3 ON lkp3.LookupID=ab3.ItemID
Current output:
DailyFieldRecordID: Ethnicity: Age: Gender: PersonID:
1111 white 17 male 1
Expected output:
DailyFieldRecordID: Ethnicity: Age: Gender: PersonID:
1111 white 17 male 1
1111 latin 18 female 2
Though i must say, this is very bad DB design, Yet you are getting only first person ID coz you are using PersonID = 1 in the query. Please try below query removing PersonID = 1.
Select ab.DailyFieldRecordID
,MAX(CASE WHEN lkp.GroupID = 300 THEN lkp.Description) as Ethinicity
,MAX(CASE WHEN lkp.GroupID = 200 THEN lkp.Description) as Age
,MAX(CASE WHEN lkp.GroupID = 310 THEN lkp.Description) as Gender
,ab.PersonID
FROM DailyFieldRecord dfr
LEFT JOIN AB953 ab ON ab.DailyFieldRecordID=dfr.DailyFieldRecordID
LEFT JOIN Lookup lkp ON lkp.GroupID=ab.GroupID
GROUP BY ab.DailyFieldRecordID, ab.PersonID

Autoincrement in query

I need to create a query which increment value of current row by 8% to previous row.
Table (let's name it money) contains one row (and two columns), and it looks like
AMOUNT ID
100.00 AAA
I just need to populate a data from this table like this way (one select from this table, eg. 6 iterations):
100.00 AAA
108.00 AAA
116.64 AAA
125.97 AAA
136.04 AAA
146.93 AAA
You can do that with a common table expression.
E.g. if your source looks like this:
db2 "create table money(amount decimal(31,2), id varchar(10))"
db2 "insert into money values (100,'AAA')"
You can create the input data with the following query (I will include counter column for clarity):
db2 "with
cte(c1,c2,counter)
as
(select
amount, id, 1
from
money
union all
select
c1*1.08, c2, counter+1
from
cte
where counter < 10)
select * from cte"
C1 C2 COUNTER
--------------------------------- ---------- -----------
100.00 AAA 1
108.00 AAA 2
116.64 AAA 3
125.97 AAA 4
136.04 AAA 5
146.92 AAA 6
158.67 AAA 7
171.36 AAA 8
185.06 AAA 9
199.86 AAA 10
To populate the existing table without repeating the existing row you use e.g. an insert like this:
$ db2 "insert into money
with
cte(c1,c2,counter)
as
(select
amount*1.08, id, 1
from
money
union all
select
c1*1.08, c2, counter+1
from
cte
where counter < 10) select c1,c2 from cte"
$ db2 "select * from money"
AMOUNT ID
--------------------------------- ----------
100.00 AAA
108.00 AAA
116.64 AAA
125.97 AAA
136.04 AAA
146.93 AAA
158.68 AAA
171.38 AAA
185.09 AAA
199.90 AAA
215.89 AAA
11 record(s) selected.

How to remove everything after a ',' in a Column in PostgreSQL

I have a table with a column containing an address.
I want to remove everything after the , in the string.
How do I go about doing that in PostgreSQL?
I've tried using REPLACE, but that only works on specific strings, which is a problem because each row in the column would have a different address.
SELECT *
FROM address_book
r_name r_address
xxx 123 XYZ st., City, Zipcode
yyy 333 abc road, City, Zipcode
zzz 222 qwe blvd, City, Zipcode
I'm need column r_address to only return:
123 XYZ st.
333 abc road
222 qwe blvs
Use the split_part function, like so:
SELECT r_name, split_part(r_address, ',', 1) AS street
FROM address_book
Docs: https://www.postgresql.org/docs/current/functions-string.html
Fiddle: http://sqlfiddle.com/#!17/51afe/1

How select row where some column has more than 1 distinct value?

I have to find if there are Rows where a Name has more than one distinct Family.
Note: Name and Family can be duplicate.
ID Name Family
1 ABC XYZ
2 DEF XYZ
3 ABC UVW
4 ABC RST
5 DEF RST
6 GHI UVW
The expected Output should be
Name
ABC
DEF
I think you could do this;
SELECT Name, COUNT(DISTINCT Family)
FROM [table]
GROUP BY Name
HAVING COUNT(DISTINCT Family) > 1