Join two pyspark dataframes with combination of columns keeping only unique values - pyspark

I have two pyspark dataframes with multiple columns. I want to join the two dataframes on specific columns from each dataframe and need the resultant dataframe to have all values from df1 in addition to new unique rows from df2.
The uniqueness is to be decided on basis of the combination of the required columns, for example in the below scenario, combination taken is Col2:Col7
df1:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
94159 New Store Puma Odd East Blue Button Zero Max
421301 New Store Lexi Even West Red Button Zero Max
226024 New Online Puma Odd East Blue Button Zero Max
560035 Old Store Puma Odd East Black Button Zero Max
df2:
Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
New Store Puma Odd East Blue Button Zero Max
New Store Lexi Even West Red Button Zero Max
New Stock Puma Odd East Blue Button Zero Max
Old Online Puma Odd East Black Button Zero Max
resultant_dataframe:
Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9 Col10
94159 New Store Puma Odd East Blue Button Zero Max
421301 New Store Lexi Even West Red Button Zero Max
226024 New Online Puma Odd East Blue Button Zero Max
560035 Old Store Puma Odd East Black Button Zero Max
null New Stock Puma Odd East Blue Button Zero Max
null Old Online Puma Odd East Black Button Zero Max
I am using the below approach:
resultant_dataframe = df1.join(df2, ['Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'], 'left')
resultant_dataframe = resultant_dataframe.dropDuplicates(subset=['Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'])
But i am missing something since all the unique rows are not getting updated in the resultant dataframe. Is there any other way to get this done?

If you do a left join, it will only return the rows from the df at left, in this case the df1.
you should change the join parameter to full:
resultant_dataframe = df1.join(df2, ['Col2', 'Col3', 'Col4', 'Col5', 'Col6', 'Col7'], 'full')

Related

GCP SQL Postgres 11 : Spike in storage size due to indexes

We have a cloud sql postgres installation with 16vcpu . Suddenly , there is an increase in storage size related to indexes.
The table size is 20gb and 4 indexes in it are consuming 70gb+ as per pgadmin stats.
Table has a bulk delete and insert ops in transactions.
Is there any special flags that needs to be added ?
PS: I heard about WAL retention settings but not sure whether it will have any impact and I cannot actually ssh to the machine as it a GCP SQL.
[Update 1]:
After deleting table/indexes, re-created:
Table T1 Stats :
Sequential scans 911504
Sequential tuples read 7338381399896
Index scans 173853066
Index tuples fetched 98226419759
Tuples inserted 109307632
Tuples updated 804
Tuples deleted 100410620
Tuples HOT updated 0
Live tuples 8797115
Dead tuples 24503371
Heap blocks read 1829681
Heap blocks hit 287184313804
Index blocks read 11094719
Index blocks hit 4239641531
Toast blocks read 0
Toast blocks hit 0
Toast index blocks read 0
Toast index blocks hit 0
Last vacuum
Last autovacuum 2020-03-31 18:48:45.626151+00
Last analyze 2020-03-31 14:17:20.834182+00
Last autoanalyze 2020-03-31 18:48:53.828135+00
Vacuum counter 0
Autovacuum counter 24
Analyze counter 3
Autoanalyze counter 53
Table size 5726 MB
Toast table size 8192
Indexes size 21 GB
T1 pg_stat_user_indexes
indexrelname idx_scan idx_tup_read idx_tup_fetch Size
index1 0 0 0 4608 kB
index2 21 103,913,145 0 3417 MB
index3 2,786 1,110,430 135,322 4007 MB
index4 949,981 1,284,602 794,130 4020 MB
index5 7,549,112 1,043,077,414 1,043,060,187 1860 MB
index6 165,334,371 13,962,773,344 12,209,134,553 1692 MB
Table Struct - (Total 14 Col)
Col1 character varying
Col2 character varying
Col3 character varying
Col4 character varying
Col5 character varying
Col6 character varying
Col7 character varying
Col8 timestamp without time zone
Col9 timestamp without time zone
Index Definition - index1 and index4 are unique index
index1 Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
where Col8 is null
index2 Col5 Col6
index3 Col1 Col2 Col3 Col5 Col6
index4 Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
where Col8 is not null
index5 Col4 Col5
index6 Col1 Col2 Col3 Col4 Col5
One strange thing I noticed while extracting stats is pg_stat_user_indexes have two additional indexes7&8 which are not visible under
schemaname->tablename> indexes
Thanks for the comments . Really useful in troubleshooting the issue in google cloud.
The main problem was bloating due to frequent deletes/inserts and fine tuning indexes.
Setting below cloud sql flags helped in resolving the problem :
autovacuum_analyze_scale_factor,autovacuum_vacuum_scale_factor,autovacuum_max_workers,maintenance_work_mem
Pgtune was helpful in finding the optimum values.

Replace infinity with nulls throughout entire table KDB

Example table:
table:([]col1:20 40 30 0w;col2:4?4;col3: 100 200 0w 300)
My solution:
{.[table;(where 0w=table[x];x);:;0n]}'[exec c from meta table where t="f"]
There is a way I am not seeing I'm sure. This just returns a list of for each change which I don't want. I just want the original table returned with nulls replaced.
Thanks in advance!
It would be good to flesh out your question a bit more. Are you always expecting it to be float columns? Will the table have many columns? Will there be string/sym columns mixed in that might complicate things?
If your table has a small number of columns you could just do an update
q)show t
col1 col2 col3
--------------
20 1 100
40 2 200
30 2 0w
0w 1 300
q)inftonull:{(x where x=0w):0n;x}
q)update inftonull col1, inftonull col3 from t
col1 col2 col3
--------------
20 2 100
40 1 200
30 0
3 300
If you think the column names might change or have a very large number of columns you could try a functional update (where you can pass the column names as parameters)
q){![t;();0b;x!inftonull,/:x,:()]}`col1`col3
col1 col2 col3
--------------
20 1 100
40 2 200
30 2
1 300
If your table is comprised of only numeric data something like
q)flip{(x where x=.Q.t[type x]$0w):x 0N;x}each flip t
col1 col2 col3
--------------
20 2 100
40 1 200
30 0
3 300
Might work, which tries to account for the fact the numeric data has different types.
If your data is going to contain string/sym columns the last example won't work

Randomly select observations separately for each column in SQL

I am interested in generating a completely (damaged) randomized data where observations are selected randomly (with replacement) for each field and then combined. I will need to generate a new dummy id to represent the old id as I don't want to reconstruct the data. My goal is to create a simulated column-wise random dataset.
Here is a sample data:
Id Col1 Col2 Col3
11 A 0.01 David
12 B 0.04 Max
13 C 0.05 Tom
14 E 0.06 West
15 C 0.02 Mike
What I am interested in is something like this:
Id2 Col1 Col2 Col3
1 E 0.04 Mike
2 C 0.06 David
3 B 0.02 West
4 A 0.04 Tom
5 C 0.05 Max
I am looking for an organized way of doing this. Here is what I attempted so far but am not interested in doing many times over since I have a lot of columns in the real data.
proc sql;
create table newtable1 as
select monotonic() as id2, col1 from
(select col1 from Table1 order by ranuni(0));
quit;
Using the above code you generate separate random columns and then combine them using the new monotonic key.

create a table where one insert is a batch

I want to create a table where one insert is a batch and if there are the same rows trying to insert again it should throw an error.
Here is a simple example.
This is one insert, if we try to insert these rows again it should throw an error.(It should not insert)
col1 col2 col3 col4(ID)
row1 a 0.1 xyz 1
row2 b 0.2 abc 1
row3 c 0.3 pqr 1
Now I just changed insert little bit this should be as a new insert.
col1 col2 col3 col4(ID)
row1 a 0.1 xyz 2
row2 b 0.211 abc 2
row3 c 0.3 pqr 2
I tried composite primary key but I was missing something. I'm seeing this error
ERROR: duplicate key value violates the unique constraint.
I want to throw an error when all three rows are repeated. If anything is changed in these 3 rows it should be a new insert.

Crystal Reports: How do I repeat a constant number of rows / headers on each new page in a cross-tab?

I have some data that I've staged in my database as such:
RowHeader ColumnHeader Value
Row1 Col1 (1,1)
Row1 Col2 (1,2)
Row1 Col3 (1,3)
Row1 Col4 (1,4)
Row1 Col5 (1,5)
Row2 Col1 (2,1)
Row2 Col2 (2,2)
... ... ...
RowN ColM (N,M)
And, as you might guess, I'm putting this in a cross tab in the following manner:
Columns:
ColumnHeader
Rows: Summerized Fields:
RowHeader Max of Value
And this generates the following report:
Col1 Col2 Col3 ... ColM
Row1 (1,1) (1,2) (1,3) ... (1,M)
Row2 (2,1) (2,2) (2,3) ... (2,M)
... ... ... ... ...
RowN (N,1) (N,2) (N,3) ... (N,M)
Now, this report spans multiple pages and on each page, I'd like to always display the data from the first couple of rows and columns (a little like freezing panes in Excel). The number of rows and columns that need to always be displayed is constant. E.g. Let's say, on each page, I want columns 1 to 3 and row 1 to appear:
-- Page 1 --
Col1 Col2 Col3 Col4 Col5
Row1 (1,1) (1,2) (1,3) (1,4) (1,5)
Row2 (2,1) (2,2) (2,3) (2,4) (2,5)
Row3 (3,1) (3,2) (3,3) (3,4) (3,5)
Row4 (4,1) (4,2) (4,3) (4,4) (4,5)
Row5 (5,1) (5,2) (5,3) (5,4) (5,5)
-- Page 2 --
Col1 Col2 Col3 Col6 Col7
Row1 (1,1) (1,2) (1,3) (1,6) (1,7)
Row6 (6,1) (6,2) (6,3) (6,6) (6,7)
Row7 (7,1) (7,2) (7,3) (7,6) (7,7)
Row8 (8,1) (8,2) (8,3) (8,6) (8,7)
Row9 (9,1) (9,2) (9,3) (9,6) (9,7)
-- etc. ---
How can I do this?
Ok ok... you caught me... I'm totally new to using Crystal Reports (what gave it away?). I have a feeling that this cannot be done with the way the data is currently staged, but I am totally open to staging the data in another fashion to make this work. Thanks in advance.
You can achieve that.. meaning your able to create a group which can which can dispatch your column.
I mean, if you column are month/year and you want only 6 per sheet.. you create a group with a formula indicating if your date in the 6st month of the year then 'start year', else 'end year'
you insert your group in the report, then you place your cross in each group... done
You cannot achieve this with cross-tabs. You can achieve this by staging the data differently (i.e. in the manner it needs to be displayed) and creating a normal report.
Morning,
AS I say, you need to find a link between columns... I don't know how to repeat the first 3 columns, as far as they're not labels....