I need to compare two files using pyspark - pyspark

I'm new to PySpark and need to compare two files based on col1 alone and populate new colum at end of file 1 based on matching conditions.
1 - Matching record
0 - Unmatached Record
File1:
Col1
Col2
...
ColN
1
abc
...
Xxxx
2
abc
...
Xxxx
3
abc
...
Xxxx
File 2
Col1
Col2
...
ColN
1
abc
...
Xxxx
2
abc
...
Xxxx
Expected output:
Col1
Col2
...
ColN
Newcol
1
abc
...
Xxxx
1
2
abc
...
Xxxx
1
3
abc
...
Xxxx
0

Related

Summary and crosstabulation in Pyspark (DataBricks)

I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns.
For Example: My df is like this
col1
col2
col3
Cat1
XYZ
A
Cat1
XYZ
C
Cat1
ABC
B
Cat2
ABC
A
Cat2
XYZ
B
Cat2
MNO
A
I want something like this
VarNAME
Category
Count
A
B
C
col1
Cat1
3
1
1
1
col1
Cat2
3
2
0
1
col2
XYZ
3
1
1
1
col2
ABC
2
1
1
0
col2
MNO
1
1
0
0
col3
A
3
3
0
0
col3
B
2
0
2
0
Col3
C
1
0
0
1
So, Basically, I want cross-tabulation for all individual columns with col3 and the total count.
I can do it in Python using a loop but the loop is somewhat different in pyspark.
Here are my 2 cents.
Created a sample dataframe
df = spark.createDataFrame(
[("Cat1","XYZ","A"),
("Cat1","XYZ","C"),
("Cat1","ABC","B"),
("Cat2","ABC","A"),
("Cat2","XYZ","B"),
("Cat2","MNO","A")
],schema = ['col1','col2','col3'])
Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it.
Then performed union for all these dataframes
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df_union = \
df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
Printing the data frame based on the column order
df_union.select('VarName','Category','count','A','B','C').show()
Please check the sample output for the reference:

complex Sql help to get the start time & end time from a datetime column

I need to create pivot datetime column in such a way that when the Order column value keep increasing take the lowest value as start time and highest value as end time but once the counter reset it should create a new row for start & end time.
Sample data
computername currentuser datetime order
abc xyz 7/5/2022 20:04:51 1
abc xyz 7/5/2022 20:04:51 1
abc xyz 7/6/2022 6:45:51 1
abc xyz 7/6/2022 6:45:51 1
abc xyz 7/6/2022 7:06:45 2
abc xyz 7/6/2022 7:06:45 3
abc xyz 7/6/2022 7:07:00 4
abc xyz 7/6/2022 7:59:12 2
abc xyz 7/6/2022 7:59:12 3
abc xyz 7/6/2022 7:59:19 4
abc xyz 7/6/2022 7:59:21 5
abc xyz 7/6/2022 21:28:19 1
abc xyz 7/6/2022 21:28:19 1
abc xyz 7/6/2022 21:28:24 2
abc xyz 7/6/2022 21:28:24 3
abc xyz 7/6/2022 21:28:24 4
Expected Output
computername currentuser starttime endtime
abc xyz 7/5/2022 20:04:51 7/5/2022 20:04:51
abc xyz 7/6/2022 6:45:51 7/6/2022 7:07:00
abc xyz 7/6/2022 7:59:12 7/6/2022 7:59:21
abc xyz 7/6/2022 21:28:19 7/6/2022 21:28:24

Concatenate a column a df column in scala

I have a df like this
col1 col2 col3
1 ab file1
1 ab file2
2 bd file3
2 bd file4
3 fe file2
Now I need to concatenate col3 with ; delimiter.
The output shd be like
Col1 col2 col3
1 ab file1;file2
2 bd file3;file4
3 fe file2
I have used concat_ws(";",collect_set(col3))
But sometimes in col3 I get file1,file2 and sometimes file2,file1.
How can I get the desired output.
df.sort(col2,col3).groupBy(col2).agg(concat_ws(";",collect_set(col3)))
You need to sort the dataframe in the order that output is required.

Oracle: How to group records by certain columns before fetching results

I have a table in Redshift that looks like this:
col1 | col2 | col3 | col4 | col5 | col6
=======================================
123 | AB | SSSS | TTTT | PQR | XYZ
---------------------------------------
123 | AB | SSTT | TSTS | PQR | XYZ
---------------------------------------
123 | AB | PQRS | WXYZ | PQR | XYZ
---------------------------------------
123 | CD | SSTT | TSTS | PQR | XYZ
---------------------------------------
123 | CD | PQRS | WXYZ | PQR | XYZ
---------------------------------------
456 | AB | GGGG | RRRR | OPQ | RST
---------------------------------------
456 | AB | SSTT | TSTS | PQR | XYZ
---------------------------------------
456 | AB | PQRS | WXYZ | PQR | XYZ
I have another table that also has a similar structure and data.
From these tables, I need to select values that don't have 'SSSS' in col3 and 'TTTT' in col4 in (edited) either of the tables. I'd also need to group my results by the value in col1 and col2.
Here, I'd like my query to return:
123,CD
456,AB
I don't want 123, AB to be in my results, since one of the rows corresponding to 123, AB has SSSS and TTTT in col3 and col4 respectively. i.e, I want to omit items that have SSSS and TTTT in col3 and col4 in either of the two tables that I'm looking up.
I am very new to writing queries to extract information from a database, so please bear with my ignorance. I was told to explore GROUP BY and ORDER BY, but I am not sure I understand their usage well enough yet.
The query I have looks like:
SELECT * from table1 join table2 on
table1.col1 = table2.col1 AND
table1.col2 = table2.col2
WHERE
col3 NOT LIKE 'SSSS' AND
col4 NOT LIKE 'TTTT'
GROUP BY col1,col2
However, this query throws an error: col5 must appear in the GROUP BY clause or be used in an aggregate function;
I'm not sure how to proceed. I'd appreciate any help. Thank you!
It seems you also want DISTINCT results. In this case a solution with MINUS is probably as efficient as any other (and, remember, MINUS automatically also means DISTINCT):
select col1, col2 from table_name -- enter your column and table names here
minus
select col1, col2 from table_name where col3 = 'SSSS' and col4 = 'TTTT'
;
No need to group by anything!
With that said, here is a solution using GROUP BY. Note that the HAVING condition uses a non-trivial aggregate function - it is a COUNT() but what is counted is a CASE to take care of what was required. Note that it is not necessary/required that the aggregate function in the HAVING clause/condition be included in the SELECT list!
select col1, col2
from table_name
group by col1, col2
having count(case when col3 = 'SSSS' and col4 = 'TTTT' then 1 else null end) = 0
;
You should use the EXCEPT operator.
EXCEPT and MINUS are two different versions of the same operator.
Here is the syntax of what your query should look like
SELECT col1, col2 FROM table1
EXCEPT
SELECT col1, col2 FROM table1 WHERE col3 = 'SSSS' AND col4 = 'TTTT';
One important consideration is to know if your desired answer requires either the and or OR operator. Do you want to see the records where col3 = 'SSSS' and col4 has a value different than col4 = 'TTTT'?
If the answer is no you should use the version below:
SELECT col1, col2 FROM table1
EXCEPT
SELECT col1, col2 FROM table1 WHERE col3 = 'SSSS' OR col4 = 'TTTT';
You can learn more about the MINUS or EXCEPT operator on the Amazon Redshift documentation here.

PostgreSQL XOR - How to check if only 1 column is filled in?

How can I simulate a XOR function in PostgreSQL? Or, at least, I think this is a XOR-kind-of situation.
Lets say the data is as follows:
id | col1 | col2 | col3
---+------+------+------
1 | 1 | | 4
2 | | 5 | 4
3 | | 8 |
4 | 12 | 5 | 4
5 | | | 4
6 | 1 | |
7 | | 12 |
And I want to return 1 column for those rows where only one of the columns is filled in. (ignore col3 for now..
Lets start with this example of 2 columns:
SELECT
id, COALESCE(col1, col2) AS col
FROM
my_table
WHERE
COALESCE(col1, col2) IS NOT NULL -- at least 1 is filled in
AND
(col1 IS NULL OR col2 IS NULL) -- at least 1 is empty
;
This works nicely an should result in:
id | col
---+----
1 | 1
3 | 8
6 | 1
7 | 12
But now, I would like to include col3 in a similar way. Like this:
id | col
---+----
1 | 1
3 | 8
5 | 4
6 | 1
7 | 12
How can this be done is a more generic way? Does Postgres support such a method?
I'm not able to find anything like it.
rows with exactly 1 column filled in:
select * from my_table where
(col1 is not null)::integer
+(col1 is not null)::integer
+(col1 is not null)::integer
=1
rows with 1 or 2
select * from my_table where
(col1 is not null)::integer
+(col1 is not null)::integer
+(col1 is not null)::integer
between 1 and 2
The "case" statement might be your friend here, the "min" aggregated function doesn't affect the result.
select id, min(coalesce(col1,col2,col3))
from my_table
group by 1
having sum(case when col1 is null then 0 else 1 end+
case when col2 is null then 0 else 1 end+
case when col3 is null then 0 else 1 end)=1
[Edit]
Well, i found a better answer without using aggregated functions, it's still based on the use of "case" but i think is more simple.
select id, coalesce(col1,col2,col3)
from my_table
where (case when col1 is null then 0 else 1 end+
case when col2 is null then 0 else 1 end+
case when col3 is null then 0 else 1 end)=1
How about
select coalesce(col1, col2, col3)
from my_table
where array_length(array_remove(array[col1, col2, col3], null), 1) = 1