Concatenate a column a df column in scala

Concatenate a column a df column in scala - scala

I have a df like this
col1 col2 col3
1 ab file1
1 ab file2
2 bd file3
2 bd file4
3 fe file2
Now I need to concatenate col3 with ; delimiter.
The output shd be like
Col1 col2 col3
1 ab file1;file2
2 bd file3;file4
3 fe file2
I have used concat_ws(";",collect_set(col3))
But sometimes in col3 I get file1,file2 and sometimes file2,file1.
How can I get the desired output.

df.sort(col2,col3).groupBy(col2).agg(concat_ws(";",collect_set(col3)))
You need to sort the dataframe in the order that output is required.

Related

Summary and crosstabulation in Pyspark (DataBricks)

I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns.
For Example: My df is like this
col1
col2
col3
Cat1
XYZ
A
Cat1
XYZ
C
Cat1
ABC
B
Cat2
ABC
A
Cat2
XYZ
B
Cat2
MNO
A
I want something like this
VarNAME
Category
Count
A
B
C
col1
Cat1
3
1
1
1
col1
Cat2
3
2
0
1
col2
XYZ
3
1
1
1
col2
ABC
2
1
1
0
col2
MNO
1
1
0
0
col3
A
3
3
0
0
col3
B
2
0
2
0
Col3
C
1
0
0
1
So, Basically, I want cross-tabulation for all individual columns with col3 and the total count.
I can do it in Python using a loop but the loop is somewhat different in pyspark.

Here are my 2 cents.
Created a sample dataframe
df = spark.createDataFrame(
[("Cat1","XYZ","A"),
("Cat1","XYZ","C"),
("Cat1","ABC","B"),
("Cat2","ABC","A"),
("Cat2","XYZ","B"),
("Cat2","MNO","A")
],schema = ['col1','col2','col3'])
Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it.
Then performed union for all these dataframes
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df_union = \
df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
Printing the data frame based on the column order
df_union.select('VarName','Category','count','A','B','C').show()
Please check the sample output for the reference:

I need to compare two files using pyspark

I'm new to PySpark and need to compare two files based on col1 alone and populate new colum at end of file 1 based on matching conditions.
1 - Matching record
0 - Unmatached Record
File1:
Col1
Col2
...
ColN
1
abc
...
Xxxx
2
abc
...
Xxxx
3
abc
...
Xxxx
File 2
Col1
Col2
...
ColN
1
abc
...
Xxxx
2
abc
...
Xxxx
Expected output:
Col1
Col2
...
ColN
Newcol
1
abc
...
Xxxx
1
2
abc
...
Xxxx
1
3
abc
...
Xxxx
0

PySark: combine two dataframes where one is repeated for all distinct rows from the other

I have two pyspark dataframes where they look like the following:
df1:
Col1 Col2
1 A
2 B
and df2:
Col3 Col4
100 200
300 400
My desired outcome of the combining operations would be:
Col1 Col2 Col3 Col4
1 A 100 200
1 A 300 400
2 B 100 200
2 B 300 400
There are no shared columns between the two dataframes which is what makes this tricky and I couldn't find anything that would do this so any help is much appreciated.

You can use the crossJoin method.
PS: This is an expensive operation and should be avoided if possible.
df = df1.crossJoin(df2)
df.show()

creating data-frame from pipe & comma delimited file

I am trying to create data-frame form a data feed which has the following format,
ABC,13:10,23| PQR,01:20,2| XYZ,07:30,14
BCD,11:40,13| ABC,05:50,9| RST,17:20,5
Each record is pipe delimited and comes in batch of 3 and consists of 3 sub records.
I intend to have each sub record as a column and each record aa one row of the data frame.So the above would result in 3 columns and 9 rows.
col1 col2 col3
ABC 13:10 23
PQR 01:20 2

from pyspark.sql.functions import split, explode
df = spark.read.text("/path/to/data.csv")
df.select(explode(split(df["value"], "\|"))).show()

Oracle: How to group records by certain columns before fetching results

I have a table in Redshift that looks like this:
col1 | col2 | col3 | col4 | col5 | col6
=======================================
123 | AB | SSSS | TTTT | PQR | XYZ
---------------------------------------
123 | AB | SSTT | TSTS | PQR | XYZ
---------------------------------------
123 | AB | PQRS | WXYZ | PQR | XYZ
---------------------------------------
123 | CD | SSTT | TSTS | PQR | XYZ
---------------------------------------
123 | CD | PQRS | WXYZ | PQR | XYZ
---------------------------------------
456 | AB | GGGG | RRRR | OPQ | RST
---------------------------------------
456 | AB | SSTT | TSTS | PQR | XYZ
---------------------------------------
456 | AB | PQRS | WXYZ | PQR | XYZ
I have another table that also has a similar structure and data.
From these tables, I need to select values that don't have 'SSSS' in col3 and 'TTTT' in col4 in (edited) either of the tables. I'd also need to group my results by the value in col1 and col2.
Here, I'd like my query to return:
123,CD
456,AB
I don't want 123, AB to be in my results, since one of the rows corresponding to 123, AB has SSSS and TTTT in col3 and col4 respectively. i.e, I want to omit items that have SSSS and TTTT in col3 and col4 in either of the two tables that I'm looking up.
I am very new to writing queries to extract information from a database, so please bear with my ignorance. I was told to explore GROUP BY and ORDER BY, but I am not sure I understand their usage well enough yet.
The query I have looks like:
SELECT * from table1 join table2 on
table1.col1 = table2.col1 AND
table1.col2 = table2.col2
WHERE
col3 NOT LIKE 'SSSS' AND
col4 NOT LIKE 'TTTT'
GROUP BY col1,col2
However, this query throws an error: col5 must appear in the GROUP BY clause or be used in an aggregate function;
I'm not sure how to proceed. I'd appreciate any help. Thank you!

It seems you also want DISTINCT results. In this case a solution with MINUS is probably as efficient as any other (and, remember, MINUS automatically also means DISTINCT):
select col1, col2 from table_name -- enter your column and table names here
minus
select col1, col2 from table_name where col3 = 'SSSS' and col4 = 'TTTT'
;
No need to group by anything!
With that said, here is a solution using GROUP BY. Note that the HAVING condition uses a non-trivial aggregate function - it is a COUNT() but what is counted is a CASE to take care of what was required. Note that it is not necessary/required that the aggregate function in the HAVING clause/condition be included in the SELECT list!
select col1, col2
from table_name
group by col1, col2
having count(case when col3 = 'SSSS' and col4 = 'TTTT' then 1 else null end) = 0
;

You should use the EXCEPT operator.
EXCEPT and MINUS are two different versions of the same operator.
Here is the syntax of what your query should look like
SELECT col1, col2 FROM table1
EXCEPT
SELECT col1, col2 FROM table1 WHERE col3 = 'SSSS' AND col4 = 'TTTT';
One important consideration is to know if your desired answer requires either the and or OR operator. Do you want to see the records where col3 = 'SSSS' and col4 has a value different than col4 = 'TTTT'?
If the answer is no you should use the version below:
SELECT col1, col2 FROM table1
EXCEPT
SELECT col1, col2 FROM table1 WHERE col3 = 'SSSS' OR col4 = 'TTTT';
You can learn more about the MINUS or EXCEPT operator on the Amazon Redshift documentation here.