I have pyspark Data frame for which want to calculate summary statistics (count of all unique categories in that column) and crossTabulation with one fixed column for all string columns.
For Example: My df is like this
col1
col2
col3
Cat1
XYZ
A
Cat1
XYZ
C
Cat1
ABC
B
Cat2
ABC
A
Cat2
XYZ
B
Cat2
MNO
A
I want something like this
VarNAME
Category
Count
A
B
C
col1
Cat1
3
1
1
1
col1
Cat2
3
2
0
1
col2
XYZ
3
1
1
1
col2
ABC
2
1
1
0
col2
MNO
1
1
0
0
col3
A
3
3
0
0
col3
B
2
0
2
0
Col3
C
1
0
0
1
So, Basically, I want cross-tabulation for all individual columns with col3 and the total count.
I can do it in Python using a loop but the loop is somewhat different in pyspark.
Here are my 2 cents.
Created a sample dataframe
df = spark.createDataFrame(
[("Cat1","XYZ","A"),
("Cat1","XYZ","C"),
("Cat1","ABC","B"),
("Cat2","ABC","A"),
("Cat2","XYZ","B"),
("Cat2","MNO","A")
],schema = ['col1','col2','col3'])
Used Crosstab function which will calculate the count for all the col3, evaluates the total row count, then created a new constant column based on the column name and renamed it.
Then performed union for all these dataframes
from pyspark.sql.functions import *
import pyspark.sql.functions as fx
df_union = \
df.crosstab('col1','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col1')).withColumnRenamed('col1_col3','Category').union(
df.crosstab('col2','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col2')).withColumnRenamed('col2_col3','Category')).union(
df.crosstab('col3','col3').withColumn('count',fx.expr(("A+B+C"))).withColumn('VarName',lit('col3')).withColumnRenamed('col3_col3','Category'))
Printing the data frame based on the column order
df_union.select('VarName','Category','count','A','B','C').show()
Please check the sample output for the reference:
Related
I'm new to PySpark and need to compare two files based on col1 alone and populate new colum at end of file 1 based on matching conditions.
1 - Matching record
0 - Unmatached Record
File1:
Col1
Col2
...
ColN
1
abc
...
Xxxx
2
abc
...
Xxxx
3
abc
...
Xxxx
File 2
Col1
Col2
...
ColN
1
abc
...
Xxxx
2
abc
...
Xxxx
Expected output:
Col1
Col2
...
ColN
Newcol
1
abc
...
Xxxx
1
2
abc
...
Xxxx
1
3
abc
...
Xxxx
0
I have two pyspark dataframes where they look like the following:
df1:
Col1 Col2
1 A
2 B
and df2:
Col3 Col4
100 200
300 400
My desired outcome of the combining operations would be:
Col1 Col2 Col3 Col4
1 A 100 200
1 A 300 400
2 B 100 200
2 B 300 400
There are no shared columns between the two dataframes which is what makes this tricky and I couldn't find anything that would do this so any help is much appreciated.
You can use the crossJoin method.
PS: This is an expensive operation and should be avoided if possible.
df = df1.crossJoin(df2)
df.show()
I have a table bb:
bb:([]key1: 0 1 2 1 7; col1: 1 2 3 4 5; col2: 5 4 3 2 1; col3:("11";"22" ;"33" ;"44"; "55"))
How do I do a relational comparison of string? Say I want to get records with col3 less than or equal to "33"
select from bb where col3 <= "33"
Expected result:
key1 col1 col2 col3
0 1 5 11
1 2 4 22
2 3 3 33
If you want col3 to remain of string type, then just cast temporarily within the qsql query?
q)select from bb where ("J"$col3) <= 33
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
If you are looking for classical string comparison, regardless to if string is number or not, I would propose the next approach:
a. Create methods which behave similar to common Java Comparators. Which returns 0 when strings are equal, -1 when first string is less than second one, and 1 when first is greater than the second
.utils.compare: {$[x~y;0;$[x~first asc (x;y);-1;1]]};
.utils.less: {-1=.utils.compare[x;y]};
.utils.lessOrEq: {0>=.utils.compare[x;y]};
.utils.greater: {1=.utils.compare[x;y]};
.utils.greaterOrEq: {0<=.utils.compare[x;y]};
b. Use them in where clause
bb:([]key1: 0 1 2 1 7;
col1: 1 2 3 4 5;
col2: 5 4 3 2 1;
col3:("11";"22" ;"33" ;"44"; "55"));
select from bb where .utils.greaterOrEq["33"]'[col3]
c. As you see below, this works for arbitrary strings
cc:([]key1: 0 1 2 1 7;
col1: 1 2 3 4 5;
col2: 5 4 3 2 1;
col3:("abc" ;"def" ;"tyu"; "55poi"; "gab"));
select from cc where .utils.greaterOrEq["ffff"]'[col3]
.utils.compare could also be written in vector form, though, I'm not sure if it will be more time/memory efficient
.utils.compareVector: {
?[x~'y;0;?[x~'first each asc each(enlist each x),'enlist each y;-1;1]]
};
one way would be to evaluate the strings before comparison:
q)bb:([]key1: 0 1 2 1 7; col1: 1 2 3 4 5; col2: 5 4 3 2 1; col3:("11";"22" ;"33" ;"44"; "55"))
q)bb
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
1 4 2 "44"
7 5 1 "55"
q)
q)
q)select from bb where 33>=value each col3
key1 col1 col2 col3
-------------------
0 1 5 "11"
1 2 4 "22"
2 3 3 "33"
in this case value each returns the strings values as integers and then performs the comparison
This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have the following dataframe
id col1 col2 col3 col4
1 1 10 100 A
1 1 20 101 B
1 1 30 102 C
2 1 10 80 D
2 1 20 90 E
2 1 30 100 F
2 1 40 104 G
So, I want to return a new dataframe, in which I can have in olnly one row the values for the same (col1, col2), and also create a new column with some oeration over both col3 columns, for example
id(1) col1(1) col2(1) col3(1) col4(1) id(2) col1(2) col2(2) col3(3) col4(4) new_column
1 1 10 100 A 2 1 10 80 D (100-80)*100
1 1 20 101 B 2 1 20 90 E (101-90)*100
1 1 30 102 C 2 1 30 100 F (102-100)*100
- - - - - 2 1 40 104 G -
I tried ordering, grouping by (col1, col2) but the grouping returns a RelationalGroupedDataset that I cannot do anything appart of aggregation functions. SO I will appreciate any help. I'm using Scala 2.11 Thanks!
what about joining the df with itself?
something like:
df.as("left")
.join(df.as("right"), Seq("col1", "col2"), "outer")
.where($"left.id" =!= $"right.id")
Is there a way to do the following in a more elegant way (i.e. with fewer commands):
df_1 = pandas.DataFrame({'col1':[1,2,3], 'col2':[10,20,30]})
df_2 = pandas.DataFrame({'col3':[100,200,300], 'col4':[1000,2000,3000]})
for col in ['col3','col4']:
df_1[col] = df_2[col]
print df_1
You can use concat
In [407]: pd.concat([df_1, df_2], axis=1)
Out[407]:
col1 col2 col3 col4
0 1 10 100 1000
1 2 20 200 2000
2 3 30 300 3000