I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.
For Example:
Data Frame-A:
A B C D
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
Data Frame-B:
A B C D
1 2 3 7
2 5 7 4
2 9 8 7
Keys: A,B,C columns
Desired Output:
A B C D
3 4 5 7
4 7 9 6
Any solution for this.
You are looking for left anti-join:
df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 3| 4| 5| 7|
| 4| 7| 9| 6|
+---+---+---+---+
Related
I am trying to create new versions of existing columns based on the values of another column. Eg. in the input I create new columns for 'var1', 'var2', 'var3' for each value 'var' split can take.
Input:
time
student
split
var1
var2
var3
t1
Student1
A
1
3
7
t1
Student1
B
2
5
6
t1
Student1
C
3
1
9
t2
Student1
A
5
3
7
t2
Student1
B
9
6
3
t2
Student1
C
3
5
3
t1
Student2
A
1
2
8
t1
Student2
C
7
4
0
Output:
time
student
splitA_var1
splitA_var2
splitA_var1
splitB_var1
splitB_var2
splitB_var3
splitC_var1
splitC_var2
splitC_var3
t1
Student1
1
3
7
2
5
6
3
1
9
t2
Student1
5
3
7
9
6
3
3
5
3
t1
Student2
1
2
8
7
4
0
Image of output here if table not formatted
this is an easy pivot with multiple aggregations (within agg()).
see example below
import pyspark.sql.functions as func
data_sdf. \
withColumn('pivot_col', func.concat(func.lit('split'), 'split')). \
groupBy('time', 'student'). \
pivot('pivot_col'). \
agg(func.first('var1').alias('var1'),
func.first('var2').alias('var2'),
func.first('var3').alias('var3')
). \
show()
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# |time| student|splitA_var1|splitA_var2|splitA_var3|splitB_var1|splitB_var2|splitB_var3|splitC_var1|splitC_var2|splitC_var3|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
# | t1|Student2| 1| 2| 8| null| null| null| 7| 4| 0|
# | t2|Student1| 5| 3| 7| 9| 6| 3| 3| 5| 3|
# | t1|Student1| 1| 3| 7| 2| 5| 6| 3| 1| 9|
# +----+--------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+-----------+
spark will create new columns with the following nomenclature - <pivot column value>_<aggregation alias>
suppose I have a lot of dataframe, with similar structure, but different columns. I want to combine all of them together, how to do it in a easier way?
for example, df1, df2, df3 are as follows:
df1
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
df2
id base1 base2 col1
5 4 100 15
6 1 99 18
7 2 89 9
df3
id base1 base2 col1 col2
9 2 77 12 3
10 1 89 16 5
11 2 88 10 7
to be:
id base1 base2 col1 col2 col3 col4
1 1 100 30 1 2 3
2 2 200 40 2 3 4
3 3 300 20 4 4 5
5 4 100 15 NaN NaN NaN
6 1 99 18 NaN NaN NaN
7 2 89 9 NaN NaN NaN
9 2 77 12 3 NaN NaN
10 1 89 16 5 NaN NaN
11 2 88 10 7 NaN NaN
currently I use this code:
from pyspark.sql import SparkSession, HiveContext
from pyspark.sql.functions import lit
from pyspark.sql import Row
def customUnion(df1, df2):
cols1 = df1.columns
cols2 = df2.columns
total_cols = sorted(cols1 + list(set(cols2) - set(cols1)))
def expr(mycols, allcols):
def processCols(colname):
if colname in mycols:
return colname
else:
return lit(None).alias(colname)
cols = map(processCols, allcols)
return list(cols)
appended = df1.select(expr(cols1, total_cols)).union(df2.select(expr(cols2, total_cols)))
return appended
df_comb1=customUnion(df1,df2)
df_comb2=customUnion(df_comb1,df3)
however, if I keep creating new dataframe like df4,df5,etc. (100+)
my code becomes messy.
is there a way to code it in a easier way?
Thanks in advance
You can manage this with a list of data frames and a function, without necessarily needing to statically name each data frame...
dataframes = [df1,df2,df3] # load data frames
Compute the set of all possible columns:
all_cols = {i for lst in [df.columns for df in dataframes] for i in lst}
#{'base1', 'base2', 'col1', 'col2', 'col3', 'col4', 'id'}
A function to add missing columns to a DF:
def add_missing_cols(df, cols):
v = df
for col in [c for c in cols if (not c in df.columns)]:
v = v.withColumn(col, f.lit(None))
return v
completed_dfs = [add_missing_cols(df, all_cols) for df in dataframes]
res = completed_dfs[0]
for df in completed_dfs[1:]:
res = res.unionAll(df)
res.show()
+---+-----+-----+----+----+----+----+
| id|base1|base2|col1|col2|col3|col4|
+---+-----+-----+----+----+----+----+
| 1| 1| 100| 30| 1| 2| 3|
| 2| 2| 200| 40| 2| 3| 4|
| 3| 3| 300| 20| 4| 4| 5|
| 5| 4| 100| 15|null|null|null|
| 6| 1| 99| 18|null|null|null|
| 7| 2| 89| 9|null|null|null|
| 9| 2| 77| 12| 3|null|null|
| 10| 1| 89| 16| 5|null|null|
| 11| 2| 88| 10| 7|null|null|
+---+-----+-----+----+----+----+----+
How can I calculate variance across numerous columns in a pyspark ?
For e.g. if the pyspark.sql.dataframe table is:
ID A B C
1 12 15 7
2 6 15 2
3 56 25 25
4 36 12 5
and output needed is
ID A B C Variance
1 12 15 7 10.9
2 6 15 2 29.6
3 56 25 25 213.6
4 36 12 5 176.2
There is a variance function in pyspark but it works only column-wise.
Just concat the columns that you need using concat_ws function and use udf to calculate variance like below
from pyspark.sql.functions import *
from pyspark.sql.types import *
from statistics import pvariance
def calculateVar(row):
data = [float(x.strip()) for x in row.split(",")]
return pvariance(data)
varUDF = udf(calculateVar,FloatType())
df.withColumn('Variance',varUDF(concat_ws(",",df.a,df.b,df.c))).show()
output :
+---+---+---+---+---------+
| id| a| b| c| Variance|
+---+---+---+---+---------+
| 1| 12| 15| 7|10.888889|
| 2| 6| 15| 2|29.555555|
| 3| 56| 25| 25|213.55556|
| 4| 36| 12| 5|176.22223|
+---+---+---+---+---------+
I have two dataframes. I want to delete some records in Data Frame-A based on some common column values in Data Frame-B.
For Example:
Data Frame-A:
A B C D
1 2 3 4
3 4 5 7
4 7 9 6
2 5 7 9
Data Frame-B:
A B C D
1 2 3 7
2 5 7 4
2 9 8 7
Keys: A,B,C columns
Desired Output:
A B C D
3 4 5 7
4 7 9 6
Any solution for this.
You are looking for left anti-join:
df_a.join(df_b, Seq("A","B","C"), "leftanti").show()
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 3| 4| 5| 7|
| 4| 7| 9| 6|
+---+---+---+---+
I want to calculate running sum from last one hour for each transaction using Spark-Scala. I have following dataframe with three fields and want to calculate fourth field as given below:
Customer TimeStamp Tr Last_1Hr_RunningSum
Cust-1 6/1/2015 6:51:55 1 1
Cust-1 6/1/2015 6:58:34 3 4
Cust-1 6/1/2015 7:20:46 3 7
Cust-1 6/1/2015 7:40:45 4 11
Cust-1 6/1/2015 7:55:34 5 15
Cust-1 6/1/2015 8:20:34 0 12
Cust-1 6/1/2015 8:34:34 3 12
Cust-1 6/1/2015 9:35:34 7 7
Cust-1 6/1/2015 9:45:34 3 10
Cust-2 6/1/2015 16:26:34 2 2
Cust-2 6/1/2015 16:35:34 1 3
Cust-2 6/1/2015 17:39:34 3 3
Cust-2 6/1/2015 17:43:34 5 8
Cust-3 6/1/2015 17:17:34 6 6
Cust-3 6/1/2015 17:21:34 4 10
Cust-3 6/1/2015 17:45:34 2 12
Cust-3 6/1/2015 17:56:34 3 15
Cust-3 6/1/2015 18:21:34 4 13
Cust-3 6/1/2015 19:24:34 1 1
I want to calculate "Last_1Hr_RunningSum" as new field which look back for one hour from each transaction by customer id and take some of "Tr"(Transaction filed).
For example :Cust-1 at 6/1/2015 8:20:34 will look back till 6/1/2015 7:20:46 and take sum of (0+5+4+3) = 12.
Same way for each row I want to look back for one hour and take sum of all Transaction during that one hour.
I tried running sqlContext.sql with nested query but its giving me error. Also Window function and Row Number over partition is not supported by Spark-Scala SQLContext.
How can I get the sum of last one hour from "Tr" using column 'TimeStamp' with Spark-Scala only.
Thanks in advance.
I tried running sqlContext.sql with nested query but its giving me error
Did you try using Join?
df.registerTempTable("input")
val result = sqlContext.sql("""
SELECT
FIRST(a.Customer) AS Customer,
FIRST(a.Timestamp) AS Timestamp,
FIRST(a.Tr) AS Tr,
SUM(b.Tr) AS Last_1Hr_RunningSum
FROM input a
JOIN input b ON
a.Customer = b.Customer
AND b.Timestamp BETWEEN (a.Timestamp - 3600000) AND a.Timestamp
GROUP BY a.Customer, a.Timestamp
ORDER BY a.Customer, a.Timestamp
""")
result.show()
Which prints the expected result:
+--------+-------------+---+-------------------+
|Customer| Timestamp| Tr|Last_1Hr_RunningSum|
+--------+-------------+---+-------------------+
| Cust-1|1420519915000| 1| 1.0|
| Cust-1|1420520314000| 3| 4.0|
| Cust-1|1420521646000| 3| 7.0|
| Cust-1|1420522845000| 4| 11.0|
| Cust-1|1420523734000| 5| 15.0|
| Cust-1|1420525234000| 0| 12.0|
| Cust-1|1420526074000| 3| 12.0|
| Cust-1|1420529734000| 7| 7.0|
| Cust-1|1420530334000| 3| 10.0|
| Cust-2|1420554394000| 2| 2.0|
| Cust-2|1420554934000| 1| 3.0|
| Cust-2|1420558774000| 3| 3.0|
| Cust-2|1420559014000| 5| 8.0|
| Cust-3|1420557454000| 6| 6.0|
| Cust-3|1420557694000| 4| 10.0|
| Cust-3|1420559134000| 2| 12.0|
| Cust-3|1420559794000| 3| 15.0|
| Cust-3|1420561294000| 4| 13.0|
| Cust-3|1420565074000| 1| 1.0|
+--------+-------------+---+-------------------+
(This solution assumes the time is given in milliseconds)