PySpark filter condition - pyspark

I have two filters where one filter checks for an exact match & will collect that item, and the second filter looks for two different matches. The second filter should override the first filter if it is true.
Filter 1 example:
df = df_vals.where(F.col("x" == 1a)).select()
Filter 2 example:
df = df_vals.where(F.col("x" == 1a | "x" == 2b ) | f.col("x" == 1a | F.col("x" == 2c)).select(items in list)
So I need advice for how to override the first filter if the second filter is also true in PySpark. Ideally the first filter will only pass when the second filter isn't true.

for a specific row of a particular column, if x already have a value of 1a then it cannot be 2b as it is arleady assigned with 1a is that right?...and are you sure in the second filter you want to use or? if that is the case, Filter 1 is considered in Filter 2 as you always look for 1a...can you update your question to include the desired output. using mock data is fine.
Sample dataframe...I added y column as a reference for the filter later.
+---+---+
| x| y|
+---+---+
| 1a| 1|
| 1a| 2|
| 2b| 3|
| 2c| 4|
| 2b| 5|
| 1a| 6|
| 1a| 7|
| zz| 8| // to be ignored
| zz| 9| // to be ignored
| zz| 10| // to be ignored
| zz| 11| // to be ignored
+---+---+
Applying the filter below
df.where(
((F.col('x') == '1a') | (F.col('x') == '2b')) | \
((F.col('x') == '1a') | ((F.col('x') == '2c')))) \
.show()
wil give you this result, row with zz values in column x will be ignored based from above filter.
+---+---+
| x| y|
+---+---+
| 1a| 1|
| 1a| 2|
| 2b| 3|
| 2c| 4|
| 2b| 5|
| 1a| 6|
| 1a| 7|
+---+---+

Related

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

Spark adding indexes to dataframe and append other dataset that doesn't have index

I have a dataset that has column userid and index values.
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
+---------+--------+
I want to append a new data frame to it and add an index to the newly added columns.
The userid is unique and the existing data frame will not have the Dataframe 2 user ids.
+----------+
| userid |
+----------+
| user11|
| user21|
| user41|
| user51|
| user64|
+----------+
The expected output with newly added userid and index
+---------+--------+
| userid | index|
+---------+--------+
| user1| 1|
| user2| 2|
| user3| 3|
| user4| 4|
| user5| 5|
| user6| 6|
| user7| 7|
| user8| 8|
| user9| 9|
| user10| 10|
| user11| 11|
| user21| 12|
| user41| 13|
| user51| 14|
| user64| 15|
+---------+--------+
Is it possible to achive this by passing a max index value and start index for second Dataframe from given index value.
If the userid has some ordering, then you can use the rownumber function. Even if it does not have, then you can add an id using monotonically_increasing_id(). For now I assume that userid can be ordered. Then you can do this:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df_merge = df1.select('userid').union(df2.select('userid'))
w=Window.orderBy('userid')
df_result = df_merge.withColumn('indexid',F.row_number().over(w))
EDIT : After discussions in comment.
#%% Test data and imports
import pyspark.sql.functions as F
from pyspark.sql import Window
df = sqlContext.createDataFrame([('a',100),('ab',50),('ba',300),('ced',60),('d',500)],schema=['userid','index'])
df1 = sqlContext.createDataFrame([('fgh',100),('ff',50),('fe',300),('er',60),('fi',500)],schema=['userid','dummy'])
#%%
#%% Merge the two dataframes, with a null columns as the index
df1=df1.withColumn('index', F.lit(None))
df_merge = df.select(df.columns).union(df1.select(df.columns))
#%%Define a window to arrange the newly added rows at the last and order them by userid
#%% The user id, even though random strings, can be ordered
w= Window.orderBy(F.col('index').asc_nulls_last(),F.col('userid'))# if possible add a partition column here, otherwise all your data will come in one partition, consider salting
#%% For the newly added rows, define index as the maximum value + increment of number of rows in main dataframe
df_final = df_merge.withColumn("index_new",F.when(~F.col('index').isNull(),F.col('index')).otherwise((F.last(F.col('index'),ignorenulls=True).over(w))+F.sum(F.lit(1)).over(w)))
#%% If number of rows in main dataframe is huge, then add an offset in the above line
df_final.show()
+------+-----+---------+
|userid|index|index_new|
+------+-----+---------+
| ab| 50| 50|
| ced| 60| 60|
| a| 100| 100|
| ba| 300| 300|
| d| 500| 500|
| er| null| 506|
| fe| null| 507|
| ff| null| 508|
| fgh| null| 509|
| fi| null| 510|
+------+-----+---------+

Show all pyspark columns after group and agg

I wish to groupby a column and then find the max of another column. Lastly, show all the columns based on this condition. However, when I used my codes, it only show 2 columns and not all of it.
# Normal way of creating dataframe in pyspark
sdataframe_temp = spark.createDataFrame([
(2,2,'0-2'),
(2,23,'22-24')],
['a', 'b', 'c']
)
sdataframe_temp2 = spark.createDataFrame([
(4,6,'4-6'),
(5,7,'6-8')],
['a', 'b', 'c']
)
# Concat two different pyspark dataframe
sdataframe_union_1_2 = sdataframe_temp.union(sdataframe_temp2)
sdataframe_union_1_2_g = sdataframe_union_1_2.groupby('a').agg({'b':'max'})
sdataframe_union_1_2_g.show()
output:
+---+------+
| a|max(b)|
+---+------+
| 5| 7|
| 2| 23|
| 4| 6|
+---+------+
Expected output:
+---+------+-----+
| a|max(b)| c |
+---+------+-----+
| 5| 7|6-8 |
| 2| 23|22-24|
| 4| 6|4-6 |
+---+------+---+
You can use a Window function to make it work:
Method 1: Using Window function
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window().partitionBy("a").orderBy(F.desc("b"))
(sdataframe_union_1_2
.withColumn('max_val', F.row_number().over(w) == 1)
.where("max_val == True")
.drop("max_val")
.show())
+---+---+-----+
| a| b| c|
+---+---+-----+
| 5| 7| 6-8|
| 2| 23|22-24|
| 4| 6| 4-6|
+---+---+-----+
Explanation
Window functions are useful when we want to attach a new column to the existing set of columns.
In this case, I tell Window function to groupby partitionBy('a') column and sort the column b in descending order F.desc(b). This make the first value in b in each group its max value.
Then we use F.row_number() to filter the max values where row number equals 1.
Finally, we drop the new column since it is not being used after filtering the data frame.
Method 2: Using groupby + inner join
f = sdataframe_union_1_2.groupby('a').agg(F.max('b').alias('b'))
sdataframe_union_1_2.join(f, on=['a','b'], how='inner').show()
+---+---+-----+
| a| b| c|
+---+---+-----+
| 2| 23|22-24|
| 5| 7| 6-8|
| 4| 6| 4-6|
+---+---+-----+

spark withcolumn create a column duplicating values from existining column

I am having problem figuring this. Here is the problem statement
lets say I have a dataframe, I want to select value for column c where column b value is foo and create a new column D and repeat the vale "3" for all rows
+---+----+---+
| A| B| C|
+---+----+---+
| 4|blah| 2|
| 2| | 3|
| 56| foo| 3|
|100|null| 5|
+---+----+---+
want it to become:
+---+----+---+-----+
| A| B| C| D |
+---+----+---+-----+
| 4|blah| 2| 3 |
| 2| | 3| 3 |
| 56| foo| 3| 3 |
|100|null| 5| 3 |
+---+----+---+-----+
You will have to extract the column C value i.e. 3 with foo in column B
import org.apache.spark.sql.functions._
val value = df.filter(col("B") === "foo").select("C").first()(0)
Then use that value using withColumn to create a new column D using lit function
df.withColumn("D", lit(value)).show(false)
You should get your desired output.

How to call print from otherwise in spark scala?

I have dataframe where I'm checking the condition and accordingly I'm performing the operation.I'm using "when and otherwise" function where I'm trying to print failed result in "otherwise" but it is not printing.Any help will be appreciating.
joinDF=
+--------+-------------+----------+--------+-------+
| ID | A | B | C | D |
+--------+-------------+----------+--------+-------+
| 9574| F| 005912| 2016022| 10|
| 9576| F| 005912| 2016022| 21|
| 9578| F| 005912| 2016022| 0|
| 9580| F| 005912| 2016022| 19|
| 9582| F| 005912| 2016022| 89|
+--------+-------------+----------+--------+-------+
joinDF
.withColumn("Validate",when(joinDF("D") =!= 0 ,lit(ture)).otherwise(print(joinDF("ID"))))
It quite simple and very straight-forward :
Seq(("A",1),("B",0)).toDF("key","value")
.withColumn("verdict", when($"value"=!=0, lit("true")).otherwise("false")).show
+---+-----+-------+
|key|value|verdict|
+---+-----+-------+
| A| 1| true|
| B| 0| false|
+---+-----+-------+
You don't need if, else or udfs
With your example :
Seq(("A",1),("B",0)).toDF("ID","D").withColumn("validate",when($"D" =!= 0 ,lit("true")).otherwise($"ID")).show
+---+---+--------+
| ID| D|validate|
+---+---+--------+
| A| 1| true|
| B| 0| B|
+---+---+--------+
If you are trying to print the value if they did not matched than you can create udf as below
val validate = udf((value :String )=>
{
//if if value is not equal to 0 return true else print value and return it
if (value != 0) "true"
else {
print(s"value = ${value}")
value
}
})
//this adds new column named validate and calls the udf validate
joinDF.withColumn("Validate", validate($"ID"))
Hope this helps!