Remove rows from Spark DataFrame that ONLY satisfy two conditions - scala

I am using Scala and Spark. I want to filter out certain rows from a DataFrame that do NOT satisfy ALL the conditions that I am specifying, while keeping other rows that might only one of the conditions be satisfied.
For example: let's say I have this DataFrame
+-------+----+
|country|date|
+-------+----+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
and I want to filter out country A and dates 1 and 2, so that the expected output should be:
+-------+----+
|country|date|
+-------+----+
| A| 3|
| B| 1|
| B| 2|
| B| 3|
+-------+----+
As you can see, I am still keeping country B with dates 1 and 2.
I tried to use filter in the following way
df.filter("country != 'A' and date not in (1,2)")
But the output filters out all dates 1, and 2, which is not what I want.
Thanks.

Your current condition is
df.filter("country != 'A' and date not in (1,2)")
which can be translated as "accept any country other than A, then accept any date except 1 or 2". Your conditions are applied independently
What you want is:
df.filter("not (country = 'A' and date in (1,2))")
i.e. "Find the rows with country A and date of 1 or 2, and reject them"
or equivalently:
df.filter("country != 'A' or date not in (1,2)")
i.e. "If country isn't A, then accept it regardless of the date. If the country is A, then the date mustn't be 1 or 2"
See De Morgan's laws:
not(A or B) = not A and not B
not (A and B) = not A or not B

Related

Collect most occurring unique values across columns after a groupby in Spark

I have the following dataframe
val input = Seq(("ZZ","a","a","b","b"),
("ZZ","a","b","c","d"),
("YY","b","e",null,"f"),
("YY","b","b",null,"f"),
("XX","j","i","h",null))
.toDF("main","value1","value2","value3","value4")
input.show()
+----+------+------+------+------+
|main|value1|value2|value3|value4|
+----+------+------+------+------+
| ZZ| a| a| b| b|
| ZZ| a| b| c| d|
| YY| b| e| null| f|
| YY| b| b| null| f|
| XX| j| i| h| null|
+----+------+------+------+------+
I need to group by the main column and pick the two most occurring values from the remaining columns for each main value
I did the following
val newdf = input.select('main,array('value1,'value2,'value3,'value4).alias("values"))
val newdf2 = newdf.groupBy('main).agg(collect_set('values).alias("values"))
val newdf3 = newdf2.select('main, flatten($"values").alias("values"))
To get the data in the following form
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j, i, h,]|
+----+--------------------+
Now I need to pick the most occurring two items from the list as two columns. Dunno how to do that.
So, in this case the expected output should be
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| i|
+----+------+------+
null should not be counted and the final values should be null only if there are no other values to fill
Is this the best way to do things ? Is there a better way of doing it ?
You can use an udf to select the two values from the array that occur the most often.
input.withColumn("values", array("value1", "value2", "value3", "value4"))
.groupBy("main").agg(flatten(collect_list("values")).as("values"))
.withColumn("max", maxUdf('values)) //(1)
.cache() //(2)
.withColumn("value1", 'max.getItem(0))
.withColumn("value2", 'max.getItem(1))
.drop("values", "max")
.show(false)
with maxUdf being defined as
def getMax[T](array: Seq[T]) = {
array
.filter(_ != null) //remove null values
.groupBy(identity).mapValues(_.length) //count occurences of each value
.toSeq.sortWith(_._2 > _._2) //sort (3)
.map(_._1).take(2) //return the two (or one) most common values
}
val maxUdf = udf(getMax[String] _)
Remarks:
using an udf here means that the whole array with all entries for a single value of main has to fit into the memory of one Spark executor
cache is required here or the the udf will be called twice, once for value1 and once for value2
the sortWith here is stable but it might be necessary to add some extra logic to handle the situation if two elements have the same number of occurences (like i, j and h for the main value XX)
Here is my try without udf.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy('main).orderBy('count.desc)
newdf3.withColumn("values", explode('values))
.groupBy('main, 'values).agg(count('values).as("count"))
.filter("values is not null")
.withColumn("target", concat(lit("value"), lit(row_number().over(w))))
.filter("target < 'value3'")
.groupBy('main).pivot('target).agg(first('values)).show
+----+------+------+
|main|value1|value2|
+----+------+------+
| ZZ| a| b|
| YY| b| f|
| XX| j| null|
+----+------+------+
The last row has the null value because I have modified your dataframe in this way,
+----+--------------------+
|main| values|
+----+--------------------+
| ZZ|[a, a, b, b, a, b...|
| YY|[b, e,, f, b, b,, f]|
| XX| [j,,,]| <- For null test
+----+--------------------+

From many pyspark columns (with certain condition) to one column with all the conditions combined. PYSPARK

I have one Python list with some PySpark columns which contains certain condition. I want to have just one column that summarizes all the conditions I have in the list of columns.
I've tried to use the sum() operation to combine all the columns but it didn't work (obviusly). Also, I've been checking the documentation https://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html
But nothing seemed to work for me.
I'm doing something like this:
my_condition_list = [col(c).isNotNull() for c in some_of_my_sdf_columns]
That returns a list of different Pyspark columns, I want just one with all the conditiones included combined with the | operator so I can use it in a .filter() or .when() clause.
THANK YOU
PySpark wouldn't accept a list as to where/filter condition. It accepts either a string or condition.
The way you tried wouldn't work, you need to tweak certain things to work on it. Below are 2 approaches for this -
data = [(("ID1", 3, None)), (("ID2", 4, 12)), (("ID3", None, 3))]
df = spark.createDataFrame(data, ["ID", "colA", "colB"])
df.show()
from pyspark.sql import functions as F
way - 1
#below change df_name if you have any other name
df_name = "df"
my_condition_list = ["%s['%s'].isNotNull()"%(df_name, c) for c in df.columns]
print (my_condition_list[0])
"df['ID'].isNotNull()"
print (" & ".join(my_condition_list))
"df['ID'].isNotNull() & df['colA'].isNotNull() & df['colB'].isNotNull()"
print (eval(" & ".join(my_condition_list)))
Column<b'(((ID IS NOT NULL) AND (colA IS NOT NULL)) AND (colB IS NOT NULL))'>
df.filter(eval(" & ".join(my_condition_list))).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID2| 4| 12|
+---+----+----+
df.filter(eval(" | ".join(my_condition_list))).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1| 3|null|
|ID2| 4| 12|
|ID3|null| 3|
+---+----+----+
way - 2
my_condition_list = ["%s is not null"%c for c in df.columns]
print (my_condition_list[0])
'ID is not null'
print (" and ".join(my_condition_list))
'ID is not null and colA is not null and colB is not null'
df.filter(" and ".join(my_condition_list)).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID2| 4| 12|
+---+----+----+
df.filter(" or ".join(my_condition_list)).show()
+---+----+----+
| ID|colA|colB|
+---+----+----+
|ID1| 3|null|
|ID2| 4| 12|
|ID3|null| 3|
+---+----+----+
Preferred way is way-2

pyspark/dataframe - creating a nested structure

i'm using pyspark with dataframe and would like to create a nested structure as below
Before:
Column 1 | Column 2 | Column 3
--------------------------------
A | B | 1
A | B | 2
A | C | 1
After:
Column 1 | Column 4
--------------------------------
A | [B : [1,2]]
A | [C : [1]]
Is this doable?
I don't think you can get that exact output, but you can come close. The problem is your key names for the column 4. In Spark, structs need to have a fixed set of columns known in advance. But let's leave that for later, first, the aggregation:
import pyspark
from pyspark.sql import functions as F
sc = pyspark.SparkContext()
spark = pyspark.sql.SparkSession(sc)
data = [('A', 'B', 1), ('A', 'B', 2), ('A', 'C', 1)]
columns = ['Column1', 'Column2', 'Column3']
data = spark.createDataFrame(data, columns)
data.createOrReplaceTempView("data")
data.show()
# Result
+-------+-------+-------+
|Column1|Column2|Column3|
+-------+-------+-------+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+-------+-------+-------+
nested = spark.sql("SELECT Column1, Column2, STRUCT(COLLECT_LIST(Column3) AS data) AS Column4 FROM data GROUP BY Column1, Column2")
nested.toJSON().collect()
# Result
['{"Column1":"A","Column2":"C","Column4":{"data":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"data":[1,2]}}']
Which is almost what you want, right? The problem is that if you do not know your key names in advance (that is, the values in Column 2), Spark cannot determine the structure of your data. Also, I am not entirely sure how you can use the value of a column as key for a structure unless you use a UDF (maybe with a PIVOT?):
datatype = 'struct<B:array<bigint>,C:array<bigint>>' # Add any other potential keys here.
#F.udf(datatype)
def replace_struct_name(column2_value, column4_value):
return {column2_value: column4_value['data']}
nested.withColumn('Column5', replace_struct_name(F.col("Column2"), F.col("Column4"))).toJSON().collect()
# Output
['{"Column1":"A","Column2":"C","Column4":{"C":[1]}}',
'{"Column1":"A","Column2":"B","Column4":{"B":[1,2]}}']
This of course has the drawback that the number of keys must be discrete and known in advance, otherwise other key values will be silently ignored.
First, reproducible example of your dataframe.
js = [{"col1": "A", "col2":"B", "col3":1},{"col1": "A", "col2":"B", "col3":2},{"col1": "A", "col2":"C", "col3":1}]
jsrdd = sc.parallelize(js)
sqlContext = SQLContext(sc)
jsdf = sqlContext.read.json(jsrdd)
jsdf.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
| A| B| 1|
| A| B| 2|
| A| C| 1|
+----+----+----+
Now, lists are not stored as key value pairs. You can either use a dictionary or simple collect_list() after doing a groupby on column2.
jsdf.groupby(['col1', 'col2']).agg(F.collect_list('col3')).show()
+----+----+------------------+
|col1|col2|collect_list(col3)|
+----+----+------------------+
| A| C| [1]|
| A| B| [1, 2]|
+----+----+------------------+

How to group by a column on a dataframe and applying single value to columns of all rows grouped?

I have a dataframe(scala) and I want to do something like below on the dataframe:
I want to group by column 'a' and select any of the value from column 1 out of the grouped columns and apply it on all rows.I.e for a=1, then b should be either x or y or h on all 3 rows and the rest of the columns should be unaffected.
any help on this?
You can try this, i.e, create another data frame that contains a, b columns where b has one value per a and then join it back with the original data frame:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
val w = Window.partitionBy($"a").orderBy($"b")
// create the window object so that we can create a column that gives unique row number
// for each unique a
(df.withColumn("rn", row_number.over(w)).where($"rn" === 1).select("a", "b")
// create the row number column for each unique a and choose the first row for each group
// which returns a reduced data frame one row per group
.join(df.select("a", "c"), Seq("a"), "inner").show)
// join the reduced data frame back with the original data frame(a,c columns), then b column
// will have just one value
+---+---+---+
| a| b| c|
+---+---+---+
| 1| h| g|
| 1| h| y|
| 1| h| x|
| 2| c| d|
| 2| c| x|

Spark Dataframe sliding window over pair of rows

I have an eventlog in csv consisting of three columns timestamp, eventId and userId.
What I would like to do is append a new column nextEventId to the dataframe.
An example eventlog:
eventlog = sqlContext.createDataFrame(Array((20160101, 1, 0),(20160102,3,1),(20160201,4,1),(20160202, 2,0))).toDF("timestamp", "eventId", "userId")
eventlog.show(4)
|timestamp|eventId|userId|
+---------+-------+------+
| 20160101| 1| 0|
| 20160102| 3| 1|
| 20160201| 4| 1|
| 20160202| 2| 0|
+---------+-------+------+
The desired endresult would be:
|timestamp|eventId|userId|nextEventId|
+---------+-------+------+-----------+
| 20160101| 1| 0| 2|
| 20160102| 3| 1| 4|
| 20160201| 4| 1| Nil|
| 20160202| 2| 0| Nil|
+---------+-------+------+-----------+
So far I've been messing around with sliding windows but can't figure out how to compare 2 rows...
val w = Window.partitionBy("userId").orderBy(asc("timestamp")) //should be a sliding window over 2 rows...
val nextNodes = second($"eventId").over(w) //should work if there are only 2 rows
What you're looking for is lead (or lag). Using window you already defined:
import org.apache.spark.sql.functions.lead
eventlog.withColumn("nextEventId", lead("eventId", 1).over(w))
For true sliding window (like sliding average) you can use rowsBetween or rangeBetween clauses of the window definition but it is not really required here. Nevertheless example usage could be something like this:
val w2 = Window.partitionBy("userId")
.orderBy(asc("timestamp"))
.rowsBetween(-1, 0)
avg($"foo").over(w2)