Arbitrarily number of filters on dataframe - scala

I have a number of filters that i need to apply to a data frame in Spark, but it is first at runtime i know which filters to user. Currently i am adding them in individual filter functions, but that fails if one of the filtes is not defined
myDataFrame
.filter(_filter1)
.filter(_filter2)
.filter(_filter3)...
I can't really find out how to dynamically at runtime exclude fx _filter2 if that is not needed?
Should i do it by creating one big filter:
var filter = _filter1
if (_filter2 != null)
filter = filter.and(_filter2)
...
Or is there a good pattern for this in Spark that i haven't found?

One possible solution is to default all filters to lit(true):
import org.apache.spark.sql.functions._
val df = Seq(1, 2, 3).toDF("x")
val filter_1 = lit(true)
val filter_2 = col("x") > 1
val filter_3 = lit(true)
val filtered = df.filter(filter_1).filter(filter_2).filter(filter_3)
This will keep null out of your code and trivially true predicates will be pruned from the execution plan:
filtered.explain
== Physical Plan ==
*Project [value#1 AS x#3]
+- *Filter (value#1 > 1)
+- LocalTableScan [value#1]
You can of course make it even simpler and a sequence of predicates:
import org.apache.spark.sql.Column
val preds: Seq[Column] = Seq(lit(true), col("x") > 1, lit(true))
df.where(preds.foldLeft(lit(true))(_ and _))
and, if implemented right, skip placeholders completely.

At first I would get rid of null filters:
val filters:List[A => Boolean] = nullableFilters.filter(_!=null)
Then define function to chain filters:
def chainFilters[A](filters:List[A => Boolean])(v:A) = filters.forall(f => f(v))
Now you can simply apply filters to your df:
df.filter(chainFilters(nullableFilters.filter(_!=null))

Why not:
var df = // load
if (_filter2 != null) {
df = df.filter(_filter2)
}
etc
Alternatively, create a list of filters:
var df = // load
val filters = Seq (filter1, filter2, filter3, ...)
filters.filter(_ != null).foreach (x => df = df.filter(x))
// Sorry if there is some mistake in code, it's more an idea - currently I can't test code

Related

Unable to flatten array of DataFrames

I have an array of DataFrames that I obtain by using randomSplit() in this manner:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
I'll be iterating over folds using a for loop, where I will be dropping the ith entry inside folds and store it separately. Then I will be using all the others as another DataFrame as in my code below:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
for (j <- ts.indices) {
trainSet = trainSet.union(ts(j))
}
}
While this does serve my purpose, I was also trying another approach where I would still separate folds into ts and testSet, and then use the flatten function for the remaining inside ts to create another DataFrame using something like this:
val df = spark.read.format("csv").load("xyz")
val folds = df.randomSplit(Array.fill(5)(1.0/5))
for (i <- folds.indices) {
var ts = folds
val testSet = ts(i)
ts = ts.drop(i)
var trainSet = ts.flatten
}
But at the initialization of the trainSet line, I get an error that: No Implicits Found for parameter asTrav: Dataset[Row] => Traversable[U_]. I have also done import spark.implicits._ after initializing the SparkSession.
My end goal with the creation of trainSet after flatten is to retrieve a DataFrame created after joining (union) the other Dataset[Row]s inside ts. I'm not sure where I'm going wrong.
I'm using Spark 2.4.5 with Scala 2.11.12
EDIT 1: Added how I read the Dataframe
I'm not sure what's your intention here but instead of using mutable variables and flattening you can do recursive iteration like this:
val folds = df.randomSplit(Array.fill(5)(1.0/5)) //Array[Dataset[Row]]
val testSet = spark.createDataFrame(Seq.empty)
val trainSet = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], testSet.schema)
go(folds, Array.empty)
def go(items: Array[Dataset[Row]], result: Array[Dataset[Row]]): Array[Dataset[Row]] = items match {
case arr # Array(_, _*) =>
val res = arr.map { t =>
trainSet.union(t)
}
go(arr.tail, result ++ res)
case Array() => result
}
As I have seen the use case of testSet, there is no usage of it in the method body
I have replaced that for loop with a simple reduce:
val trainSet = ts.reduce((a,b) => a.union(b))

Spark-rdd manipulating data

I have a sample data like below:
UserId,ProductId,Category,Action
1,111,Electronics,Browse
2,112,Fashion,Click
3,113,Kids,AddtoCart
4,114,Food,Purchase
5,115,Books,Logout
6,114,Food,Click
7,113,Kids,AddtoCart
8,115,Books,Purchase
9,111,Electronics,Click
10,112,Fashion,Purchase
3,112,Fashion,Click
I need to generate list of users who are interested in either “Fashion” category or “Electronics” category but not in both categories. User is interested if he/she has performed any of these actions (Click / AddToCart / Purchase) using spark/scala code I have done up till below:
val rrd1 = sc.textFile("/user/harshit.kacker/datametica_logs.csv")
val rrd2 = rrd1.map( x=> {
| val c = x.split(",")
| (c(0).toInt , x)})
val rrd3 = rrd1.filter(x=> x.split(",")(2) == "Fashion" || x.split(",")(2) == "Electronics")
val rrd4 = rrd3.filter(x=> x.split(",")(3)== "Click" || x.split(",")(3)=="Purchase" || x.split(",")(3)=="AddtoCart")
rrd4.collect.foreach(println)
2,112,Fashion,Click
9,111,Electronics,Click
10,112,Fashion,Purchase
3,112,Fashion,Click
4,111,Electronics,Click
19,112,Fashion,Click
9,112,Fashion,Purchase
2,112,Fashion,Click
2,111,Electronics,Click
1,112,Fashion,Purchase
now I have to work on "to generate list of users who are interested in either “Fashion” category or “Electronics” category but not in both categories" this italic part and get desired output as :
10,Fashion
3,Fashion
4,Electronics
19,Fashion
1,Fashion
means userId having Fashion and Electronics as category should be eliminated. How can I achieve the same?
Start by parsing the input text file in to tuples:
val srcPath = "/user/harshit.kacker/datametica_logs.csv"
// parse test file in to tuples:
val rdd = spark.sparkContext.textFile(srcPath)
val rows = rdd.map(line => line.split(",")).map(row => (row(0), row(1), row(2), row(3)))
val header = rows.first
// drop the header:
val logs = rows.filter(row => row != header)
Filter the RDD by interest criteria:
val interests = logs.filter(log =>
List("Click", "AddtoCart", "Purchase").contains(log._4)
)
Filter for fashion and electronics separately:
val fashion = interests.filter(row => row._3 == "Fashion")
val electronics = interests.filter(row => row._3 == "Electronics")
Find the common user IDs between fashion and electronics:
val fashionIds = fashion.map(_._1).distinct
val electronicsIds = electronics.map(_._1).distinct
val commonIds = fashionIds.intersection(electronicsIds).collect()
Combine the fashion and electronics rows and filter the ids common between both:
val finalRdd = (fashion ++ electronics)
.filter(log => !commonIds.contains(log._1))
.map(log => (log._1, log._3))
.distinct()
Edit: Using DataFrame
// using dataframes:
val df = spark.read.option("header", "true").csv(srcPath)
val interestDf = df.where($"Action".isin("Click", "Purchase", "AddToCart"))
val fashionDf = interestDf.where($"Category" === "Fashion")
val electronicsDf = interestDf.where($"Category" === "Electronics")
val joinDf = electronicsDf.alias("e").join(fashionDf.alias("f"), Seq("UserId"), "outer")
.where($"e.Category".isNull || $"f.Category".isNull)
val finalDf = joinDf.select($"UserId", when($"e.Category".isNull, $"f.Category").otherwise($"e.Category").as("Category")).distinct

Spark scala remove columns containing only null values

Is there a way to remove the columns of a spark dataFrame that contain only null values ?
(I am using scala and Spark 1.6.2)
At the moment I am doing this:
var validCols: List[String] = List()
for (col <- df_filtered.columns){
val count = df_filtered
.select(col)
.distinct
.count
println(col, count)
if (count >= 2){
validCols ++= List(col)
}
}
to build the list of column containing at least two distinct values, and then use it in a select().
Thank you !
I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.
for (String column:df.columns()){
long count = df.select(column).distinct().count();
if(count == 1 && df.select(column).first().isNullAt(0)){
df = df.drop(column);
}
}
I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.
Here's a scala example to remove null columns that only queries that data once (faster):
def removeNullColumns(df:DataFrame): DataFrame = {
var dfNoNulls = df
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
for(c <- df.columns) {
val uses = cnts.getAs[Long]("count("+c+")")
if ( uses == 0 ) {
dfNoNulls = dfNoNulls.drop(c)
}
}
return dfNoNulls
}
A more idiomatic version of #swdev answer:
private def removeNullColumns(df:DataFrame): DataFrame = {
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
df.columns
.filter(c => cnts.getAs[Long]("count("+c+")") == 0)
.foldLeft(df)((df, col) => df.drop(col))
}
If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.
scala snippet:
originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)
here's #timo-strotmann solution in pySpark syntax:
for column in df.columns:
count = df.select(column).distinct().count()
if count == 1 and df.first()[column] is None:
df = df.drop(column)

How to skip line in spark rdd map action based on if condition

I have a file and I want to give it to an mllib algorithm. So I am following the example and doing something like:
val data = sc.textFile(my_file).
map {line =>
val parts = line.split(",");
Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray)
};
and this works except that sometimes I have a missing feature. That is sometimes one column of some row does not have any data and I want to throw away rows like this.
So I want to do something like this map{line => if(containsMissing(line) == true){ skipLine} else{ ... //same as before}}
how can I do this skipLine action?
You can use filter function to filter out such lines:
val data = sc.textFile(my_file)
.filter(_.split(",").length == cols)
.map {line =>
// your code
};
Assuming variable cols holds number of columns in a valid row.
You can use flatMap, Some and None for this:
def missingFeatures(stuff): Boolean = ??? // Determine if features is missing
val data = sc.textFile(my_file)
.flatMap {line =>
val parts = line.split(",");
if(missingFeatures(parts)) None
else Some(Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray))
};
This way you avoid mapping over the rdd more than once.
Java code to skip empty lines / header from Spark RDD:
First the imports:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
Now, the filter compares total columns to 17 or header column which starts with VendorID.
Function<String, Boolean> isInvalid = row -> (row.split(",").length == 17 && !(row.startsWith("VendorID")));
JavaRDD<String> taxis = sc.textFile("datasets/trip_yellow_taxi.data")
.filter(isInvalid);

Deduping evnts using hiveContext in spark with scala

I am trying to dedupe event records, using the hiveContext in spark with Scala.
df to rdd is compilation error saying "object Tuple23 is not a member of package scala". There is known issue, that Scala Tuple can't have 23 or more
Is there any other way to dedupe
val events = hiveContext.table("default.my_table")
val valid_events = events.select(
events("key1"),events("key2"),events("col3"),events("col4"),events("col5"),
events("col6"),events("col7"),events("col8"),events("col9"),events("col10"),
events("col11"),events("col12"),events("col13"),events("col14"),events("col15"),
events("col16"),events("col17"),events("col18"),events("col19"),events("col20"),
events("col21"),events("col22"),events("col23"),events("col24"),events("col25"),
events("col26"),events("col27"),events("col28"),events("col29"),events("epoch")
)
//events are deduped based on latest epoch time
val valid_events_rdd = valid_events.rdd.map(t => {
((t(0),t(1)),(t(2),t(3),t(4),t(5),t(6),t(7),t(8),t(9),t(10),t(11),t(12),t(13),t(14),t(15),t(16),t(17),t(18),t(19),t(20),t(21),t(22),t(23),t(24),t(25),t(26),t(28),t(29)))
})
// reduce by key so we will only get one record for every primary key
val reducedRDD = valid_events_rdd.reduceByKey((a,b) => if ((a._29).compareTo(b._29) > 0) a else b)
//Get all the fields
reducedRDD.map(r => r._1 + "," + r._2._1 + "," + r._2._2).collect().foreach(println)
Off the top of my head:
use cases classes which no longer have size limit. Just keep in mind that cases classes won't work correctly in Spark REPL,
use Row objects directly and extract only keys,
operate directly on a DataFrame,
import org.apache.spark.sql.functions.{col, max}
val maxs = df
.groupBy(col("key1"), col("key2"))
.agg(max(col("epoch")).alias("epoch"))
.as("maxs")
df.as("df")
.join(maxs,
col("df.key1") === col("maxs.key1") &&
col("df.key2") === col("maxs.key2") &&
col("df.epoch") === col("maxs.epoch"))
.drop(maxs("epoch"))
.drop(maxs("key1"))
.drop(maxs("key2"))
or with window function:
val w = Window.partitionBy($"key1", $"key2").orderBy($"epoch")
df.withColumn("rn_", rowNumber.over(w)).where($"rn" === 1).drop("rn")