Transform scala to Spark - scala

I have to transform the code below into spark, But I dont understand what exactly Seq perform in this code ?
val tempFactDF = unionTempDF.join(fact.select("x","y","d","f","s"),
Seq("x","y","d","f")).dropDuplicates

Here it is performing a join operation over multiple columns and it is defined as Seq("x","y","d","f").
It is equivalent to:
val joiningTable = fact.select("x","y","d","f","s")
unionTempDF.join(joiningTable, unionTempDF("x") === joiningTable("x") &&
unionTempDF("y") === joiningTable("y") &&
unionTempDF("d") === joiningTable("d") &&
unionTempDF("f") === joiningTable("f"))

Related

isNullOrEmpty function in spark to check column in data frame is null or empty string

How can I check the columns of dataframe is null or empty ins spark.
Ex.
type IdentifiedDataFrame = {SourceIdentfier, DataFrame}
def splitRequestIntoDFsWithAndWithoutTransactionId(df: DataFrame) : Seq[IdentifiedDataFrame] = {
seq((DeltaTableStream(RequestWithTransactionId), df.filter(col(RequestLocationCodeColName).isNull
&& col(ServiceNumberColName).isNull
&& col(DateOfServiceColName).isNull
&& col(TransactionIdColName).isNotNull)).
(DeltaTableStream(RequestWithoutTransactionId), df.filter(col(RequestLocationCodeColName).isNotNull
&& col(ServiceNumberColName).isNotNull
&& col(DateOfServiceColName).isNotNull))
)
}
Note : this code only check the null value in column and I want to check null or empty string both
Please help
You can use isNull function and check for empty String with filter as below
val columns = List("column1", "column2")
val filter = columns.map(c => isnull(col(c)) || !(col(c) <=> lit("")))
.reduce(_ and _)
df.filter(filter)

how to filter isNullOrEmpty in spark scala

In code below, I am filtering based on isNull and now I want to filter null or Empty string both. Can anyone please help me with this code:
type IdentifiedDataFrame = {SourceIdentfier, DataFrame}
def splitRequestIntoDFsWithAndWithoutTransactionId(df: DataFrame) : Seq[IdentifiedDataFrame] = {
seq((DeltaTableStream(RequestWithTransactionId),
df.filter(
col(RequestLocationCodeColName).isNull &&
col(ServiceNumberColName).isNull &&
col(DateOfServiceColName).isNull &&
col(TransactionIdColName).isNotNull)),
(DeltaTableStream(RequestWithoutTransactionId),
df.filter(
col(RequestLocationCodeColName).isNotNull &&
col(ServiceNumberColName).isNotNull &&
col(DateOfServiceColName).isNotNull)))
}

Arbitrarily number of filters on dataframe

I have a number of filters that i need to apply to a data frame in Spark, but it is first at runtime i know which filters to user. Currently i am adding them in individual filter functions, but that fails if one of the filtes is not defined
myDataFrame
.filter(_filter1)
.filter(_filter2)
.filter(_filter3)...
I can't really find out how to dynamically at runtime exclude fx _filter2 if that is not needed?
Should i do it by creating one big filter:
var filter = _filter1
if (_filter2 != null)
filter = filter.and(_filter2)
...
Or is there a good pattern for this in Spark that i haven't found?
One possible solution is to default all filters to lit(true):
import org.apache.spark.sql.functions._
val df = Seq(1, 2, 3).toDF("x")
val filter_1 = lit(true)
val filter_2 = col("x") > 1
val filter_3 = lit(true)
val filtered = df.filter(filter_1).filter(filter_2).filter(filter_3)
This will keep null out of your code and trivially true predicates will be pruned from the execution plan:
filtered.explain
== Physical Plan ==
*Project [value#1 AS x#3]
+- *Filter (value#1 > 1)
+- LocalTableScan [value#1]
You can of course make it even simpler and a sequence of predicates:
import org.apache.spark.sql.Column
val preds: Seq[Column] = Seq(lit(true), col("x") > 1, lit(true))
df.where(preds.foldLeft(lit(true))(_ and _))
and, if implemented right, skip placeholders completely.
At first I would get rid of null filters:
val filters:List[A => Boolean] = nullableFilters.filter(_!=null)
Then define function to chain filters:
def chainFilters[A](filters:List[A => Boolean])(v:A) = filters.forall(f => f(v))
Now you can simply apply filters to your df:
df.filter(chainFilters(nullableFilters.filter(_!=null))
Why not:
var df = // load
if (_filter2 != null) {
df = df.filter(_filter2)
}
etc
Alternatively, create a list of filters:
var df = // load
val filters = Seq (filter1, filter2, filter3, ...)
filters.filter(_ != null).foreach (x => df = df.filter(x))
// Sorry if there is some mistake in code, it's more an idea - currently I can't test code

Spark scala remove columns containing only null values

Is there a way to remove the columns of a spark dataFrame that contain only null values ?
(I am using scala and Spark 1.6.2)
At the moment I am doing this:
var validCols: List[String] = List()
for (col <- df_filtered.columns){
val count = df_filtered
.select(col)
.distinct
.count
println(col, count)
if (count >= 2){
validCols ++= List(col)
}
}
to build the list of column containing at least two distinct values, and then use it in a select().
Thank you !
I had the same problem and i came up with a similar solution in Java. In my opinion there is no other way of doing it at the moment.
for (String column:df.columns()){
long count = df.select(column).distinct().count();
if(count == 1 && df.select(column).first().isNullAt(0)){
df = df.drop(column);
}
}
I'm dropping all columns containing exactly one distinct value and which first value is null. This way I can be sure that i don't drop columns where all values are the same but not null.
Here's a scala example to remove null columns that only queries that data once (faster):
def removeNullColumns(df:DataFrame): DataFrame = {
var dfNoNulls = df
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
for(c <- df.columns) {
val uses = cnts.getAs[Long]("count("+c+")")
if ( uses == 0 ) {
dfNoNulls = dfNoNulls.drop(c)
}
}
return dfNoNulls
}
A more idiomatic version of #swdev answer:
private def removeNullColumns(df:DataFrame): DataFrame = {
val exprs = df.columns.map((_ -> "count")).toMap
val cnts = df.agg(exprs).first
df.columns
.filter(c => cnts.getAs[Long]("count("+c+")") == 0)
.foldLeft(df)((df, col) => df.drop(col))
}
If the dataframe is of reasonable size, I write it as json then reload it. The dynamic schema will ignore null columns and you'd have a lighter dataframe.
scala snippet:
originalDataFrame.write(tempJsonPath)
val lightDataFrame = spark.read.json(tempJsonPath)
here's #timo-strotmann solution in pySpark syntax:
for column in df.columns:
count = df.select(column).distinct().count()
if count == 1 and df.first()[column] is None:
df = df.drop(column)

Deduping evnts using hiveContext in spark with scala

I am trying to dedupe event records, using the hiveContext in spark with Scala.
df to rdd is compilation error saying "object Tuple23 is not a member of package scala". There is known issue, that Scala Tuple can't have 23 or more
Is there any other way to dedupe
val events = hiveContext.table("default.my_table")
val valid_events = events.select(
events("key1"),events("key2"),events("col3"),events("col4"),events("col5"),
events("col6"),events("col7"),events("col8"),events("col9"),events("col10"),
events("col11"),events("col12"),events("col13"),events("col14"),events("col15"),
events("col16"),events("col17"),events("col18"),events("col19"),events("col20"),
events("col21"),events("col22"),events("col23"),events("col24"),events("col25"),
events("col26"),events("col27"),events("col28"),events("col29"),events("epoch")
)
//events are deduped based on latest epoch time
val valid_events_rdd = valid_events.rdd.map(t => {
((t(0),t(1)),(t(2),t(3),t(4),t(5),t(6),t(7),t(8),t(9),t(10),t(11),t(12),t(13),t(14),t(15),t(16),t(17),t(18),t(19),t(20),t(21),t(22),t(23),t(24),t(25),t(26),t(28),t(29)))
})
// reduce by key so we will only get one record for every primary key
val reducedRDD = valid_events_rdd.reduceByKey((a,b) => if ((a._29).compareTo(b._29) > 0) a else b)
//Get all the fields
reducedRDD.map(r => r._1 + "," + r._2._1 + "," + r._2._2).collect().foreach(println)
Off the top of my head:
use cases classes which no longer have size limit. Just keep in mind that cases classes won't work correctly in Spark REPL,
use Row objects directly and extract only keys,
operate directly on a DataFrame,
import org.apache.spark.sql.functions.{col, max}
val maxs = df
.groupBy(col("key1"), col("key2"))
.agg(max(col("epoch")).alias("epoch"))
.as("maxs")
df.as("df")
.join(maxs,
col("df.key1") === col("maxs.key1") &&
col("df.key2") === col("maxs.key2") &&
col("df.epoch") === col("maxs.epoch"))
.drop(maxs("epoch"))
.drop(maxs("key1"))
.drop(maxs("key2"))
or with window function:
val w = Window.partitionBy($"key1", $"key2").orderBy($"epoch")
df.withColumn("rn_", rowNumber.over(w)).where($"rn" === 1).drop("rn")