Spark dataframe duplicate row based on splitting column value in scala - scala

I have the following code in scala:
val fullCertificateSourceDf = certificateSourceDf
.withColumn("Stage", when(col("Data.WorkBreakdownUp1Summary").isNotNull && col("Data.WorkBreakdownUp1Summary")=!="", rtrim(regexp_extract($"Data.WorkBreakdownUp1Summary","^.*?(?= - *[a-zA-Z])",0))).otherwise(""))
.withColumn("SubSystem", when(col("Data.ProcessBreakdownSummaryList").isNotNull && col("Data.ProcessBreakdownSummaryList")=!="", regexp_extract($"Data.ProcessBreakdownSummaryList","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("System", when(col("Data.ProcessBreakdownUp1SummaryList").isNotNull && col("Data.ProcessBreakdownUp1SummaryList")=!="", regexp_extract($"Data.ProcessBreakdownUp1SummaryList","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("Facility", when(col("Data.ProcessBreakdownUp2Summary").isNotNull && col("Data.ProcessBreakdownUp2Summary")=!="", regexp_extract($"Data.ProcessBreakdownUp2Summary","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.withColumn("Area", when(col("Data.ProcessBreakdownUp3Summary").isNotNull && col("Data.ProcessBreakdownUp3Summary")=!="", regexp_extract($"Data.ProcessBreakdownUp3Summary","^.*?(?= - *[a-zA-Z])",0)).otherwise(""))
.select("Data.ID",
"Data.CertificateID",
"Data.CertificateTag",
"Data.CertificateDescription",
"Data.WorkBreakdownUp1Summary",
"Data.ProcessBreakdownSummaryList",
"Data.ProcessBreakdownUp1SummaryList",
"Data.ProcessBreakdownUp2Summary",
"Data.ProcessBreakdownUp3Summary",
"Data.ActualStartDate",
"Data.ActualEndDate",
"Data.ApprovedDate",
"Data.CurrentState",
"DataType",
"PullDate",
"PullTime",
"Stage",
"System",
"SubSystem",
"Facility",
"Area"
)
.filter((col("Stage").isNotNull) && (length(col("Stage"))>0))
.filter(((col("SubSystem").isNotNull) && (length(col("SubSystem"))>0)) || ((col("System").isNotNull) && (length(col("System"))>0)) || ((col("Facility").isNotNull) && (length(col("Facility"))>0)) || ((col("Area").isNotNull) && (length(col("Area"))>0))
)
.select("*")
This dataframe fullCertificateSourceDf contains the following data:
I have hidden some columns for Brevity.
I want the data to look like this:
We are splitting on two columns: ProcessBreakdownSummaryList and ProcessBreakdownUp1SummaryList. They both are comma seperated lists.
Please note if the values in ProcessBreakdownSummaryList (CS10-100-22-10 - Mine Intake Air Fan Heater System, CS10-100-81-10 - Mine Services Switchgear) and ProcessBreakdownUp1SummaryList (CS10-100-22 - Service Shaft Ventilation, CS10-100-81 - Service Shaft Electrical) are the same we should only split once.
However, if they are different as in ProcessBreakdownSummaryList(CS10-100-22-10 - Mine Intake Air Fan Heater System, CS10-100-81-10 - Mine Services Switchgear) and ProcessBreakdownUp1SummaryList (CS10-100-22 - Service Shaft Ventilation, CS10-100-34 - Service Shaft Electrical) it should split again for a third row.
Thank you in advance for your help with this.

You can solve it many ways, I think the easiest approach for complicated processing is to use scala. You can read all columns including "ProcessBreakdownSummaryList" and "ProcessBreakdownUp1SummaryList", compare their values for being same/different and emit multiple rows for a single input row. Then flatmap on the output to get a dataframe with all rows you need.
val fullCertificateSourceDf = // your code
fullCertificateSourceDf.map{ row =>
val id = row.getAs[String]("Data.ID")
... read all columns
val processBreakdownSummaryList = row.getAs[String]("Data.ProcessBreakdownSummaryList")
val processBreakdownUp1SummaryList = row.getAs[String]("Data.ProcessBreakdownUp1SummaryList")
//split processBreakdownSummaryList on ","
//split processBreakdownUp1SummaryList on ","
//compare then for equality
//lets say you end up with 4 rows.
//return Seq of those 4 rows in a list processBreakdownSummary
//return a List of tuple of strings like List((id, certificateId, certificateTag, ..distinct values of processBreakdownUp1SummaryList...), (...) ...)
//all columns id, certificateId, certificateTag etc are repeated for each distinct value of processBreakdownUp1SummaryList and processBreakdownSummaryList
}.flatMap(identity(_)).toDF("column1","column2"...)
Here is an example of splitting one row into multiple
val employees = spark.createDataFrame(Seq(("E1",100.0,"a,b"), ("E2",200.0,"e,f"),("E3",300.0,"c,d"))).toDF("employee","salary","clubs")
employees.map{ r =>
val clubs = r.getAs[String]("clubs").split(",")
for{
c : String <- clubs
}yield(r.getAs[String]("employee"),r.getAs[Double]("salary"), c)
}.flatMap(identity(_)).toDF("employee","salary","clubs").show(false)
The result looks like
+--------+------+-----+
|employee|salary|clubs|
+--------+------+-----+
|E1 |100.0 |a |
|E1 |100.0 |b |
|E2 |200.0 |e |
|E2 |200.0 |f |
|E3 |300.0 |c |
|E3 |300.0 |d |
+--------+------+-----+

Related

How to apply conditional expression filter on spark dataframe where the conditional expression is saved in the dataframe column

I have many DataFrames and each of them can have separate filters to filter out data. The filters are pre-defined as well. I am planning to create a combined dataframe which will contain the filtering expression as one of the column. In this combined dataframe, I need to apply filter which is part of the data row itself. For example
If I have 3 DataFrames like this
val ausDF = Seq(
("australia", "Steve Smith", "batter"),
("australia", "David Warner", "batter"),
("australia", "Pat Cummins", "bowler")
).toDF("country", "player", "speciality")
val indDF = Seq(
("india", "Rohit Sharma", "batsman"),
("india", "Virat Kohli", "batsman"),
("india", "Jaspreet Bumrah", "bowler")
).toDF("country", "player", "speciality")
val engDF = Seq(
("england", "Jos Buttler", "bat"),
("england", "Joe Root", "bat"),
("england", "James Anderson", "bowl")
).toDF("country", "player", "speciality")
I can do a union to create a combined dataframe like this
val cricketersDF = ausDF.union(indDF).union(engDF)
If there is a filter dataframe like this
val batsmanFilter = Seq(
("australia", "speciality == \"batter\""),
("india", "speciality == \"batsman\""),
("england", "speciality == \"bat\"")
).toDF("country", "filter")
I can then join these 2 DataFrames
val batsmanFilterDF = cricketersDF.join(batsmanFilter, "country")
which gives me a dataframe with filters like this
+---------+---------------+----------+-----------------------+
|country |player |speciality|filter |
+---------+---------------+----------+-----------------------+
|australia|Steve Smith |batter |speciality == "batter" |
|australia|David Warner |batter |speciality == "batter" |
|australia|Pat Cummins |bowler |speciality == "batter" |
|india |Rohit Sharma |batsman |speciality == "batsman"|
|india |Virat Kohli |batsman |speciality == "batsman"|
|india |Jaspreet Bumrah|bowler |speciality == "batsman"|
|england |Jos Buttler |bat |speciality == "bat" |
|england |Joe Root |bat |speciality == "bat" |
|england |James Anderson |bowl |speciality == "bat" |
+---------+---------------+----------+-----------------------+
Now, what I want is to apply the filter provided in the filter column to get the required result. Something similar to this
batsmanFilterDF.filter(col("filter"))
However, this gives me an error that
Exception in thread "main" org.apache.spark.sql.AnalysisException: filter expression '`filter`' of type string is not a boolean.;;
Filter filter#45: string
So, I wanted to know is there a way to use filtering based on conditional expression using the value specified in the dataframe column?
AFAIK, there is no way to apply a complex filter contained in a column to a dataframe. If the filter was simple, we could design a trick but you seem to say that the filters can be complex.
If the dataframe batsmanFilter is small, you can design and apply the filter from the driver. It would go like this:
val filter = batsmanFilter
.collect
.map(row => (row.getAs[String]("country"), row.getAs[String]("filter")))
.map{ case (country, filter) =>
"((country == \"" + country + "\") and (" + filter + "))"
}
.reduce(_ + " or " + _)
cricketersDF.where(filter).show
which yields what you seem to expect:
+---------+------------+----------+
| country| player|speciality|
+---------+------------+----------+
|australia| Steve Smith| batter|
|australia|David Warner| batter|
| india|Rohit Sharma| batsman|
| india| Virat Kohli| batsman|
| england| Jos Buttler| bat|
| england| Joe Root| bat|
+---------+------------+----------+
The advantage of this approach is that only one filter is applied. Yet, this will only work if the batsmanFilter dataframe is reasonably small. If it is not, we could work something out as well but we would need to know more about the kind of filter we can find.

How to check whether multiple columns values of a row are not null and then add a true/false resulting column in Spark Scala

Hi how's it going? Here are my two dataframes:
val id_df = Seq(("1","gender"),("2","city"),("3","state"),("4","age")).toDF("id","type")
val main_df = Seq(("male","los angeles","null"),("female","new york","new york")).toDF("1","2","3")
Here's what they look like in tabular form:
and this is what I would like the resultant dataframe to look like:
I want to check for all the ids in id_df, if they exist in main_df's columns, then check whether all the id values for that row are not null. If they're all not null, then we put "true" in the meets condition column for that row, otherwise we put "false". Notice how id number 4 for age isn't in main_df's columns, so we just ignore it.
How would I do this?
Thanks so much and have a great day.
Allow me to start with two short observations:
I believe that it would be safer to avoid naming our columns with single numbers. Think of the case where we need to evaluate the expression 1 is not null. Here it is ambiguous whether we mean column 1 or the value 1 itself.
As far as I am aware, it is not performant to store and process the target columns through a dataframe. That would create an overhead that can be easily avoided by using a single scala collection i.e: Seq, Array, Set, etc.
And here is the solution to your problem:
import org.apache.spark.sql.functions.col
val id_df = Seq(
("c1","gender"),
("c2","city"),
("c3","state"),
("c4","age")
).toDF("id","type")
val main_df = Seq(
("male", "los angeles", null),
("female", "new york", "new york"),
("trans", null, "new york")
).toDF("c1","c2","c3")
val targetCols = id_df.collect()
.map{_.getString(0)} //get id
.toSet //convert current sequence to a set (required for the intersection)
.intersect(main_df.columns.toSet) //get common columns with main_df
.map(col(_).isNotNull) //convert c1,..cN to col(c[i]).isNotNull
.reduce(_ && _) // apply the AND operator between items
// (((c1 IS NOT NULL) AND (c2 IS NOT NULL)) AND (c3 IS NOT NULL))
main_df.withColumn("meets_conditions", targetCols).show(false)
// +------+-----------+--------+----------------+
// |c1 |c2 |c3 |meets_conditions|
// +------+-----------+--------+----------------+
// |male |los angeles|null |false |
// |female|new york |new york|true |
// |trans |null |new york|false |
// +------+-----------+--------+----------------+

Are these values empty or null and how do I drop these columns?

So I have this dataframe which looks like below:
+----------------+----------+-------------+-----------+---------+-------------+
|_manufacturerRef|_masterRef|_nomenclature|_partNumber|_revision|_serialNumber|
+----------------+----------+-------------+-----------+---------+-------------+
| #id2| #id19| | zaa01948| | JTJHA31U2400|
| #id2| #id29| | zaa22408| | null|
| #id2| #id45| | zaa24981| | null|
+----------------+----------+-------------+-----------+---------+-------------+
I want to drop empty columns, which are _nomenclature and _revision as shown in the above dataframe. I am trying various methods but none would drop. No method is able to detect these columns as empty. Also, there might be the possibility that the columns can be of type Struct as well. I am trying like below:
val cols = xmldf.columns
cols.foreach(c => {
var currDF = xmldf.select("`" + c + "`")
currDF.show()
val df1 = currDF.filter(currDF("`" + c + "`").isNotNull)
if(df1.count() == 0 || df1.rdd.isEmpty()){
xmldf = xmldf.drop(c)
}
})
Problem with your code is, that columns _nomeclature and _revision aren't really empty, they contain empty strings, not nulls. Because of that, you can't use isNotNull to check if the cell is empty, you need to use =!= operator.
You can also use filter and foldLeft instead of foreach, if you want to avoid using mutable var.
val df = List(("#id2","#id19", "", "zaa01947", "", "JTJHA31U2400"), ("#id2", "#id29", "", "zaa22408", "", null)).toDF("_manufacturerRef", "_masterRef", "_nomenclature", "_partNumber", "_revision", "_serialNumber")
val newDf = df.columns
.filter(c => df.where(df(c) =!= "").isEmpty) //find column containing only empty strings
.foldLeft(df)(_.drop(_)) //drop all found columns from dataframe
newDf.show()
And as expected, _nomeclature and _revision are dropped in result:
+----------------+----------+-----------+-------------+
|_manufacturerRef|_masterRef|_partNumber|_serialNumber|
+----------------+----------+-----------+-------------+
| #id2| #id19| zaa01947| JTJHA31U2400|
| #id2| #id29| zaa22408| null|
+----------------+----------+-----------+-------------+

How to merge two or more columns into one?

I have a streaming Dataframe that I want to calculate min and avg over some columns.
Instead of getting separate resulting columns of min and avg after applying the operations, I want to merge the min and average output into a single column.
The dataframe look like this:
+-----+-----+
| 1 | 2 |
+-----+-----+-
|24 | 55 |
+-----+-----+
|20 | 51 |
+-----+-----+
I thought I'd use a Scala tuple for it, but that does not seem to work:
val res = List("1","2").map(name => (min(col(name)), avg(col(name))).as(s"result($name)"))
All code used:
val res = List("1","2").map(name => (min(col(name)),avg(col(name))).as(s"result($name)"))
val groupedByTimeWindowDF1 = processedDf.groupBy($"xyz", window($"timestamp", "60 seconds"))
.agg(res.head, res.tail: _*)
I'm expecting the output after applying the min and avg mathematical opearations to be:
+-----------+-----------+
| result(1)| result(2)|
+-----------+-----------+
|20 ,22 | 51,53 |
+-----------+-----------+
How I should write the expression?
Use struct standard function:
struct(colName: String, colNames: String*): Column
struct(cols: Column*): Column
Creates a new struct column that composes multiple input columns.
That gives you the values as well as the names (of the columns).
val res = List("1","2").map(name =>
struct(min(col(name)), avg(col(name))) as s"result($name)")
^^^^^^ HERE
The power of struct can be seen when you want to reference one field in the struct and you can use the name (not index).
q.select("structCol.name")
What you want to do is to merge the values of multiple columns together in a single column. For this you can use the array function. In this case it would be:
val res = List("1","2").map(name => array(min(col(name)),avg(col(name))).as(s"result($name)"))
Which will give you :
+------------+------------+
| result(1)| result(2)|
+------------+------------+
|[20.0, 22.0]|[51.0, 53.0]|
+------------+------------+

Spark columnar performance

I'm a relative beginner to things Spark. I have a wide dataframe (1000 columns) that I want to add columns to based on whether a corresponding column has missing values
so
+----+
| A |
+----+
| 1 |
+----+
|null|
+----+
| 3 |
+----+
becomes
+----+-------+
| A | A_MIS |
+----+-------+
| 1 | 0 |
+----+-------+
|null| 1 |
+----+-------+
| 3 | 1 |
+----+-------+
This is part of a custom ml transformer but the algorithm should be clear.
override def transform(dataset: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame = {
var ds = dataset
dataset.columns.foreach(c => {
if (dataset.filter(col(c).isNull).count() > 0) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}
Loop over the columns, if > 0 nulls create a new column.
The dataset passed in is cached (using the .cache method) and the relevant config settings are the defaults.
This is running on a single laptop for now, and runs in the order of 40 minutes for the 1000 columns even with a minimal amount of rows.
I thought the problem was due to hitting a database, so I tried with a parquet file instead with the same result. Looking at the jobs UI it appears to be doing filescans in order to do the count.
Is there a way I can improve this algorithm to get better performance, or tune the cacheing in some way? Increasing spark.sql.inMemoryColumnarStorage.batchSize just got me an OOM error.
Remove the condition:
if (dataset.filter(col(c).isNull).count() > 0)
and leave only the internal expression. As it is written Spark requires #columns data scans.
If you want prune columns compute statistics once, as outlined in Count number of non-NaN entries in each column of Spark dataframe with Pyspark, and use single drop call.
Here's the code that fixes the problem.
override def transform(dataset: Dataset[_]): DataFrame = {
var ds = dataset
val rowCount = dataset.count()
val exprs = dataset.columns.map(count(_))
val colCounts = dataset.agg(exprs.head, exprs.tail: _*).toDF(dataset.columns: _*).first()
dataset.columns.foreach(c => {
if (colCounts.getAs[Long](c) > 0 && colCounts.getAs[Long](c) < rowCount ) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}