Remove unwanted columns from a dataframe in scala - scala

I am pretty new to scala
I have a situation where I have a dataframe with multiple columns, some of these columns having random null values at random places. I need to find any such column having even a single null value and drop it from the dataframe.
#### Input
| Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
| --------------| --------------| --------------| --------------| --------------|
|(123)-456-7890 | 123-456-7890 |(123)-456-789 | |(123)-456-7890 |
|(123)-456-7890 | 123-4567890 |(123)-456-7890 |(123)-456-7890 | null |
|(123)-456-7890 | 1234567890 |(123)-456-7890 |(123)-456-7890 | null |
#### Output
| Column 1 | Column 2 |
| --------------| --------------|
|(123)-456-7890 | 123-456-7890 |
|(123)-456-7890 | 123-4567890 |
|(123)-456-7890 | 1234567890 |
Please advise.
Thank you.

I would recommend a 2-step approach:
Exclude columns that are not nullable from the dataframe
Assemble a list of columns that contain at least a null and drop them altogether
Creating a sample dataframe with a mix of nullable/non-nullable columns:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
val df0 = Seq(
(Some(1), Some("x"), Some("a"), None),
(Some(2), Some("y"), None, Some(20.0)),
(Some(3), Some("z"), None, Some(30.0))
).toDF("c1", "c2", "c3", "c4")
val newSchema = StructType(df0.schema.map{ field =>
if (field.name == "c1") field.copy(name = "c1_notnull", nullable = false) else field
})
// Revised dataframe with non-nullable `c1`
val df = spark.createDataFrame(df0.rdd, newSchema)
Carrying out step 1 & 2:
val nullableCols = df.schema.collect{ case StructField(name, _, true, _) => name }
// nullableCols: Seq[String] = List(c2, c3, c4)
val colsWithNulls = nullableCols.filter(c => df.where(col(c).isNull).count > 0)
// colsWithNulls: Seq[String] = List(c3, c4)
df.drop(colsWithNulls: _*).show
// +----------+---+
// |c1_notnull| c2|
// +----------+---+
// | 1| x|
// | 2| y|
// | 3| z|
// +----------+---+

Related

Fetch the partial value from a column having key value pairs and assign it to new column in Spark Dataframe

I have a data frame as below
+----+-----------------------------+
|id | att |
+----+-----------------------------+
| 25 | {"State":"abc","City":"xyz"}|
| 26 | null |
| 27 | {"State":"pqr"} |
+----+-----------------------------+
I want a dataframe with columns id and city if the att column has city attribute else null
+----+------+
|id | City |
+----+------+
| 25 | xyz |
| 26 | null |
| 27 | null |
+----+------+
Language : Scala
You can use from_json to parse and convert your json data to Map. Then access the map item using one of:
getItem method of the Column class
default accessor, i.e map("map_key")
element_at function
import org.apache.spark.sql.functions.from_json
import org.apache.spark.sql.types.{MapType, StringType}
import sparkSession.implicits._
val df = Seq(
(25, """{"State":"abc","City":"xyz"}"""),
(26, null),
(27, """{"State":"pqr"}""")
).toDF("id", "att")
val schema = MapType(StringType, StringType)
df.select($"id", from_json($"att", schema).getItem("City").as("City"))
//or df.select($"id", from_json($"att", schema)("City").as("City"))
//or df.select($"id", element_at(from_json($"att", schema), "City").as("City"))
// +---+----+
// | id|City|
// +---+----+
// | 25| xyz|
// | 26|null|
// | 27|null|
// +---+----+

How to use dataframe inside an udf and parse the data in spark scala

I am new to scala and spark. I have a requirement to create the new dataframe by using the udf.
I have a 2 dataframes, one contains 3 columns namely company, id, and type.
df2 contains 2 columns namely company and message.
df2 JSON will be like this
{"company": "Honda", "message": ["19:[\"cost 500k\"],[\"colour blue\"]","20:[\"cost 600k\"],[\"colour white\"]"]}
{"company": "BMW", "message": ["19:[\"cost 1500k\"],[\"colour blue\"]"]}
df2 will be like this:
+-------+--------------------+
|company| message|
+-------+--------------------+
| Honda|[19:["cost 500k"]...|
| BMW|[19:["cost 1500k"...|
+-------+--------------------+
|-- company: string (nullable = true)
|-- message: array (nullable = true)
| |-- element: string (containsNull = true)
df1 will be like this:
+----------+---+-------+
|company | id| name|
+----------+---+-------+
| Honda | 19| city |
| Honda | 20| amaze |
| BMW | 19| x1 |
+----------+---+-------+
I want to create a new data frame by replacing the id in df2 with the name in df1.
["city:[\"cost 500k\"],[\"colour blue\"]","amaze:[\"cost 600k\"],[\"colour white\"]"]
I had tried with udf by passing message as Seq[String] and company but I was not able to select the data in df1.
I want the output like this:
+-------+----------------------+
|company| message |
+-------+----------------------+
| Honda|[city:["cost 500k"]...|
| BMW|[x1:["cost 1500k"... |
+-------+----------------------+
I tried by using the fallowing udf but I was facing errors while selecting the name
def asdf(categories: Seq[String]):String={
| var data=""
| for(w<-categories){
| if (w != null){
| var id=w.toString().indexOf(":")
| var namea=df1.select("name").where($"id" === 20).map(_.getString(0)).collect()
| var name=namea(0)
| println(name)
| var ids=w.toString().substring(0,id)
| var li=w.toString().replace(ids,name)
| println(li)
| data=data+li
| }
| }
| data
| }
Please check below code.
scala> df1.show(false)
+-------+---------------------------------------------------------------------+
|company|message |
+-------+---------------------------------------------------------------------+
|Honda |[19:["cost 500k"],["colour blue"], 20:["cost 600k"],["colour white"]]|
|BMW |[19:["cost 1500k"],["colour blue"]] |
+-------+---------------------------------------------------------------------+
scala> df2.show(false)
+-------+---+-----+
|company|id |name |
+-------+---+-----+
|Honda | 19|city |
|Honda | 20|amaze|
|BMW | 19|x1 |
+-------+---+-----+
val replaceFirst = udf((message: String,id:String,name:String) =>
if(message.contains(s"""${id}:""")) message.replaceFirst(s"""${id}:""",s"${name}:") else ""
)
val jdf =
df1
.withColumn("message",explode($"message"))
.join(df2,df1("company") === df2("company"),"inner")
.withColumn(
"message_data",
replaceFirst($"message",trim($"id"),$"name")
)
.filter($"message_data" =!= "")
scala> jdf.show(false)
+-------+---------------------------------+-------+---+-----+------------------------------------+
|company|message |company|id |name |message_data |
+-------+---------------------------------+-------+---+-----+------------------------------------+
|Honda |19:["cost 500k"],["colour blue"] |Honda | 19|city |city:["cost 500k"],["colour blue"] |
|Honda |20:["cost 600k"],["colour white"]|Honda | 20|amaze|amaze:["cost 600k"],["colour white"]|
|BMW |19:["cost 1500k"],["colour blue"]|BMW | 19|x1 |x1:["cost 1500k"],["colour blue"] |
+-------+---------------------------------+-------+---+-----+------------------------------------+
scala> df1.join(df2,df1("company") === df2("company"),"inner").select(df1("company"),df1("message"),df2("id"),df2("name")).withColumn("message",explode($"message")).withColumn("message",replaceFirst($"message",trim($"id"),$"name")).filter($"message" =!= "").groupBy($"company").agg(collect_list($"message").cast("string").as("message")).show(false)
+-------+--------------------------------------------------------------------------+
|company|message |
+-------+--------------------------------------------------------------------------+
|Honda |[amaze:["cost 600k"],["colour white"], city:["cost 500k"],["colour blue"]]|
|BMW |[x1:["cost 1500k"],["colour blue"]] |
+-------+--------------------------------------------------------------------------+

Spark scala create multiple columns from array column

Creating a multiple columns from array column
Dataframe
Car name | details
Toyota | [[year,2000],[price,20000]]
Audi | [[mpg,22]]
Expected dataframe
Car name | year | price | mpg
Toyota | 2000 | 20000 | null
Audi | null | null | 22
You can try this
Let's define the data
scala> val carsDF = Seq(("toyota",Array(("year", 2000), ("price", 100000))), ("Audi", Array(("mpg", 22)))).toDF("car", "details")
carsDF: org.apache.spark.sql.DataFrame = [car: string, details: array<struct<_1:string,_2:int>>]
scala> carsDF.show(false)
+------+-----------------------------+
|car |details |
+------+-----------------------------+
|toyota|[[year,2000], [price,100000]]|
|Audi |[[mpg,22]] |
+------+-----------------------------+
Splitting the data & accessing the values in the data
scala> carsDF.withColumn("split", explode($"details")).withColumn("col", $"split"("_1")).withColumn("val", $"split"("_2")).select("car", "col", "val").show
+------+-----+------+
| car| col| val|
+------+-----+------+
|toyota| year| 2000|
|toyota|price|100000|
| Audi| mpg| 22|
+------+-----+------+
Define the list of columns that are required
scala> val colNames = Seq("mpg", "price", "year", "dummy")
colNames: Seq[String] = List(mpg, price, year, dummy)
Use pivoting on the above defined column names gives required output.
By giving new column names in the sequence makes it a single point input
scala> weDF.groupBy("car").pivot("col", colNames).agg(avg($"val")).show
+------+----+--------+------+-----+
| car| mpg| price| year|dummy|
+------+----+--------+------+-----+
|toyota|null|100000.0|2000.0| null|
| Audi|22.0| null| null| null|
+------+----+--------+------+-----+
This seems more elegant & easy way to achieve the output
you can do it like that
import org.apache.spark.functions.col
val df: DataFrame = Seq(
("toyota",Array(("year", 2000), ("price", 100000))),
("toyota",Array(("year", 2001)))
).toDF("car", "details")
+------+-------------------------------+
|car |details |
+------+-------------------------------+
|toyota|[[year, 2000], [price, 100000]]|
|toyota|[[year, 2001]] |
+------+-------------------------------+
val newdf = df
.withColumn("year", when(col("details")(0)("_1") === lit("year"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.withColumn("price", when(col("details")(0)("_1") === lit("price"), col("details")(0)("_2")).otherwise(col("details")(1)("_2")))
.drop("details")
newdf.show()
+------+----+------+
| car|year| price|
+------+----+------+
|toyota|2000|100000|
|toyota|2001| null|
+------+----+------+

Scala : Passing elements of a Dataframe from every row and get back the result in separate rows

In My requirment , i come across a situation where i have to pass 2 strings from my dataframe's 2 column and get back the result in string and want to store it back to a dataframe.
Now while passing the value as string, it is always returning the same value. So in all the rows the same value is being populated. (In My case PPPP is being populated in all rows)
Is there a way to pass element (for those 2 columns) from every row and get the result in separate rows.
I am ready to modify my function to accept Dataframe and return Dataframe OR accept arrayOfString and get back ArrayOfString but i dont know how to do that as i am new to programming. Can someone please help me.
Thanks.
def myFunction(key: String , value :String ) : String = {
//Do my functions and get back a string value2 and return this value2 string
value2
}
val DF2 = DF1.select (
DF1("col1")
,DF1("col2")
,DF1("col5") )
.withColumn("anyName", lit(myFunction ( DF1("col3").toString() , DF1("col4").toString() )))
/* DF1:
/*+-----+-----+----------------+------+
/*|col1 |col2 |col3 | col4 | col 5|
/*+-----+-----+----------------+------+
/*|Hello|5 |valueAAA | XXX | 123 |
/*|How |3 |valueCCC | YYY | 111 |
/*|World|5 |valueDDD | ZZZ | 222 |
/*+-----+-----+----------------+------+
/*DF2:
/*+-----+-----+--------------+
/*|col1 |col2 |col5| anyName |
/*+-----+-----+--------------+
/*|Hello|5 |123 | PPPPP |
/*|How |3 |111 | PPPPP |
/*|World|5 |222 | PPPPP |
/*+-----+-----+--------------+
*/
After you define the function, you need to register them as udf(). The udf() function is available in org.apache.spark.sql.functions. check this out
scala> val DF1 = Seq(("Hello",5,"valueAAA","XXX",123),
| ("How",3,"valueCCC","YYY",111),
| ("World",5,"valueDDD","ZZZ",222)
| ).toDF("col1","col2","col3","col4","col5")
DF1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 3 more fields]
scala> val DF2 = DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5") )
DF2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> DF2.show(false)
+-----+----+----+
|col1 |col2|col5|
+-----+----+----+
|Hello|5 |123 |
|How |3 |111 |
|World|5 |222 |
+-----+----+----+
scala> DF1.select("*").show(false)
+-----+----+--------+----+----+
|col1 |col2|col3 |col4|col5|
+-----+----+--------+----+----+
|Hello|5 |valueAAA|XXX |123 |
|How |3 |valueCCC|YYY |111 |
|World|5 |valueDDD|ZZZ |222 |
+-----+----+--------+----+----+
scala> def myConcat(a:String,b:String):String=
| return a + "--" + b
myConcat: (a: String, b: String)String
scala>
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val myConcatUDF = udf(myConcat(_:String,_:String):String)
myConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5"), myConcatUDF( DF1("col3"), DF1("col4"))).show()
+-----+----+----+---------------+
| col1|col2|col5|UDF(col3, col4)|
+-----+----+----+---------------+
|Hello| 5| 123| valueAAA--XXX|
| How| 3| 111| valueCCC--YYY|
|World| 5| 222| valueDDD--ZZZ|
+-----+----+----+---------------+
scala>

how to concat multiple columns in spark while getting the column names to be concatenated from another table (different for each row)

I am trying to concat multiple columns in spark using concat function.
For example below is the table for which I have to add new concatenated column
table - **t**
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
and below is the table which has the information about which columns are to be concatenated for given id (for id 1 column id and name needs to be concatenated and for id 2 only id)
table - **r**
+---+-------+
| id| att |
+---+-------+
| 1|id,name|
| 2| id |
+---+-------+
if I join the two tables and do something like below, I am able to concat but not based on the table r (as the new column is having 1,a for first row but for second row it should be 2 only)
t.withColumn("new",concat_ws(",",t.select("att").first.mkString.split(",").map(c => col(c)): _*)).show
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2,b|
+---+----+-------+---+
I have to apply filter before the select in the above query, but I am not sure how to do that in withColumn for each row.
Something like below, if that is possible.
t.withColumn("new",concat_ws(",",t.**filter**("id="+this.id).select("att").first.mkString.split(",").map(c => col(c)): _*)).show
As it will require to filter each row based on the id.
scala> t.filter("id=1").select("att").first.mkString.split(",").map(c => col(c))
res90: Array[org.apache.spark.sql.Column] = Array(id, name)
scala> t.filter("id=2").select("att").first.mkString.split(",").map(c => col(c))
res89: Array[org.apache.spark.sql.Column] = Array(id)
Below is the final required result.
+---+----+-------+---+
| id|name| att |new|
+---+----+-------+---+
| 1| a|id,name|1,a|
| 2| b| id |2 |
+---+----+-------+---+
We can use UDF
Requirements for this logic to work.
The column name of your table t should be in same order as it comes in col att of table r
scala> input_df_1.show
+---+----+
| id|name|
+---+----+
| 1| a|
| 2| b|
+---+----+
scala> input_df_2.show
+---+-------+
| id| att|
+---+-------+
| 1|id,name|
| 2| id|
+---+-------+
scala> val join_df = input_df_1.join(input_df_2,Seq("id"),"inner")
join_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 1 more field]
scala> val req_cols = input_df_1.columns
req_cols: Array[String] = Array(id, name)
scala> def new_col_udf = udf((cols : Seq[String],row : String,attr : String) => {
| val row_values = row.split(",")
| val attrs = attr.split(",")
| val req_val = attrs.map{at =>
| val index = cols.indexOf(at)
| row_values(index)
| }
| req_val.mkString(",")
| })
new_col_udf: org.apache.spark.sql.expressions.UserDefinedFunction
scala> val intermediate_df = join_df.withColumn("concat_column",concat_ws(",",'id,'name)).withColumn("new_col",new_col_udf(lit(req_cols),'concat_column,'att))
intermediate_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 3 more fields]
scala> val result_df = intermediate_df.select('id,'name,'att,'new_col)
result_df: org.apache.spark.sql.DataFrame = [id: int, name: string ... 2 more fields]
scala> result_df.show
+---+----+-------+-------+
| id|name| att|new_col|
+---+----+-------+-------+
| 1| a|id,name| 1,a|
| 2| b| id| 2|
+---+----+-------+-------+
Hope it answers your question.
This may be done in a UDF:
val cols: Seq[Column] = dataFrame.columns.map(x => col(x)).toSeq
val indices: Seq[String] = dataFrame.columns.map(x => x).toSeq
val generateNew = udf((values: Seq[Any]) => {
val att = values(indices.indexOf("att")).toString.split(",")
val associatedIndices = indices.filter(x => att.contains(x))
val builder: StringBuilder = StringBuilder.newBuilder
values.filter(x => associatedIndices.contains(values.indexOf(x)))
values.foreach{ v => builder.append(v).append(";") }
builder.toString()
})
val dfColumns = array(cols:_*)
val dNew = dataFrame.withColumn("new", generateNew(dfColumns))
This is just a sketch, but the idea is that you can pass a sequence of items to the user defined function, and select the ones that are needed dynamically.
Note that there are additional types of collection/maps that you can pass - for example How to pass array to UDF