Spark dataframe Column content modification - scala

I have a dataframe as shown below df.show():
+--------+---------+---------+---------+---------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+---------+---------+---------+
| Value1 | value1 | 123 | 2264 | 56 |
| Value1 | value2 | 124 | 2255 | 23 |
+--------+---------+---------+---------+---------+
Can I transform the above data frame to the below using some SQL?
+--------+---------+-------------+---------------+------------+
| Col11 | Col22 | Expend1 | Expend2 | Expend3 |
+--------+---------+-------------+---------------+------------+
| Value1 | value1 | Expend1:123 | Expend2: 2264 | Expend3:56 |
| Value1 | value2 | Expend1:124 | Expend2: 2255 | Expend3:23 |
+--------+---------+-------------+---------------+------------+

You can use the idea of foldLeft here
import spark.implicits._
import org.apache.spark.sql.functions._
val df = spark.sparkContext.parallelize(Seq(
("Value1", "value1", "123", "2264", "56"),
("Value1", "value2", "124", "2255", "23")
)).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
//Lists your columns for operation
val cols = List("Expend1", "Expend2", "Expend3")
val newDF = cols.foldLeft(df){(acc, name) =>
acc.withColumn(name, concat(lit(name + ":"), col(name)))
}
newDF.show()
Output:
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+

you can do that using simple sql select statement if you want can use udf as well
Ex -> select Col11 , Col22 , 'Expend1:' + cast(Expend1 as varchar(10)) as Expend1, .... from table

val df = Seq(("Value1", "value1", "123", "2264", "56"), ("Value1", "value2", "124", "2255", "23") ).toDF("Col11", "Col22", "Expend1", "Expend2", "Expend3")
val cols = df.columns.filter(!_.startsWith("Col")) // It will only fetch other than col% prefix columns
val getCombineData = udf { (colName:String, colvalue:String) => colName + ":"+ colvalue}
var in = df
for (e <- cols) {
in = in.withColumn(e, getCombineData(lit(e), col(e)) )
}
in.show
// results
+------+------+-----------+------------+----------+
| Col11| Col22| Expend1| Expend2| Expend3|
+------+------+-----------+------------+----------+
|Value1|value1|Expend1:123|Expend2:2264|Expend3:56|
|Value1|value2|Expend1:124|Expend2:2255|Expend3:23|
+------+------+-----------+------------+----------+

Related

Remove unwanted columns from a dataframe in scala

I am pretty new to scala
I have a situation where I have a dataframe with multiple columns, some of these columns having random null values at random places. I need to find any such column having even a single null value and drop it from the dataframe.
#### Input
| Column 1 | Column 2 | Column 3 | Column 4 | Column 5 |
| --------------| --------------| --------------| --------------| --------------|
|(123)-456-7890 | 123-456-7890 |(123)-456-789 | |(123)-456-7890 |
|(123)-456-7890 | 123-4567890 |(123)-456-7890 |(123)-456-7890 | null |
|(123)-456-7890 | 1234567890 |(123)-456-7890 |(123)-456-7890 | null |
#### Output
| Column 1 | Column 2 |
| --------------| --------------|
|(123)-456-7890 | 123-456-7890 |
|(123)-456-7890 | 123-4567890 |
|(123)-456-7890 | 1234567890 |
Please advise.
Thank you.
I would recommend a 2-step approach:
Exclude columns that are not nullable from the dataframe
Assemble a list of columns that contain at least a null and drop them altogether
Creating a sample dataframe with a mix of nullable/non-nullable columns:
import org.apache.spark.sql.Column
import org.apache.spark.sql.types._
val df0 = Seq(
(Some(1), Some("x"), Some("a"), None),
(Some(2), Some("y"), None, Some(20.0)),
(Some(3), Some("z"), None, Some(30.0))
).toDF("c1", "c2", "c3", "c4")
val newSchema = StructType(df0.schema.map{ field =>
if (field.name == "c1") field.copy(name = "c1_notnull", nullable = false) else field
})
// Revised dataframe with non-nullable `c1`
val df = spark.createDataFrame(df0.rdd, newSchema)
Carrying out step 1 & 2:
val nullableCols = df.schema.collect{ case StructField(name, _, true, _) => name }
// nullableCols: Seq[String] = List(c2, c3, c4)
val colsWithNulls = nullableCols.filter(c => df.where(col(c).isNull).count > 0)
// colsWithNulls: Seq[String] = List(c3, c4)
df.drop(colsWithNulls: _*).show
// +----------+---+
// |c1_notnull| c2|
// +----------+---+
// | 1| x|
// | 2| y|
// | 3| z|
// +----------+---+

How to use dataframe inside an udf and parse the data in spark scala

I am new to scala and spark. I have a requirement to create the new dataframe by using the udf.
I have a 2 dataframes, one contains 3 columns namely company, id, and type.
df2 contains 2 columns namely company and message.
df2 JSON will be like this
{"company": "Honda", "message": ["19:[\"cost 500k\"],[\"colour blue\"]","20:[\"cost 600k\"],[\"colour white\"]"]}
{"company": "BMW", "message": ["19:[\"cost 1500k\"],[\"colour blue\"]"]}
df2 will be like this:
+-------+--------------------+
|company| message|
+-------+--------------------+
| Honda|[19:["cost 500k"]...|
| BMW|[19:["cost 1500k"...|
+-------+--------------------+
|-- company: string (nullable = true)
|-- message: array (nullable = true)
| |-- element: string (containsNull = true)
df1 will be like this:
+----------+---+-------+
|company | id| name|
+----------+---+-------+
| Honda | 19| city |
| Honda | 20| amaze |
| BMW | 19| x1 |
+----------+---+-------+
I want to create a new data frame by replacing the id in df2 with the name in df1.
["city:[\"cost 500k\"],[\"colour blue\"]","amaze:[\"cost 600k\"],[\"colour white\"]"]
I had tried with udf by passing message as Seq[String] and company but I was not able to select the data in df1.
I want the output like this:
+-------+----------------------+
|company| message |
+-------+----------------------+
| Honda|[city:["cost 500k"]...|
| BMW|[x1:["cost 1500k"... |
+-------+----------------------+
I tried by using the fallowing udf but I was facing errors while selecting the name
def asdf(categories: Seq[String]):String={
| var data=""
| for(w<-categories){
| if (w != null){
| var id=w.toString().indexOf(":")
| var namea=df1.select("name").where($"id" === 20).map(_.getString(0)).collect()
| var name=namea(0)
| println(name)
| var ids=w.toString().substring(0,id)
| var li=w.toString().replace(ids,name)
| println(li)
| data=data+li
| }
| }
| data
| }
Please check below code.
scala> df1.show(false)
+-------+---------------------------------------------------------------------+
|company|message |
+-------+---------------------------------------------------------------------+
|Honda |[19:["cost 500k"],["colour blue"], 20:["cost 600k"],["colour white"]]|
|BMW |[19:["cost 1500k"],["colour blue"]] |
+-------+---------------------------------------------------------------------+
scala> df2.show(false)
+-------+---+-----+
|company|id |name |
+-------+---+-----+
|Honda | 19|city |
|Honda | 20|amaze|
|BMW | 19|x1 |
+-------+---+-----+
val replaceFirst = udf((message: String,id:String,name:String) =>
if(message.contains(s"""${id}:""")) message.replaceFirst(s"""${id}:""",s"${name}:") else ""
)
val jdf =
df1
.withColumn("message",explode($"message"))
.join(df2,df1("company") === df2("company"),"inner")
.withColumn(
"message_data",
replaceFirst($"message",trim($"id"),$"name")
)
.filter($"message_data" =!= "")
scala> jdf.show(false)
+-------+---------------------------------+-------+---+-----+------------------------------------+
|company|message |company|id |name |message_data |
+-------+---------------------------------+-------+---+-----+------------------------------------+
|Honda |19:["cost 500k"],["colour blue"] |Honda | 19|city |city:["cost 500k"],["colour blue"] |
|Honda |20:["cost 600k"],["colour white"]|Honda | 20|amaze|amaze:["cost 600k"],["colour white"]|
|BMW |19:["cost 1500k"],["colour blue"]|BMW | 19|x1 |x1:["cost 1500k"],["colour blue"] |
+-------+---------------------------------+-------+---+-----+------------------------------------+
scala> df1.join(df2,df1("company") === df2("company"),"inner").select(df1("company"),df1("message"),df2("id"),df2("name")).withColumn("message",explode($"message")).withColumn("message",replaceFirst($"message",trim($"id"),$"name")).filter($"message" =!= "").groupBy($"company").agg(collect_list($"message").cast("string").as("message")).show(false)
+-------+--------------------------------------------------------------------------+
|company|message |
+-------+--------------------------------------------------------------------------+
|Honda |[amaze:["cost 600k"],["colour white"], city:["cost 500k"],["colour blue"]]|
|BMW |[x1:["cost 1500k"],["colour blue"]] |
+-------+--------------------------------------------------------------------------+

Scala : Passing elements of a Dataframe from every row and get back the result in separate rows

In My requirment , i come across a situation where i have to pass 2 strings from my dataframe's 2 column and get back the result in string and want to store it back to a dataframe.
Now while passing the value as string, it is always returning the same value. So in all the rows the same value is being populated. (In My case PPPP is being populated in all rows)
Is there a way to pass element (for those 2 columns) from every row and get the result in separate rows.
I am ready to modify my function to accept Dataframe and return Dataframe OR accept arrayOfString and get back ArrayOfString but i dont know how to do that as i am new to programming. Can someone please help me.
Thanks.
def myFunction(key: String , value :String ) : String = {
//Do my functions and get back a string value2 and return this value2 string
value2
}
val DF2 = DF1.select (
DF1("col1")
,DF1("col2")
,DF1("col5") )
.withColumn("anyName", lit(myFunction ( DF1("col3").toString() , DF1("col4").toString() )))
/* DF1:
/*+-----+-----+----------------+------+
/*|col1 |col2 |col3 | col4 | col 5|
/*+-----+-----+----------------+------+
/*|Hello|5 |valueAAA | XXX | 123 |
/*|How |3 |valueCCC | YYY | 111 |
/*|World|5 |valueDDD | ZZZ | 222 |
/*+-----+-----+----------------+------+
/*DF2:
/*+-----+-----+--------------+
/*|col1 |col2 |col5| anyName |
/*+-----+-----+--------------+
/*|Hello|5 |123 | PPPPP |
/*|How |3 |111 | PPPPP |
/*|World|5 |222 | PPPPP |
/*+-----+-----+--------------+
*/
After you define the function, you need to register them as udf(). The udf() function is available in org.apache.spark.sql.functions. check this out
scala> val DF1 = Seq(("Hello",5,"valueAAA","XXX",123),
| ("How",3,"valueCCC","YYY",111),
| ("World",5,"valueDDD","ZZZ",222)
| ).toDF("col1","col2","col3","col4","col5")
DF1: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 3 more fields]
scala> val DF2 = DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5") )
DF2: org.apache.spark.sql.DataFrame = [col1: string, col2: int ... 1 more field]
scala> DF2.show(false)
+-----+----+----+
|col1 |col2|col5|
+-----+----+----+
|Hello|5 |123 |
|How |3 |111 |
|World|5 |222 |
+-----+----+----+
scala> DF1.select("*").show(false)
+-----+----+--------+----+----+
|col1 |col2|col3 |col4|col5|
+-----+----+--------+----+----+
|Hello|5 |valueAAA|XXX |123 |
|How |3 |valueCCC|YYY |111 |
|World|5 |valueDDD|ZZZ |222 |
+-----+----+--------+----+----+
scala> def myConcat(a:String,b:String):String=
| return a + "--" + b
myConcat: (a: String, b: String)String
scala>
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val myConcatUDF = udf(myConcat(_:String,_:String):String)
myConcatUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,StringType,Some(List(StringType, StringType)))
scala> DF1.select ( DF1("col1") ,DF1("col2") ,DF1("col5"), myConcatUDF( DF1("col3"), DF1("col4"))).show()
+-----+----+----+---------------+
| col1|col2|col5|UDF(col3, col4)|
+-----+----+----+---------------+
|Hello| 5| 123| valueAAA--XXX|
| How| 3| 111| valueCCC--YYY|
|World| 5| 222| valueDDD--ZZZ|
+-----+----+----+---------------+
scala>

SPARK: How to get day difference between a data frame column and timestamp in SCALA

How do I get date difference (no of days in between) in data frame scala ?
I have a df : [id: string, itemName: string, eventTimeStamp: timestamp] and a startTime (timestamp string) how do I get a column "Daydifference" - day between (startTime - timeStamp)
My Code :
Initial df :
+------------+-----------+-------------------------+
| id | itemName | eventTimeStamp |
----------------------------------------------------
| 1 | TV | 2016-09-19T00:00:00Z |
| 1 | Movie | 2016-09-19T00:00:00Z |
| 1 | TV | 2016-09-26T00:00:00Z |
| 2 | TV | 2016-09-18T00:00:00Z |
I need to get most recent eventTimeStamp based on id and itemName, so I did:
val result = df.groupBy("id", "itemName").agg(max("eventTimeStamp") as "mostRecent")
+------------+-----------+-------------------------+
| id | itemName | mostRecent |
----------------------------------------------------
| 1 | TV | 2016-09-26T00:00:00Z |
| 1 | Movie | 2016-09-19T00:00:00Z |
| 2 | TV | 2016-09-26T00:00:00Z |
Now I need to get the date difference between mostRecent and startTime (2016-09-29T00:00:00Z) , so that I can get :
{ id : 1, {"itemMap" : {"TV" : 3, "Movie" : 10 }} }
{ id : 2, {"itemMap" : {"TV" : 3}} }
I tried like this :
val startTime = "2016-09-26T00:00:00Z"
val result = df.groupBy("id", "itemName").agg(datediff(startTime, max("eventTimeStamp")) as Daydifference)
case class Data (itemMap : Map[String, Long]) extends Serializable
result.map{
case r =>
val id = r.getAs[String]("id")
val itemName = r.getAs[String]("itemName")
val Daydifference = r.getAs[Long]("Daydifference")
(id, Map(itemName -> Daydifference ))
}.reduceByKey((x, y) => x ++ y).map{
case (k, v) =>
(k, JacksonUtil.toJson(Data(v)))
}
But getting error on datediff. Can some one tell me how do I acheive this ?
When you want to use some constant ("literal") value as a Column in a DataFrame, you should use the lit(...) function. The other error here is trying to use a String as the startDate, to compare it to a timestamp column you can use java.sql.Date:
val startTime = new java.sql.Date(2016, 8, 26) // beware, months are Zero-based
val result = df.groupBy("id", "itemName")
.agg(datediff(lit(startTime), max("eventTimeStamp")) as "Daydifference")
result.show()
// +---+--------+-------------+
// | id|itemName|Daydifference|
// +---+--------+-------------+
// | 1| Movie| 7|
// | 1| TV| 0|
// | 2| TV| 0|
// +---+--------+-------------+

I've a table with Map as column data type, how can I explode it to generate 2 columns, one for map and one for key?

Hive Table: (Name_Age: Map[String, Int] and ID: Int)
+---------------------------------------------------------++------+
| Name_Age || ID |
+---------------------------------------------------------++------+
|"SUBHAJIT SEN":28,"BINOY MONDAL":26,"SHANTANU DUTTA":35 || 15 |
|"GOBINATHAN SP":35,"HARSH GUPTA":27,"RAHUL ANAND":26 || 16 |
+---------------------------------------------------------++------+
I've exploded the Name_Age column into multiple Rows:
def toUpper(name: Seq[String]) = (name.map(a => a.toUpperCase)).toSeq
sqlContext.udf.register("toUpper",toUpper _)
var df = sqlContext.sql("SELECT toUpper(name) FROM namelist").toDF("Name_Age")
df.explode(df("Name_Age")){case org.apache.spark.sql.Row(arr: Seq[String]) => arr.toSeq.map(v => Tuple1(v))}.drop(df("Name_Age")).withColumnRenamed("_1","Name_Age")
+-------------------+
| Name_Age |
+-------------------+
| [SUBHAJIT SEN,28]|
| [BINOY MONDAL,26]|
|[SHANTANU DUTTA,35]|
| [GOBINATHAN SP,35]|
| [HARSH GUPTA,27]|
| [RAHUL ANAND,26]|
+-------------------+
But I want to explode and create 2 rows: Name and Age
+-------------------+-------+
| Name | Age |
+-------------------+-------+
| SUBHAJIT SEN | 28 |
| BINOY MONDAL | 26 |
|SHANTANU DUTTA | 35 |
| GOBINATHAN SP | 35 |
| HARSH GUPTA | 27 |
| RAHUL ANAND | 26 |
+-------------------+-------+
Could any one please help with the explode code modification?
All you need here is to drop toUpper call explode function:
import org.apache.spark.sql.functions.explode
val df = Seq((Map("foo" -> 1, "bar" -> 2), 1)).toDF("name_age", "id")
val exploded = df.select($"id", explode($"name_age")).toDF("id", "name", "age")
exploded.printSchema
// root
// |-- id: integer (nullable = false)
// |-- name: string (nullable = false)
// |-- age: integer (nullable = false)
You can convert to upper case using built-in functions afterwards:
import org.apache.spark.sql.functions.upper
exploded.withColumn("name", upper($"name"))