Error while running UDF on dataframe columns - scala

I have extracted some data from hive to dataframe, which is in the below shown format.
| NUM_ID| SIG1| SIG2| SIG3| SIG4|
+----------------------+---------------+--------------------+---------------+
|XXXXX01|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX02|[{15695604780...|[{15695604780...|[{15695604780...|[{15695604780...|
|XXXXX03|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX04|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX05|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX06|[{15695605340...|[{15695605340...|[{15695605340...|[{15695605340...|
|XXXXX07|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
|XXXXX08|[{15695605310...|[{15695605310...|[{15695605310...|[{15695605310...|
If we take only one signal it will be as below.
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|
[{1569560537000,3.7825},{1569560481000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|
[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560527000,34.7825}]|
[{1569560535000,34.7825},{1569560479000,34.7825},{1569560487000,34.7825}]
The schema for the above data is
fromHive.printSchema
root
|-- NUM_ID: string (nullable = true)
|-- SIG1: string (nullable = true)
|-- SIG2: string (nullable = true)
|-- SIG3: string (nullable = true)
|-- SIG4: string (nullable = true)
My requirement is to get the all E values from all the columns for a particular NUM_ID and create as a new cloumn with corresponding signal values in another columns as shown below.
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560531000|33.7825|34.7825| null|96.3354|
|XXXXX01|1569560505000| null| null|35.5501| null|
|XXXXX01|1569560531001|73.7825| null| null| null|
|XXXXX02|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX02|1569560531000|33.7825|34.7825|35.5501|96.3354|
|XXXXX02|1569560505001|73.7825| null| null| null|
|XXXXX02|1569560502000| null| null|35.5501|96.3354|
|XXXXX03[1569560531000|73.7825| null| null| null|
|XXXXX03|1569560505000|34.7825| null|35.5501|96.3354|
|XXXXX03|1569560509000| null|34.7825|35.5501|96.3354|
I tried writing UDFs to achieve this as below.
UDF#1:
def UDF_E:UserDefinedFunction=udf((r: Row)=>{
val SigColumn = "SIG1,SIG2,SIG3,SIG4,SIG5,SIG6"
val colList = SigColumn.split(",").toList
val rr = "[\\}],[\\{]".r
var out = ""
colList.foreach{ x =>
val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","").replaceAll(""""E":""","")
val st = a.split("\\|").map(x => x.split(",")(0)).toSet
out = out + "," + st.mkString(",")
}
val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
out1
})
UDF#2:
def UDF_V:UserDefinedFunction=udf((E: String,SIG:String)=>{
val Signal = SIG.replaceAll("\\{", "\\(").replaceAll("\\}", "\\)").replaceAll("\\[", "").replaceAll("\\]", "").replaceAll(""""E":""","").replaceAll(""","V":""","=")
val SigMap = "(\\w+)=([\\w 0-9 .]+)".r.findAllIn(Signal).matchData.map(i => {(i.group(1), i.group(2))}).toMap
var out = ""
if(SigMap.keys.toList.contains(E)){
out = SigMap(E).toString
}
out})
output of UDF#1 is shown as below:
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|NUM_ID |SIG1 |SIG2 |SIG3 |SIG4 |E |
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560483000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560497000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560475000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560489000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560535000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560531000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560513000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560537000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560491000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560521000|
|XXXXX01|[{1569560531000,3.7825},{1569560475000,3.7812},{1569560483000,3.7812},{1569560491000,34.7875}]|[{1569560537000,3.7825},{1569560531000,34.7825},{1569560489000,34.7825},{1569560497000,34.7825}]|[{1569560505000,34.7825},{1569560513000,34.7825},{1569560521000,34.7825},{1569560531000,34.7825}]|[{1569560535000,34.7825},{1569560531000,34.7825},{1569560483000,34.7825}]|1569560505000|
+-------+----------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------+-------------+
Output of UDF#2 is:
+-------+-------------+-------+-------+-------+-------+
| NUM_ID| E| SIG1_V| SIG2_V| SIG3_V| SIG4_V|
+-------+-------------+-------+-------+-------+-------+
|XXXXX01|1569560475000| 3.7812| | | |
|XXXXX01|1569560483000| 3.7812| | |34.7825|
|XXXXX01|1569560489000| |34.7825| | |
|XXXXX01|1569560491000|34.7875| | | |
|XXXXX01|1569560497000| |34.7825| | |
|XXXXX01|1569560505000| | |34.7825| |
|XXXXX01|1569560513000| | |34.7825| |
|XXXXX01|1569560521000| | |34.7825| |
|XXXXX01|1569560531000| 3.7825|34.7825|34.7825|34.7825|
|XXXXX01|1569560535000| | | |34.7825|
|XXXXX01|1569560537000| | 3.7825| | |
+-------+-------------+-------+-------+-------+-------+
When I run df.show() i will get output as above, But when i do a df.count() I am getting the below error:
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$UDF_E$1: (struct<NUM_ID:string,SIG1:string,SIG2:string,SIG3:string,SIG4:string>) => string)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
Caused by: java.lang.NullPointerException
at $line74.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$UDF_E$1$$anonfun$apply$1.apply(<console>:57
What might be the reason for this error ? any leads to resolve this?

Resolved this issue by adding a null check as below.
def UDF_E:UserDefinedFunction=udf((r: Row)=>{
val SigColumn = "SIG1,SIG2,SIG3,SIG4,SIG5,SIG6"
val colList = SigColumn.split(",").toList
val rr = "[\\}],[\\{]".r
var out = ""
colList.foreach{ x =>
if(r.getAs(x) != null){
val a = (rr replaceAllIn(r.getAs(x).toString, "|")).replaceAll("\\[\\{","").replaceAll("\\}\\]","").replaceAll(""""E":""","")
val st = a.split("\\|").map(x => x.split(",")(0)).toSet
out = out + "," + st.mkString(",")
}}
val out1 = out.replaceFirst(s""",""","").split(",").toSet.mkString(",")
out1
})

Related

How to use dataframe inside an udf and parse the data in spark scala

I am new to scala and spark. I have a requirement to create the new dataframe by using the udf.
I have a 2 dataframes, one contains 3 columns namely company, id, and type.
df2 contains 2 columns namely company and message.
df2 JSON will be like this
{"company": "Honda", "message": ["19:[\"cost 500k\"],[\"colour blue\"]","20:[\"cost 600k\"],[\"colour white\"]"]}
{"company": "BMW", "message": ["19:[\"cost 1500k\"],[\"colour blue\"]"]}
df2 will be like this:
+-------+--------------------+
|company| message|
+-------+--------------------+
| Honda|[19:["cost 500k"]...|
| BMW|[19:["cost 1500k"...|
+-------+--------------------+
|-- company: string (nullable = true)
|-- message: array (nullable = true)
| |-- element: string (containsNull = true)
df1 will be like this:
+----------+---+-------+
|company | id| name|
+----------+---+-------+
| Honda | 19| city |
| Honda | 20| amaze |
| BMW | 19| x1 |
+----------+---+-------+
I want to create a new data frame by replacing the id in df2 with the name in df1.
["city:[\"cost 500k\"],[\"colour blue\"]","amaze:[\"cost 600k\"],[\"colour white\"]"]
I had tried with udf by passing message as Seq[String] and company but I was not able to select the data in df1.
I want the output like this:
+-------+----------------------+
|company| message |
+-------+----------------------+
| Honda|[city:["cost 500k"]...|
| BMW|[x1:["cost 1500k"... |
+-------+----------------------+
I tried by using the fallowing udf but I was facing errors while selecting the name
def asdf(categories: Seq[String]):String={
| var data=""
| for(w<-categories){
| if (w != null){
| var id=w.toString().indexOf(":")
| var namea=df1.select("name").where($"id" === 20).map(_.getString(0)).collect()
| var name=namea(0)
| println(name)
| var ids=w.toString().substring(0,id)
| var li=w.toString().replace(ids,name)
| println(li)
| data=data+li
| }
| }
| data
| }
Please check below code.
scala> df1.show(false)
+-------+---------------------------------------------------------------------+
|company|message |
+-------+---------------------------------------------------------------------+
|Honda |[19:["cost 500k"],["colour blue"], 20:["cost 600k"],["colour white"]]|
|BMW |[19:["cost 1500k"],["colour blue"]] |
+-------+---------------------------------------------------------------------+
scala> df2.show(false)
+-------+---+-----+
|company|id |name |
+-------+---+-----+
|Honda | 19|city |
|Honda | 20|amaze|
|BMW | 19|x1 |
+-------+---+-----+
val replaceFirst = udf((message: String,id:String,name:String) =>
if(message.contains(s"""${id}:""")) message.replaceFirst(s"""${id}:""",s"${name}:") else ""
)
val jdf =
df1
.withColumn("message",explode($"message"))
.join(df2,df1("company") === df2("company"),"inner")
.withColumn(
"message_data",
replaceFirst($"message",trim($"id"),$"name")
)
.filter($"message_data" =!= "")
scala> jdf.show(false)
+-------+---------------------------------+-------+---+-----+------------------------------------+
|company|message |company|id |name |message_data |
+-------+---------------------------------+-------+---+-----+------------------------------------+
|Honda |19:["cost 500k"],["colour blue"] |Honda | 19|city |city:["cost 500k"],["colour blue"] |
|Honda |20:["cost 600k"],["colour white"]|Honda | 20|amaze|amaze:["cost 600k"],["colour white"]|
|BMW |19:["cost 1500k"],["colour blue"]|BMW | 19|x1 |x1:["cost 1500k"],["colour blue"] |
+-------+---------------------------------+-------+---+-----+------------------------------------+
scala> df1.join(df2,df1("company") === df2("company"),"inner").select(df1("company"),df1("message"),df2("id"),df2("name")).withColumn("message",explode($"message")).withColumn("message",replaceFirst($"message",trim($"id"),$"name")).filter($"message" =!= "").groupBy($"company").agg(collect_list($"message").cast("string").as("message")).show(false)
+-------+--------------------------------------------------------------------------+
|company|message |
+-------+--------------------------------------------------------------------------+
|Honda |[amaze:["cost 600k"],["colour white"], city:["cost 500k"],["colour blue"]]|
|BMW |[x1:["cost 1500k"],["colour blue"]] |
+-------+--------------------------------------------------------------------------+

Replicating rows in Spark Dataset N times

When I try to do something like this in Spark:
val replicas = 10
val dsReplicated = ds flatMap (a => 0 until replicas map ((a, _)))
I get the following exception:
java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row
- field (class: "org.apache.spark.sql.Row", name: "_1")
- root class: "scala.Tuple2"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:625)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:619)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$10.apply(ScalaReflection.scala:607)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:607)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:438)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
... 48 elided
I can achieve this using Spark DataFrame with explode function. I'd like to achieve something similar using Datasets.
For the reference, here is the code that replicates rows using the DataFrame API:
val dfReplicated = df.
withColumn("__temporarily__", typedLit((0 until replicas).toArray)).
withColumn("idx", explode($"__temporarily__")).
drop($"__temporarily__")
Here is one way of doing it:
case class Zip(zipcode: String)
case class Person(id: Int,name: String,zipcode: List[Zip])
data: org.apache.spark.sql.Dataset[Person]
data.show()
+---+----+--------------+
| id|name| zipcode|
+---+----+--------------+
| 1| AAA|[[MVP], [RB2]]|
| 2| BBB|[[KFG], [YYU]]|
| 3| CCC|[[JJJ], [7IH]]|
+---+----+--------------+
data.printSchema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- zipcode: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- zipcode: string (nullable = true)
val df = data.withColumn("ArrayCol",explode($"zipcode"))
df.select($"id",$"name",$"ArrayCol.zipcode").show()
Output:
+---+----+-------+
| id|name|zipcode|
+---+----+-------+
| 1| AAA| MVP|
| 1| AAA| RB2|
| 2| BBB| KFG|
| 2| BBB| YYU|
| 3| CCC| JJJ|
| 3| CCC| 7IH|
+---+----+-------+
Using Dataset:
val resultDS = data.flatMap(x => x.zipcode.map(y => (x.id,x.name,y.zipcode)))
resultDS.show(false)
//resultDS:org.apache.spark.sql.Dataset[(Int, String, String)] =
// [_1: integer, _2: string ... 1 more fields]
//+---+---+---+
//|_1 |_2 |_3 |
//+---+---+---+
//|1 |AAA|MVP|
//|1 |AAA|RB2|
//|2 |BBB|KFG|
//|2 |BBB|YYU|
//|3 |CCC|JJJ|
//|3 |CCC|7IH|
//+---+---+---+

Spark GroupBy agg collect_list multiple columns

I have a question similar to this but the number of columns to be operated by collect_list is given by a name list. For example:
scala> w.show
+---+-----+----+-----+
|iid|event|date|place|
+---+-----+----+-----+
| A| D1| T0| P1|
| A| D0| T1| P2|
| B| Y1| T0| P3|
| B| Y2| T2| P3|
| C| H1| T0| P5|
| C| H0| T9| P5|
| B| Y0| T1| P2|
| B| H1| T3| P6|
| D| H1| T2| P4|
+---+-----+----+-----+
scala> val combList = List("event", "date", "place")
combList: List[String] = List(event, date, place)
scala> val v = w.groupBy("iid").agg(collect_list(combList(0)), collect_list(combList(1)), collect_list(combList(2)))
v: org.apache.spark.sql.DataFrame = [iid: string, collect_list(event): array<string> ... 2 more fields]
scala> v.show
+---+-------------------+------------------+-------------------+
|iid|collect_list(event)|collect_list(date)|collect_list(place)|
+---+-------------------+------------------+-------------------+
| B| [Y1, Y2, Y0, H1]| [T0, T2, T1, T3]| [P3, P3, P2, P6]|
| D| [H1]| [T2]| [P4]|
| C| [H1, H0]| [T0, T9]| [P5, P5]|
| A| [D1, D0]| [T0, T1]| [P1, P2]|
+---+-------------------+------------------+-------------------+
Is there any way I can apply collect_list to multiple columns inside agg without knowing the number of elements in the combList prior?
You can use collect_list(struct(col1, col2)) AS elements.
Example:
df.select("cd_issuer", "cd_doc", "cd_item", "nm_item").printSchema
val outputDf = spark.sql(s"SELECT cd_issuer, cd_doc, collect_list(struct(cd_item, nm_item)) AS item FROM teste GROUP BY cd_issuer, cd_doc")
outputDf.printSchema
df
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- cd_item: string (nullable = true)
|-- nm_item: string (nullable = true)
outputDf
|-- cd_issuer: string (nullable = true)
|-- cd_doc: string (nullable = true)
|-- item: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cd_item: string (nullable = true)
| | |-- nm_item: string (nullable = true)

Spark Scala replace Dataframe blank records to "0"

I need to replace my Dataframe field's blank records to "0"
Here is my code -->
import sqlContext.implicits._
case class CInspections (business_id:Int, score:String, date:String, type1:String)
val baseDir = "/FileStore/tables/484qrxx21488929011080/"
val raw_inspections = sc.textFile (s"$baseDir/inspections_plus.txt")
val raw_inspectionsmap = raw_inspections.map ( line => line.split ("\t"))
val raw_inspectionsRDD = raw_inspectionsmap.map ( raw_inspections => CInspections (raw_inspections(0).toInt,raw_inspections(1), raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
raw_inspectionsDF.show()
I am using case class and then converting to Dataframe. But I need "score" as Int as I have to perform some operations and sort it.
But if I declare it as score:Int then I am getting error for blank values.
java.lang.NumberFormatException: For input string: "" 
+-----------+-----+--------+--------------------+
|business_id|score| date| type1|
+-----------+-----+--------+--------------------+
| 10| |20140807|Reinspection/Foll...|
| 10| 94|20140729|Routine - Unsched...|
| 10| |20140124|Reinspection/Foll...|
| 10| 92|20140114|Routine - Unsched...|
| 10| 98|20121114|Routine - Unsched...|
| 10| |20120920|Reinspection/Foll...|
| 17| |20140425|Reinspection/Foll...|
+-----------+-----+--------+--------------------+
I need score field as Int because for the below query, it sort as String not Int and giving wrong result
sqlContext.sql("""select raw_inspectionsDF.score from raw_inspectionsDF where score <>"" order by score""").show()
+-----+
|score|
+-----+
| 100|
| 100|
| 100|
+-----+
Empty string can't be converted to Integer, you need to make the Score nullable so that if the field is missing, it is represented as null, you can try the following:
import scala.util.{Try, Success, Failure}
1) Define a customized parse function which returns None, if the string can't be converted to an Int, in your case empty string;
def parseScore(s: String): Option[Int] = {
Try(s.toInt) match {
case Success(x) => Some(x)
case Failure(x) => None
}
}
2) Define the score field in your case class to be an Option[Int] type;
case class CInspections (business_id:Int, score: Option[Int], date:String, type1:String)
val raw_inspections = sc.textFile("test.csv")
val raw_inspectionsmap = raw_inspections.map(line => line.split("\t"))
3) Use the customized parseScore function to parse the score field;
val raw_inspectionsRDD = raw_inspectionsmap.map(raw_inspections =>
CInspections(raw_inspections(0).toInt, parseScore(raw_inspections(1)),
raw_inspections(2),raw_inspections(3)))
val raw_inspectionsDF = raw_inspectionsRDD.toDF
raw_inspectionsDF.createOrReplaceTempView ("Inspections")
raw_inspectionsDF.printSchema
//root
// |-- business_id: integer (nullable = false)
// |-- score: integer (nullable = true)
// |-- date: string (nullable = true)
// |-- type1: string (nullable = true)
raw_inspectionsDF.show()
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
| 1| null| a| b|
| 2| 3| s| k|
+-----------+-----+----+-----+
4) After parsing the file correctly, you can easily replace null value with 0 using na functions fill:
raw_inspectionsDF.na.fill(0).show
+-----------+-----+----+-----+
|business_id|score|date|type1|
+-----------+-----+----+-----+
| 1| 0| a| b|
| 2| 3| s| k|
+-----------+-----+----+-----+

delete constant columns spark having issue with timestamp column

hi guys i did this code that allows to drop columns with constant values.
i start by computing the standard deviation i then drop the ones having standard equal to zero ,but i got this issue when having a column which has a timestamp type what to do
cannot resolve 'stddev_samp(time.1)' due to data type mismatch: argument 1 requires double type, however, 'time.1' is of timestamp type.;;
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
//val df = spark.range(1, 1000).withColumn("X2", lit(0)).toDF("X1","X2")
val df = spark.read.option("inferSchema", "true").option("header", "true").csv("C:/Users/mhattabi/Desktop/dataTestCsvFile/dataTest2.txt")
df.show(5)
//df.columns.map(p=>s"`${p}`")
//val aggs = df.columns.map(c => stddev(c).as(c))
val aggs = df.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = df.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(df.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
df.select(columnsToKeep: _*).show(5,false)
Using stddev
stddev is only defined on numeric columns. If you want to compute the standard deviation of a date column you will need to convert it to a time stamp first:
scala> var myDF = (0 to 10).map(x => (x, scala.util.Random.nextDouble)).toDF("id", "rand_double")
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double]
scala> myDF = myDF.withColumn("Date", current_date())
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: date (nullable = false)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|2017-03-21|
| 1| 0.5968932024004612|2017-03-21|
| 2|0.05912760417456575|2017-03-21|
| 3|0.29974600653895667|2017-03-21|
| 4| 0.8448407414817856|2017-03-21|
| 5| 0.2049495659443249|2017-03-21|
| 6| 0.4184846380144779|2017-03-21|
| 7|0.21400484330739022|2017-03-21|
| 8| 0.9558142791013501|2017-03-21|
| 9|0.32530639391058036|2017-03-21|
| 10| 0.5100585655062743|2017-03-21|
+---+-------------------+----------+
scala> myDF = myDF.withColumn("Date", unix_timestamp($"Date"))
myDF: org.apache.spark.sql.DataFrame = [id: int, rand_double: double ... 1 more field]
scala> myDF.printSchema
root
|-- id: integer (nullable = false)
|-- rand_double: double (nullable = false)
|-- Date: long (nullable = true)
scala> myDF.show
+---+-------------------+----------+
| id| rand_double| Date|
+---+-------------------+----------+
| 0| 0.3786008989478248|1490072400|
| 1| 0.5968932024004612|1490072400|
| 2|0.05912760417456575|1490072400|
| 3|0.29974600653895667|1490072400|
| 4| 0.8448407414817856|1490072400|
| 5| 0.2049495659443249|1490072400|
| 6| 0.4184846380144779|1490072400|
| 7|0.21400484330739022|1490072400|
| 8| 0.9558142791013501|1490072400|
| 9|0.32530639391058036|1490072400|
| 10| 0.5100585655062743|1490072400|
+---+-------------------+----------+
At this point all of the columns are numeric so your code will run fine:
scala> :pa
// Entering paste mode (ctrl-D to finish)
val aggs = myDF.columns.map(p=>stddev(s"`${p}`").as(p))
val stddevs = myDF.select(aggs: _*)
val columnsToKeep: Seq[Column] = stddevs.first // Take first row
.toSeq // convert to Seq[Any]
.zip(myDF.columns) // zip with column names
.collect {
// keep only names where stddev != 0
case (s: Double, c) if s != 0.0 => col(c)
}
myDF.select(columnsToKeep: _*).show(false)
// Exiting paste mode, now interpreting.
+---+-------------------+
|id |rand_double |
+---+-------------------+
|0 |0.3786008989478248 |
|1 |0.5968932024004612 |
|2 |0.05912760417456575|
|3 |0.29974600653895667|
|4 |0.8448407414817856 |
|5 |0.2049495659443249 |
|6 |0.4184846380144779 |
|7 |0.21400484330739022|
|8 |0.9558142791013501 |
|9 |0.32530639391058036|
|10 |0.5100585655062743 |
+---+-------------------+
aggs: Array[org.apache.spark.sql.Column] = Array(stddev_samp(id) AS `id`, stddev_samp(rand_double) AS `rand_double`, stddev_samp(Date) AS `Date`)
stddevs: org.apache.spark.sql.DataFrame = [id: double, rand_double: double ... 1 more field]
columnsToKeep: Seq[org.apache.spark.sql.Column] = ArrayBuffer(id, rand_double)
Using countDistinct
All that being said, it would be more general to use countDistinct:
scala> val distCounts = myDF.select(myDF.columns.map(c => countDistinct(c) as c): _*).first.toSeq.zip(myDF.columns)
distCounts: Seq[(Any, String)] = ArrayBuffer((11,id), (11,rand_double), (1,Date))]
scala> distCounts.foldLeft(myDF)((accDF, dc_col) => if (dc_col._1 == 1) accDF.drop(dc_col._2) else accDF).show
+---+-------------------+
| id| rand_double|
+---+-------------------+
| 0| 0.3786008989478248|
| 1| 0.5968932024004612|
| 2|0.05912760417456575|
| 3|0.29974600653895667|
| 4| 0.8448407414817856|
| 5| 0.2049495659443249|
| 6| 0.4184846380144779|
| 7|0.21400484330739022|
| 8| 0.9558142791013501|
| 9|0.32530639391058036|
| 10| 0.5100585655062743|
+---+-------------------+