I'm analyzing a log file for customer impact analysis by using Apache spark. I have the log file that contains the time stamp in one line, customer's details in another line and the error caused by in another line, I want the output in one file which will combine all the extracted record to one line. Here is my log file below:
2018-10-15 05:24:00.102 ERROR 7 --- [DefaultMessageListenerContainer-2] c.l.p.a.c.event.listener.MQListener : The ABC/CDE object received for the xyz event was not valid. e_id=11111111, s_id=111, e_name=ABC
com.xyz.abc.pqr.exception.PNotVException: The r received from C was invalid/lacks mandatory fields. S_id: 123, P_Id: 123456789, R_Number: 12345678
at com.xyz.abc.pqr.mprofile.CPServiceImpl.lambda$bPByC$1(CPServiceImpl.java:240)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:681)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:651)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Invalid D because cm: null and pk: null were missing.
at com.xyz.abc.pqr.mp.DD.resolveDetailsFromCDE(DD.java:151)
at com.xyz.abc.pqr.mp.DD.<init>(DD.java:35)
at java.util.ArrayList.forEach(ArrayList.java:1249)
2018-10-15 05:24:25.136 ERROR 7 --- [DefaultMessageListenerContainer-1] c.l.p.a.c.event.listener.MQListener : The ABC/CDE object received for the C1 event was not valid. entity_id=2222222, s_id=3333, event_name=CDE
com.xyz.abc.pqr.PNotVException: The r received from C was invalid/lacks mandatory fields. S_id: 123, P_Id: 123456789, R_Number: 12345678
at com.xyz.abc.pqr.mp.CSImpl.lambda$buildABCByCo$1(CSImpl.java:240)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at com.xyz.abc.pqr.event.handler.DHandler.handle(CDEEventHandler.java:45)
at sun.reflect.GMA.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:333)
at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:197)
at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:115)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
You can use the DataFrame API to do this in a few ways. Here is one
import org.apache.spark.sql.functions._
val rd = sc.textFile("/FileStore/tables/log.txt").zipWithIndex.map{case (r, i) => Row(r, i)}
val schema = StructType(StructField("logs", StringType, false) :: StructField("id", LongType, false) :: Nil)
val df = spark.sqlContext.createDataFrame(rd, schema)
| logs| id|
|2018-10-15 05:24:...| 0|
| | 1|
|com.xyz.abc.pqr.e...| 2|
| at com.xyz.ab...| 3|
| at java.util....| 4|
| rContainer.do...| 5|
| at org.spring...| 6|
| at org.spring...| 7|
| at java.lang....| 8|
|Caused by: java.l...| 9|
| at com.xyz.ab...| 10|
| at com.xyz.ab...| 11|
| at com.xyz.ab...| 12|
| at java.util....| 13|
| | 14|
|2018-10-15 05:24:...| 15|
| | 16|
|com.xyz.abc.pqr.P...| 17|
val df1 = df.filter($"logs".contains("c.l.p.a.c.event.listener.MQListener")).withColumn("logs",regexp_replace($"logs","ERROR.*","")).sort("id")
| logs| id|
|2018-10-15 05:24:...| 0|
|2018-10-15 05:24:...| 15|
val df2 = df.filter($"logs".contains("PrescriptionNotValidException:")).withColumn("logs",regexp_replace($"logs","(.*?)mandatory fields.","")).sort("id")
| logs| id|
| StoreId: 123, Co...| 2|
| StoreId: 234, Co...| 17|
val df3 = df.filter($"logs".contains("Caused by: java.lang.")).sort("id")
| logs| id|
|Caused by: java.l...| 9|
|Caused by: java.l...| 28|
df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [logs: string, id: bigint]
res71: Seq[((org.apache.spark.sql.Row, org.apache.spark.sql.Row), org.apache.spark.sql.Row)] = ArrayBuffer((([2018-10-15 05:24:00.102 ],[ StoreId: 123, Co Patient Id: 123456789, Rx Number: 12345678]),[Caused by: java.lang.IllegalArgumentException: Invalid Dispense Object because compound: null and pack: null were missing.]), (([2018-10-15 05:24:25.136 ],[ StoreId: 234, Co Patient Id: 999999, Rx Number: 45555]),[Caused by: java.lang.NullPointerException: null]))
I have a scenario, where I have to parse the XML and JSON values based on another field.
Customer_Order table have two fields named response_id and response_output. response_output will have a combination of JSON strings, XML Strings, Error, Blanks, and Nulls.
I need to address below Problem Statements
problem statements
If response_id=1 and response_output have valid JSON then pick JSON
If response_id=1 and response_output is not have valid JSON then
If response_id=1 and response_output is XML value then Nullify
If response_id=1 and response_output is Error then Error
If response_id=1 and response_output is Blank or Null then Nullify
If response_id=2 and response_output have valid JSON then pick XML
If response_id=2 and response_output is not have valid XML then
If response_id=2 and response_output is JSON value then Nullify
If response_id=2 and response_output is Error then Error
If response_id=2 and response_output is Blank or Null then Nullify
when I am trying to achieve the above problem statements using SPARK SQL but my code is breaking when I am encountering invalid XML or invalid JSON.
Below is the error and Could anyone help me to handle this?
CASE WHEN (response_id=2 and response_output!="Error") THEN get_json_object(response_output, '$.Metrics.OrderResponseTime')
WHEN (response_id=1 and response_output!="Error") THEN xpath_string(response_output,'USR_ORD/OrderResponse/USR1OrderTotalTime')
WHEN ((response_id=1 or response_id=2) and response_output="Error") THEN "Error"
ELSE null END as order_time
from Customer_Order""").show()
Below Error I am Getting when trying the above query, How to handle invalid XML or JSON
Driver stacktrace:
21/02/05 00:48:06 INFO scheduler.DAGScheduler: Job 5 failed: show at Engine.scala:221, took 1.099890 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 10.0 failed 4 times, most recent failure: Lost task 3.3 in stage 10.0 (TID 83, srwredf2021.analytics1.test.dev.corp, executor 3): java.lang.RuntimeException: Invalid XML document: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1234; XML document structures must start and end within the same entity.
<USR_ORD><OrderResult><ORDTime>2021-02-02 10:34:19</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>
My Code snippets with data for reference
Data List
val custList = List((100,1,"<USR_ORD><OrderResult><ORDTime>2021-02-02 10:34:19</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>"),
(200,1,"<USR_ORD><OrderResponse><OrderResult><ORDTime>2021-02-02 21:13:12</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>"),
(500,1,"""{"OrderResponseTime":"2021-02-02 11:34:19", "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }"""),
(600,2,"""{"OrderResponseTime":"2021-02-02 15:14:13", "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }"""),
(700,2,"""{"OrderResponseTime":"2021-02-02 12:38:26", "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4} """),
(800,2,"""{"OrderResponseTime":"2021-02-02 13:24:19", "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1}"""),
(900,2,"<USR_ORD><OrderResponse><OrderResult><ORDTime>2021-02-02 01:12:49</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>"),
Loading List to RDD
val rdd = spark.sparkContext.parallelize(custList)
Imposing Schema Column Names
val DF1 = rdd.toDF("customer_id","response_id","response_output")
Creating Table
Printing Schema
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").printSchema()
|-- customer_id: integer (nullable = false)
|-- response_id: integer (nullable = false)
|-- response_output: string (nullable = true)
Showing Records
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").show()
|customer_id|response_id|response_output |
|100 |1 |<USR_ORD><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD> |
|200 |1 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|300 |1 |Error |
|400 |1 | |
|500 |1 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|600 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|700 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4} |
|800 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1} |
|900 |2 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|101 |2 |Error |
|202 |2 | |
For JSON you don't need to validate it as get_json_object will not fail if the path doesn't exist or the json is not valid.
To avoid exception when extracting values from XML, you can use UDF function to check whether the string response_output can be parsed to XML or not :
import scala.util.{Failure, Success, Try}
val isParsableXML = (xml: String) => {
Try(scala.xml.XML.loadString(xml)) match {
case Success(_) => true
case Failure(_) => false
spark.udf.register("is_parsable_xml", isParsableXML)
Then using it in your SQL query :
SELECT customer_id,
WHEN (response_id=2
AND response_output!='Error') THEN get_json_object(response_output, '$.OrderTime')
WHEN (response_id=1
AND response_output!='Error'
AND is_parsable_xml(response_output)) THEN xpath_string(response_output, 'USR_ORD/OrderResponse/USR1OrderTotalTime')
WHEN ((response_id=1
OR response_id=2)
AND response_output='Error') THEN 'Error'
END AS order_time
FROM Customer_Order
//| 100| 1| null|
//| 200| 1| |
//| 300| 1| Error|
//| 400| 1| null|
//| 500| 1| null|
//| 600| 2| 300|
//| 700| 2| null|
//| 800| 2| 100|
//| 900| 2| null|
//| 101| 2| Error|
//| 202| 2| null|
Now you can write your case when logic.
I create a DataFrame which is showed as below, I want to apply map-reduce algorithm for column 'title', but when I use reduceByKey function, I encounter some problems.
|project| title|requests_num|return_size|
| aa|%CE%92%CE%84_%CE%...| 1| 4854|
| aa|%CE%98%CE%B5%CF%8...| 1| 4917|
| aa|%CE%9C%CF%89%CE%A...| 1| 4832|
| aa|%CE%A0%CE%B9%CE%B...| 1| 4828|
| aa|%CE%A3%CE%A4%CE%8...| 1| 4819|
| aa|%D0%A1%D0%BE%D0%B...| 1| 4750|
| aa| 271_a.C| 1| 4675|
| aa|Battaglia_di_Qade...| 1| 4765|
| aa| Category:User_th| 1| 4770|
| aa| Chiron_Elias_Krase| 1| 4694|
| aa|County_Laois/en/Q...| 1| 4752|
| aa| Dassault_rafaele| 2| 9372|
| aa|Dyskusja_wikiproj...| 1| 4824|
| aa| E.Desv| 1| 4662|
| aa|Enclos-apier/fr/E...| 1| 4772|
| aa|File:Wiktionary-l...| 1| 10752|
| aa|Henri_de_Sourdis/...| 1| 4748|
| aa|Incentive_Softwar...| 1| 4777|
| aa|Indonesian_Wikipedia| 1| 4679|
| aa| Main_Page| 5| 266946|
I try this, but it doesn't work:
dataframe.select("title").map(word => (word,1)).reduceByKey(_+_);
it seems that I should transfer dataframe to list first and then use map function to generate key-value pairs(word,1), finally sum up key value.
I a method for transfering dataframe to list from stackoverflow,
for example
val text =dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()
but an error occurs
scala> val text = dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()
2018-04-08 21:49:35 WARN NettyRpcEnv:66 - Ignored message: HeartbeatResponse(false)
2018-04-08 21:49:35 WARN Executor:87 - Issue communicating with driver in heartbeater
org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval
at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:92)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:785)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply$mcV$sp(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.executor.Executor$$anon$2$$anonfun$run$1.apply(Executor.scala:814)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
at org.apache.spark.executor.Executor$$anon$2.run(Executor.scala:814)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
... 14 more
java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:280)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:276)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:276)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:298)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:297)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
... 16 elided
scala> val text = dataframe.select("title").map(r=>r(0).asInstanceOf[String]).collect()
java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:280)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.next(SparkPlan.scala:276)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.sql.execution.SparkPlan$$anon$1.foreach(SparkPlan.scala:276)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:298)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeCollect$1.apply(SparkPlan.scala:297)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:297)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3272)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2722)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3253)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3252)
at org.apache.spark.sql.Dataset.collect(Dataset.scala:2722)
... 16 elided
Collect-ing your DataFrame to a Scala collection would impose constraint on your dataset size. Rather, you could convert the DataFrame to a RDD then apply map and reduceByKey as below:
val df = Seq(
("aa", "271_a.C", 1, 4675),
("aa", "271_a.C", 1, 4400),
("aa", "271_a.C", 1, 4600),
("aa", "Chiron_Elias_Krase", 1, 4694),
("aa", "Chiron_Elias_Krase", 1, 4500)
).toDF("project", "title", "request_num", "return_size")
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Row
val rdd = df.rdd.
map{ case Row(_, title: String, _, _) => (title, 1) }.
reduceByKey(_ + _)
// res1: Array[(String, Int)] = Array((Chiron_Elias_Krase,2), (271_a.C,3))
You could also transform your DataFrame directly using groupBy:
// +------------------+-----+
// | title|count|
// +------------------+-----+
// | 271_a.C| 3|
// |Chiron_Elias_Krase| 2|
// +------------------+-----+
I'm trying to update the value of a column using another column's value in Scala.
This is the data in my dataframe :
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
| 1| 0| 0| Name| 0|Desc| | 0|
| 2| 2.11| 10000|Juice| 0| XYZ|2016/12/31 : Inco...| 0|
| 3|-0.500|-24.12|Fruit| -255| ABC| 1994-11-21 00:00:00| 0|
| 4| 0.087| 1222|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5| 0.087| 1222|Bread|-22.06| | | 0|
Here column _c5 contains a value which is incorrect(value in Row2 has the string Incorrect) based on which I'd like to update its isBadRecord field to 1.
Is there a way to update this field?
You can use withColumn api and use one of the functions which meet your needs to fill 1 for bad record.
For your case you can write a udf function
def fillbad = udf((c5 : String) => if(c5.contains("Incorrect")) 1 else 0)
And call it as
val newDF = dataframe.withColumn("isBadRecord", fillbad(dataframe("_c5")))
Instead of reasoning about updating it, I'll suggest you think about it as you would in SQL; you could do the following:
import org.spark.sql.functions.when
val spark: SparkSession = ??? // your spark session
val df: DataFrame = ??? // your dataframe
import spark.implicits._
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
Here is a self-contained script that you can copy and paste on your Spark shell to see the result locally:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val schema =
StructField("UniqueRowIdentifier", IntegerType),
StructField("_c0", DoubleType),
StructField("_c1", DoubleType),
StructField("_c2", StringType),
StructField("_c3", DoubleType),
StructField("_c4", StringType),
StructField("_c5", StringType),
StructField("isBadRecord", IntegerType)))
val contents =
Row(1, 0.0 , 0.0 , "Name", 0.0, "Desc", "", 0),
Row(2, 2.11 , 10000.0 , "Juice", 0.0, "XYZ", "2016/12/31 : Incorrect", 0),
Row(3, -0.5 , -24.12, "Fruit", -255.0, "ABC", "1994-11-21 00:00:00", 0),
Row(4, 0.087, 1222.0 , "Bread", -22.06, "", "2017-02-14 00:00:00", 0),
Row(5, 0.087, 1222.0 , "Bread", -22.06, "", "", 0)
val df = spark.createDataFrame(sc.parallelize(contents), schema)
val withBadRecords =
$"UniqueRowIdentifier", $"_c0", $"_c1", $"_c2", $"_c3", $"_c4",
$"_c5", when($"_c5".contains("Incorrect"), 1).otherwise(0) as "isBadRecord")
Whose relevant output is the following:
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 0|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
|UniqueRowIdentifier| _c0| _c1| _c2| _c3| _c4| _c5|isBadRecord|
| 1| 0.0| 0.0| Name| 0.0|Desc| | 0|
| 2| 2.11|10000.0|Juice| 0.0| XYZ|2016/12/31 : Inco...| 1|
| 3| -0.5| -24.12|Fruit|-255.0| ABC| 1994-11-21 00:00:00| 0|
| 4|0.087| 1222.0|Bread|-22.06| | 2017-02-14 00:00:00| 0|
| 5|0.087| 1222.0|Bread|-22.06| | | 0|
The best option is to create a UDF and try to convert it do Date format.
If it can be converted then return 0 else return 1
This work even if you have an bad date format
val spark = SparkSession.builder().master("local")
import spark.implicits._
//create test dataframe
val data = spark.sparkContext.parallelize(Seq(
(1,"1994-11-21 Xyz"),
(2,"1994-11-21 00:00:00"),
(3,"1994-11-21 00:00:00")
)).toDF("id", "date")
// create udf which tries to convert to date format
// returns 0 if success and returns 1 if failure
val check = udf((value: String) => {
Try(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(value)) match {
case Success(d) => 1
case Failure(e) => 0
// Add column
data.withColumn("badData", check($"date")).show
Hope this helps!
I have RDD[String] according to device,timestamp,on/off format.How do I calculate amount of time each device is swiched on.What is the best way of doing this in spark ?
on means 1 and off means 0
Intermediate step 1
A,((1335953754 - 1335952933),(1335995228 - 1335994294))
B,((1336002622- 1336001513),(1336007462 - 1336006905))
Intermediate step 2
I'll assume that RDD[String] can be parsed into a RDD of DeviceLog where DeviceLog is:
case class DeviceLog(val id: String, val timestamp: Long, val onoff: Int)
The DeviceLog class is pretty straight forward.
// initialize contexts
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
Those initialize the spark context and sql context that we'll use it for dataframes.
Step 1:
val input = List(
val df = input.toDF()
| id| timestamp|onoff|
| A|1335952933| 1|
| A|1335953754| 0|
| A|1335994294| 1|
| A|1335995228| 0|
| B|1336001513| 1|
| B|1336002622| 0|
| B|1336006905| 1|
| B|1336007462| 0|
Step 2: Partition by device id, order by timestamp and retain pair information (on/off)
val wSpec = Window.partitionBy("id").orderBy("timestamp")
val df1 = df
.withColumn("spend", lag("timestamp", 1).over(wSpec))
.withColumn("one", lag("onoff", 1).over(wSpec))
.where($"spend" isNotNull)
| id| timestamp|onoff| spend|one|
| A|1335953754| 0|1335952933| 1|
| A|1335994294| 1|1335953754| 0|
| A|1335995228| 0|1335994294| 1|
| B|1336002622| 0|1336001513| 1|
| B|1336006905| 1|1336002622| 0|
| B|1336007462| 0|1336006905| 1|
Step 3: Compute upTime and filter by criteria
val df2 = df1
.withColumn("upTime", $"timestamp" - $"spend")
.withColumn("criteria", $"one" - $"onoff")
.where($"criteria" === 1)
| id| timestamp|onoff| spend|one|upTime|criteria|
| A|1335953754| 0|1335952933| 1| 821| 1|
| A|1335995228| 0|1335994294| 1| 934| 1|
| B|1336002622| 0|1336001513| 1| 1109| 1|
| B|1336007462| 0|1336006905| 1| 557| 1|
Step 4: group by id and sum
val df3 = df2.groupBy($"id").agg(sum("upTime"))
| id|sum(upTime)|
| A| 1755|
| B| 1666|
I am using scala and spark and have a simple dataframe.map to produce the required transformation on data. However I need to provide an additional row of data with the modified original. How can I use the dataframe.map to give out this.
dataset from:
id, name, age
1, john, 23
2, peter, 32
if age < 25 default to 25.
dataset to:
id, name, age
1, john, 25
1, john, -23
2, peter, 32
Would a 'UnionAll' handle it?
df1 = original dataframe
df2 = transformed df1
EDIT: implementation using unionAll()
val df1=sqlContext.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfTransform= udf[Int,Int] { (age) => if (age<25) 25 else age }
val df2=df1.withColumn("age2", udfTransform($"age")).
df1.withColumn("age", udfTransform($"age")).
| id| name|age|
| 1| john| 25|
| 1| john| 23|
| 2|peter| 32|
Note: the implementation differs a bit from the originally proposed (naive) solution. The devil is always in the detail!
EDIT 2: implementation using nested array and explode
val df1=sx.createDataFrame(Seq( (1,"john",23) , (2,"peter",32) )).
toDF( "id","name","age")
def udfArr= udf[Array[Int],Int] { (age) =>
if (age<25) Array(age,25) else Array(age) }
val df2=df1.withColumn("age", udfArr($"age"))
| id| name| age|
| 1| john|[23, 25]|
| 2|peter| [32]|
df2.withColumn("age",explode($"age") ).show()
| id| name|age|
| 1| john| 23|
| 1| john| 25|
| 2|peter| 32|