Pyspark: Failed to execute user defined function($anonfun$1: (double) => double) - pyspark

I have a column which I'm converting to double from string but I get the below error.
An error occurred while calling o2564.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 619.0 failed 4 times, most recent failure: Lost task 0.3 in stage 619.0 org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (double) => double)
train_with_summary.select('cd_val').show(10)
+-------------------+
| cd_val |
+-------------------+
| 1|
| 9|
| 9|
| 0|
| 1|
| 3|
| 3|
| 0|
| 1|
| 2|
+-------------------+
bucket_cols = ['cd_val']
for bucket_col in bucket_cols:
train_with_summary = train_with_summary.withColumn(bucket_col,train_with_summary[bucket_col].cast(DoubleType()))
bucketizer = Bucketizer(splits=[-float("inf"),4,9,14,19],inputCol=bucket_col,outputCol=bucket_col+"_buckets")
train_with_summary = bucketizer.setHandleInvalid("keep").transform(train_with_summary)
print(bucket_col)
print(train_with_summary.select([bucket_col,bucket_col+'_buckets']).show(10))
The erorr was at the last line and there were no Null values in the column

Figured it out myself, The error is because it is trying to convert from double to double type itself.
Since I ran the code twice, the initial run converted the column.

Related

How to count the last 30 day occurrence & transpose a column's row value to new columns in pyspark

I am trying to get the count of occurrence of the status column for each 'name', 'id' & 'branch' combination in the last 30 days using Pyspark.
For simplicity lets assume the current day is 19/07/2021
Input dataframe
id name branch status eventDate
1 a main failed 18/07/2021
1 a main error 15/07/2021
2 b main failed 16/07/2021
3 c main snooze 12/07/2021
4 d main failed 18/01/2021
2 b main failed 18/07/2021
expected output
id name branch failed error snooze
1 a main 1 1 0
2 b main 2 0 0
3 c main 0 0 1
4 d main 0 0 0
I tried the following code.
from pyspark.sql import functions as F
df = df.withColumn("eventAgeinDays", (F.datediff(F.current_timestamp(), F.col("eventDate"))))
df = df.groupBy('id', 'branch', 'name', 'status')\
.agg(
F.sum(
F.when(F.col("eventAgeinDays") <= 30, 1).otherwise(0)
).alias("Last30dayFailure")
)
df = df.groupBy('id', 'branch', 'name', 'status').pivot('status').agg(F.collect_list('Last30dayFailure'))
The code kind of gives me the output, but I get arrays in the output since I am using F.collect_list()
my partially correct output
id name branch failed error snooze
1 a main [1] [1] []
2 b main [2] [] []
3 c main [] [] [1]
4 d main [] [] []
Could you please suggest a more elegant way of creating my expected output? Or let me know how to fix my code?
Instead of using collect_list which creates list, use first as the aggregation method (The reason we can use first is that you already had an aggregation grouped by id, branch, name and status so you are sure that there's at most one value for each unique combination):
(df.groupBy('id', 'branch', 'name')
.pivot('status')
.agg(F.first('Last30dayFailure'))
.fillna(0)
.show())
+---+------+----+-----+------+------+
| id|branch|name|error|failed|snooze|
+---+------+----+-----+------+------+
| 1| main| a| 1| 1| 0|
| 4| main| d| 0| 0| 0|
| 3| main| c| 0| 0| 1|
| 2| main| b| 0| 2| 0|
+---+------+----+-----+------+------+

How to handle Invalid XML String and Invalid JSON String in Dataframe / Spark SQL/ Spark Scala

I have a scenario, where I have to parse the XML and JSON values based on another field.
Customer_Order table have two fields named response_id and response_output. response_output will have a combination of JSON strings, XML Strings, Error, Blanks, and Nulls.
I need to address below Problem Statements
problem statements
If response_id=1 and response_output have valid JSON then pick JSON
logic
If response_id=1 and response_output is not have valid JSON then
Nullify
If response_id=1 and response_output is XML value then Nullify
If response_id=1 and response_output is Error then Error
If response_id=1 and response_output is Blank or Null then Nullify
If response_id=2 and response_output have valid JSON then pick XML
logic
If response_id=2 and response_output is not have valid XML then
Nullify
If response_id=2 and response_output is JSON value then Nullify
If response_id=2 and response_output is Error then Error
If response_id=2 and response_output is Blank or Null then Nullify
when I am trying to achieve the above problem statements using SPARK SQL but my code is breaking when I am encountering invalid XML or invalid JSON.
Below is the error and Could anyone help me to handle this?
spark.sql("""select
customer_id,
response_id,
CASE WHEN (response_id=2 and response_output!="Error") THEN get_json_object(response_output, '$.Metrics.OrderResponseTime')
WHEN (response_id=1 and response_output!="Error") THEN xpath_string(response_output,'USR_ORD/OrderResponse/USR1OrderTotalTime')
WHEN ((response_id=1 or response_id=2) and response_output="Error") THEN "Error"
ELSE null END as order_time
from Customer_Order""").show()
Below Error I am Getting when trying the above query, How to handle invalid XML or JSON
Driver stacktrace:
21/02/05 00:48:06 INFO scheduler.DAGScheduler: Job 5 failed: show at Engine.scala:221, took 1.099890 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 10.0 failed 4 times, most recent failure: Lost task 3.3 in stage 10.0 (TID 83, srwredf2021.analytics1.test.dev.corp, executor 3): java.lang.RuntimeException: Invalid XML document: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1234; XML document structures must start and end within the same entity.
<USR_ORD><OrderResult><ORDTime>2021-02-02 10:34:19</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>
My Code snippets with data for reference
Data List
val custList = List((100,1,"<USR_ORD><OrderResult><ORDTime>2021-02-02 10:34:19</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD>"),
(200,1,"<USR_ORD><OrderResponse><OrderResult><ORDTime>2021-02-02 21:13:12</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>"),
(300,1,"Error"),
(400,1,""),
(500,1,"""{"OrderResponseTime":"2021-02-02 11:34:19", "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }"""),
(600,2,"""{"OrderResponseTime":"2021-02-02 15:14:13", "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 }"""),
(700,2,"""{"OrderResponseTime":"2021-02-02 12:38:26", "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4} """),
(800,2,"""{"OrderResponseTime":"2021-02-02 13:24:19", "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1}"""),
(900,2,"<USR_ORD><OrderResponse><OrderResult><ORDTime>2021-02-02 01:12:49</ORDTime><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>"),
(101,2,"Error"),
(202,2,""));
Loading List to RDD
val rdd = spark.sparkContext.parallelize(custList)
Imposing Schema Column Names
val DF1 = rdd.toDF("customer_id","response_id","response_output")
Creating Table
DF1.createOrReplaceTempView("Customer_Order")
Printing Schema
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").printSchema()
root
|-- customer_id: integer (nullable = false)
|-- response_id: integer (nullable = false)
|-- response_output: string (nullable = true)
Showing Records
spark.sql("""select customer_id,response_id,response_output from Customer_Order""").show()
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|customer_id|response_id|response_output |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|100 |1 |<USR_ORD><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>321</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult><OrderResponse></USR_ORD> |
|200 |1 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|300 |1 |Error |
|400 |1 | |
|500 |1 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|600 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":300, "USR1Order":null, "USR1Orderqut":10 } |
|700 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":200, "USR1Order":null "USR1Orderqut":4} |
|800 |2 |{ "OrderResponse":"COMPLETE", "OrderTime":100, "USR1Order":null} "USR1Orderqut":1} |
|900 |2 |<USR_ORD><OrderResponse><OrderResult><ORDStatus>COMPLETE</ORDStatus><ORDValue><USR1OrderTotalTime>221</USR1OrderTotalTime><USR1OrderKYC>{ND}</USR1OrderKYC><USR1OrderLoc>{ND}</USR1OrderLoc><USR1Orderqnt>10</USR1Orderqnt><USR1Orderxyz>0</USR1Orderxyz><USR1OrderD>{ND}</USR1OrderD></ORDValue></OrderResult></OrderResponse></USR_ORD>|
|101 |2 |Error |
|202 |2 | |
+-----------+-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
For JSON you don't need to validate it as get_json_object will not fail if the path doesn't exist or the json is not valid.
To avoid exception when extracting values from XML, you can use UDF function to check whether the string response_output can be parsed to XML or not :
import scala.util.{Failure, Success, Try}
val isParsableXML = (xml: String) => {
Try(scala.xml.XML.loadString(xml)) match {
case Success(_) => true
case Failure(_) => false
}
}
spark.udf.register("is_parsable_xml", isParsableXML)
Then using it in your SQL query :
spark.sql("""
SELECT customer_id,
response_id,
CASE
WHEN (response_id=2
AND response_output!='Error') THEN get_json_object(response_output, '$.OrderTime')
WHEN (response_id=1
AND response_output!='Error'
AND is_parsable_xml(response_output)) THEN xpath_string(response_output, 'USR_ORD/OrderResponse/USR1OrderTotalTime')
WHEN ((response_id=1
OR response_id=2)
AND response_output='Error') THEN 'Error'
ELSE NULL
END AS order_time
FROM Customer_Order
""").show()
//+-----------+-----------+----------+
//|customer_id|response_id|order_time|
//+-----------+-----------+----------+
//| 100| 1| null|
//| 200| 1| |
//| 300| 1| Error|
//| 400| 1| null|
//| 500| 1| null|
//| 600| 2| 300|
//| 700| 2| null|
//| 800| 2| 100|
//| 900| 2| null|
//| 101| 2| Error|
//| 202| 2| null|
//+-----------+-----------+----------+
Now you can write your case when logic.

Histogram -Doing it in a parallel way

+----+----+--------+
| Id | M1 | trx |
+----+----+--------+
| 1 | M1 | 11.35 |
| 2 | M1 | 3.4 |
| 3 | M1 | 10.45 |
| 2 | M1 | 3.95 |
| 3 | M1 | 20.95 |
| 2 | M2 | 25.55 |
| 1 | M2 | 9.95 |
| 2 | M2 | 11.95 |
| 1 | M2 | 9.65 |
| 1 | M2 | 14.54 |
+----+----+--------+
With the above dataframe I should be able to generate a histogram as below using the below code.
Similar Queston is here
val (Range,counts) = df
.select(col("trx"))
.rdd.map(r => r.getDouble(0))
.histogram(10)
// Range: Array[Double] = Array(3.4, 5.615, 7.83, 10.045, 12.26, 14.475, 16.69, 18.905, 21.12, 23.335, 25.55)
// counts: Array[Long] = Array(2, 0, 2, 3, 0, 1, 0, 1, 0, 1)
But Issue here is,how can I parallely create the histogram based on column 'M1' ?This means I need to have two histogram output for column Values M1 and M2.
First, you need to know that histogram generates two separate sequential jobs. One to detect the minimum and maximum of your data, one to compute the actual histogram. You can check this using the Spark UI.
We can follow the same scheme to build histograms on as many columns as you wish, with only two jobs. Yet, we cannot use the histogram function which is only meant to handle one collection of doubles. We need to implement it by ourselves. The first job is dead simple.
val Row(min_trx : Double, max_trx : Double) = df.select(min('trx), max('trx)).head
Then we compute locally the ranges of the histogram. Note that I use the same ranges for all the columns. It allows to compare the results easily between the columns (by plotting them on the same figure). Having different ranges per column would just be a small modification of this code though.
val hist_size = 10
val hist_step = (max_trx - min_trx) / hist_size
val hist_ranges = (1 until hist_size)
.scanLeft(min_trx)((a, _) => a + hist_step) :+ max_trx
// I add max_trx manually to avoid rounding errors that would exclude the value
That was the first part. Then, we can use a UDF to determine in what range each value ends up, and compute all the histograms in parallel with spark.
val range_index = udf((x : Double) => hist_ranges.lastIndexWhere(x >= _))
val hist_df = df
.withColumn("rangeIndex", range_index('trx))
.groupBy("M1", "rangeIndex")
.count()
// And voilĂ , all the data you need is there.
hist_df.show()
+---+----------+-----+
| M1|rangeIndex|count|
+---+----------+-----+
| M2| 2| 2|
| M1| 0| 2|
| M2| 5| 1|
| M1| 3| 2|
| M2| 3| 1|
| M1| 7| 1|
| M2| 10| 1|
+---+----------+-----+
As a bonus, you can shape the data to use it locally (within the driver), either using the RDD API or by collecting the dataframe and modifying it in scala.
Here is one way to do it with spark since this is a question about spark ;-)
val hist_map = hist_df.rdd
.map(row => row.getAs[String]("M1") ->
(row.getAs[Int]("rangeIndex"), row.getAs[Long]("count")))
.groupByKey
.mapValues( _.toMap)
.mapValues( hists => (1 to hist_size)
.map(i => hists.getOrElse(i, 0L)).toArray )
.collectAsMap
EDIT: how to build one range per column value:
Instead of computing the min and max of M1, we compute it for each value of the column with groupBy.
val min_max_map = df.groupBy("M1")
.agg(min('trx), max('trx))
.rdd.map(row => row.getAs[String]("M1") ->
(row.getAs[Double]("min(trx)"), row.getAs[Double]("max(trx)")))
.collectAsMap // maps each column value to a tuple (min, max)
Then we adapt the UDF so that it uses this map and we are done.
// for clarity, let's define a function that generates histogram ranges
def generate_ranges(min_trx : Double, max_trx : Double, hist_size : Int) = {
val hist_step = (max_trx - min_trx) / hist_size
(1 until hist_size).scanLeft(min_trx)((a, _) => a + hist_step) :+ max_trx
}
// and use it to generate one range per column value
val range_map = min_max_map.keys
.map(key => key ->
generate_ranges(min_max_map(key)._1, min_max_map(key)._2, hist_size))
.toMap
val range_index = udf((x : Double, m1 : String) =>
range_map(m1).lastIndexWhere(x >= _))
Finally, just replace range_index('trx) by range_index('trx, 'M1) and you will have one range per column value.
The way I do histograms with Spark is as follows:
val binEdes = 0.0 to 25.0 by 5.0
val bins = binEdes.init.zip(binEdes.tail).toDF("bin_from","bin_to")
df
.join(bins,$"trx">=$"bin_from" and $"trx"<$"bin_to","right")
.groupBy($"bin_from",$"bin_to")
.agg(
count($"trx").as("count")
// add more, e.g. sum($"trx)
)
.orderBy($"bin_from",$"bin_to")
.show()
gives:
+--------+------+-----+
|bin_from|bin_to|count|
+--------+------+-----+
| 0.0| 5.0| 2|
| 5.0| 10.0| 2|
| 10.0| 15.0| 4|
| 15.0| 20.0| 0|
| 20.0| 25.0| 1|
+--------+------+-----+
Now if you have more dimensions, just add that to the groupBy-clause
df
.join(bins,$"trx">=$"bin_from" and $"trx"<$"bin_to","right")
.groupBy($"M1",$"bin_from",$"bin_to")
.agg(
count($"trx").as("count")
)
.orderBy($"M1",$"bin_from",$"bin_to")
.show()
gives:
+----+--------+------+-----+
| M1|bin_from|bin_to|count|
+----+--------+------+-----+
|null| 15.0| 20.0| 0|
| M1| 0.0| 5.0| 2|
| M1| 10.0| 15.0| 2|
| M1| 20.0| 25.0| 1|
| M2| 5.0| 10.0| 2|
| M2| 10.0| 15.0| 2|
+----+--------+------+-----+
You may tweak to code a bit to get the output you want, but this should get you started. You could also do the UDAF approach I posted here : Spark custom aggregation : collect_list+UDF vs UDAF
I think its not easily possible using RDD's, because histogram is only available on DoubleRDD, i.e. RDDs of Double. If you really need to use RDD API, you can do it in parallel by firing parallel jobs, this can be done using scalas parallel collection:
import scala.collection.parallel.immutable.ParSeq
val List((rangeM1,histM1),(rangeM2,histM2)) = ParSeq("M1","M2")
.map(c => df.where($"M1"===c)
.select(col("trx"))
.rdd.map(r => r.getDouble(0))
.histogram(10)
).toList
println(rangeM1.toSeq,histM1.toSeq)
println(rangeM2.toSeq,histM2.toSeq)
gives:
(WrappedArray(3.4, 5.155, 6.91, 8.665000000000001, 10.42, 12.175, 13.930000000000001, 15.685, 17.44, 19.195, 20.95),WrappedArray(2, 0, 0, 0, 2, 0, 0, 0, 0, 1))
(WrappedArray(9.65, 11.24, 12.83, 14.420000000000002, 16.01, 17.6, 19.19, 20.78, 22.37, 23.96, 25.55),WrappedArray(2, 1, 0, 1, 0, 0, 0, 0, 0, 1))
Note that the bins differ here for M1 and M2

Apply UDF function to Spark window where the input paramter is a list of all column values in range

I would like to build a moving average on each row in a window. Let's say -10 rows. BUT if there are less than 10 rows available I would like to insert a 0 in the resulting row -> new column.
So what I would try to achieve is using a UDF in an aggregate window with input paramter List() (or whatever superclass) which has the values of all rows available.
Here's a code example that doesn't work:
val w = Window.partitionBy("id").rowsBetween(-10, +0)
dfRetail2.withColumn("test", udftestf(dfRetail2("salesMth")).over(w))
Expected output: List( 1,2,3,4) if no more rows are available and take this as input paramter for the udf function. udf function should return a calculated value or 0 if less than 10 rows available.
the above code terminates: Expression 'UDF(salesMth#152L)' not supported within a window function.;;
You can use Spark's built-in Window functions along with when/otherwise for your specific condition without the need of UDF/UDAF. For simplicity, the sliding-window size is reduced to 4 in the following example with dummy data:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import spark.implicits._
val df = (1 to 2).flatMap(i => Seq.tabulate(8)(j => (i, i * 10.0 + j))).
toDF("id", "amount")
val slidingWin = 4
val winSpec = Window.partitionBy($"id").rowsBetween(-(slidingWin - 1), 0)
df.
withColumn("slidingCount", count($"amount").over(winSpec)).
withColumn("slidingAvg", when($"slidingCount" < slidingWin, 0.0).
otherwise(avg($"amount").over(winSpec))
).show
// +---+------+------------+----------+
// | id|amount|slidingCount|slidingAvg|
// +---+------+------------+----------+
// | 1| 10.0| 1| 0.0|
// | 1| 11.0| 2| 0.0|
// | 1| 12.0| 3| 0.0|
// | 1| 13.0| 4| 11.5|
// | 1| 14.0| 4| 12.5|
// | 1| 15.0| 4| 13.5|
// | 1| 16.0| 4| 14.5|
// | 1| 17.0| 4| 15.5|
// | 2| 20.0| 1| 0.0|
// | 2| 21.0| 2| 0.0|
// | 2| 22.0| 3| 0.0|
// | 2| 23.0| 4| 21.5|
// | 2| 24.0| 4| 22.5|
// | 2| 25.0| 4| 23.5|
// | 2| 26.0| 4| 24.5|
// | 2| 27.0| 4| 25.5|
// +---+------+------------+----------+
Per remark in the comments section, I'm including a solution via UDF below as an alternative:
def movingAvg(n: Int) = udf{ (ls: Seq[Double]) =>
val (avg, count) = ls.takeRight(n).foldLeft((0.0, 1)){
case ((a, i), next) => (a + (next-a)/i, i + 1)
}
if (count <= n) 0.0 else avg // Expand/Modify this for specific requirement
}
// To apply the UDF:
df.
withColumn("average", movingAvg(slidingWin)(collect_list($"amount").over(winSpec))).
show
Note that unlike sum or count, collect_list ignores rowsBetween() and generates partitioned data that can potentially be very large to be passed to the UDF (hence the need for takeRight()). If the computed Window sum and count are sufficient for what's needed for your specific requirement, consider passing them to the UDF instead.
In general, especially if the data at hand is already in DataFrame format, it'd perform and scale better by using built-in DataFrame API to take advantage of Spark's execution engine optimization than using user-defined UDF/UDAF. You might be interested in reading this article re: advantages of DataFrame/Dataset API over UDF/UDAF.

How to extract a previous and next line sentence in spark?

I'm analyzing a log file for customer impact analysis by using Apache spark. I have the log file that contains the time stamp in one line, customer's details in another line and the error caused by in another line, I want the output in one file which will combine all the extracted record to one line. Here is my log file below:
2018-10-15 05:24:00.102 ERROR 7 --- [DefaultMessageListenerContainer-2] c.l.p.a.c.event.listener.MQListener : The ABC/CDE object received for the xyz event was not valid. e_id=11111111, s_id=111, e_name=ABC
com.xyz.abc.pqr.exception.PNotVException: The r received from C was invalid/lacks mandatory fields. S_id: 123, P_Id: 123456789, R_Number: 12345678
at com.xyz.abc.pqr.mprofile.CPServiceImpl.lambda$bPByC$1(CPServiceImpl.java:240)
at java.util.ArrayList.forEach(ArrayList.java:1249)
rContainer.doInvokeListener(AbstractMessageListenerContainer.java:721)
at org.springframework.jms.listener.AbstractMessageListenerContainer.invokeListener(AbstractMessageListenerContainer.java:681)
at org.springframework.jms.listener.AbstractMessageListenerContainer.doExecuteListener(AbstractMessageListenerContainer.java:651)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Invalid D because cm: null and pk: null were missing.
at com.xyz.abc.pqr.mp.DD.resolveDetailsFromCDE(DD.java:151)
at com.xyz.abc.pqr.mp.DD.<init>(DD.java:35)
at java.util.ArrayList.forEach(ArrayList.java:1249)
2018-10-15 05:24:25.136 ERROR 7 --- [DefaultMessageListenerContainer-1] c.l.p.a.c.event.listener.MQListener : The ABC/CDE object received for the C1 event was not valid. entity_id=2222222, s_id=3333, event_name=CDE
com.xyz.abc.pqr.PNotVException: The r received from C was invalid/lacks mandatory fields. S_id: 123, P_Id: 123456789, R_Number: 12345678
at com.xyz.abc.pqr.mp.CSImpl.lambda$buildABCByCo$1(CSImpl.java:240)
at java.util.ArrayList.forEach(ArrayList.java:1249)
at com.xyz.abc.pqr.event.handler.DHandler.handle(CDEEventHandler.java:45)
at sun.reflect.GMA.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:333)
at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.doInvoke(InvocableHandlerMethod.java:197)
at org.springframework.messaging.handler.invocation.InvocableHandlerMethod.invoke(InvocableHandlerMethod.java:115)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException: null
You can use the DataFrame API to do this in a few ways. Here is one
import org.apache.spark.sql.functions._
val rd = sc.textFile("/FileStore/tables/log.txt").zipWithIndex.map{case (r, i) => Row(r, i)}
val schema = StructType(StructField("logs", StringType, false) :: StructField("id", LongType, false) :: Nil)
val df = spark.sqlContext.createDataFrame(rd, schema)
df.show
+--------------------+---+
| logs| id|
+--------------------+---+
|2018-10-15 05:24:...| 0|
| | 1|
|com.xyz.abc.pqr.e...| 2|
| at com.xyz.ab...| 3|
| at java.util....| 4|
| rContainer.do...| 5|
| at org.spring...| 6|
| at org.spring...| 7|
| at java.lang....| 8|
|Caused by: java.l...| 9|
| at com.xyz.ab...| 10|
| at com.xyz.ab...| 11|
| at com.xyz.ab...| 12|
| at java.util....| 13|
| | 14|
|2018-10-15 05:24:...| 15|
| | 16|
|com.xyz.abc.pqr.P...| 17|
val df1 = df.filter($"logs".contains("c.l.p.a.c.event.listener.MQListener")).withColumn("logs",regexp_replace($"logs","ERROR.*","")).sort("id")
df1.show
+--------------------+---+
| logs| id|
+--------------------+---+
|2018-10-15 05:24:...| 0|
|2018-10-15 05:24:...| 15|
+--------------------+---+
val df2 = df.filter($"logs".contains("PrescriptionNotValidException:")).withColumn("logs",regexp_replace($"logs","(.*?)mandatory fields.","")).sort("id")
df2.show
+--------------------+---+
| logs| id|
+--------------------+---+
| StoreId: 123, Co...| 2|
| StoreId: 234, Co...| 17|
+--------------------+---+
val df3 = df.filter($"logs".contains("Caused by: java.lang.")).sort("id")
df3.show
df1.select("logs").collect.toSeq.zip(df2.select("logs").collect.toSeq).zip(df3.select("logs").collect.toSeq)
+--------------------+---+
| logs| id|
+--------------------+---+
|Caused by: java.l...| 9|
|Caused by: java.l...| 28|
+--------------------+---+
df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [logs: string, id: bigint]
res71: Seq[((org.apache.spark.sql.Row, org.apache.spark.sql.Row), org.apache.spark.sql.Row)] = ArrayBuffer((([2018-10-15 05:24:00.102 ],[ StoreId: 123, Co Patient Id: 123456789, Rx Number: 12345678]),[Caused by: java.lang.IllegalArgumentException: Invalid Dispense Object because compound: null and pack: null were missing.]), (([2018-10-15 05:24:25.136 ],[ StoreId: 234, Co Patient Id: 999999, Rx Number: 45555]),[Caused by: java.lang.NullPointerException: null]))