Split string in Dataframe using Scala on Spark - scala

I have a logfile which has 100+ columns. Out of which I only needed two columns '_raw' and '_time', so i created loaded the logfile as "csv" DF.
Step 1:
scala> val log = spark.read.format("csv").option("inferSchema", "true").option("header", "true").load("soa_prod_diag_10_jan.csv")
log: org.apache.spark.sql.DataFrame = [ARRAffinity: string, CoordinatorNonSecureURL: string ... 126 more fields]
Step 2:
I registered the DF as temp table
log.createOrReplaceTempView("logs")
Step 3: I extracted my two required columns '_raw' and '_time'
scala> val sqlDF = spark.sql("select _raw, _time from logs")
sqlDF: org.apache.spark.sql.DataFrame = [_raw: string, _time: string]
scala> sqlDF.show(1, false)
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|_raw |_time|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
|[2019-01-10T23:59:59.998-06:00] [xx_yyy_zz_sss_ra10] [ERROR] [OSB-473003] [oracle.osb.statistics.statistics] [tid: [ACTIVE].ExecuteThread: '28' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,0] [partition-name: DOMAIN] [tenant-name: GLOBAL] Aggregation Server Not Available. Failed to get remote aggregator[[|null |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----+
only showing top 1 row
My requirement:
I need to split the string in the '_raw' column to produce
[2019-01-10T23:59:59.998-06:00] [xx_yyy_zz_sss_ra10] [ERROR] [OSB-473003] [oracle.osb.statistics.statistics] [ecid: 92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b] with column names a, b, c, d, e, f respectively
Also remove all null values from both '_raw' and '_time'
Your answers will appreciated :)

You can you split function, and split the _raw by space. This will return an array and then you can extract the values from that array. You can also use regexp_extract function to extract values from log messages. Both the ways are shown below. I hope it is helpful.
//Creating Test Data
val df = Seq("[2019-01-10T23:59:59.998-06:00] [xx_yyy_zz_sss_ra10] [ERROR] [OSB-473003] [oracle.osb.statistics.statistics] [tid: [ACTIVE].ExecuteThread: '28' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b,0] [partition-name: DOMAIN] [tenant-name: GLOBAL] Aggregation Server Not Available. Failed to get remote aggregator[[")
.toDF("_raw")
val splitDF = df.withColumn("split_raw_arr", split($"_raw", " "))
.withColumn("A", $"split_raw_arr"(0))
.withColumn("B", $"split_raw_arr"(1))
.withColumn("C", $"split_raw_arr"(2))
.withColumn("D", $"split_raw_arr"(3))
.withColumn("E", $"split_raw_arr"(4))
.drop("_raw", "split_raw_arr")
splitDF.show(false)
+-------------------------------+--------------------+-------+------------+----------------------------------+
|A |B |C |D |E |
+-------------------------------+--------------------+-------+------------+----------------------------------+
|[2019-01-10T23:59:59.998-06:00]|[xx_yyy_zz_sss_ra10]|[ERROR]|[OSB-473003]|[oracle.osb.statistics.statistics]|
+-------------------------------+--------------------+-------+------------+----------------------------------+
val extractedDF = df
.withColumn("a", regexp_extract($"_raw", "\\[(.*?)\\]",1))
.withColumn("b", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\]",2))
.withColumn("c", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
.withColumn("d", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",4))
.withColumn("e", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
.withColumn("f", regexp_extract($"_raw", "(?<=ecid: )(.*?)(?=,)",1))
.drop("_raw")
+-----------------------------+------------------+-----+----------+--------------------------------+---------------------------------------------+
|a |b |c |d |e |f |
+-----------------------------+------------------+-----+----------+--------------------------------+---------------------------------------------+
|2019-01-10T23:59:59.998-06:00|xx_yyy_zz_sss_ra10|ERROR|OSB-473003|oracle.osb.statistics.statistics|92b39a8b-8234-4d19-9ac7-4908dc79c5ed-0000bd0b|
+-----------------------------+------------------+-----+----------+--------------------------------+---------------------------------------------+

Related

How to create a dataframe from Array[Strings]?

I used rdd.collect() to create an Array and now I want to use this Array[Strings] to create a DataFrame. My test file is in the following format(separated by a pipe |).
TimeStamp
IdC
Name
FileName
Start-0f-fields
column01
column02
column03
column04
column05
column06
column07
column08
column010
column11
End-of-fields
Start-of-data
G0002B|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002A|0|13|IS|LS|Xys|Xyz|12|23|45|
G0002x|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002C|0|13|IS|LS|Xys|Xyz|12|23|48|
End-of-data
document
the column name are in between Start-of-field and End-of-Field.
I want to store "| " pipe separated in different columns of Dataframe.
like below example:
column01 column02 column03 column04 column05 column06 column07 column08 column010 column11
G0002C 0 13 IS LS Xys Xyz 12 23 48
G0002x 0 13 LS MS Xys Xyz 14 300 400
my code :
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(5,16)
val rdd1 = sc.parallelize(rdd.collect())
val df = rdd1.toDf(columns)
but this is not giving me the above desired dataframe
Could you try this?
import spark.implicits._ // Add to use `toDS()` and `toDF()`
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16) // `.mkString(",")` is not needed
val dataDS = rdd.collect.slice(5,16)
.map(_.trim()) // to remove whitespaces
.map(s => s.substring(0, s.length - 1)) // to remove last pipe '|'
.toSeq
.toDS
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.csv(dataDS)
.toDF(columns: _*)
df.show(false)
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|column01|column02|column03|column04|column05|column06|column07|column08|column010|column11|
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|G0002B |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002A |0 |13 |IS |LS |Xys |Xyz |12 |23 |45 |
|G0002x |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002C |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
Calling spark.read...csv() method without schema, can take a long time with huge data, because of schema inferences(e,g. Additional reading).
On that case, you can specify schema like below.
/*
column01 STRING,
column02 STRING,
column03 STRING,
...
*/
val schema = columns
.map(c => s"$c STRING")
.mkString(",\n")
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.schema(schema) // schema inferences not occurred
.csv(dataDS)
// .toDF(columns: _*) => unnecessary when schema is specified
If the number of columns and the name of the column are fixed then you can do that as below :
val columns = rdd.collect.slice(5,15).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(17,21)
val d = data.mkString("\n").split('\n').toSeq.toDF()
import org.apache.spark.sql.functions._
val dd = d.withColumn("columnX",split($"value","\\|")).withColumn("column1",$"columnx".getItem(0)).withColumn("column2",$"columnx".getItem(1)).withColumn("column3",$"columnx".getItem(2)).withColumn("column4",$"columnx".getItem(3)).withColumn("column5",$"columnx".getItem(4)).withColumn("column6",$"columnx".getItem(5)).withColumn("column8",$"columnx".getItem(7)).withColumn("column10",$"columnx".getItem(8)).withColumn("column11",$"columnx".getItem(9)).drop("columnX","value")
display(dd)
you can see the output as below:

Losing entries when inner-joining data to a left-joined DataFrame in Spark Structured Streaming

I'm trying to join data with a DataFrame that in turn resulted from a left join. While in batch processing this works as expected, in stream processing some entries are lost...
Below I created a minimal example of "sessions", that have "start" and "end" events and optionally some "metadata".
The script generates two outputs: sessionStartsWithMetadata result from "start" events that are left-joined with the "metadata" events, based on sessionId. A "left join" is used, since we like to get an output event even when no corresponding metadata exists.
Additionally a DataFrame endedSessionsWithMetadata is created by joining "end" events to the previously created DataFrame. Here an "inner join" is used, since we only want some output when a session has ended for sure.
This code can be executed in spark-shell:
import java.sql.Timestamp
import org.apache.spark.sql.execution.streaming.{MemoryStream, StreamingQueryWrapper}
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SQLContext}
import org.apache.spark.sql.functions.{col, expr, lit}
import spark.implicits._
implicit val sqlContext: SQLContext = spark.sqlContext
// Main data processing, regardless whether batch or stream processing
def process(
sessionStartEvents: DataFrame,
sessionOptionalMetadataEvents: DataFrame,
sessionEndEvents: DataFrame
): (DataFrame, DataFrame) = {
val sessionStartsWithMetadata: DataFrame = sessionStartEvents
.join(
sessionOptionalMetadataEvents,
sessionStartEvents("sessionId") === sessionOptionalMetadataEvents("sessionId") &&
sessionStartEvents("sessionStartTimestamp").between(
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").minus(expr(s"INTERVAL 1 seconds")),
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp").plus(expr(s"INTERVAL 1 seconds"))
),
"left" // metadata is optional
)
.select(
sessionStartEvents("sessionId"),
sessionStartEvents("sessionStartTimestamp"),
sessionOptionalMetadataEvents("sessionOptionalMetadataTimestamp")
)
val endedSessionsWithMetadata = sessionStartsWithMetadata.join(
sessionEndEvents,
sessionStartsWithMetadata("sessionId") === sessionEndEvents("sessionId") &&
sessionStartsWithMetadata("sessionStartTimestamp").between(
sessionEndEvents("sessionEndTimestamp").minus(expr(s"INTERVAL 10 seconds")),
sessionEndEvents("sessionEndTimestamp")
)
)
(sessionStartsWithMetadata, endedSessionsWithMetadata)
}
def streamProcessing(
sessionStartData: Seq[(Timestamp, Int)],
sessionOptionalMetadata: Seq[(Timestamp, Int)],
sessionEndData: Seq[(Timestamp, Int)]
): (StreamingQuery, StreamingQuery) = {
val sessionStartEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)]
sessionStartEventsStream.addData(sessionStartData)
val sessionStartEvents: DataFrame = sessionStartEventsStream
.toDS()
.toDF("sessionStartTimestamp", "sessionId")
.withWatermark("sessionStartTimestamp", "1 second")
val sessionOptionalMetadataEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)]
sessionOptionalMetadataEventsStream.addData(sessionOptionalMetadata)
val sessionOptionalMetadataEvents: DataFrame = sessionOptionalMetadataEventsStream
.toDS()
.toDF("sessionOptionalMetadataTimestamp", "sessionId")
.withWatermark("sessionOptionalMetadataTimestamp", "1 second")
val sessionEndEventsStream: MemoryStream[(Timestamp, Int)] = MemoryStream[(Timestamp, Int)]
sessionEndEventsStream.addData(sessionEndData)
val sessionEndEvents: DataFrame = sessionEndEventsStream
.toDS()
.toDF("sessionEndTimestamp", "sessionId")
.withWatermark("sessionEndTimestamp", "1 second")
val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents)
val sessionStartsWithMetadataQuery = sessionStartsWithMetadata
.select(lit("sessionStartsWithMetadata"), col("*")) // Add label to see which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()
val endedSessionsWithMetadataQuery = endedSessionsWithMetadata
.select(lit("endedSessionsWithMetadata"), col("*")) // Add label to see which query's output it is
.writeStream
.outputMode("append")
.format("console")
.option("truncate", "false")
.option("numRows", "1000")
.start()
(sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery)
}
def batchProcessing(
sessionStartData: Seq[(Timestamp, Int)],
sessionOptionalMetadata: Seq[(Timestamp, Int)],
sessionEndData: Seq[(Timestamp, Int)]
): Unit = {
val sessionStartEvents = spark.createDataset(sessionStartData).toDF("sessionStartTimestamp", "sessionId")
val sessionOptionalMetadataEvents = spark.createDataset(sessionOptionalMetadata).toDF("sessionOptionalMetadataTimestamp", "sessionId")
val sessionEndEvents = spark.createDataset(sessionEndData).toDF("sessionEndTimestamp", "sessionId")
val (sessionStartsWithMetadata, endedSessionsWithMetadata) =
process(sessionStartEvents, sessionOptionalMetadataEvents, sessionEndEvents)
println("sessionStartsWithMetadata")
sessionStartsWithMetadata.show(100, truncate = false)
println("endedSessionsWithMetadata")
endedSessionsWithMetadata.show(100, truncate = false)
}
// Data is represented as tuples of (eventTime, sessionId)...
val sessionStartData = Vector(
(new Timestamp(1), 0),
(new Timestamp(2000), 1),
(new Timestamp(2000), 2),
(new Timestamp(20000), 10)
)
val sessionOptionalMetadata = Vector(
(new Timestamp(1), 0),
// session `1` has no metadata
(new Timestamp(2000), 2),
(new Timestamp(20000), 10)
)
val sessionEndData = Vector(
(new Timestamp(10000), 0),
(new Timestamp(11000), 1),
(new Timestamp(12000), 2),
(new Timestamp(30000), 10)
)
batchProcessing(sessionStartData, sessionOptionalMetadata, sessionEndData)
val (sessionStartsWithMetadataQuery, endedSessionsWithMetadataQuery) =
streamProcessing(sessionStartData, sessionOptionalMetadata, sessionEndData)
In the example session with ID 1 has no metadata, so the respective metadata column is null.
The main functionality of joining the data is implemented in def process(…), which is called using both batch data and stream data.
In the batch version the output is as expected:
sessionStartsWithMetadata
+---------+-----------------------+--------------------------------+
|sessionId|sessionStartTimestamp |sessionOptionalMetadataTimestamp|
+---------+-----------------------+--------------------------------+
|0 |1970-01-01 01:00:00.001|1970-01-01 01:00:00.001 |
|1 |1970-01-01 01:00:02 |null | ← has no metadata ✔
|2 |1970-01-01 01:00:02 |1970-01-01 01:00:02 |
|10 |1970-01-01 01:00:20 |1970-01-01 01:00:20 |
+---------+-----------------------+--------------------------------+
endedSessionsWithMetadata
+---------+-----------------------+--------------------------------+-------------------+---------+
|sessionId|sessionStartTimestamp |sessionOptionalMetadataTimestamp|sessionEndTimestamp|sessionId|
+---------+-----------------------+--------------------------------+-------------------+---------+
|0 |1970-01-01 01:00:00.001|1970-01-01 01:00:00.001 |1970-01-01 01:00:10|0 |
|1 |1970-01-01 01:00:02 |null |1970-01-01 01:00:11|1 | ← has no metadata ✔
|2 |1970-01-01 01:00:02 |1970-01-01 01:00:02 |1970-01-01 01:00:12|2 |
|10 |1970-01-01 01:00:20 |1970-01-01 01:00:20 |1970-01-01 01:00:30|10 |
+---------+-----------------------+--------------------------------+-------------------+---------+
But when the same processing is run as stream processing the output of endedSessionsWithMetadata does not contain the entry of session 1 that has no metadata:
-------------------------------------------
Batch: 0 ("start event")
-------------------------------------------
+-------------------------+---------+-----------------------+--------------------------------+
|sessionStartsWithMetadata|sessionId|sessionStartTimestamp |sessionOptionalMetadataTimestamp|
+-------------------------+---------+-----------------------+--------------------------------+
|sessionStartsWithMetadata|10 |1970-01-01 01:00:20 |1970-01-01 01:00:20 |
|sessionStartsWithMetadata|2 |1970-01-01 01:00:02 |1970-01-01 01:00:02 |
|sessionStartsWithMetadata|0 |1970-01-01 01:00:00.001|1970-01-01 01:00:00.001 |
+-------------------------+---------+-----------------------+--------------------------------+
-------------------------------------------
Batch: 0 ("end event")
-------------------------------------------
+-------------------------+---------+-----------------------+--------------------------------+-------------------+---------+
|endedSessionsWithMetadata|sessionId|sessionStartTimestamp |sessionOptionalMetadataTimestamp|sessionEndTimestamp|sessionId|
+-------------------------+---------+-----------------------+--------------------------------+-------------------+---------+
|endedSessionsWithMetadata|10 |1970-01-01 01:00:20 |1970-01-01 01:00:20 |1970-01-01 01:00:30|10 |
|endedSessionsWithMetadata|2 |1970-01-01 01:00:02 |1970-01-01 01:00:02 |1970-01-01 01:00:12|2 |
|endedSessionsWithMetadata|0 |1970-01-01 01:00:00.001|1970-01-01 01:00:00.001 |1970-01-01 01:00:10|0 |
+-------------------------+---------+-----------------------+--------------------------------+-------------------+---------+
-------------------------------------------
Batch: 1 ("start event")
-------------------------------------------
+-------------------------+---------+---------------------+--------------------------------+
|sessionStartsWithMetadata|sessionId|sessionStartTimestamp|sessionOptionalMetadataTimestamp|
+-------------------------+---------+---------------------+--------------------------------+
|sessionStartsWithMetadata|1 |1970-01-01 01:00:02 |null | ← has no metadata ✔
+-------------------------+---------+---------------------+--------------------------------+
-------------------------------------------
Batch: 1 ("end event")
-------------------------------------------
+-------------------------+---------+---------------------+--------------------------------+-------------------+---------+
|endedSessionsWithMetadata|sessionId|sessionStartTimestamp|sessionOptionalMetadataTimestamp|sessionEndTimestamp|sessionId|
+-------------------------+---------+---------------------+--------------------------------+-------------------+---------+
+-------------------------+---------+---------------------+--------------------------------+-------------------+---------+
↳ ✘ here I would have expected a line with sessionId=1, that has "start" and "end" information, but no "metadata" ✘
Can anybody explain why in stream processing the "end" event with no "metadata" (sessionId=1) is not there? What do I need to do to make it appear in the output?
Thanks a lot!
After considerable testing, looking around and re-reading the manual:
It must be a bug in Spark.
I note also this post in circulation: https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a#%3Cdev.spark.apache.org%3E
and whilst global vs chained stream-stream joins are understood, this
point imo to an issue for this type of processing.
I ran on Spark Databricks 3.x to no avail.

How to create new columns in dataframe using Spark Scala based on different string patterns

Step 1: I created a Dataframe df with two columns 'COLUMN A' and 'COLUMN B' of type string.
Step 2: I have created new columns from 'COLUMN B' based on their Index positions.
My Requirement: I need one more column a6 to be created NOT on index position but by anything which matches yyy or xxx or yyy or zzz in the string
val extractedDF = df
.withColumn("a1", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\]",2))
.withColumn("a2", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
.withColumn("a3", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
.withColumn("a4", regexp_extract($"_raw", "(?<=uvwx: )(.*?)(?=,)",1))
.withColumn("a5", regexp_extract($"_raw", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",13))
Please help me!!
You can use regexp_replace() and provide xxx|yyy|zzz as alternations
scala> val df = Seq(("abcdef"),("axxx"),("byyypp"),("czzzr")).toDF("_raw")
df: org.apache.spark.sql.DataFrame = [_raw: string]
scala> df.show(false)
+------+
|_raw |
+------+
|abcdef|
|axxx |
|byyypp|
|czzzr |
+------+
scala> df.withColumn("a6",regexp_replace($"_raw",""".*(xxx|yyy|zzz).*""","OK")===lit("OK")).show(false)
+------+-----+
|_raw |a6 |
+------+-----+
|abcdef|false|
|axxx |true |
|byyypp|true |
|czzzr |true |
+------+-----+
scala>
If you want to extract the match, then
scala> df.withColumn("a6",regexp_extract($"_raw",""".*(xxx|yyy|zzz).*""",1)).show(false)
+------+---+
|_raw |a6 |
+------+---+
|abcdef| |
|axxx |xxx|
|byyypp|yyy|
|czzzr |zzz|
+------+---+
scala>
EDIT1:
scala> val df = Seq((""" [2019-03-18T02:13:20.988-05:00] [svc4_prod2_bpel_ms14] [NOTIFICATION] [] [oracle.soa.mediator.serviceEngine] [tid: [ACTIVE].ExecuteThread: '57' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 7e05e8d3-8d20-475f-a414-cb3295151c3e-0054c6b8,1:84559] [APP: soa-infra] [partition-name: DOMAIN] [tenant-name: GLOBAL] [oracle.soa.tracking.FlowId: 14436421] [oracle.soa.tracking.InstanceId: 363460793] [oracle.soa.tracking.SCAEntityId: 50139] [composite_name: DFOLOutputRouting] """)).toDF("_raw")
df: org.apache.spark.sql.DataFrame = [_raw: string]
scala> df.withColumn("a6",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+(\S+)\]""",2)).select("a6").show(false)
+-----------------+
|a6 |
+-----------------+
|DFOLOutputRouting|
+-----------------+
scala>
EDIT2
scala> val df = Seq((""" [2019-03-18T02:13:20.988-05:00] [svc4_prod2_bpel_ms14] [NOTIFICATION] [] [oracle.soa.mediator.serviceEngine] [tid: [ACTIVE].ExecuteThread: '57' for queue: 'weblogic.kernel.Default (self-tuning)'] [userId: <anonymous>] [ecid: 7e05e8d3-8d20-475f-a414-cb3295151c3e-0054c6b8,1:84559] [APP: soa-infra] [partition-name: DOMAIN] [tenant-name: GLOBAL] [oracle.soa.tracking.FlowId: 14436421] [oracle.soa.tracking.InstanceId: 363460793] [oracle.soa.tracking.SCAEntityId: 50139] [composite_name: DFOLOutputRouting!3.20.0202.190103.1116_19] """)).toDF("_raw")
df: org.apache.spark.sql.DataFrame = [_raw: string]
scala> df.withColumn("a6",regexp_extract($"_raw",""".*(composite_name|compositename|composites|componentDN):\s+([a-zA-Z]+)""",2)).select("a6").show(false)
+-----------------+
|a6 |
+-----------------+
|DFOLOutputRouting|
+-----------------+
scala>
I guess you are only about to get the result which matches the above string,
you can make use of the below code:
df.withColumn("a6",col("colName").contains("yyy")|| col("colName").contains("xxx"))

change data capture in spark

I have got a requirement to do , but I am confused how to do it.
I have two dataframes. so first time i got the below data file1
file1
prodid, lastupdatedate, indicator
00001,,A
00002,01-25-1981,A
00003,01-26-1982,A
00004,12-20-1985,A
the output should be
0001,1900-01-01, 2400-01-01, A
0002,1981-01-25, 2400-01-01, A
0003,1982-01-26, 2400-01-01, A
0004,1985-12-20, 2400-10-01, A
Second time i got another one file2
prodid, lastupdatedate, indicator
00002,01-25-2018,U
00004,01-25-2018,U
00006,01-25-2018,A
00008,01-25-2018,A
I want the end result like
00001,1900-01-01,2400-01-01,A
00002,1981-01-25,2018-01-25,I
00002,2018-01-25,2400-01-01,A
00003,1982-01-26,2400-01-01,A
00004,1985-12-20,2018-01-25,I
00004,2018-01-25,2400-01-01,A
00006,2018-01-25,2400-01-01,A
00008,2018-01-25,2400-01-01,A
so whatever the updates are there in the second file that date should come in the second column and the default date (2400-01-01) should come in the third column and the relavant indicator. The default indicator is A
I have started like this
val spark=SparkSession.builder()
.master("local")
.appName("creating data frame for csv")
.getOrCreate()
import spark.implicits._
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod.txt")
val df1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod1.txt")
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
if((df1("indicator")=='U') && (df1("prodid")== newdf("prodid"))){
val df3 = df1.except(newdf)
}
You should join them with prodid and use some when function to manipulate the dataframes to the expected output. You should filter the updated dataframes for second rows and merge them back (I have included comments for explaining each part of the code)
import org.apache.spark.sql.functions._
//filling empty lastupdatedate and changing the date to the expected format
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//changing the date to the expected format of the second dataframe
val newdf1 = df1.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//joining both dataframes and updating columns according to your needs
val tempdf = newdf.as("table1").join(newdf1.as("table2"),Seq("prodid"), "outer")
.select(col("prodid"),
when(col("table1.lastupdatedate").isNotNull, col("table1.lastupdatedate")).otherwise(col("table2.lastupdatedate")).as("lastupdatedate"),
when(col("table1.indicator").isNotNull, when(col("table2.lastupdatedate").isNotNull, col("table2.lastupdatedate")).otherwise(lit("2400-01-01"))).otherwise(lit("2400-01-01")).as("defaultdate"),
when(col("table2.indicator").isNull, col("table1.indicator")).otherwise(when(col("table2.indicator") === "U", lit("I")).otherwise(col("table2.indicator"))).as("indicator"))
//filtering tempdf for duplication
val filtereddf = tempdf.filter(col("indicator") === "I")
.withColumn("lastupdatedate", col("defaultdate"))
.withColumn("defaultdate", lit("2400-01-01"))
.withColumn("indicator", lit("A"))
//finally merging both dataframes
tempdf.union(filtereddf).sort("prodid", "lastupdatedate").show(false)
which should give you
+------+--------------+-----------+---------+
|prodid|lastupdatedate|defaultdate|indicator|
+------+--------------+-----------+---------+
|1 |1900-01-01 |2400-01-01 |A |
|2 |1981-01-25 |2018-01-25 |I |
|2 |2018-01-25 |2400-01-01 |A |
|3 |1982-01-26 |2400-01-01 |A |
|4 |1985-12-20 |2018-01-25 |I |
|4 |2018-01-25 |2400-01-01 |A |
|6 |2018-01-25 |2400-01-01 |A |
|8 |2018-01-25 |2400-01-01 |A |
+------+--------------+-----------+---------+

Cannot merge two DataFrames in Scala Spark

I've been trying to append 1 DataFrame to another DF in Scala. The append operation in this case is simply adding a new column of the same size to the existing column - no key matching is involved. Both DataFrames are of the same shape (5 rows and 1 column only).
scala> val coefficients = lrModel.coefficients.toArray.toSeq.toDF("coefficients")
coefficients: org.apache.spark.sql.DataFrame = [coefficients: double]
scala> coefficients.show()
+--------------------+
| coefficients|
+--------------------+
| -59525.0697785032|
| 6957.836000531959|
| 314.2998010755629|
|-0.37884289844065666|
| -1758.154438149325|
+--------------------+
scala> val tvalues = trainingSummary.tValues.toArray.drop(1).toSeq.toDF("t-values")
tvalues: org.apache.spark.sql.DataFrame = [t-values: double]
scala> tvalues.show()
+-------------------+
| t-values|
+-------------------+
| 1.8267249911295418|
| 100.35507390273406|
| -8.768588605222108|
|-0.4656738230173362|
| 10.48091833711012|
+-------------------+
The join() function runs and I can even get the schema, but when I want to display all values of the new DF I'm getting the error:
scala> val outputModelDF1 = coefficients.join(tvalues)
outputModelDF1: org.apache.spark.sql.DataFrame = [coefficients: double, t-values: double]
scala> outputModelDF1.printSchema()
root
|-- coefficients: double (nullable = false)
|-- t-values: double (nullable = false)
scala> outputModelDF1.show()
org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
Project [value#359 AS coefficients#361]
+- LocalRelation [value#359]
and
Project [value#368 AS t-values#370]
+- LocalRelation [value#368]
Join condition is missing or trivial.
Use the CROSS JOIN syntax to allow cartesian products between these relations.;
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1080)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts$$anonfun$apply$20.applyOrElse(Optimizer.scala:1077)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:267)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:266)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:306)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:304)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:272)
at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:256)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1077)
at org.apache.spark.sql.catalyst.optimizer.CheckCartesianProducts.apply(Optimizer.scala:1062)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:85)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:82)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:82)
at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:74)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:74)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:78)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:89)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:89)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:2832)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2153)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2366)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:245)
at org.apache.spark.sql.Dataset.show(Dataset.scala:644)
at org.apache.spark.sql.Dataset.show(Dataset.scala:603)
at org.apache.spark.sql.Dataset.show(Dataset.scala:612)
... 52 elided
Any idea how to deal with it and how to simply merge these two DFs together?
UPDATE 1
I should have stated the desired format of the output that I want to achieve. Please see below:
+--------------------+--------------------+
| coefficients| t-values|
+--------------------+--------------------+
| -59525.0697785032| 1.8267249911295418|
| 6957.836000531959| 100.35507390273406|
| 314.2998010755629| -8.768588605222108|
|-0.37884289844065666| -0.4656738230173362|
| -1758.154438149325| -1758.154438149325|
+--------------------+--------------------+
UPDATE 2
Unfortunately, the following approach using withColumn() didn't work.
scala> val outputModelDF1 = coefficients.withColumn("t-values", tvalues("t-values"))
org.apache.spark.sql.AnalysisException: resolved attribute(s) t-values#119 missing from coefficients#113 in operator !Project [coefficients#113, t-values#119 AS t-values#130];;
!Project [coefficients#113, t-values#119 AS t-values#130]
+- Project [value#111 AS coefficients#113]
+- LocalRelation [value#111]
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:39)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:347)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:127)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:78)
at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:91)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:52)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:66)
at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$withPlan(Dataset.scala:2872)
at org.apache.spark.sql.Dataset.select(Dataset.scala:1153)
at org.apache.spark.sql.Dataset.withColumn(Dataset.scala:1908)
... 52 elided
One approach would be to create key columns in the dataframes for the join using monotonicallyIncreasingId:
val df1 = Seq(
(-59525.0697785032), (6957.836000531959), (314.2998010755629), (-0.37884289844065666), (-1758.154438149325)
).toDF("coefficients")
val df2 = Seq(
(1.8267249911295418), (100.35507390273406), (-8.768588605222108), (-0.4656738230173362), (10.48091833711012)
).toDF("t-values")
val df1R = df1.withColumn("rowid", monotonicallyIncreasingId)
val df2R = df2.withColumn("rowid", monotonicallyIncreasingId)
val dfJoined = df1R.join(df2R, Seq("rowid"))
dfJoined.show
+-----+--------------------+-------------------+
|rowid| coefficients| t-values|
+-----+--------------------+-------------------+
| 0| -59525.0697785032| 1.8267249911295418|
| 1| 6957.836000531959| 100.35507390273406|
| 2| 314.2998010755629| -8.768588605222108|
| 3|-0.37884289844065666|-0.4656738230173362|
| 4| -1758.154438149325| 10.48091833711012|
+-----+--------------------+-------------------+