change data capture in spark - scala

I have got a requirement to do , but I am confused how to do it.
I have two dataframes. so first time i got the below data file1
file1
prodid, lastupdatedate, indicator
00001,,A
00002,01-25-1981,A
00003,01-26-1982,A
00004,12-20-1985,A
the output should be
0001,1900-01-01, 2400-01-01, A
0002,1981-01-25, 2400-01-01, A
0003,1982-01-26, 2400-01-01, A
0004,1985-12-20, 2400-10-01, A
Second time i got another one file2
prodid, lastupdatedate, indicator
00002,01-25-2018,U
00004,01-25-2018,U
00006,01-25-2018,A
00008,01-25-2018,A
I want the end result like
00001,1900-01-01,2400-01-01,A
00002,1981-01-25,2018-01-25,I
00002,2018-01-25,2400-01-01,A
00003,1982-01-26,2400-01-01,A
00004,1985-12-20,2018-01-25,I
00004,2018-01-25,2400-01-01,A
00006,2018-01-25,2400-01-01,A
00008,2018-01-25,2400-01-01,A
so whatever the updates are there in the second file that date should come in the second column and the default date (2400-01-01) should come in the third column and the relavant indicator. The default indicator is A
I have started like this
val spark=SparkSession.builder()
.master("local")
.appName("creating data frame for csv")
.getOrCreate()
import spark.implicits._
val df = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod.txt")
val df1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("d:/prod1.txt")
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
if((df1("indicator")=='U') && (df1("prodid")== newdf("prodid"))){
val df3 = df1.except(newdf)
}

You should join them with prodid and use some when function to manipulate the dataframes to the expected output. You should filter the updated dataframes for second rows and merge them back (I have included comments for explaining each part of the code)
import org.apache.spark.sql.functions._
//filling empty lastupdatedate and changing the date to the expected format
val newdf = df.na.fill("01-01-1900",Seq("lastupdatedate"))
.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//changing the date to the expected format of the second dataframe
val newdf1 = df1.withColumn("lastupdatedate", date_format(unix_timestamp(trim(col("lastupdatedate")), "MM-dd-yyyy").cast("timestamp"), "yyyy-MM-dd"))
//joining both dataframes and updating columns according to your needs
val tempdf = newdf.as("table1").join(newdf1.as("table2"),Seq("prodid"), "outer")
.select(col("prodid"),
when(col("table1.lastupdatedate").isNotNull, col("table1.lastupdatedate")).otherwise(col("table2.lastupdatedate")).as("lastupdatedate"),
when(col("table1.indicator").isNotNull, when(col("table2.lastupdatedate").isNotNull, col("table2.lastupdatedate")).otherwise(lit("2400-01-01"))).otherwise(lit("2400-01-01")).as("defaultdate"),
when(col("table2.indicator").isNull, col("table1.indicator")).otherwise(when(col("table2.indicator") === "U", lit("I")).otherwise(col("table2.indicator"))).as("indicator"))
//filtering tempdf for duplication
val filtereddf = tempdf.filter(col("indicator") === "I")
.withColumn("lastupdatedate", col("defaultdate"))
.withColumn("defaultdate", lit("2400-01-01"))
.withColumn("indicator", lit("A"))
//finally merging both dataframes
tempdf.union(filtereddf).sort("prodid", "lastupdatedate").show(false)
which should give you
+------+--------------+-----------+---------+
|prodid|lastupdatedate|defaultdate|indicator|
+------+--------------+-----------+---------+
|1 |1900-01-01 |2400-01-01 |A |
|2 |1981-01-25 |2018-01-25 |I |
|2 |2018-01-25 |2400-01-01 |A |
|3 |1982-01-26 |2400-01-01 |A |
|4 |1985-12-20 |2018-01-25 |I |
|4 |2018-01-25 |2400-01-01 |A |
|6 |2018-01-25 |2400-01-01 |A |
|8 |2018-01-25 |2400-01-01 |A |
+------+--------------+-----------+---------+

Related

Spark Structured Streaming joins csv file stream and rate stream too much time per batch

I have rate and csv file streams joining on rat values and csv file id:
def readFromCSVFile(path: String)(implicit spark: SparkSession): DataFrame = {
val schema = StructType(
StructField("id", LongType, nullable = false) ::
StructField("value1", LongType, nullable = false) ::
StructField("another", DoubleType, nullable = false) :: Nil)
val spark: SparkSession = SparkSession
.builder
.master("local[1]")
.config(new SparkConf().setIfMissing("spark.master", "local[1]")
.set("spark.eventLog.dir", "file:///tmp/spark-events")
).getOrCreate()
spark
.readStream
.format("csv")
.option("header", value=true)
.schema(schema)
.option("delimiter", ",")
.option("maxFilesPerTrigger", 1)
//.option("inferSchema", value = true)
.load(path)
}
val rate = spark.readStream
.format("rate")
.option("rowsPerSecond", 1)
.option("numPartitions", 10)
.load()
.withWatermark("timestamp", "1 seconds")
val cvsStream=readFromCSVFile(tmpPath.toString)
val cvsStream2 = cvsStream.as("csv").join(rate.as("counter")).where("csv.id == counter.value").withWatermark("timestamp", "1 seconds")
cvsStream2
.writeStream
.trigger(Trigger.ProcessingTime(10))
.format("console")
.option("truncate", "false")
.queryName("kafkaDataGenerator")
.start().awaitTermination(300000)
CSV file is 6 lines long, but proccessing one batch takes at about 100 s:
2021-10-15 23:21:29 WARN ProcessingTimeExecutor:69 - Current batch is falling behind. The trigger interval is 10 milliseconds, but spent 92217 milliseconds
-------------------------------------------
Batch: 1
-------------------------------------------
+---+------+-------+-----------------------+-----+
|id |value1|another|timestamp |value|
+---+------+-------+-----------------------+-----+
|6 |2 |3.0 |2021-10-15 20:20:02.507|6 |
|5 |2 |2.0 |2021-10-15 20:20:01.507|5 |
|1 |1 |1.0 |2021-10-15 20:19:57.507|1 |
|3 |1 |3.0 |2021-10-15 20:19:59.507|3 |
|2 |1 |2.0 |2021-10-15 20:19:58.507|2 |
|4 |2 |1.0 |2021-10-15 20:20:00.507|4 |
+---+------+-------+-----------------------+-----+
How I can optimize the join operation to process this batch faster? It shouldn't take so many calculation, so it looks like there is a kind of hidden watermarking or what else, making batch to wait for about 100 s. What kind of options/properties can be applied?
I would suggest that you don't have enough data to look into performance yet. Why don't you crank the data up to 500,000 and see if you have an issue? Right now I'm concerned that you aren't running enough data to exercise the performance of your system effectively and the startup costs aren't being appropriately amortized by the volume of data.
What dramatically improved the performance? Usage of spark.read instead of spark.readStream like that and persisting the DataFrame in the memory:
val dataFrameToBeReturned = spark.read
.format("csv")
.schema(schema)
.option("delimiter", ";")
.option("maxFilesPerTrigger", 1)
.csv("hdfs://" + hdfsLocation + homeZeppelinPrefix + "/generator/" + shortPath)
.persist(StorageLevel.MEMORY_ONLY_SER)

How to create a dataframe from Array[Strings]?

I used rdd.collect() to create an Array and now I want to use this Array[Strings] to create a DataFrame. My test file is in the following format(separated by a pipe |).
TimeStamp
IdC
Name
FileName
Start-0f-fields
column01
column02
column03
column04
column05
column06
column07
column08
column010
column11
End-of-fields
Start-of-data
G0002B|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002A|0|13|IS|LS|Xys|Xyz|12|23|45|
G0002x|0|13|IS|LS|Xys|Xyz|12|23|48|
G0002C|0|13|IS|LS|Xys|Xyz|12|23|48|
End-of-data
document
the column name are in between Start-of-field and End-of-Field.
I want to store "| " pipe separated in different columns of Dataframe.
like below example:
column01 column02 column03 column04 column05 column06 column07 column08 column010 column11
G0002C 0 13 IS LS Xys Xyz 12 23 48
G0002x 0 13 LS MS Xys Xyz 14 300 400
my code :
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(5,16)
val rdd1 = sc.parallelize(rdd.collect())
val df = rdd1.toDf(columns)
but this is not giving me the above desired dataframe
Could you try this?
import spark.implicits._ // Add to use `toDS()` and `toDF()`
val rdd = sc.textFile("the above text file")
val columns = rdd.collect.slice(5,16) // `.mkString(",")` is not needed
val dataDS = rdd.collect.slice(5,16)
.map(_.trim()) // to remove whitespaces
.map(s => s.substring(0, s.length - 1)) // to remove last pipe '|'
.toSeq
.toDS
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.csv(dataDS)
.toDF(columns: _*)
df.show(false)
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|column01|column02|column03|column04|column05|column06|column07|column08|column010|column11|
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
|G0002B |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002A |0 |13 |IS |LS |Xys |Xyz |12 |23 |45 |
|G0002x |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
|G0002C |0 |13 |IS |LS |Xys |Xyz |12 |23 |48 |
+--------+--------+--------+--------+--------+--------+--------+--------+---------+--------+
Calling spark.read...csv() method without schema, can take a long time with huge data, because of schema inferences(e,g. Additional reading).
On that case, you can specify schema like below.
/*
column01 STRING,
column02 STRING,
column03 STRING,
...
*/
val schema = columns
.map(c => s"$c STRING")
.mkString(",\n")
val df = spark.read
.option("header", false)
.option("delimiter", "|")
.schema(schema) // schema inferences not occurred
.csv(dataDS)
// .toDF(columns: _*) => unnecessary when schema is specified
If the number of columns and the name of the column are fixed then you can do that as below :
val columns = rdd.collect.slice(5,15).mkString(",") // it will hold columnnames
val data = rdd.collect.slice(17,21)
val d = data.mkString("\n").split('\n').toSeq.toDF()
import org.apache.spark.sql.functions._
val dd = d.withColumn("columnX",split($"value","\\|")).withColumn("column1",$"columnx".getItem(0)).withColumn("column2",$"columnx".getItem(1)).withColumn("column3",$"columnx".getItem(2)).withColumn("column4",$"columnx".getItem(3)).withColumn("column5",$"columnx".getItem(4)).withColumn("column6",$"columnx".getItem(5)).withColumn("column8",$"columnx".getItem(7)).withColumn("column10",$"columnx".getItem(8)).withColumn("column11",$"columnx".getItem(9)).drop("columnX","value")
display(dd)
you can see the output as below:

Add new rows in the Spark DataFrame using scala

I have a dataframe like:
Name_Index City_Index
2.0 1.0
0.0 2.0
1.0 0.0
I have a new list of values.
list(1.0,1.0)
I want to add these values to a new row in dataframe in the case that all previous rows are dropped.
My code:
val spark = SparkSession.builder
.master("local[*]")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
var data = spark.read.option("header", "true")
.option("inferSchema", "true")
.csv("src/main/resources/student.csv")
val someDF = Seq(
(1.0,1.0)
).toDF("Name_Index","City_Index")
data=data.union(someDF).show()
It show output like:
Name_Index City_Index
2.0 1.0
0.0 2.0
1.0 0.0
1.1 1.1
But output should be like this. So that all the previous rows are dropped and new values are added.
Name_Index City_Index
1.0 1.0
you can achieve this using limit & union functions. check below.
scala> val df = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")
df: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]
scala> df.show(false)
+----------+----------+
|name_index|city_index|
+----------+----------+
|2.0 |1.0 |
|0.0 |2.0 |
|1.0 |0.0 |
+----------+----------+
scala> val ndf = Seq((1.0,1.0)).toDF("name_index","city_index")
ndf: org.apache.spark.sql.DataFrame = [name_index: double, city_index: double]
scala> ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
| 1.0| 1.0|
+----------+----------+
scala> df.limit(0).union(ndf).show(false) // this is not good approach., you can directly call ndf.show
+----------+----------+
|name_index|city_index|
+----------+----------+
|1.0 |1.0 |
+----------+----------+
change the last line to
data=data.except(data).union(someDF).show()
you could try this approach
data = data.filter(_ => false).union(someDF)
output
+----------+----------+
|Name_Index|City_Index|
+----------+----------+
|1.0 |1.0 |
+----------+----------+
I hope it gives you some insights.
Regards.
As far as I can see, you only need the list of columns from source Dataframe.
If your sequence has the same order of the columns as the source Dataframe does, you can re-use schema without actually querying the source Dataframe. Performance wise, it will be faster.
val srcDf = Seq((2.0,1.0),(0.0,2.0),(1.0,0.0)).toDF("name_index","city_index")
val dstDf = Seq((1.0, 1.0)).toDF(srcDf.columns:_*)


How to load and process multiple csv files from a DBFS directory with Spark

I want to run the following code on each file that I read from DBFS (Databricks FileSystem). I tested it on all files that are in a folder, but I want to make similar calculations for each file in the folder, one by one:
// a-e are calculated fields
val df2=Seq(("total",a,b,c,d,e)).toDF("file","total","count1","count2","count3","count4")
//schema is now an empty dataframe
val final1 = schema.union(df2)
Is that possible? I guess reading it from dbfs should be done differently as well, from what I do now:
val df1 = spark
.read
.format("csv")
.option("header", "true")
.option("delimiter",",")
.option("inferSchema", "true")
.load("dbfs:/Reports/*.csv")
.select("lot of ids")
Thank you a lot in advance for the ideas :)
As discussed you have 3 options here.
In my example I am using the next 3 datasets:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |100 |200 |
|2 |300 |400 |
+----+----+----+
+----+----+----+
|col1|col2|col3|
+----+----+----+
|3 |60 |80 |
|4 |12 |100 |
|5 |20 |10 |
+----+----+----+
+----+----+----+
|col1|col2|col3|
+----+----+----+
|7 |20 |40 |
|8 |30 |40 |
+----+----+----+
You create first you schema (is faster to define the schema explicitly instead of inferring it):
import org.apache.spark.sql.types._
val df_schema =
StructType(
List(
StructField("col1", IntegerType, true),
StructField("col2", IntegerType, true),
StructField("col3", IntegerType, true)))
Option 1:
Load all CSVs at once with:
val df1 = spark
.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(df_schema)
.csv("file:///C:/data/*.csv")
Then apply your logic to the whole dataset grouping by the file name.
Precondition: You must find a way to append the file name to each file
Option 2:
Load csv files from directory. Then iterate over the files and create a dataframe for each csv. Inside the loop apply your logic to each csv. Finally in the end of the loop append (union) the results into a 2nd dataframe which will store your accumulated results.
Attention: Please be aware that a large number of files might cause a very big DAG and subsequently a huge execution plan, in order to avoid this you can persist the current results or call collect. In the example below I assumed that persist or collect will get executed for every bufferSize iterations. You can adjust or even remove this logic according to the number of csv files.
This is a sample code for the 2nd option:
import java.io.File
import org.apache.spark.sql.Row
import spark.implicits._
val dir = "C:\\data_csv\\"
val csvFiles = new File(dir).listFiles.filter(_.getName.endsWith(".csv"))
val bufferSize = 10
var indx = 0
//create an empty df which will hold the accumulated results
var bigDf = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], df_schema)
csvFiles.foreach{ path =>
var tmp_df = spark
.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(df_schema)
.csv(path.getPath)
//execute your custom logic/calculations with tmp_df
if((indx + 1) % bufferSize == 0){
// If buffer size reached then
// 1. call unionDf.persist() or unionDf.collect()
// 2. in the case you use collect() load results into unionDf again
}
bigDf = bigDf.union(tmp_df)
indx = indx + 1
}
bigDf.show(false)
This should output:
+----+----+----+
|col1|col2|col3|
+----+----+----+
|1 |100 |200 |
|2 |300 |400 |
|3 |60 |80 |
|4 |12 |100 |
|5 |20 |10 |
|7 |20 |40 |
|8 |30 |40 |
+----+----+----+
Option 3:
The last option is to use the build-in spark.sparkContext.wholeTextFiles.
This is the code to load all csv files into a RDD:
val data = spark.sparkContext.wholeTextFiles("file:///C:/data_csv/*.csv")
val df = spark.createDataFrame(data)
df.show(false)
And the output:
+--------------------------+--------------------------+
|_1 |_2 |
+--------------------------+--------------------------+
|file:/C:/data_csv/csv1.csv|1,100,200 |
| |2,300,400 |
|file:/C:/data_csv/csv2.csv|3,60,80 |
| |4,12,100 |
| |5,20,10 |
|file:/C:/data_csv/csv3.csv|7,20,40 |
| |8,30,40 |
+--------------------------+--------------------------+
spark.sparkContext.wholeTextFiles will return a key/value RDD in which key is the file path and value is the file data.
This requires extra code to extract the content of the _2 which is the content of each csv. In my opinion this would consist an overhead regarding the performance and the maintainability of the program therefore I would have avoided it.
Let me know if you need further clarifications
I am adding to the answer that #Alexandros Biratsis provided.
One can use the First Approach as below by concatenating the file name as a separate column within the same dataframe that has all the data from multiple files.
val df1 = spark
.read
.option("header", "false")
.option("delimiter", ",")
.option("inferSchema", "false")
.schema(df_schema)
.csv("file:///C:/data/*.csv")
.withColumn("FileName",input_file_name())
Here input_file_name() is a function that adds the name of the file to each row within the DataFrame. This is an inbuilt function in spark.
To use this function you need to import the below namespace.
import org.apache.spark.sql.functions._
One can find the documentation for the function at https://spark.apache.org/docs/1.6.0/api/java/org/apache/spark/sql/functions.html
I would advise not to use the second approach suggested by #Alexandros Biratsis of doing union and persisting the temporary dataframes as it would work for small number of files but as the number of files gets increased it becomes too slow and sometimes it gets timed out and driver gets shutdown unexpectedly.
I would like to thank Alexandros for answer as that gave me a way to move forward with the problem.

Spark Dataframe - Push a particular Row to the last in a Dataframe

Have been trying to push a particular row in a Spark Dataframe to the end of the Dataframe.
This is what I have tried so far.
Input Dataframe:
+-------------+-------+------------+
|expected_date|count |Downstream |
+-------------+-------+------------+
|2018-08-26 |1 |abc |
|2018-08-26 |6 |Grand Total |
|2018-08-26 |3 |xyy |
|2018-08-26 |2 |xxx |
+-------------+-------+------------+
Code:
df.withColumn("Downstream_Hierarchy", when(col("Downstream") === "Grand Total", 2)
.otherwise(1))
.orderBy(col("Downstream_Hierarchy").asc)
.drop("Downstream_Hierarchy")
Output Dataframe:
+-------------+-------+------------+
|expected_date|count |Downstream |
+-------------+-------+------------+
|2018-08-26 |1 |abc |
|2018-08-26 |3 |xyy |
|2018-08-26 |2 |xxx |
|2018-08-26 |6 |Grand Total |
+-------------+-------+------------+
Is there a simpler way to do this?
Going through your comments, Since the end result is needed in HDFS you can write it as csv to HDFS twice
1st time write dataframe to hdfs without "Grand Total" row.
2nd time write "Grand Total" row alone with save mode as "append".
Data Frame except the required row :
val df1 = df.filter(col("Downstream") =!= "Grand Total" )
Data Frame with the required row :
val df2 = df.filter(col("Downstream") === "Grand Total" )
Required DataFrame :
val df_final = df1.union(df2)
Might not be the best solution, but it avoids the expensive OrderBy operation .
You can try below straightforward steps.
val lastRowDf = df.filter("Downstream='Grand Total'")
val remainDf = df.filter("Downstream !='Grand Total'")
remainDf.unionAll(lastRowDf).show