Make “Partitioned” Staging Committer work on Databricks - scala

Currently, I have a working Spark ETL application that reads data from an S3 bucket, applies transformation, and writes them back to another S3 bucket in a parquet format. But depending on the input data volume, it takes up to a few hours to complete the job.
The processing takes ~20-25 minutes and the rest of the time job waits until S3 files are copied inside the S3.
I want to use the modern Partitioned Staging Committer (https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/committers.html#The_.E2.80.9CPartitioned.E2.80.9D_Staging_Committer) to speed up the job and avoid wasting cluster time during the S3 routine. I've implemented a simple POC job:
import org.apache.spark.sql.{SaveMode, SparkSession}
import scala.io.Source
// scalastyle:off
object PartitionedStagingCommitterPoc {
private val spark = SparkSession.builder()
.appName("PartitionedStagingCommitterPoc")
.config("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
.config("spark.sql.parquet.mergeSchema", "false")
.config("spark.sql.parquet.filterPushdown", "true")
.config("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
.config("spark.sql.hive.metastorePartitionPruning", "true")
.config("spark.hadoop.parquet.enable.summary-metadata", "false")
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", "true")
.config("spark.hadoop.mapreduce.fileoutputcommitter.marksuccessfuljobs", "false")
.config("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.committer.name", "partitioned")
.config("spark.hadoop.fs.s3a.committer.magic.enabled", "false")
.config("spark.hadoop.fs.s3a.committer.staging.unique-filenames", "true")
.config("spark.hadoop.fs.s3a.committer.staging.abort.pending.uploads", "true")
.config("spark.hadoop.fs.s3a.committer.staging.conflict-mode", "replace")
.getOrCreate()
def main(args: Array[String]): Unit = {
import spark.implicits._
val resourceStream = Thread.currentThread().getContextClassLoader.getResourceAsStream("input.csv")
val linesDataset = Source.fromInputStream(resourceStream).getLines().toSeq.toDS()
val dataFrame = spark.read
.option("header", value = true)
.option("inferSchema", value = true)
.csv(linesDataset)
dataFrame.write
.mode(SaveMode.Append)
.partitionBy("date", "country_code")
.parquet("s3a://my-sandbox/data")
}
}
My POC application works fine when I run it on my local workstation, but it fails with an exception when I try to run the same code on the Databricks using an uber jar in a "spark-submit" job:
Caused by: org.apache.hadoop.fs.s3a.commit.PathCommitException: `s3a://my-sandbox/data': Filesystem not supported by this committer
at org.apache.hadoop.fs.s3a.commit.AbstractS3ACommitterFactory.createOutputCommitter(AbstractS3ACommitterFactory.java:51)
at org.apache.hadoop.mapreduce.lib.output.BindingPathOutputCommitter.<init>(BindingPathOutputCommitter.java:87)
at org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter.<init>(BindingParquetOutputCommitter.scala:44)
... 76 more
Unfortunately, the exception is not descriptive, and I cannot understand what's wrong. Since the same jar works without issues when I submit it to my local Spark, I assume, that some implicit pre-configuration causes the problem in the Databricks runtime. Any help would be much appreciated.

Related

.csv not a SequenceFile error on Select Hive Query

I am quite a newbie to Spark and Scala ;)
Code summary :
Reading data from CSV files --> Creating A simple inner join on 2 Files --> Writing data to Hive table --> Submitting the job on the cluster
Can you please help to identify what went wrong.
The code is not really complex.
The job is executed well on cluster.
Therefore when I try to visualize data written on hive table I am facing issue.
hive> select * from Customers limit 10;
Failed with exception java.io.IOException:java.io.IOException: hdfs://m01.itversity.com:9000/user/itv000666/warehouse/updatedcustomers.db/customers/part-00000-348a54cf-aa0c-45b4-ac49-3a881ae39702_00000.c000 .csv not a SequenceFile
object LapeyreSparkDemo extends App {
//Getting spark ready
val sparkConf = new SparkConf()
sparkConf.set("spark.app.name","Spark for Lapeyre")
//Creating Spark Session
val spark = SparkSession.builder()
.config(sparkConf)
.enableHiveSupport()
.config("spark.sql.warehouse.dir","/user/itv000666/warehouse")
.getOrCreate()
Logger.getLogger(getClass.getName).info("Spark Session Created Successfully")
//Reading
Logger.getLogger(getClass.getName).info("Data loading in DF started")
val ordersSchema = "orderid Int, customerName String, orderDate String, custId Int, orderStatus
String, age String, amount Int"
val orders2019Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv0006666/lapeyrePoc/orders2019.csv")
.load
val newOrder = orders2019Df.withColumnRenamed("custId", "oldCustId")
.withColumnRenamed("customername","oldCustomerName")
val orders2020Df = spark.read
.format("csv")
.option("header",true)
.schema(ordersSchema)
.option("path","/user/itv000666/lapeyrePoc/orders2020.csv")
.load
Logger.getLogger(getClass.getName).info("Data loading in DF complete")
//processing
Logger.getLogger(getClass.getName).info("Processing Started")
val joinCondition = newOrder.col("oldCustId") === orders2020Df.col("custId")
val joinType = "inner"
val joinData = newOrder.join(orders2020Df, joinCondition, joinType)
.select("custId","customername")
//Writing
spark.sql("create database if not exists updatedCustomers")
joinData.write
.format("csv")
.mode(SaveMode.Overwrite)
.bucketBy(4, "custId")
.sortBy("custId")
.saveAsTable("updatedCustomers.Customers")
//Stopping Spark Session
spark.stop()
}
Please let me know in case more information required.
Thanks in advance.
This is the culprit
joinData.write
.format("csv")
Instead used this and it worked.
joinData.write
.format("Hive")
Since I am writing data to hive table (orc format), the format should be "Hive" and not csv.
Also, do not forget to enable hive support while creating spark session.
Also, In spark 2, bucketby & sortby is not supported. Maybe it does in Spark 3.

Regarding org.apache.spark.sql.AnalysisException error when creating a jar file using Scala

I have following simple Scala class , which i will later modify to fit some machine learning models.
I need to create a jar file out of this as i am going to run these models in amazon-emr . I am a beginner in this process. So i first tested whether i can successfully import the following csv file and write to another file by creating a jar file using the Scala class mention below.
The csv file looks like this and its include a Date column as one of the variables.
+-------------------+-------------+-------+---------+-----+
| Date| x1 | y | x2 | x3 |
+-------------------+-------------+-------+---------+-----+
|0010-01-01 00:00:00|0.099636562E8|6405.29| 57.06|21.55|
|0010-03-31 00:00:00|0.016645123E8|5885.41| 53.54|21.89|
|0010-03-30 00:00:00|0.044308936E8|6260.95|57.080002|20.93|
|0010-03-27 00:00:00|0.124928214E8|6698.46|65.540001|23.44|
|0010-03-26 00:00:00|0.570222885E7|6768.49| 61.0|24.65|
|0010-03-25 00:00:00|0.086162414E8|6502.16|63.950001|25.24|
Data set link : https://drive.google.com/open?id=18E6nf4_lK46kl_zwYJ1CIuBOTPMriGgE
I created a jar file out of this using intelliJ IDEA. And it was successfully done.
object jar1 {
def main(args: Array[String]): Unit = {
val sc: SparkSession = SparkSession.builder()
.appName("SparkByExample")
.getOrCreate()
val data = sc.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load(args(0))
data.write.format("text").save(args(1))
}
}
After that I upload this jar file along with the csv file mentioned above in amazon-s3 and tried to ran this in a cluster of amazon-emr .
But it was failed and i got following error message :
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support timestamp data type.;
I am sure this error is something to do with the Date variable in the data set. But i dont know how to fix this.
Can anyone help me to figure this out ?
Updated :
I tried to open the same csv file that i mentioned earlier without the date column . In this case i am getting this error :
ERROR Client: Application diagnostics message: User class threw exception: org.apache.spark.sql.AnalysisException: Text data source does not support double data type.;
Thank you
As I paid attention later that your are going to write to a text file. Spark's .format(text) doesn't support any specific types except String/Text. So to achive a goal you need to first convert the all the types to String and store:
df.rdd.map(_.toString().replace("[","").replace("]", "")).saveAsTextFile("textfilename")
If it's you could consider other oprions to store the data as file based, then you can have benefits of types. For example using CSV or JSON.
This is working code example based on your csv file for csv.
val spark = SparkSession.builder
.appName("Simple Application")
.config("spark.master", "local")
.getOrCreate()
import spark.implicits._
import spark.sqlContext.implicits._
val df = spark.read
.format("csv")
.option("delimiter", ",")
.option("header", "true")
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.load("datat.csv")
df.printSchema()
df.show()
df.write
.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("delimiter", "\t")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.option("escape", "\\")
.save("another")
There is no need custom encoder/decoder.

Spark Listener execute hook on onJobComplete on Executors?

I have a simple spark job which reads csv data from S3, transforms it, partitions it by and saves it to local file system.
I have csv file on s3 with below content
sample input: japan, 01-01-2020, weather, provider, device
case class WeatherReport(country:String, date:String, event:String, provide:String, device:String )
object SampleSpark extends App{
val conf = new SparkConf()
.setAppName("processing")
.setIfMissing("spark.master", "local[*]")
.setIfMissing("spark.driver.host", "localhost")
val sc = new SparkContext(conf)
val baseRdd = sc.textFile("s3a://mybucket/sample/*.csv")
val weatherDataFrame = baseRdd
.filter(_.trim.nonEmpty)
.map(x => WeatherReport(x))
.toDF()
df.write.partitionBy("date")
.mode(SaveMode.Append)
.format("com.databricks.spark.csv")
.save("outputDirectory")
}
The file gets saved in "outputDirectory/date=01-01-2020/part-" with more than 1 part files.
I want to merge the part file and remove prefix date= name like "outputDirectory/01-01-2020/output.csv" and copy this to S3.
How is it possible to do it??
I thought of using SparkListener like below but i guess it'll only run on Drive but the files would be present on Executors.
sparkContext.addListener(new SparkListener {
override def onJobEnd(jobEnd: SparkListenerJobEnd) {
renameDirectory()
mergePartFilesToSingleFiles()
uploadFileToS3()
}
})
Is there a way to run a post Job Completion hook on Executors and Driver which would sync all the local files on them to S3?
You can run post execution hooks on executors by registering TaskCompletionListener
// call this from the code that is running on executor such as your mapper WeatherReport
val taskContext = TaskContext.get
taskContext.addTaskCompletionListener(customTaskCompletionListener)
Reference:
https://spark.apache.org/docs/latest/api/java/org/apache/spark/TaskContext.html#addTaskCompletionListener-scala.Function1-

Spark: java.io.FileNotFoundException: File does not exist in copyMerge

I am trying to merge all spark output part files in a directory and create a single file in Scala.
Here is my code:
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.functions.regexp_extract
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), true, hadoopConfig, null)
// the "true" setting deletes the source files once they are merged into the new output
}
And then at last step, I am writing data frame output like below.
dfMainOutputFinalWithoutNull.repartition(10).write.partitionBy("DataPartition","StatementTypeCode")
.format("csv")
.option("nullValue", "")
.option("header", "true")
.option("codec", "gzip")
.mode("overwrite")
.save(outputfile)
merge(mergeFindGlob, mergedFileName )
dfMainOutputFinalWithoutNull.unpersist()
When I run this I get below exception
java.io.FileNotFoundException: File does not exist: hdfs:/user/zeppelin/FinancialLineItem/temp_FinancialLineItem
at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309)
This is how I get my output
Instead of the folder, I want to merge all files inside a folder and create a single file.
There is a copyMerge API in Hadoop 2 :
https://hadoop.apache.org/docs/r2.7.1/api/src-html/org/apache/hadoop/fs/FileUtil.html#line.382
Unfortunately this is going to be deprecated and removed in Hadoop 3.0.
Here's re-implementation of copyMerge (in PySpark though) I had to write as we couldn't find a better solution:
https://github.com/Tagar/stuff/blob/master/copyMerge.py
Hope it helps somebody else too.

Using Scala and SparkSql and importing CSV file with header [duplicate]

This question already has answers here:
Spark - load CSV file as DataFrame?
(14 answers)
Closed 5 years ago.
I'm very new to Spark and Scala(Like two hours new), I'm trying to play with a CSV data file but I cannot do it as I'm not sure how to deal with "Header row", I have searched internet for the way to load it or to skip it but I don't really know how to do that.
I'm pasting my code That I'm using, please help me.
object TaxiCaseOne{
case class NycTaxiData(Vendor_Id:String, PickUpdate:String, Droptime:String, PassengerCount:Int, Distance:Double, PickupLong:String, PickupLat:String, RateCode:Int, Flag:String, DropLong:String, DropLat:String, PaymentMode:String, Fare:Double, SurCharge:Double, Tax:Double, TripAmount:Double, Tolls:Double, TotalAmount:Double)
def mapper(line:String): NycTaxiData = {
val fields = line.split(',')
val data:NycTaxiData = NycTaxiData(fields(0), fields(1), fields(2), fields(3).toInt, fields(4).toDouble, fields(5), fields(6), fields(7).toInt, fields(8), fields(9),fields(10),fields(11),fields(12).toDouble,fields(13).toDouble,fields(14).toDouble,fields(15).toDouble,fields(16).toDouble,fields(17).toDouble)
return data
}def main(args: Array[String]) {
// Set the log level to only print errors
Logger.getLogger("org").setLevel(Level.ERROR)
// Use new SparkSession interface in Spark 2.0
val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val lines = spark.sparkContext.textFile("../nyc.csv")
val data = lines.map(mapper)
// Infer the schema, and register the DataSet as a table.
import spark.implicits._
val schemaData = data.toDS
schemaData.printSchema()
schemaData.createOrReplaceTempView("data")
// SQL can be run over DataFrames that have been registered as a table
val vendor = spark.sql("SELECT * FROM data WHERE Vendor_Id == 'CMT'")
val results = teenagers.collect()
results.foreach(println)
spark.stop()
}
}
If you have a CSV file you should use spark-csv to read the csv files rather than using textFile
val spark = SparkSession.builder().appName("test val spark = SparkSession
.builder
.appName("SparkSQL")
.master("local[*]")
.config("spark.sql.warehouse.dir", "file:///C:/temp") // Necessary to work around a Windows bug in Spark 2.0.0; omit if you're not on Windows.
.getOrCreate()
val df = spark.read
.format("csv")
.option("header", "true") //This identifies first line as header
.csv("../nyc.csv")
You need a spark-core and spark-sql dependency to work with this
Hope this helps!