Spark: overwrite partitioned folders - scala

I have a workflow on Spark 3.1 and writing a dataframe in the end partitioned by year,month,day,hour to S3. I expect the files in each "folder" in S3 to be overwritten but they're always appended. Any idea as to what might be the problem?
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
df
.write
.mode(SaveMode.Overwrite)
.partitionBy("year", "month", "day", "hour")
.json(outputPath)

It seems this is a bug on Spark 3.1. Downgrading to Spark 3.0.1 helps.

I suggest this version:
df
.write
.mode('overwrite')
.partitionBy("year", "month", "day", "hour")
.json(outputPath)
or this one:
df
.write
.mode(SaveMode.Overwrite)
.partitionBy("year", "month", "day", "hour")
.json(outputPath)
For older versions of Spark, you can use the following to overwrite the output directory with the RDD contents:
sparkConf.set("spark.hadoop.validateOutputSpecs", "false")
val sparkContext = SparkContext(sparkConf)

Related

Spark streaming sourceArchiveDir does not move file to archive dir

How to move source CSV files into archive directory using "sourceArchiveDir" and "cleanSource=archive"? I am running below code, but it does not move source file, however stream processing is working fine, i.e. it prints source file content to console.
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val inputPath =
"/<here is an absolute path to my project dir>/data/input/spark_full_delta/2021-06-21"
spark
.readStream
.format("csv")
.schema(jsonSchema)
.option("pathGlobFilter","customers_*2021-06-21.csv")
.option(
"sourceArchiveDir",
"/<here is an absolute path to my project dir>/data/archive")
.option("cleanSource", "archive")
.option("latestFirst","false")
.option("spark.sql.streaming.fileSource.cleaner.numThreads", "2")
.option("header", "true")
.load(inputPath)
.withColumn("date", lit("2021-06-21"))
.writeStream
.outputMode(OutputMode.Append)
.trigger(Trigger.ProcessingTime("5 seconds"))
.format("console")
.start()
StructSchema for reference:
scala> jsonSchema
res54: org.apache.spark.sql.types.StructType = StructType(
StructField(customerId,IntegerType,true),
StructField(name,StringType,true),
StructField(country,StringType,true),
StructField(date,DateType,false))
Documentation reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets. Scroll down to the table of source with their options.
File source archiving is based on several more internal Spark options, which you can try to change (but you do not have to) for debugging purpose to speed up the source files archiving process:
spark
.readStream
.format("csv")
.schema(jsonSchema)
// Number of log files after which all the previous files
// are compacted into the next log file.
.option("spark.sql.streaming.fileSource.log.compactInterval","1")
// How long in milliseconds a file is guaranteed to
// be visible for all readers.
.option("spark.sql.streaming.fileSource.log.cleanupDelay","1")
.option(
"sourceArchiveDir",
"/<here is an absolute path to my project dir>/data/archive")
.option("cleanSource", "archive")
...
then try to add more files to the source path. Spark should move already seen files from previous micro-batch to "sourceArchiveDir".
Please note, both options (compactInterval, cleanupDelay) are Spark internal options, which may change in future without any notice. Default values as of Spark 3.2.0-SNAPSHOT:
spark.sql.streaming.fileSource.log.compactInterval: 10
spark.sql.streaming.fileSource.log.cleanupDelay: 10 minutes

How to parallelize spark.read.parquet()?

My Spark job reads a folder with parquet data partitioned by the column partition:
val spark = SparkSession
.builder()
.appName("Prepare Id Mapping")
.getOrCreate()
import spark.implicits._
spark.read
.parquet(sourceDir)
.filter($"field" === "ss_id" and $"int_value".isNotNull)
.select($"int_value".as("ss_id"), $"partition".as("date"), $"ct_id")
.coalesce(1)
.write
.partitionBy("date")
.parquet(idMappingDir)
I've noticed that only one task is created so it's very slow. There is a lot of subfolders like partition=2019-01-07 inside the source folder, and each subfolder contains a lot of files with the extension snappy.parquet. I submit the job --num-executors 2 --executor-cores 4, and RAM is not an issue. I tried reading from both S3 and the local filesystem. I tried adding .repartition(nPartitions), removing .coalesce(1) and .partitionBy("date") but the same.
Could you suggest how I can get Spark read these parquet files in parallel?
Well, I've figured out the correct code:
val spark = SparkSession
.builder()
.appName("Prepare Id Mapping")
.getOrCreate()
import spark.implicits._
spark.read
.option("mergeSchema", "true")
.parquet(sourceDir)
.filter($"field" === "ss_id" and $"int_value".isNotNull)
.select($"int_value".as("ss_id"), $"partition".as("date"), $"ct_id")
.write
.partitionBy("date")
.parquet(idMappingDir)
Hope this will save someone time in future.

How to overwrite a partition in apache spark 2.3 while still writing to parquet with insertInto method

I saw this example code to overwrite a partition through spark 2.3 really nicely
dfPartition.coalesce(coalesceNum).write.mode("overwrite").format("parquet").insertInto(tblName)
My issue is that even after adding .format("parquet") it is not being written as parquet rather .c000 .
The compaction and overwriting of the partition if working but not the writing as parquet.
Fullc code here
val sparkSession = SparkSession.builder //.master("local[2]")
.config("spark.hadoop.parquet.enable.summary-metadata", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("parquet.compression", "snappy")
.enableHiveSupport() //can just comment out hive support
.getOrCreate
sparkSession.sparkContext.setLogLevel("ERROR")
println("Created hive Context")
val currentUtcDateTime = new DateTime(DateTimeZone.UTC)
//to compact yesterdays partition
val partitionDtKey = currentUtcDateTime.minusHours(24).toString("yyyyMMdd").toLong
val dfPartition = sparkSession.sql(s"select * from $tblName where $columnPartition=$hardCodedPartition")
if (!dfPartition.take(1).isEmpty) {
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
dfPartition.coalesce(coalesceNum).write.format("parquet").mode("overwrite").insertInto(tblName)
sparkSession.sql(s"msck repair table $tblName")
Helpers.executeQuery("refresh " + tblName, "impala", resultRequired = false)
}
else {
"echo invalid partition"
}
here is the question where I got the suggestion to use this code Overwrite specific partitions in spark dataframe write method.
What I like about this method is not having to list the partition columns which is really good nice. I can easily use it in many cases
Using scala 2.11 , cdh 5.12 , spark 2.3
Any suggestions
The extension .c000 relates to the executor who did the file, not to the actual file format. The file could be parquet and end with .c000, or .snappy, or .zip... To know the actual file format, run this command:
hadoop dfs -cat /tmp/filename.c000 | head
where /tmp/filename.c000 is the hdfs path to your file. You will see some strange simbols, and you should see parquet there somewhere if its actually a parquet file.

Reading CSV using Spark from Zeppelin in EMR

I have a simple CSV file in S3 where I have read it many times using Spark in EMR.
Now I want to use Zeppelin so, I can do some analysis.
My code is very simple
val path="s3://somewhere/some.csv"
val df=
_spark
.read
.format("csv")
.option("delimiter", "\t")
.option("header", false)
.option("mode", ParseModes.DROP_MALFORMED_MODE)
.option("nullValue", "NULL")
.option("charset", "UTF-8")
.option("treatEmptyValuesAsNulls", "true")
.option("inferSchema", "true")
.load(path)
But when I try to collect the dataframe
df.collect
I get an error
java.io.InvalidClassException:
org.apache.commons.lang3.time.FastDateFormat; local class
incompatible: stream classdesc serialVersionUID = 1, local class
serialVersionUID = 2
which is the different versions commons-lang3 between Zeppelin and Spark use.
reference:
http://apache-zeppelin-users-incubating-mailing-list.75479.x6.nabble.com/InvalidClassException-using-Zeppelin-master-and-spark-2-1-on-a-standalone-spark-cluster-td4900.html
I have used many different EMR version from 5.3.1 to 5.7.0
I have tried to add in --jars in spark
commons-lang3-3.4.jar
but with no luck.
Has anyone, had the same error?

spark error: spark.read.format("org.apache.spark.csv")

I am getting the following error after firing the command from spark-shell
scala> val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswa
s7917/src_files/movies_data_srcfile_sess06_01.csv")
<console>:21: error: not found: value spark
val df1 = spark.read.format("org.apache.spark.csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("/user/mailtosudiptabiswas7917/src_files/movies_data_srcfile_sess06_01.csv")
Do I need to import something explicitly.
Please help with the complete command set
Thanks.
It seems like you are using the old version of spark, You need to use the spark2.x or higher and import the implicits as
import spark.implicits._
And then
val df1 = spark.read.format("csv").option("inferSchema", true).option("header",true).option("delimiter", ",").csv("path")
You aren't even getting a SparkSession. You are using an older version of Spark it seems, and you should use the SQlContext and also you need to include the external databricks csv library when you start spark shell...
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
and then from within the spark shell...
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")
You can see more info about it here