How to overwrite a partition in apache spark 2.3 while still writing to parquet with insertInto method - scala

I saw this example code to overwrite a partition through spark 2.3 really nicely
dfPartition.coalesce(coalesceNum).write.mode("overwrite").format("parquet").insertInto(tblName)
My issue is that even after adding .format("parquet") it is not being written as parquet rather .c000 .
The compaction and overwriting of the partition if working but not the writing as parquet.
Fullc code here
val sparkSession = SparkSession.builder //.master("local[2]")
.config("spark.hadoop.parquet.enable.summary-metadata", "false")
.config("hive.exec.dynamic.partition", "true")
.config("hive.exec.dynamic.partition.mode", "nonstrict")
.config("parquet.compression", "snappy")
.enableHiveSupport() //can just comment out hive support
.getOrCreate
sparkSession.sparkContext.setLogLevel("ERROR")
println("Created hive Context")
val currentUtcDateTime = new DateTime(DateTimeZone.UTC)
//to compact yesterdays partition
val partitionDtKey = currentUtcDateTime.minusHours(24).toString("yyyyMMdd").toLong
val dfPartition = sparkSession.sql(s"select * from $tblName where $columnPartition=$hardCodedPartition")
if (!dfPartition.take(1).isEmpty) {
sparkSession.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
dfPartition.coalesce(coalesceNum).write.format("parquet").mode("overwrite").insertInto(tblName)
sparkSession.sql(s"msck repair table $tblName")
Helpers.executeQuery("refresh " + tblName, "impala", resultRequired = false)
}
else {
"echo invalid partition"
}
here is the question where I got the suggestion to use this code Overwrite specific partitions in spark dataframe write method.
What I like about this method is not having to list the partition columns which is really good nice. I can easily use it in many cases
Using scala 2.11 , cdh 5.12 , spark 2.3
Any suggestions

The extension .c000 relates to the executor who did the file, not to the actual file format. The file could be parquet and end with .c000, or .snappy, or .zip... To know the actual file format, run this command:
hadoop dfs -cat /tmp/filename.c000 | head
where /tmp/filename.c000 is the hdfs path to your file. You will see some strange simbols, and you should see parquet there somewhere if its actually a parquet file.

Related

Spark streaming sourceArchiveDir does not move file to archive dir

How to move source CSV files into archive directory using "sourceArchiveDir" and "cleanSource=archive"? I am running below code, but it does not move source file, however stream processing is working fine, i.e. it prints source file content to console.
import org.apache.spark.sql.streaming.{OutputMode, Trigger}
val inputPath =
"/<here is an absolute path to my project dir>/data/input/spark_full_delta/2021-06-21"
spark
.readStream
.format("csv")
.schema(jsonSchema)
.option("pathGlobFilter","customers_*2021-06-21.csv")
.option(
"sourceArchiveDir",
"/<here is an absolute path to my project dir>/data/archive")
.option("cleanSource", "archive")
.option("latestFirst","false")
.option("spark.sql.streaming.fileSource.cleaner.numThreads", "2")
.option("header", "true")
.load(inputPath)
.withColumn("date", lit("2021-06-21"))
.writeStream
.outputMode(OutputMode.Append)
.trigger(Trigger.ProcessingTime("5 seconds"))
.format("console")
.start()
StructSchema for reference:
scala> jsonSchema
res54: org.apache.spark.sql.types.StructType = StructType(
StructField(customerId,IntegerType,true),
StructField(name,StringType,true),
StructField(country,StringType,true),
StructField(date,DateType,false))
Documentation reference: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#creating-streaming-dataframes-and-streaming-datasets. Scroll down to the table of source with their options.
File source archiving is based on several more internal Spark options, which you can try to change (but you do not have to) for debugging purpose to speed up the source files archiving process:
spark
.readStream
.format("csv")
.schema(jsonSchema)
// Number of log files after which all the previous files
// are compacted into the next log file.
.option("spark.sql.streaming.fileSource.log.compactInterval","1")
// How long in milliseconds a file is guaranteed to
// be visible for all readers.
.option("spark.sql.streaming.fileSource.log.cleanupDelay","1")
.option(
"sourceArchiveDir",
"/<here is an absolute path to my project dir>/data/archive")
.option("cleanSource", "archive")
...
then try to add more files to the source path. Spark should move already seen files from previous micro-batch to "sourceArchiveDir".
Please note, both options (compactInterval, cleanupDelay) are Spark internal options, which may change in future without any notice. Default values as of Spark 3.2.0-SNAPSHOT:
spark.sql.streaming.fileSource.log.compactInterval: 10
spark.sql.streaming.fileSource.log.cleanupDelay: 10 minutes

sparksql output dataframe has no records

I have a spark sql code
object MyTest extends App {
val conf = new SparkConf().setAppName("GTPCP KPIs")
val sc = new SparkContext(conf)
val hContext = new org.apache.spark.sql.hive.HiveContext(sc)
val outputDF = hContext.sql("Select field1, field2 from prddb.cust_data")
println("records selected: " + outputDF.count() + "\n")
outputDF.write.mode("append").saveAsTable("devdb.vs_test")
//outputDF.show()
}
The problem is that if i run the query
Select field1, field2 from prddb.cust_data
in hive it gives me around 1.5 million records.
However, through spark sql I am not getting any output in devdb.vs_test table
The println statement prints 0.
I am using spark 1.5.0
Any help here will be appreciated!!
What it looks like your spark-session doesn't have hive connectivity to it.
You need to link hive-site.xml with spark conf or copy hive-site.xml into spark conf directory. Spark is not able to find your hive metastore (derby database which is by default), so for that we have to link hive-conf to spark conf direcrtory.
Finally, to connect Spark SQL to an existing Hive installation, you must copy your hive-site.xml file to Spark’s configuration directory ($SPARK_HOME/conf). If you don’t have an existing Hive installation, Spark SQL will still run.
Sudo to root user and then copy hive-site to spark conf directory.
sudo -u root
cp /etc/hive/conf/hive-site.xml /etc/spark/conf

How to continuously monitor a directory by using Spark Structured Streaming

I want spark to continuously monitor a directory and read the CSV files by using spark.readStream as soon as the file appears in that directory.
Please don't include a solution of Spark Streaming. I am looking for a way to do it by using spark structured streaming.
Here is the complete Solution for this use Case:
If you are running in stand alone mode. You can increase the driver memory as:
bin/spark-shell --driver-memory 4G
No need to set the executor memory as in Stand Alone mode executor runs within the Driver.
As Completing the solution of #T.Gaweda, find the solution below:
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
csvDf.writeStream.format("console").option("truncate","false").start()
now the spark will continuously monitor the specified directory and as soon as you add any csv file in the directory your DataFrame operation "csvDF" will be executed on that file.
Note: If you want spark to inferschema you have to first set the following configuration:
spark.sqlContext.setConf("spark.sql.streaming.schemaInferenc‌​e","true")
where spark is your spark session.
As written in official documentation you should use "file" source:
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
Code example taken from documentation:
// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
.readStream
.option("sep", ";")
.schema(userSchema) // Specify schema of the csv files
.csv("/path/to/directory") // Equivalent to format("csv").load("/path/to/directory")
If you don't specify trigger, Spark will read new files as soon as possible

How to read a file using sparkstreaming and write to a simple file using Scala?

I'm trying to read a file using a scala SparkStreaming program. The file is stored in a directory on my local machine and trying to write it as a new file on my local machine itself. But whenever I write my stream and store it as parquet I end up getting blank folders.
This is my code :
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession
.builder()
.master("local[*]")
.appName("StreamAFile")
.config("spark.sql.warehouse.dir", "file:///C:/temp")
.getOrCreate()
import spark.implicits._
val schemaforfile = new StructType().add("SrNo",IntegerType).add("Name",StringType).add("Age",IntegerType).add("Friends",IntegerType)
val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")
file.writeStream.format("parquet").start("C:\\Users\\roswal01\\Desktop\\streamed")
spark.stop()
Is there anything missing in my code or anything in the code where I've gone wrong?
I also tried reading this file from a hdfs location but the same code ends up not creating any output folders on my hdfs.
You've mistake here:
val file = spark.readStream.schema(schemaforfile).csv("C:\\SparkScala\\fakefriends.csv")
csv() function should have directory path as an argument. It will scan this directory and read all new files when they will be moved into this directory
For checkpointing, you should add
.option("checkpointLocation", "path/to/HDFS/dir")
For example:
val query = file.writeStream.format("parquet")
.option("checkpointLocation", "path/to/HDFS/dir")
.start("C:\\Users\\roswal01\\Desktop\\streamed")
query.awaitTermination()

Not able to load file from HDFS in spark Dataframe

I have a CSV file stored in local windows HDFS (hdfs://localhost:54310), under path /tmp/home/.
I would like to load this file from HDFS to spark Dataframe. So I tried this
val spark = SparkSession.builder.master(masterName).appName(appName).getOrCreate()
and then
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
import sparkSession.implicits._
spark.sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load(path)
.show()
But fails at runtime with below exception Stack trace:
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/test/sampleApp/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:205)
at org.apache.hadoop.fs.Path.<init>(Path.java:171)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.makeQualifiedPath(SessionCatalog.scala:114)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createDatabase(SessionCatalog.scala:145)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.<init>(SessionCatalog.scala:89)
at org.apache.spark.sql.internal.SessionState.catalog$lzycompute(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState.catalog(SessionState.scala:95)
at org.apache.spark.sql.internal.SessionState$$anon$1.<init>(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:112)
at org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:111)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:49)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
at org.apache.spark.sql.SparkSession.baseRelationToDataFrame(SparkSession.scala:382)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:143)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:132)
C:/test/sampleApp/ is the path where my sample project lies. But I have specified the HDFS path.
Additionally, this works perfectly fine with plain rdd
val path = "hdfs://localhost:54310/tmp/home/mycsv.csv"
val sc = SparkContext.getOrCreate()
val rdd = sc.textFile(path)
println(rdd.first()) //prints first row of CSV file
I found and tried this as well but no luck :(
I am missing something? Why spark is looking at my local file system & not the HDFS?
I am using spark 2.0 on hadoop-hdfs 2.7.2 with scala 2.11.
EDIT: Just one additional info I tried to downgrade to spark 1.6.2. I was able to make it work. So I think this is a bug in spark 2.0
Just to close the loop.This seems to be issue in spark 2.0 and a ticket has been raised.
https://issues.apache.org/jira/browse/SPARK-15899