_spark_metadata causing problems - scala

I am using Spark with Scala and I have a directory where I have multiple files.
In this directory I have Parquet files generated by Spark and other files generated by Spark Streaming.
And Spark streaming generates a directory _spark_metadata.
The problem I am facing is when I read the directory with Spark (sparksession.read.load), it reads only the data generated by Spark streaming, like if the other data does not exist.
Does someone know how to resolve this issue, I think there should be a property to force Spark to ignore the spark_metadata directory.
Thank you for your help

I have the same problem (Spark 2.4.0), and the only way I am aware of is to load the files using a mask/pattern, something like this
sparksession.read.format("parquet").load("/path/*.parquet")
As far as I know there is no way to ignore this directory. If it exists, Spark will consider it.

Related

I want to overcome some limitations I faced while using Kafka Spooldir Connector

I have set up Kafka's spooldir connector on a unix machine and it seems to work well. I would like to know if there are a few things that can be done with spooldir
I want to create multiple directories inside the spooldir-scanning
file path, create files of the provided format inside them and scan
those too. How do i accomplish this?
I do not want the source files to move to different directories after
completion/error. I tried providing the same path for source, target
and error but the connector would not accept the value. Any way
around these?

FileWatcher on a directory

I have a Spark/Scala application and my requirement here is to look for a file in a directory
and process it and finally cleaning up that directory.
Isn't this possible to do this within the spark application itself like
- Watching for a file in a directory
- When it finds the file continue the process
- Cleans up the directory before ending the app
- Repeat the above for the next new run and so on...
We currently do this file-watching process using an external application
so in order to remove the dependency on that third-party application
we would like to do this within our spark/scala application itself.
Is there a feasible solution using just scala/spark for a file-watcher?
Please guide me.
File streams in spark streaming?
https://spark.apache.org/docs/latest/streaming-programming-guide.html#file-streams

Spark SQL: set avro Schema option using SQL API

I'm trying to read Avro data from Spark SQL using SQL API.
Example:
CREATE TEMPORARY TABLE episodes
USING com.databricks.spark.avro
OPTIONS (path "/tmp/episodes.avro")
Is it possible to set avroSchema (.avsc file) option like in Scala API?
Example:
spark
.read
.format("com.databricks.spark.avro")
.option("avroSchema", new Schema.Parser().parse(new File("user.avsc")).toString)
.load("/tmp/episodes.avro").show()
I think my answer might be helpful to someone who are working on a local machine or learners new to Pyspark. If you are working in Pycharm IDE, it doesn't offer you a method to include Scala or Java dependencies. Spark AVRO doesn't come bundled with Apache Spark. So we need to configure some Java System Variables.
Go to Spark package where you install Spark.--> spark folder looks for
conf --> spark-defaults.conf --> edit the file by adding the below
line at the bottom of the conf file.
> spark.jars.packages org.apache.spark:spark-avro_2.11:2.4.5

Upload zip file using --archives option of spark-submit on yarn

I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?
Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.

Talend issue while copying local files to HDFS

Hi I want to know how to copy files to HDFS from source file system(Local File system),if source file already copied to HDFS,then how to eliminate or ignore that file to copy again in HDFS using Talend.
Thanks
Venkat
To copy files from local file system to the HDFS, you need to use tHDFSPut components if you have Talend for big data. If you use Talend for data integration you can easily use tSystem component with the right command.
To avoid duplicated files, you need to create a table in a RDBMS and keep track of all copied files. Each time the job start copying file, it should check if it already exists in the table.