Spark SQL: set avro Schema option using SQL API - scala

I'm trying to read Avro data from Spark SQL using SQL API.
Example:
CREATE TEMPORARY TABLE episodes
USING com.databricks.spark.avro
OPTIONS (path "/tmp/episodes.avro")
Is it possible to set avroSchema (.avsc file) option like in Scala API?
Example:
spark
.read
.format("com.databricks.spark.avro")
.option("avroSchema", new Schema.Parser().parse(new File("user.avsc")).toString)
.load("/tmp/episodes.avro").show()

I think my answer might be helpful to someone who are working on a local machine or learners new to Pyspark. If you are working in Pycharm IDE, it doesn't offer you a method to include Scala or Java dependencies. Spark AVRO doesn't come bundled with Apache Spark. So we need to configure some Java System Variables.
Go to Spark package where you install Spark.--> spark folder looks for
conf --> spark-defaults.conf --> edit the file by adding the below
line at the bottom of the conf file.
> spark.jars.packages org.apache.spark:spark-avro_2.11:2.4.5

Related

AWS MSK. Kafka Connect - plugin class loading

I am using Kafka Connect in MSK.
I have defined a plugin that points to a zip file in s3 - this works fine.
I have implemented SMT and uploaded the SMT jar into the same bucket and folder as the zip file of the plugin.
I define a new connector and this time I add the SMT using
transforms
I get an error message that the Class com.x.y.z.MySMT could not be found.
I verified that the jar is valid and contains the SMT.
Where should I put the SMT jar in order to make Kafka connect loading it?
Pushing the SMT jar into the zip (under /lib) solved the class not found issue.

Apache beam - Is it possible to write to an Excel file?

I would like to create an Apache Beam pipeline for Dataflow to get data from a database, transform it and upload the result as Excel files (xls and xlsx) with style on GCP that I could then share.
The Apache POI library allows me to create Excel files with style but I fail to integrate it with Apache Beam in the pipeline creation process because it's not really a processing on the PCollection.
If anyone would have an idea how I could do this without having to write CSV.
Thanks

_spark_metadata causing problems

I am using Spark with Scala and I have a directory where I have multiple files.
In this directory I have Parquet files generated by Spark and other files generated by Spark Streaming.
And Spark streaming generates a directory _spark_metadata.
The problem I am facing is when I read the directory with Spark (sparksession.read.load), it reads only the data generated by Spark streaming, like if the other data does not exist.
Does someone know how to resolve this issue, I think there should be a property to force Spark to ignore the spark_metadata directory.
Thank you for your help
I have the same problem (Spark 2.4.0), and the only way I am aware of is to load the files using a mask/pattern, something like this
sparksession.read.format("parquet").load("/path/*.parquet")
As far as I know there is no way to ignore this directory. If it exists, Spark will consider it.

Google Spread Sheet Spark library

I am using https://github.com/potix2/spark-google-spreadsheets library for reading the spread sheet file in spark. It is working perfectly in my local.
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
option("serviceAccountId", "xxxxxx#developer.gserviceaccount.com").
option("credentialPath", "/path/to/credentail.p12").
load("<spreadsheetId>/worksheet1")
I created a new assembly jar with included all the credentials and use that jar for reading the file. But I am facing issue with reading the credentialPath file. I tried using
getClass.getResourceAsStream("/resources/Aircraft/allAircraft.txt")
But library only supports absolute path. Please help me to resolve this issue.
You can use --files argument of spark-submit or SparkContext.addFile() to distribute a credential file. If you want to get a local path of the credential file in worker node, you should call SparkFiles.get("credential filename").
import org.apache.spark.SparkFiles
// you can also use `spark-submit --files=credential.p12`
sqlContext.sparkContext.addFile("credential.p12")
val credentialPath = SparkFiles.get("credential.p12")
val df = sqlContext.read.
format("com.github.potix2.spark.google.spreadsheets").
option("serviceAccountId", "xxxxxx#developer.gserviceaccount.com").
option("credentialPath", credentialPath).
load("<spreadsheetId>/worksheet1")
Use SBT and try typesafe config library.
Here is a simple but complete sample which reads some information from the config file placed in resources folder.
Then you can assemble a jar file using sbt-assembly plugin.
If you're working in the Databricks environment, you can upload the credentials file.
Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable, as described here, does not get you around this requirement because it's a link to the file path, not the actual credentials. See here for more details about getting the right credentials and using the library.

Upload zip file using --archives option of spark-submit on yarn

I have a directory with some model files and my application has to access these models files in local file system due to some reason.
Of course I know that --files option of spark-submit can upload file to the working directory of each executor and it does work.
However, I want keep the directory structure of my files so I come up with --archives option, which is said
YARN-only:
......
--archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor.
......
But when I actually use it to upload models.zip, I found yarn just put it there without extraction, like what it did with --files. Have I misunderstood to be extracted or misused this option?
Found the answer myself.
YARN does extract the archive but add an extra folder with the same name of the archive. To make it clear, If I put models/model1 and models/models2 in models.zip, then I have to access my models by models.zip/models/model1 and models.zip/models/model2.
Moreover, we can make this more beautiful using the # syntax.
The --files and --archives options support specifying file names with the # similar to Hadoop. For example you can specify: --files localtest.txt#appSees.txt and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name appSees.txt, and your application should use the name as appSees.txt to reference it when running on YARN.
Edit:
This answer was tested on spark 2.0.0 and I'm not sure the behavior in other versions.