Spark Databricks local file API - scala

I'm trying to build a summary/report of the processing done in spark DataBricks.
I came across below piece of code that allows the data to be written to dbfs as well as ADLS(through mount point), but the issue arises when I package the code in jar and try to execute it as a DataBricks job, I get file not found exception and now wondering how to write data into storage with out using notebooks .
import java.io.File
import java.io.PrintWriter
val writer = new PrintWriter(new File("/dbfs/mnt/data/out-01/test-01"))
writer.write("Hello Developer, Welcome to Programming.")
writer.write("Hello Developer, Welcome to Programming 2.")
writer.close()
I came across DButils from DataBricks, but haven't seen any sample code / documentation that I can use.
Any help on it will be appreciated.

If your notebook created by the below of the figure below to mount ADLS, yes that you can directly write date to dbfs in the current session of your databricks.
So I think the necessary code of DBFS mount points is missing in your code which be packaged to jar file.
And please refer to the offical documents as below to see how to access directly ADLS v1 and v2 in your code.
Azure Data Lake Storage Gen1
Azure Data Lake Storage Gen2
Meanwhile, if without databricks library, you also can change your code using ADLS SDK and REST APIs to write code without using DBFS and run it in databricks.

Related

How to get Talend to wait for a file to land in S3

I have a file that lands in AWS S3 several times a day. I am using Talend as my ETL tool to populate a warehouse in Snowflake and need it to watch for the file to trigger my job. I've tried tWaitForFile but can't seem to get it to connect to S3. Has anyone done this before?
Can you check below link automate pipeline using S3 and lambda to trigger files to talend job.
Automate S3 File Push

What's the difference between using Data Export Service and Export to Data Lake, regarding dataverse replication?

I know Data Export Service has a SQL storage target where as Export to Data Lake is Gen2 but seeing that Dataverse (aka Common Data Service) is structured data, I can't see why you'd use Export to Data Lake option in Powerapps, as Gen2 is for un-structured and semi-structured data!
Am I missing something here? Could they both be used e.g. Gen2 to store images data?
Data Export service is v1 used to replicate the Dynamics CRM online data to Azure SQL or Azure IaaS SQL server in near real time.
Export to Datalake is similar to v2, for the same replication purpose with new trick :) snapshot is advantage here.
There is a v3 coming, almost similar to v2 but additionally with Azure synapse linkage.
These are happening very fast and not sure how community is going to adapt.

Azure Databrics - Running a Spark Jar from Gen2 DataLake Storage

I am trying to run a spark-submit from Azure Databrics. Currently I can create a job, with the jar uploaded within the Databrics workspace, and run it.
My queries are:
Is there a way to access a jar residing on a GEN2 DataLake storage and do a spark-submit from Databrics workspace, or even from Azure ADF ? (Because the communication between the workspace and GEN2 storage is protected "fs.azure.account.key")
Is there a way to do a spark-submit from a databrics notebook?
Is there a way to access a jar residing on a GEN2 DataLake storage and
do a spark-submit from Databrics workspace, or even from Azure ADF ?
(Because the communication between the workspace and GEN2 storage is
protected "fs.azure.account.key") Unfortunately, you cannot access a
jar residing on Azure Storage such as ADLS Gen2/Gen1 account.
Note: The --jars, --py-files, --files arguments support DBFS and S3 paths.
Typically, the Jar libraries are stored under dbfs:/FileStore/jars.
You need to upload libraries in dbfs and pass as the parameters in the jar activity.
For more details, refer "Transform data by running a jar activity in Azure Databricks using ADF".
Is there a way to do a spark-submit from a databricks notebook?
To answer the second question, you may refer the below Job types:
Reference: SparkSubmit and "Create a job"
Hope this helps.
If this answers your query, do click “Mark as Answer” and "Up-Vote" for the same. And, if you have any further query do let us know.
Finally I figured out how to run this:
You can do a run a Databricks jar from an ADF, and attach it to an existing cluster, which will have the adls key configured in the cluster.
It is not possible to do a spark-submit from a notebook. But you can create a spark job in jobs, or you can use the Databricks Run Sumbit api, to do a spark-submit.

Google Cloud Dataflow to Cloud Storage

Above reference architecture indicates the existence of Cloud Storage sink from Cloud Dataflow, however the Beam API which seems to be the current default Dataflow API has no Cloud Storage I/O connector listed.
Can anyone help clarify if there is one that exists, if not what is the alternative to bring data from Dataflow to Cloud Storage.
Beam does support writing/reading from GCS. You simply use the TextIO classes.
https://beam.apache.org/documentation/sdks/javadoc/0.2.0-incubating/org/apache/beam/sdk/io/TextIO.html
To read a PCollection from one or more text files, use TextIO.Read. You can instantiate a transform using TextIO.Read.from(String) to specify the path of the file(s) to read from (e.g., a local filename or filename pattern if running locally, or a Google Cloud Storage filename or filename pattern of the form "gs:///").
You can use TextIO, AvroIO or any other connector that reads from/writes to files to interact with GCS. Beam identifies any file path that starts with "gs://" to be for GCS. Beam does this using the pluggable FileSystem [1] interface.
[1] https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/storage/GcsFileSystem.java

Triggering a Dataflow job when new files are added to Cloud Storage

I'd like to trigger a Dataflow job when new files are added to a Storage bucket in order to process and add new data into a BigQuery table. I see that Cloud Functions can be triggered by changes in the bucket, but I haven't found a way to start a Dataflow job using the gcloud node.js library.
Is there a way to do this using Cloud Functions or is there an alternative way of achieving the desired result (inserting new data to BigQuery when files are added to a Storage bucket)?
This is supported in Apache Beam starting with 2.2. See Watching for new files matching a filepattern in Apache Beam.
Maybe this post would help on how to trigger Dataflow pipelines from App Engine or Cloud Functions?
https://cloud.google.com/blog/big-data/2016/04/scheduling-dataflow-pipelines-using-app-engine-cron-service-or-cloud-functions