Azure data bricks spark streaming with autoloader - pyspark

My source is azure datafactory which is copying files to containerA --> FolderA,FolderB, FolderC. I am using below syntax to use the autoloader need to read the files as it comes to any one of the folder.
Mounting I have done till storage account
source = "abfss://containerA#storageaccount.dfs.core.windows.net/",
mount_point = "/mnt/containerA/",
extra_configs = configs)
Streaming code:
df1=spark.readStream.format("cloudFiles") \
.option("cloudFiles.format","Json") \
.option("cloudFiles.useNotifications","True") \
.option('cloudFiles.subscriptionId',"xxxx-xxxx-xxxx-xxxx-xxx") \
.option('cloudFiles.tenantId',"xxxx-1cxxxx98-xxxx-xxxx-xxxx") \
.option("cloudFiles.clientId","xxxx-xx-46d8-xx-xxx") \
.option("cloudFiles.clientSecret","xxxxxxxxxx") \
.option('cloudFiles.resourceGroup',"xxxx-xxx") \
.schema(Userdefineschema) \
.load("/mnt/containerA/") \
.withColumn("rawFilePath",input_file_name())
Above syntax is creating new queue always is there any way if I wanted to give name to the queue.
Issue when I am starting my stream and adf is copy data to folder A streaming is running fine. but when adf starts copy data to folder B streaming query is not fetchING records which is present in folder B in the same streaming session. But when I close the streaming cell and again start it will pick data for folder A and Folder B. My objective is to use autoloader when files comes in any of the folder stream starts automatically.
Kindly advice I am new to spark streaming.
Thanks Anuj gupta

Please try using to perform nested folder file lookup
.option("recursiveFileLookup", "true")

Related

Unable to load multiple json files with pyspark

I am fairly new to pyspark and am trying to load data from a folder which contains multiple json files.However the load fails. Here is the code that I am using:
spark = SparkSession.builder.master("local[1]") \
.appName('SparkByExamples.com') \
.getOrCreate()
spark.read.json('file_directory/*')
I am getting error as :
Exception in thread "globPath-ForkJoinPool-1-worker-57" java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
I tried setting the path variables for hadoop and spark as well but still no use.
However, if I load a single file from the directory, it loads perfectly.
Can someone please tell me what is going wrong in this case.
I can successfully read all CSV under a directory without adding the asterik.
I think you should try
spark.read.json('file_directory/')

How to use Aspera Hot Folder feature to download large files from COS

I have been using Aspera Hot folder Feature to download large files (100GB+) from a vendor server (A). I am trying to do similar downloads from another vendor (B) who has large files on IBM COS. I have HMAC credentials for IBM COS , able to connect using AWS CLI (B) but could not establish same connection using Aspera Desktop application. Any ideas on how to setup a watch folder/hot folder like functionality to download large files from COS.
Thanks
It is actually not possible using the aspera provided with COS.
but if you are not afraid of using command line, it is possible with:
https://github.com/IBM/aspera-cli
see:
https://www.rubydoc.info/gems/aspera-cli#hot-folder
You will have to use the "cos" plugin (instead of "server" as in doc)
Example:
ascli cos node \
--bucket=my_bucket \
--endpoint="https://s3.eu-de.cloud-object-storage.appdomain.cloud" \
--apikey=DsqdsqdSQDSQddqsDQS --crn=crn:v1:bluemix:public:cloud-object-storage:global:a/656435423542ababa5454:ffffffff-5029-abcd-af65-ebc6d2b46b45:: \
upload source_hot \
--to-folder=/Upload/target_hot \
--lock-port=12345 \
--ts=#json:'{"EX_ascp_args":["--remove-after-transfer","--remove-empty-directories","--exclude-newer-than=-8","--src-base","source_hot"]}'

How to download multiple objects from IBM Cloud Object Storage?

I am trying to use IBM Cloud Object Storage to store images uploaded to my site by users. I have this functionality working just fine.
However, based on the documentation here (link) it appears as though only one object can be downloaded from a bucket at a time.
Is there any way a list of objects could all be downloaded from the bucket? Is there a different approach to requesting multiple objects from a COS bucket?
Via the REST API, no, you can only download a single object at a time. But most tools (like the AWS CLI, or Minio Client) allow downloading all objects that share a prefix (eg foo/bar and foo/bas). The IBM forks of the S3 libraries also are now integrated with Aspera, and can transfer large directories all at once. What are you trying to do?
According to S3 spec (https://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html), you can only download one object at a time.
There are various tools which may help to download multiple objects at a time from COS. I used AWS CLI tool to download and upload the objects from/to COS.
So install aws-cli tool and configure it by supplying access_key_id and secret_access_key here.
Recursively copying S3 objects to a local directory: When passed with the parameter --recursive, the following cp command recursively copies all objects under a specified prefix and bucket to a specified directory.
C:\Users\Shashank>aws s3 cp s3://yourBucketName . --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp s3://yourBucketName D:\s3\ --recursive
In my case having endpoint based on us-east region and I am copying objects into D:\s3 directory.
Recursively copying local files to S3: When passed with the parameter --recursive, the following cp command recursively copies all files under a specified directory to a specified bucket.
C:\Users\Shashank>aws s3 cp myDir s3://yourBucketName/ --recursive
for example:
C:\Users\Shashank>aws --endpoint-url http://s3.us-east.cloud-object-storage.appdomain.cloud s3 cp D:\s3 s3://yourBucketName/ --recursive
I am copying objects from D:\s3 directory to COS.
For more reference, you can see the link here.
I hope it works for you.

Read file created in HDFS with Livy

I am using Livy to run the wordcount example by creating jar file which is working perfectly fine and writing output in HDFS. Now I want to get the result back to my HTML page. I am using Spark scala, sbt, HDFS and Livy.
The GET/batches REST API only shows log and state.
How do I get output results?
Or how can I read a file in HDFS using REST API in Livy? Please help me out with this.
Thanks in advance.
If you check the status for the batches using curl you will get the status of Livy batch job which will come as Finished(If spark driver has launched successfully).
To read the output:
1. You can do SSH using paramiko to the machine where hdfs is running and run hdfs dfs -ls / to check the output and perform your desired tasks.
Using the Livy rest API you need to write a script which does the step 1 and that script can be called through curl command to fetch the output from HDFS but in this case Livy will launch seperate spark driver and output will come in the STDOUT of the driver logs.
curl -vvv -u : :/batches -X POST --data '{"file": "http://"}' -H "Content-Type: application/json"
First one is the sure way of getting the output though I am not 100% sure about how second approach will behave.
You can use WebHDFS in you REST call.Get the WebHDFS enabled first by ur Admin.
Use the webHDFS URL
Create HttpURLConnection object
Set Request method as GET
then use buffer reader to getInputStream.

How do I query Spark JobServer and find where it stores my Jars?

I am trying to follow this documentation:
https://github.com/spark-jobserver/spark-jobserver#dependency-jars
Option 2 Listed in the docs says:
The dependent-jar-uris can also be used in job configuration param
when submitting a job. On an ad-hoc context this has the same effect
as dependent-jar-uris context configuration param. On a persistent
context the jars will be loaded for the current job and then for every
job that will be executed on the persistent context. curl -d ""
'localhost:8090/contexts/test-context?num-cpu-cores=4&memory-per-node=512m'
OK⏎ curl
'localhost:8090/jobs?appName=test&classPath=spark.jobserver.WordCountExample&context=test-context&sync=true'
-d '{ dependent-jar-uris = ["file:///myjars/deps01.jar", "file:///myjars/deps02.jar"], input.string = "a b c a b see" }' The
jars /myjars/deps01.jar & /myjars/deps02.jar (present only on the SJS
node) will be loaded and made available for the Spark driver &
executors.
Is "file:///myjars/" directory the SJS node's JAR directory or some custom directory?
I have a client on a Windows box and a Spark JobServer on a Linux box. Next, I upload a JAR to SJS node. SJS node puts that Jar somewhere. Then, when I call to start a Job and set the 'dependent-jar-uris', the SJS node will find my previously uploaded JAR and run the job:
"dependent-jar-uris" set to "file:///tmp/spark-jobserver/filedao/data/simpleJobxxxxxx.jar"
This works fine, but I had to manually go searching around the SJS node to find this location (e.g. file:///tmp/spark-jobserver/filedao/data/simpleJobxxxxxx.jar) and then add it into my future requests to start the job.
Instead, how to I make a REST call from the client to just get the path where Spark JobServer puts my jars when I uploaded them, so that I can set the file:/// path correctly in my 'dependent-jar-uris' property dynamically?
I don't think uploaded jars using "POST /jars" can be used in dependent-jar-uris. Since you are uploading jars, you already know the local path. Just use that.