Databricks python/pyspark code to find the age of the blob in azure container - pyspark

Looking for databricks python/pyspark code to copy azure blob from one container to another container older than 30 days

The copy code is simple as follows.
dbutils.fs.cp("/mnt/xxx/file_A", "/mnt/yyy/file_A", True)
The difficult part is checking blob modification time. According to the doc, the modification time will only get returned by using dbutils.fs.ls command on Databricks Runtime 10.2 or above. You may check the Runtime version using the command below.
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
The returned value will be Databricks Runtime followed by Scala versions.
If you get lucky with the version, you can can do something like:
import time
ts_now = time.time()
for file in dbutils.fs.ls('/mnt/xxx'):
if ts_now - file.modificationTime > 30 * 86400:
dbutils.fs.cp(f'/mnt/xxx/{file.name}', f'/mnt/yyy/{file.name}', True)

Related

AWS Glue job throwing Null pointer exception when writing df

I am trying to write a job to read data from S3 and write to BQ db (using connector), running the same script for other tables and it is working correctly, but for one of the tables the write is not working.
It is working on the first run, but after first load the incremental runs throws this null pointer exception error. I have bookmarks enabled to fetch new data added in S3 and write to BQ database.
I am already handling the new data check, if there are files to process then proceed else abort job.
In the job logs df is printing and count is printing too, everything seems to be working but as it runs the write df command the job fails.
I am not sure what is the cause. Had tried to make the nullability of source and target to be same too, by setting the nullable property of source to True same as target, but it still fails.
Unable to understand the null pointer exception thrown.
Error: Caused by: java.lang.NullPointerException at com.google.cloud.bigquery.connector.common.BigQueryClient.loadDataIntoTable(BigQueryClient.java:532) at com.google.cloud.spark.bigquery.BigQueryWriteHelper.loadDataToBigQuery(BigQueryWriteHelper.scala:87) at com.google.cloud.spark.bigquery.BigQueryWriteHelper.writeDataFrameToBigQuery(BigQueryWriteHelper.scala:66) ... 42 more
The BQ connector by AWS had a bug. This was resolved when I contacted the AWS team and they suggested to use previous version of the connector.
So, using previous version of connector helped me resolve the issue.

How to run data bricck notebook with mlflow in azure data factory pipeline?

My colleagues and I are facing an issue when trying to run my databricks notebook in Azure Data Factory and the error is coming from MLFlow.
The command that is failing is the following:
# Take the parent notebook path to use as path for the experiment
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
nb_base_path = context['extraContext']['notebook_path'][:-len("00_training_and_validation")]
experiment_path = nb_base_path + 'trainings'
mlflow.set_experiment(experiment_path)
experiment = mlflow.get_experiment_by_name(experiment_path)
experiment_id = experiment.experiment_id
run = mlflow.start_run(experiment_id=experiment_id, run_name=f"run_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}")
And the error that is throwing is:
An exception was thrown from a UDF: 'mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: No experiment ID was specified. An experiment ID must be specified in Databricks Jobs and when logging to the MLflow server from outside the Databricks workspace. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("/path/to/experiment/in/workspace") at the start of your program.', from , line 32.
The pipeline just runs the notebook from ADF, it does not have any other step and the cluster we are using is type 7.3 ML.
Could you please help us?
Thank you in advance!
I think you need to set artifact URI and specify experiment ID (if in the artifact directory has much experiment ID
Reference: https://www.mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded

SparkFiles - path not found

Please, can you help me with this question below? The image with the error is available in the question.
I use Azure databricks for data engineering. Running the same code in databricks community runs without error, but in Azure returns the error that path was not found. Has anyone been through this situation?
I'm using sparkfiles.
cnae = 'https://servicodados.ibge.gov.br/api/v2/cnae/subclasses'
from pyspark import SparkFiles
spark.sparkContext.addFile(cnae)
cnaeDF = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("file://"+SparkFiles.get("subclasses"))
pixel raster: rendered error message & stuff
It seems like a bug on runtime 10 as spark.sparkContext.addFile(cnae) add it to local storage:
/local_disk0/spark-f1411c54-0a2e-4138-a0ed-c2e6bbfe5ca4/userFiles-7616de8f-3e03-493c-89e6-50fa1f7324ca/subclasses
but SparkFiles.get("subclasses") want to read it from dbfs storage (I tried to add it all possible ways)...
but when magic command is run:
%sh
cp -r /local_disk0/spark-f1411c54-0a2e-4138-a0ed-c2e6bbfe5ca4 /dbfs/local_disk0/
then it is possible to read it without problem

Unexpected close tag in aws cdk deploy

I am trying to create a CDK Code Construct for my python scripts, in stack i have added s3 and lambda.
When I am trying to execute cdk deploy, it is exiting after 0% progress or it is giving following error.
When i tried for s3 only it is working fine but when i added the lambda it is giving me error.
file_feed_lambda = _lambda.Function(
self, id='MyLambdaHandler001',
runtime=_lambda.Runtime.PYTHON_3_7,
code=_lambda.Code.asset('lambda'),
handler='lambda_function.lambda_handler',
)
bucket = s3.Bucket(self,
"FeedBucket-01")
Note : cdk diff and cdk synth are working properly
Apparently error which is showing is wrong i have updated the version of node and cdk to latest.
After update I have received the meaningful error which was socket time out.
After setting the proxy it worked for me.

Spark Scala S3 storage: permission denied

I've read a lot of topic on Internet on how to get working Spark with S3 still there's nothing working properly.
I've downloaded : Spark 2.3.2 with hadoop 2.7 and above.
I've copied only some libraries from Hadoop 2.7.7 (which matches Spark/Hadoop version) to Spark jars folder:
hadoop-aws-2.7.7.jar
hadoop-auth-2.7.7.jar
aws-java-sdk-1.7.4.jar
Still I can't use nor S3N nor S3A to get my file read by spark:
For S3A I have this exception:
sc.hadoopConfiguration.set("fs.s3a.access.key","myaccesskey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecretkey")
val file = sc.textFile("s3a://my.domain:8080/test_bucket/test_file.txt")
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 403, AWS Service: Amazon S3, AWS Request ID: AE203E7293ZZA3ED, AWS Error Code: null, AWS Error Message: Forbidden
Using this piece of Python, and some more code, I can list my buckets, list my files, download files, read files from my computer and get file url.
This code gives me the following file url:
https://my.domain:8080/test_bucket/test_file.txt?Signature=%2Fg3jv96Hdmq2450VTrl4M%2Be%2FI%3D&Expires=1539595614&AWSAccessKeyId=myaccesskey
How should I install / set up / download to get spark able to read and write from my S3 server ?
Edit 3:
Using debug tool in comment here's the result.
Seems like the issue is with a signature thing not sure what it means.
First you will need to download aws-hadoop.jar and aws-java-sdk.jar that matches the install of your spark-hadoop release and add them to the jars folder inside spark folder.
Then you will need to precise the server you will use and enable path style if your S3 server do not support dynamic DNS:
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
#I had to change signature version because I have an old S3 api implementation:
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
Here's my final code:
sc.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
val tmp = sc.textFile("s3a://test_bucket/test_file.txt")
sc.hadoopConfiguration.set("fs.s3a.access.key","mykey")
sc.hadoopConfiguration.set("fs.s3a.secret.key","mysecret")
sc.hadoopConfiguration.set("fs.s3a.endpoint","my.domain:8080")
sc.hadoopConfiguration.set("fs.s3a.connection.ssl.enabled","true")
sc.hadoopConfiguration.set("fs.s3a.path.style.access","true")
sc.hadoopConfiguration.set("fs.s3a.signing-algorithm","S3SignerType")
tmp.count()
I would recommand to put most of the settings inside spark-defaults.conf:
spark.hadoop.fs.s3a.impl org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.path.style.access true
spark.hadoop.fs.s3a.endpoint mydomain:8080
spark.hadoop.fs.s3a.connection.ssl.enabled true
spark.hadoop.fs.s3a.signing-algorithm S3SignerType
One of the issue I had has been to set spark.hadoop.fs.s3a.connection.timeout to 10 but this value is set in millisecond prior to Hadoop 3 and it gives you a very long timeout; error message would appear 1.5 minute after the attempt to read a file.
PS:
Special thanks to Steve Loughran.
Thank you a lot for the precious help.