How to run data bricck notebook with mlflow in azure data factory pipeline? - azure-data-factory

My colleagues and I are facing an issue when trying to run my databricks notebook in Azure Data Factory and the error is coming from MLFlow.
The command that is failing is the following:
# Take the parent notebook path to use as path for the experiment
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
nb_base_path = context['extraContext']['notebook_path'][:-len("00_training_and_validation")]
experiment_path = nb_base_path + 'trainings'
mlflow.set_experiment(experiment_path)
experiment = mlflow.get_experiment_by_name(experiment_path)
experiment_id = experiment.experiment_id
run = mlflow.start_run(experiment_id=experiment_id, run_name=f"run_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}")
And the error that is throwing is:
An exception was thrown from a UDF: 'mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: No experiment ID was specified. An experiment ID must be specified in Databricks Jobs and when logging to the MLflow server from outside the Databricks workspace. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("/path/to/experiment/in/workspace") at the start of your program.', from , line 32.
The pipeline just runs the notebook from ADF, it does not have any other step and the cluster we are using is type 7.3 ML.
Could you please help us?
Thank you in advance!

I think you need to set artifact URI and specify experiment ID (if in the artifact directory has much experiment ID
Reference: https://www.mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded

Related

ADF Copy data from Azure Data Bricks Delta Lake to Azure Sql Server

I'm trying to use the data copy activity to extract information from azure databricks delta lake, but I've noticed that it doesn't pass the information directly from the delta lake to the SQL server I need, but must pass it to an azure blob storage, when running it, it throws the following error
ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: Failure to initialize configurationInvalid configuration value detected for fs.azure.account.key Caused by: Invalid configuration value detected for fs.azure.account.key
Looking for information I found a possible solution but it didn't work.
Invalid configuration value detected for fs.azure.account.key copy activity fails
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
These are some images of the structure that I have in ADF:
In the image I get a message that tells me that I must have a Storage Account to continue
These are the configuration images, and execution failed:
Conf:
Fail:
Thank you very much
The solution for this problem was the following:
Correct the way the Storage Access Key configuration was being defined:
in the instruction: spark.hadoop.fs.azure.account.key..blob.core.windows.net
The following change must be made:
spark.hadoop.fs.azure.account.key.
storageaccountname.dfs.core.windows.net
Does anyone have any idea how the hell to pass information from an azure databricks delta lake table to a table in Sql Server??
To achieve Above scenario, follow below steps:
First go to your Databricks cluster Edit it and under Advance options >> spark >> spark config Add below code if you are using blob storage.
spark.hadoop.fs.azure.account.key.<storageaccountname>.blob.core.windows.net <Accesskey>
spark.databricks.delta.optimizeWrite.enabled true
spark.databricks.delta.autoCompact.enabled true
After that as you are using SQL Database as a sink.
Enable staging and give same blob storage account linked service as Staging account linked service give storage path from your blob storage.
And then debug it. make sure you complete Prerequisites from official document.
My sample Input:
Output in SQL:

I can't trigger SageMaker TrainJob from ECS; getting this exception: botocore.exceptions.NoCredentialsError: Unable to locate credentials

I'm trying to trigger a Sagemaker Trainjob from my flask app, which will be hosted on ECS Fargate.
When I run the same script in separate Python script it runs fine, but when I'm trying to do the same in Flask script, it gives the above mentioned exception. Below is a snippet of the code!
role = 'AmazonSageMaker-ExecutioRole-20221019T223514'
estimator = Estimator(role=role,
instance=1,
instance_type='ml.g4dn.xlarge'
image_uri=img,
hyperparameters=parms)
estimator.fit()
The exception occurs at estimator.fit()
Any help would be appreciated!
I have added the respective policies of ECS and Sagemaker to the role, but still no luck!

Azure Data Factory CICD error: The document creation or update failed because of invalid reference

All, when running a build pipeline using Azure Devops with ARM template, the process is consistently failing when trying to deploy a dataset or a reference to a dataset with this error:
ARM Template deployment: Resource Group scope (AzureResourceManagerTemplateDeployment)
BadRequest: The document creation or update failed because of invalid reference 'dataset_1'.
I've tried renaming the dataset and also recreating it to see if that would help.
I then deleted the dataset_1.json file from the repo and still get the same message so it's some reference to this dataset and not the dataset itself I think. I've looked through all the other files for references to this but they all look fine.
Any ideas on how to troubleshoot this?
thanks
try this
Looks like you have created 'myTestLinkedService' linked service, tested connection but haven't published it yet and trying to reference that linked service in the new dataset that you are trying to create using Powershell.
In order to reference any data factory entity from Powershell, please make sure those entities are published first. Please try publishing the linked service first from the portal and then try to run your Powershell script to create the new dataset/actvitiy.
I think I found the issue. When I went into the detailed logs I found that in addition to this error there was an error message about an invalid SQL connection string, so I though it may be related since the dataset in question uses Azure SQL database linked service.
I adjusted the connection string and this seems to have solved the issue.

Databricks python/pyspark code to find the age of the blob in azure container

Looking for databricks python/pyspark code to copy azure blob from one container to another container older than 30 days
The copy code is simple as follows.
dbutils.fs.cp("/mnt/xxx/file_A", "/mnt/yyy/file_A", True)
The difficult part is checking blob modification time. According to the doc, the modification time will only get returned by using dbutils.fs.ls command on Databricks Runtime 10.2 or above. You may check the Runtime version using the command below.
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
The returned value will be Databricks Runtime followed by Scala versions.
If you get lucky with the version, you can can do something like:
import time
ts_now = time.time()
for file in dbutils.fs.ls('/mnt/xxx'):
if ts_now - file.modificationTime > 30 * 86400:
dbutils.fs.cp(f'/mnt/xxx/{file.name}', f'/mnt/yyy/{file.name}', True)

Integration with Dataproc + Datalab + Source Code repos

Can someone been able to integrate Dataproc,Datalab and Source code repo? As many of us have seen that when you call an init action to install datalab, it does not create the source code repo. I am trying to achieve a full end-to-end solution where a user logs into to a datalab notebook, interact with Dataproc through Pyspark and check-in the notebooks to the Source code repo. I have not been able to do this with the init action like i pointed out earlier. I also tried installing dataproc and then datalab as a separate install ( this time it creates the source repo) , however, I can't run any spark code on this datalab notebook. Can someone please give me some pointers on how to achieve this? Any and all is appreciated.
Code in Datalab
from pyspark.sql import HiveContext
hc=HiveContext(sc)
hc.sql("""show databases""").show()
hc.sql("""CREATE EXTERNAL TABLE IF NOT EXISTS INVOICES
(SubmissionDate DATE, TransactionAmount DOUBLE, TransactionType STRING)
STORED AS PARQUET
LOCATION 'gs://my-exercise-project-2019016-ds-team/datasets/invoices'""")
hc.sql("""select * from invoices limit 10""").show()
Error
Py4JJavaError: An error occurred while calling o55.sql.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2395)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3208)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3240)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:121)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3291)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3259)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:470)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:356)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$or
Unfortunately, it takes some pre-work to be able to create the datalab-notebooks repository in Cloud Source Repositories from an init action.
The reason is that creating the repository requires the service account for the VM to have the "source.repos.create" IAM permission on the project, which is not true by default.
You can either grant that permission to the service account, and then create the repository via gcloud source repos create datalab-notebooks, or manually create the repository before creating the cluster.
Then, to clone the repository inside of your startup script, add the following lines:
mkdir -p ${HOME}/datalab
gcloud source repos clone datalab-notebooks ${HOME}/datalab/notebooks
If you are modifying the canned init action for Datalab, then I would suggest adding these lines here