I have a python 3.5 notebook in databricks. I have a requirement to execute databricks notebook cells based on some conditions. I didn't see any functionality out of the box.
I have tried creating a python egg with the below code and installed it in databricks cluster.
def skip(line, cell=None):
'''Skips execution of the current line/cell if line evaluates to True.'''
if eval(line):
return
get_ipython().ex(cell)
def load_ipython_extension(shell):
'''Registers the skip magic when the extension loads.'''
shell.register_magic_function(skip, 'line_cell')
def unload_ipython_extension(shell):
'''Unregisters the skip magic when the extension unloads.'''
del shell.magics_manager.magics['cell']['skip']
But while I'm trying to load that with extension using
%load_ext skip_cell
it's throwing an error saying "The module is not an IPython module". Any help or suggestion is appreciated. Thanks.
Databricks notebooks are not based on Jupyter/IPython, which is why you are seeing that error.
If you are trying to build conditional workflows I would recommend combining the Notebook Workflows functionality with the Databricks REST API. This will allow you to control the flow of your program based on conditional statements and results of other processes.
Think of a notebook as a function that can be parameterized to accept and return exit values.
For an example, see the official documentation here.
Now there is support to run notebooks based on condition
if <condition>:
dbutils.notebook.run("notebook-name", 60, {"argument": "data", "argument2": "data2", ...})
More details here
https://docs.databricks.com/notebooks/notebook-workflows.html#example
Related
Looking for databricks python/pyspark code to copy azure blob from one container to another container older than 30 days
The copy code is simple as follows.
dbutils.fs.cp("/mnt/xxx/file_A", "/mnt/yyy/file_A", True)
The difficult part is checking blob modification time. According to the doc, the modification time will only get returned by using dbutils.fs.ls command on Databricks Runtime 10.2 or above. You may check the Runtime version using the command below.
spark.conf.get("spark.databricks.clusterUsageTags.sparkVersion")
The returned value will be Databricks Runtime followed by Scala versions.
If you get lucky with the version, you can can do something like:
import time
ts_now = time.time()
for file in dbutils.fs.ls('/mnt/xxx'):
if ts_now - file.modificationTime > 30 * 86400:
dbutils.fs.cp(f'/mnt/xxx/{file.name}', f'/mnt/yyy/{file.name}', True)
My colleagues and I are facing an issue when trying to run my databricks notebook in Azure Data Factory and the error is coming from MLFlow.
The command that is failing is the following:
# Take the parent notebook path to use as path for the experiment
context = json.loads(dbutils.notebook.entry_point.getDbutils().notebook().getContext().toJson())
nb_base_path = context['extraContext']['notebook_path'][:-len("00_training_and_validation")]
experiment_path = nb_base_path + 'trainings'
mlflow.set_experiment(experiment_path)
experiment = mlflow.get_experiment_by_name(experiment_path)
experiment_id = experiment.experiment_id
run = mlflow.start_run(experiment_id=experiment_id, run_name=f"run_{datetime.now().strftime('%Y-%m-%d_%H-%M-%S')}")
And the error that is throwing is:
An exception was thrown from a UDF: 'mlflow.exceptions.RestException: INVALID_PARAMETER_VALUE: No experiment ID was specified. An experiment ID must be specified in Databricks Jobs and when logging to the MLflow server from outside the Databricks workspace. If using the Python fluent API, you can set an active experiment under which to create runs by calling mlflow.set_experiment("/path/to/experiment/in/workspace") at the start of your program.', from , line 32.
The pipeline just runs the notebook from ADF, it does not have any other step and the cluster we are using is type 7.3 ML.
Could you please help us?
Thank you in advance!
I think you need to set artifact URI and specify experiment ID (if in the artifact directory has much experiment ID
Reference: https://www.mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded
Please, can you help me with this question below? The image with the error is available in the question.
I use Azure databricks for data engineering. Running the same code in databricks community runs without error, but in Azure returns the error that path was not found. Has anyone been through this situation?
I'm using sparkfiles.
cnae = 'https://servicodados.ibge.gov.br/api/v2/cnae/subclasses'
from pyspark import SparkFiles
spark.sparkContext.addFile(cnae)
cnaeDF = spark.read.option("multiLine", True).option("mode", "PERMISSIVE").json("file://"+SparkFiles.get("subclasses"))
pixel raster: rendered error message & stuff
It seems like a bug on runtime 10 as spark.sparkContext.addFile(cnae) add it to local storage:
/local_disk0/spark-f1411c54-0a2e-4138-a0ed-c2e6bbfe5ca4/userFiles-7616de8f-3e03-493c-89e6-50fa1f7324ca/subclasses
but SparkFiles.get("subclasses") want to read it from dbfs storage (I tried to add it all possible ways)...
but when magic command is run:
%sh
cp -r /local_disk0/spark-f1411c54-0a2e-4138-a0ed-c2e6bbfe5ca4 /dbfs/local_disk0/
then it is possible to read it without problem
I have been of late trying out apache spark. My question is more specific to trigger spark jobs. Here I had posted question on understanding spark jobs. After getting dirty on jobs I moved on to my requirement.
I have a REST end point where I expose API to trigger Jobs, I have used Spring4.0 for Rest Implementation. Now going ahead I thought of implementing Jobs as Service in Spring where I would submit Job programmatically, meaning when the endpoint is triggered, with given parameters I would trigger the job.
I have now few design options.
Similar to the below written job, I need to maintain several Jobs called by a Abstract Class may be JobScheduler .
/*Can this Code be abstracted from the application and written as
as a seperate job. Because my understanding is that the
Application code itself has to have the addJars embedded
which internally sparkContext takes care.*/
SparkConf sparkConf = new SparkConf().setAppName("MyApp").setJars(
new String[] { "/path/to/jar/submit/cluster" })
.setMaster("/url/of/master/node");
sparkConf.setSparkHome("/path/to/spark/");
sparkConf.set("spark.scheduler.mode", "FAIR");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
sc.setLocalProperty("spark.scheduler.pool", "test");
// Application with Algorithm , transformations
extending above point have multiple versions of jobs handled by service.
Or else use an Spark Job Server to do this.
Firstly, I would like to know what is the best solution in this case, execution wise and also scaling wise.
Note : I am using a standalone cluster from spark.
kindly help.
It turns out Spark has a hidden REST API to submit a job, check status and kill.
Check out full example here: http://arturmkrtchyan.com/apache-spark-hidden-rest-api
Just use the Spark JobServer
https://github.com/spark-jobserver/spark-jobserver
There are a lot of things to consider with making a service, and the Spark JobServer has most of them covered already. If you find things that aren't good enough, it should be easy to make a request and add code to their system rather than reinventing it from scratch
Livy is an open source REST interface for interacting with Apache Spark from anywhere. It supports executing snippets of code or programs in a Spark context that runs locally or in Apache Hadoop YARN.
Here is a good client that you might find helpful: https://github.com/ywilkof/spark-jobs-rest-client
Edit: this answer was given in 2015. There are options like Livy available now.
Even I had this requirement I could do it using Livy Server, as one of the contributor Josemy mentioned. Following are the steps I took, hope it helps somebody:
Download livy zip from https://livy.apache.org/download/
Follow instructions: https://livy.apache.org/get-started/
Upload the zip to a client.
Unzip the file
Check for the following two parameters if doesn't exists, create with right path
export SPARK_HOME=/opt/spark
export HADOOP_CONF_DIR=/opt/hadoop/etc/hadoop
Enable 8998 port on the client
Update $LIVY_HOME/conf/livy.conf with master details any other stuff needed
Note: Template are there in $LIVY_HOME/conf
Eg. livy.file.local-dir-whitelist = /home/folder-where-the-jar-will-be-kept/
Run the server
$LIVY_HOME/bin/livy-server start
Stop the server
$LIVY_HOME/bin/livy-server stop
UI: <client-ip>:8998/ui/
Submitting job:POST : http://<your client ip goes here>:8998/batches
{
"className" : "<ur class name will come here with package name>",
"file" : "your jar location",
"args" : ["arg1", "arg2", "arg3" ]
}
I am trying to run a workflow in hortonworks cluster using oozie.
Getting the following error:
Error: Invalid workflow-app, org.xml.sax.SAXParseException: cvc-complex-type.2.4.c: The matching wildcard is strict, but no declaration can be found for element 'hive'.
does anyone know the reason?
Atleast a sample hive workflow.xml which can be run on hortonworks distribution would be helpful??
This has to do with the first line of your workflow:
<workflow-app name="${workflowName}" xmlns="uri:oozie:workflow:0.4">
Specifically: uri:oozie:workflow:0.4
the xmlns value tells oozie what xml pattern to follow. I am assuming you used an online resource to build an action, which maybe in a newer scheme than what you specified.
There are versions
-uri:oozie:workflow:0.1
-uri:oozie:workflow:0.2
-uri:oozie:workflow:0.2.5
-uri:oozie:workflow:0.3
-uri:oozie:workflow:0.4
See: Oozie Workflow Schemes
But Usually setting yours to the code example above (0.4) will work for all newer workflows.
Actions also have schemes so it is important to look at what functions they have in each version.
The hive action currently goes up to 0.5 I believe, although I use 0.4 with this line:
<hive xmlns="uri:oozie:hive-action:0.4">
If this does not help, please update the question with your workflow for further help.