Connecting QGIS to JUPYTER - qgis

QGIS is currently using a graphics modeler to automate.
After designing the model, the Python script declared classes and functions, but no model execution code was written. How can I get this done in Jupyter notebook?
(If this is feasible, we will be able to receive variables through input() in Jupyter environment and extract results from the QGIS model according to each variable.)

Related

Missing Python file for models using pyCommunicator

My experience with Python is limited, but I just started looking at the new Python models included AnyLogic's examples. I am looking at the 1st one Passing Data Types. The model runs correctly with the set, modify and get functions working as expected. My question is there a python file somewhere that the communicator is working with? I only see the .alp file in the folder.
Thanks
I'm also beginning to use this, but as I understand, the idea of the python helper here (among other things) is that you can run python commands with Anylogic, so you actually don't need a python file. Nevertheless it uses python installed in your computer to run the scripts, if you don't have python installed, your model won't work.

Can I use Jupyter lab to interact with databricks spark cluster using Scala?

Can I use Jupyter lab to connect to a databricks spark cluster that is hosted remotely?
There are KB articles about databricks connect, which allows a scala or java client-process to control a spark cluster. Here is an example:
https://docs.databricks.com/dev-tools/databricks-connect.html
While that KB article covers a lot of scenarios, it doesn't explain how to use Jupyter notebooks to interact with a databricks cluster using the Scala programming language. I'm familiar with scala programming, but not Python.
Yes, it appears to be possible although it is not well documented. These steps worked for me on windows. I used databricks v.7.1 with scala 2.12.10.
Step 1. Install anaconda : https://repo.anaconda.com/
Step 2. Because python seems to be the language of choice for notebooks,
you will need to manually install and configure a scala kernel
I'm able to get things working with the almond kernel : https://almond.sh/
When you install almond, be careful to pick a version of scala
that corresponds to the DBR runtime you will be connected to in the remote cluster.
Step 3. Now follow the databricks-connect docs to get a scala program to
compile and connect to the remote cluster via the intellij / sbt environment.
The documentation can be found here. https://docs.databricks.com/dev-tools/databricks-connect.html
This is a fully supported and fairly conventional approach that can be used to develop custom modules.
Step 4. Once you have created a working scala process, you will be familiar with sbt. The build.sbt is used for referencing the "databricks-connect" distribution. The distribution will be in a location like so:
unmanagedBase := new java.io.File("C:\\Users\\minime\\AppData\\Local\\Programs\\Python\\Python37\\Lib\\site-packages\\pyspark\\jars")
While it is straightforward for intellij / sbt to compile those dependencies into your program, it will take a bit more work to do the equivalent thing in he almond/jupyter kernel.
Before you go back to your jupyter notebook, run your new scala process and allow it to create a spark session. Then before the process dies, use "process explorer" to find the the related java.exe, then in the lower view/pane show handles, then copy all the handles into notepad (Ctrl+A in process explorer, Ctrl+V in notepad). This gives you the subset of modules from the databricks distribution that are actually being loaded into your process at runtime.
Step 5. Now that you have the relevant modules, you need to configure your almond scala kernel to load them into memory. Create a new jupyter notebook and select the scala kernel and use code like the following to load all your modules:
interp.load.cp(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath( "C:/Users/minime/AppData/Local/Programs/Python/Python37/Lib/site-packages/pyspark/jars/whatever001-1.1.1.jar")))
interp.load.cp(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath( "C:/Users/minime/AppData/Local/Programs/Python/Python37/Lib/site-packages/pyspark/jars/whatever002-1.1.1.jar")))
interp.load.cp(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath( "C:/Users/minime/AppData/Local/Programs/Python/Python37/Lib/site-packages/pyspark/jars/whatever003-1.1.1.jar")))
...
Please note that there are lots and lots of jars in the distribution (maybe 100!?).
You may wish to load other libraries directly from maven (assuming they are compatible with scala 2.12.10 and your databricks-connect distribution)
// Microsoft JDBC
interp.load.ivy("com.microsoft.sqlserver" % "mssql-jdbc" % "8.2.1.jre8")
// Other libraries
interp.load.ivy("joda-time" % "joda-time" % "2.10.5")
interp.load.ivy("org.scalaj" %% "scalaj-http" % "2.3.0")
interp.load.ivy("org.json4s" %% "json4s-native" % "3.5.3")
interp.load.ivy("com.microsoft.azure" % "msal4j" % "1.6.1")
// Other libraries
interp.load.ivy("org.apache.hadoop" % "hadoop-azure" % "3.2.1")
Fair warning... when loading libraries into the almond kernel, it is sometimes important to load them in a specific order. My examples above aren't intended to tell you what order to load them via interp.load.
Step 6. If everything went as planned, you should now be able to create a spark session running in a jupyter notebook using code that is similar to the stuff you were writing in "Step 3" above.
import org.apache.spark.sql._
val p_SparkSession = SparkSession.builder()
.appName("APP_" + java.util.UUID.randomUUID().toString)
.master("local")
.config("spark.cores.max","4")
.getOrCreate()
Your almond kernel is now connected to the remote cluster, via the databricks-connect distribution. Everything works as long as you don't need to serialize any functions or data types out to the remote cluster. In that case you will probably get a variety of serialization errors and null pointer exceptions. Here is an example:
java.lang.NullPointerException
com.databricks.service.SparkServiceClassSync$.checkSynced(SparkServiceClassSync.scala:244)
org.apache.spark.sql.util.SparkServiceObjectOutputStream.writeReplaceClassDescriptor(SparkServiceObjectOutputStream.scala:82)
...
org.apache.spark.sql.util.ProtoSerializer.serializePlan(ProtoSerializer.scala:377)
com.databricks.service.SparkServiceRPCClientStub.$anonfun$executePlan$1(SparkServiceRPCClientStub.scala:193)
This answer will be the first of several. I'm hoping that there are other scala/spark/databricks experts who can help work out the remaining kinks in this configuration, so that any of the functions and data types that are declared in my notebooks can be used by the remote cluster as well!
In my first answer I pointed out that the primary challenge in using scala notebooks (in Jupyter lab with almond) is that we are missing the functionality to serialize any functions or data types, and send them out to the remote cluster that is being hosted by databricks.
I should point out that there are two workarounds that I use regularly when I encounter this limitation.
I revert to using the "spark-shell". It is a standard component of the databricks-connect distribution. I can then load the relevant parts of my scala code using :load and :paste commands. For some happy reason the "spark-shell" is fully capable of serializing functions and data types in order to dynamically send them to the remote cluster. This is something that the almond kernel is not able to do for us within the context of the Jupyter notebooks.
The other workaround is to .collect() the dataframes back to the driver (within the memory of the jupyter notebook kernel.) Once they are collected, I can perform additional transformations on them, even with the help of "original" functions and "original" data types that are only found within my jupyter notebook. In this case I won't get the performance benefits of distributed processing. But while the code is still under development, I'm typically not working with very large datasets so it doesn't make that much of a difference if the driver is running my functions, or if the workers are.
Hope this is clear. I'm hoping that Databricks may eventually see the benefit of allowing scala programmers to develop code remotely, in jupyter lab. I think they need to be the ones to select one of the scala kernels, and do the heavy-lifting to support this scenario. As-of now they probably believe their own notebook experience in their own portal is sufficient for the needs of all scala programmers.
To add on to David's first answer, I did this additional step:
Step 5.5. Programmatically add the databricks jar dependencies to the scala kernel.
Using the directory you get from databricks-connect get-jar-dir I used the following code:
import $ivy.`com.lihaoyi::os-lib:0.2.7`
def importJars{
val myJars = os.list(os.Path("/Users/me/miniconda3/envs/dbx-p40/lib/python3.7/site-packages/pyspark/jars/"))
for (j <- myJars){
interp.load.cp(ammonite.ops.Path(java.nio.file.FileSystems.getDefault().getPath(j.toString)))
}
}
importJars

How to run magic commands in Jupyter deployed on Watson Studio?

I'm trying to analyze my datasets on DB2 on Cloud in the Jupyter notebook created in Watson Studio. When using the "%sql" magic for connecting DB2 doesn't work naturally, showing no such module. According to an IBM tutorial, it is required to run the "%run db2re.ipynb" command in a Jupyter cell before connecting DB2. But when I run this cell nothing happens and the "%sql" magic still not available. Any advise is appreciated.
In general, there are two ways of accessing libraries in Watson Studio:
- Install or import a library, then reference it. Note that you need to specify the --user option.
- First save your own scripts, then import them.
There are also the built-in line and cell magics.
With that, I think I got it to work the following way:
1st cell, download db2re.ipynb to your environment:
%%sh
wget https://raw.githubusercontent.com/DB2-Samples/Db2re/master/db2re.ipynb
2nd cell, install necessary library:
!pip install --user qgrid
3rd cell, run the db2re.ipynb notebook extension:
%run db2re.ipynb
Thereafter, I was able to run a %sqlcommand.

Setting Specific Python in Zeppelin Interpreter

What do I need to do beyond setting "zeppelin.pyspark.python" to make a Zeppelin interpreter us a specific Python executable?
Background:
I'm using Apache Zeppelin connected to a Spark+Mesos cluster. The cluster's worked fine for several years. Zeppelin is new and works fine in general.
But I'm unable to import numpy within functions applied to an RDD in pyspark. When I use Python subprocess to locate the Python executable, it shows that the code is being run in the system's Python, not in the virutalenv it needs to be in.
So I've seen a few questions on this issue that say the fix is to set "zeppelin.pyspark.python" to point to the correct python. I've done that and restarted the interpreter a few times. But it is still using the system Python.
Is there something additional I need to do? This is using Zeppelin 0.7.
On an older, custom snapshot build of Zeppelin I've been using on an EMR cluster, I set the following two properties to use a specific virtualenv:
"zeppelin.pyspark.python": "/path/to/bin/python",
"spark.executorEnv.PYSPARK_PYTHON": "/path/to/bin/python"
When you are in your activated venv in python:
(my_venv)$ python
>>> import sys
>>> sys.executable
# http://localhost:8080/#/interpreters
# search for 'python'
# set `zeppelin.python` to output of `sys.executable`

Running external commands in IPython

I'd like to run a new command from IPython configuration and capture its output. Basically, I'd like to access the equivalent of !command via normal functions. I know I can just use subprocess, but since IPython already provides this functionality, I guess there must be a properly made wrapper included somewhere in the API.
Apparently, such wrapper can be called via ip.IP.getoutput("command").