I started working with SparkMagic lately, I was able to create notebooks on PySpark kernel which works fine.
Now when I use SparkMagic on Ipython kernel, there's a step which should be done manually (execute %manage_spark, create endpoint, create session)
I would like to know if there's away to do these steps programatically!
Related
I have Apache Airflow running on an EC2 instance (Ubuntu). Everything is running fine.
The DB is SQLite and the executor is Sequential Executor (provided as default). But now I would like to run some DAGs which needs to be run at the same time every hour and every 2 minutes.
My question is how can I upgrade my current setup to Celery executor and postgres DB to have the advantage of parallel execution?
Will it work, if I install and setup the postgres, rabbitmq and celery. And make the necessary changes in the airflow.cfg configuration file?
Or do I need to re-install everything from scratch (including airflow)?
Please guide me on this.
Thanks
You can, indeed, install Postgres/RabbitMQ/Celery, then update your configuration file (airflow.cfg), initialise the database, and restart the Airflow services.
However, there is a side note: if required, you'd also have to migrate data from SQLite to Postgres. Most importantly, the database contains your connections and variables. It's possible to export variables beforehand and import them again using the Airflow CLI (see this answer, and the Airflow documentation).
It's also possible to import your connections using the CLI, as described in this Airflow guide (or the documentation).
If you just switched to the new database set up and you see something's missing, you can still easily switch back to the SQLite setup by reverting the changes to airflow.cfg.
I have a Django project connecting to a PostgreSQL database which I develop in PyCharm, and I want to enable PostgreSQL history logging.
There is PSQL_HISTORY env variable set to /home/user/apps/postgres/logs/.pycharm_log, but when I start the project in PyCharm and update some data via the Django Admin (which certainly hits the database) -- nothing gets logged and the file is not created at all.
Is there a way to make PyCharm and PSQL_HISTORY work together as I expected?
'psql' is the name of a specific client tool. Why would a completely different tool use psql's configuration options? If you want to log every statement sent to the server, you could configure that in the server side with log_statement=all.
I am using pyspark kernel inside jupyterhub and want to connect hivellap from spark . I am able to create a spark session but when i am trying to execute
from pyspark_llap import HiveWarehouseSession it is showing error no module found pyspark_llap
The same command i am able to execute in python kernel and it successfully executed.
Kindly suggest what configuration is needed to import HiveWarehouseSession from pyspark_llap inside pyspark kernel.
I have a spark cluster which master is on 192.168.0.60:7077
I used to use jupyter notebook to make some pyspark scripts.
I am now willing to move on to scala.
I don't know scala's world.
I am trying to use Apache Toree.
I installed it, downloaded the scala kernels, and runned it to the point to open a scala notebook . Till there everything seems ok :-/
But I can't find the spark context, and there are errors in the jupyter's server logs :
[I 16:20:35.953 NotebookApp] Kernel started: afb8cb27-c0a2-425c-b8b1-3874329eb6a6
Starting Spark Kernel with SPARK_HOME=/Users/romain/spark
Error: Master must start with yarn, spark, mesos, or local
Run with --help for usage help or --verbose for debug output
[I 16:20:38.956 NotebookApp] KernelRestarter: restarting kernel (1/5)
As I don't know scala, I am not sure of the issue here ?
It could be :
I need a spark kernel (according to https://github.com/ibm-et/spark-kernel/wiki/Getting-Started-with-the-Spark-Kernel )
I need to add an option on the server (the error message says 'Master must start with yarn, spark, mesos, or local' )
or something else :-/
I was just willing to migrate from python to scala, and I spend a few hours lost just on starting up the jupyter IDE :-/
It looks like you are using Spark in a standalone deploy mode. As Tzach suggested in his comment, following should work:
SPARK_OPTS='--master=spark://192.168.0.60:7077' jupyter notebook
SPARK_OPTS expects usual spark-submit parameter list.
If that does not help, you would need to check the SPARK_MASTER_PORT value in conf/spark-env.sh (7077 is the default).
I have ipython notebook running on a remote server, i.e.
ipython notebook --profile=nbserver
which I access from my local machine. Further, I ssh to the remote server from my machine, and start ipython console (terminal) on that server. I have found following command to work well:
ipython console --existing \
~/.config/ipython/profile_nbserver/security/kernel-*.json
Now I am connected to the same remote kernel from two different clients (lets call them browser and terminal). Everything works well, except one annoying detail:
1) in browser, I type a=1
2) in terminal, I type b=2
3) in both clients I can see both commands using %history. But when I want to cycle through the history (in terminal) using Up, it only shows the commands which have been typed in the terminal, (i.e b=2). Similarly, I am unable to use a + PageDown in the terminal, to go back in history and find the command starting with a.
From what I understand, my two clients are using two separate history files history.sqlite. But why does %history show all commands ?
Question:
Is there any way to configure using one history.sqlite for both clients ?
I find, having easy access to history is absolutely crucial. Moreover, I see using both terminal and browser as complementary, they both have tradeoffs and are best used combined.
You can set where the history gets loaded either by setting it at the terminal:
ipython --HistoryManager.hist_file=$HOME/ipython_hist.sqlite
or within the ipython config files:
import os
c.HistoryManager.hist_file=os.path.expanduser("~/ipython_hist.sqlite")