zeppelin-0.7.3 Interpreter pyspark not found - pyspark

I get the below error when I use pyspark via Zeppelin.
The python & spark interpreters work and all environment variables are set correctly.
print os.environ['PYTHONPATH']
/x01/spark_u/spark/python:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python/lib/py4j-0.10.4-src.zip:/x01/spark_u/spark/python/lib/pyspark.zip:/x01/spark_u/spark/python:/x01/spark_u/spark/python/pyspark:/x01/spark_u/zeppelin/interpreter/python/py4j-0.9.2/src:/x01/spark_u/zeppelin/interpreter/lib/python
zepplin-env.sh is set with the below vars
export PYSPARK_PYTHON=/usr/local/bin/python2
export PYTHONPATH=${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.10.4-src.zip:${PYTHONPATH}
export SPARK_YARN_USER_ENV="PYTHONPATH=${PYTHONPATH}"
See the below log file
INFO [2017-11-01 12:30:42,972] ({pool-2-thread-4}
RemoteInterpreter.java[init]:221) - Create remote interpreter
org.apache.zeppelin.spark.PySparkInterpreter
org.apache.zeppelin.interpreter.InterpreterException:
paragraph_1509038605940_-1717438251's Interpreter pyspark not
found
Thank you in advance

I found a workaround for the above issue.The interpreter not found issue does not happen when I create note inside a directory.The issue only happens when I use notes at toplevel.Addionally I foud out that this issue does not happen in 0.7.2 version
Ex :
enter image description here

Related

Add Jar file to Jupyter notebook - : java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

I got a pyspark script which was run by using this bash script:
Now I am running the pyspark script on jupyter notebook. I added the teradata jar like this:
But when I tried to use"spark.read.jdbc" later to run a query to retrieve teradata data, I got this error:
May I know how to solve this issue?
Try this.
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars /jar/path/ pyspark-shell'

Spark reading file from local gives InvalidInputException

Using Spark 2.2.0 installed by Homebrew on OSX High Sierra. I got into spark-shell and tried to read a local file like so:
val lines = sc.textFile("file:///Users/username/Documents/Notes/sampleFile")
val llist = lines.collect()
This gives me:
org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: file:/Users/bsj625/Documents/Notes/sampleFile
I've tried a bunch of variations, file:/ and file://. I also tried running spark-shell in local mode like so:
spark-shell --master local
But I'm still getting the same error. Are there any environment variables I need to set? Any help appreciated.

ipython notebook is not updating when I change my code

So, I ran into a weird issue using an ipython notebook and not sure what to do. Normally, when I run a part of the code, if there is an error, I would trace it back, fix it, and then re-run the code. I was doing a similar thing but even after making changes to the code, it looks like nothing is changing!
Here is the example... I am using Python 3.5 so xrange is gone. This then caused an error to be thrown:
XXXX
24 XXXX
25 XXXX
---> 26 for t in xrange(0,len(data),1):
27
28 XXXX
NameError: name 'xrange' is not defined
but after changing my code (which you can see below the difference in line 26), the same error pops up!
XXXX
24 XXXX
25 XXXX
---> 26 for t in range(0,len(data),1):
27
28 XXX
NameError: name 'xrange' is not defined
Any ideas on why this would be happening?
As Thomas K said, you're probably making a change in an external file that was not imported. There is a very useful command in ipython notebook for such cases, called autoreaload. With autoreaload, whenever you modify an external file you do not have to import it again because the extension takes care of it for you. For more information check: ipython autoreload.
Whenever using external files along with Ipython use autoreload. It will reload the external files every time before executing any code in IPython.
Add this at first cell of the IPython.
%load_ext autoreload
%autoreload 2
For me this was due to one of the following:
Cause 1: imported module not updated
Solution:
import importlib
importlib.reload(your_module)
Cause 2: other
Solution: restart the kernel, for jupyter notebook this is how
I have the same problem. I tried jupyter magic autoreload but it didn't work. Finally, I solved it in this way:
in the first cell, add
import My_Functions as my
import importlib
importlib.reload(my)
But notice if the module is imported in this way:
from My_Functions import *
I couldn't reload it properly.
I have the same issue sometimes. I think it has to do with memory - if I have a bunch of dataframes hanging around it seems to cause issues. If I restart the kernel using the Kernel > Restart option the problem goes away.
I have the same problem sometimes. I restarted the kernels but it didn't work.I try to run the cell (ctr+ enter) two or three times. then the result will be displayed according to the updated codes. I hope it helps.
insert new empty cell with + option, go to Kernel, choose Restart & Run All.
Then, fill in the new inserted cell and run again in Kernel, choose Restart & Run All.
It works with me.

How to start a Spark Shell using pyspark in Windows?

I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html
But when I run in cmd the following:
C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4\>c:\Python27\python bin\pyspark
then I receive the following error message:
File "bin\pyspark", line 21
export SPARK_HOME="$(cd ="$(cd "`dirname "$0"`"/..; pwd)"
SyntaxError: invalid syntax
What am I doing wrong here?
P.S. When in cmd I try just C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4>bin\pyspark
then I receive ""python" is not recognized as internal or external command, operable program or batch file".
You need to have Python available in the system path, you can add it with setx:
setx path "%path%;C:\Python27"
I'm a fairly new Spark user (as of today, really). I am using spark 1.6.0 on Windows 10 and 7 machines. The following worked for me:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Using the code above, I was able to launch Spark in an IPython notebook and my Enthought Canopy Python IDE. Before, this, I was only able to launch pyspark through a cmd prompt. The code above will only work if you have your Environment Variables set correctly for Python and Spark (pyspark).
I run these set of path settings whenever I start pyspark in ipython:
import os
import sys
# Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') for R
### MANNN restart spart using ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
os.environ['SPARK_HOME']="G:/Spark/spark-1.5.1-bin-hadoop2.6"
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/bin")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
##sc.stop() # IF you wish to stop the context
sc = SparkContext("local", "Simple App")
With the reference and help of the user "maxymoo" I was able to find a way to set a PERMANENT path is Windows 7 as well. The instructions are here:
http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx
Simply set path in System -> Environment Variables -> Path
R Path in my system C:\Program Files\R\R-3.2.3\bin
Python Path in my system c:\python27
Spark Path in my system c:\spark-2
The path must be separated by ";" and there must be no space between paths

Running the ruby plugin foo.rb for logstash-1.5.0beta1 version

I am trying to run the ruby plugin foo.rb as given on this link-http://logstash.net/docs/1.3.3/extending/example-add-a-new-filter .
As specified there I have created the ruby file inside /logstash-1.5.0.beta1/lib/logstash/filters/foo.rb
The command for running the conf file is given as-
% logstash --pluginpath . -f example.conf
What exactly should I write in the place of 'pluginpath'?
On specifying the path of foo.rb it is giving me the following errors-
Clamp::UsageError: Unrecognised option '--lib/logstash/filters/foo.rb'
signal_usage_error at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/clamp-0.6.3/lib/clamp/command.rb:103
find_option at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/clamp-0.6.3/lib/clamp/option/parsing.rb:62
parse_options at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/clamp-0.6.3/lib/clamp/option/parsing.rb:28
parse at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/clamp-0.6.3/lib/clamp/command.rb:52
run at /home/administrator/Softwares/logstash-1.5.0.beta1/lib/logstash/runner.rb:155
call at org/jruby/RubyProc.java:271
run at /home/administrator/Softwares/logstash-1.5.0.beta1/lib/logstash/runner.rb:171
call at org/jruby/RubyProc.java:271
initialize at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/stud-0.0.18/lib/stud/task.rb:12
What should I do? Thanks in advance! :)