Using Spark 2.2.0 installed by Homebrew on OSX High Sierra. I got into spark-shell and tried to read a local file like so:
val lines = sc.textFile("file:///Users/username/Documents/Notes/sampleFile")
val llist = lines.collect()
This gives me:
org.apache.hadoop.mapred.InvalidInputException:
Input path does not exist: file:/Users/bsj625/Documents/Notes/sampleFile
I've tried a bunch of variations, file:/ and file://. I also tried running spark-shell in local mode like so:
spark-shell --master local
But I'm still getting the same error. Are there any environment variables I need to set? Any help appreciated.
I'm a newbie with Spark. In multiple examples seen on the net, we see something like:
conf.set("es.nodes", "from.escluster.com")
val sc = new SparkContext(conf)
but when I try in the interactive shell to do such a thing, I get the error:
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243)
So how am I supposed to set a conf parameter and be sure that sc "integrates" it?
Two options :
call sc.stop() before this code
Add --conf with your additional configuration as spark-shell argument
I am a beginner in Spark and trying to follow instructions from here on how to initialize Spark shell from Python using cmd: http://spark.apache.org/docs/latest/quick-start.html
But when I run in cmd the following:
C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4\>c:\Python27\python bin\pyspark
then I receive the following error message:
File "bin\pyspark", line 21
export SPARK_HOME="$(cd ="$(cd "`dirname "$0"`"/..; pwd)"
SyntaxError: invalid syntax
What am I doing wrong here?
P.S. When in cmd I try just C:\Users\Alex\Desktop\spark-1.4.1-bin-hadoop2.4>bin\pyspark
then I receive ""python" is not recognized as internal or external command, operable program or batch file".
You need to have Python available in the system path, you can add it with setx:
setx path "%path%;C:\Python27"
I'm a fairly new Spark user (as of today, really). I am using spark 1.6.0 on Windows 10 and 7 machines. The following worked for me:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME environment variable is not set')
sys.path.insert(0, os.path.join(spark_home, 'python'))
sys.path.insert(0, os.path.join(spark_home, 'C:/spark-1.6.0-bin-hadoop2.6/python/lib/py4j-0.9-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
Using the code above, I was able to launch Spark in an IPython notebook and my Enthought Canopy Python IDE. Before, this, I was only able to launch pyspark through a cmd prompt. The code above will only work if you have your Environment Variables set correctly for Python and Spark (pyspark).
I run these set of path settings whenever I start pyspark in ipython:
import os
import sys
# Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"') for R
### MANNN restart spart using ipython notebook --profile=pyspark --packages com.databricks:spark-csv_2.10:1.0.3
os.environ['SPARK_HOME']="G:/Spark/spark-1.5.1-bin-hadoop2.6"
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/bin")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/sql")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/pyspark/mllib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip")
sys.path.append("G:/Spark/spark-1.5.1-bin-hadoop2.6/python/lib/pyspark.zip")
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SQLContext
##sc.stop() # IF you wish to stop the context
sc = SparkContext("local", "Simple App")
With the reference and help of the user "maxymoo" I was able to find a way to set a PERMANENT path is Windows 7 as well. The instructions are here:
http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx
Simply set path in System -> Environment Variables -> Path
R Path in my system C:\Program Files\R\R-3.2.3\bin
Python Path in my system c:\python27
Spark Path in my system c:\spark-2
The path must be separated by ";" and there must be no space between paths
I am trying to run the ruby plugin foo.rb as given on this link-http://logstash.net/docs/1.3.3/extending/example-add-a-new-filter .
As specified there I have created the ruby file inside /logstash-1.5.0.beta1/lib/logstash/filters/foo.rb
The command for running the conf file is given as-
% logstash --pluginpath . -f example.conf
What exactly should I write in the place of 'pluginpath'?
On specifying the path of foo.rb it is giving me the following errors-
Clamp::UsageError: Unrecognised option '--lib/logstash/filters/foo.rb'
signal_usage_error at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/clamp-0.6.3/lib/clamp/command.rb:103
find_option at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/clamp-0.6.3/lib/clamp/option/parsing.rb:62
parse_options at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/clamp-0.6.3/lib/clamp/option/parsing.rb:28
parse at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/clamp-0.6.3/lib/clamp/command.rb:52
run at /home/administrator/Softwares/logstash-1.5.0.beta1/lib/logstash/runner.rb:155
call at org/jruby/RubyProc.java:271
run at /home/administrator/Softwares/logstash-1.5.0.beta1/lib/logstash/runner.rb:171
call at org/jruby/RubyProc.java:271
initialize at /home/administrator/Softwares/logstash-1.5.0.beta1/vendor/bundle/jruby/1.9/gems/stud-0.0.18/lib/stud/task.rb:12
What should I do? Thanks in advance! :)
I would like to change the following batch script to Scala (just for fun), however, the script must keep running and listen for changes to the *.mkd files. If any file is changed, then the script should re-generate the affected doc. File IO has always been my Achilles heel...
#!/bin/sh
for file in *.mkd
do
pandoc --number-sections $file -o "${file%%.*}.pdf"
done
Any ideas around a good approach to this will be appreciated.
The following code, taken from my answer on: Watch for project files also can watch a directory and execute a specific command:
#!/usr/bin/env scala
import java.nio.file._
import scala.collection.JavaConversions._
import scala.sys.process._
val file = Paths.get(args(0))
val cmd = args(1)
val watcher = FileSystems.getDefault.newWatchService
file.register(
watcher,
StandardWatchEventKinds.ENTRY_CREATE,
StandardWatchEventKinds.ENTRY_MODIFY,
StandardWatchEventKinds.ENTRY_DELETE
)
def exec = cmd run true
#scala.annotation.tailrec
def watch(proc: Process): Unit = {
val key = watcher.take
val events = key.pollEvents
val newProc =
if (!events.isEmpty) {
proc.destroy()
exec
} else proc
if (key.reset) watch(newProc)
else println("aborted")
}
watch(exec)
Usage:
watchr.scala markdownFolder/ "echo \"Something changed!\""
Extensions have to be made to the script to inject file names into the command. As of now this snippet should just be regarded as a building block for the actual answer.
Modifying the script to incorporate the *.mkd wildcards would be non-trivial as you'd have to manually search for the files and register a watch on all of them. Re-using the script above and placing all files in a directory has the added advantage of picking up new files when they are created.
As you can see it gets pretty big and messy pretty quick just relying on Scala & Java APIs, you would be better of relying on alternative libraries or just sticking to bash while using INotify.