How to run a shell file in Apache Spark using scala - scala

I need to execute a shell file at the end of my code in spark using scala. I used count and groupby functions in my code. I should mention that, my code works perfectly without the last line of code.
(import sys.process._
/////////////////////////linux commands
val procresult="./Users/saeedtkh/Desktop/seprator.sh".!!)
could you please help me how to fix it.

You must use sys.process._ package from Scala SDK and use DSL with !:
import sys.process._
val result = "ls -al".!
Or make same with scala.sys.process.Process:
import scala.sys.process._
Process("cat data.txt")!

Related

Download file from Databricks (Scala)

I've used the following piece of code to divide the romania-latest.osm.pbf into romania-latest.osm.pbf.node.parquet and romania-latest.osm.pbf.way.parquet in Databricks. Now, I want to download these files to my local computer, to be able to use these in Intellij, but I can't seem to find where they're located and how to get them. I'm using the community edition of Databricks. This is done in Scala.
import sys.process._
"wget https://github.com/adrianulbona/osm-parquetizer/releases/download/v1.0.0/osm-parquetizer-1.0.0.jar -P /tmp/osm" !!
import sys.process._
"wget http://download.geofabrik.de/europe/monaco-latest.osm.pbf -P /tmp/osm" !!
import sys.process._
"java -jar /tmp/osm/osm-parquetizer-1.0.0.jar /tmp/osm/monaco-latest.osm.pbf" !!
I've searched on Google for a solution but nothing seems to work.

Scala REPL startup command line

When I start the Scala REPL, is there a way to put Scala code or source files on the command line to automatically run? I would like to have some imports done automatically (on the command line in a bash script) at startup for user convenience.
Instead of Scala REPL use Ammonite. It is much better than the Scala REPL.
The best part about ammonite is that you can create a file called predef.sc under ~/.ammonite folder. Here you can specify SBT dependencies like interp.load.ivy("joda-time" % "joda-time" % jodaVersion) and also imports like import scala.concurrent.ExecutionContext.Implicits.global
Now each time you start the ammonite REPL you have all SBT dependencies and imports already present. How cool is that.
If your project has a build.sbt file then you can add something like this to that file:
initialCommands in console :=
"""
import java.net._
import java.time._
import java.time.format._
import java.time.temporal._
import javax.mail._
import javax.mail.internet._
"""
Not sure how clear that is but it is simply a block string using triple quotes
Saves a lot of time having that collection of basic classes (whatever classes make sense for your work) handy!

Importing spark.implicits._ inside a Jupyter notebook

In order to make use of $"my_column" constructs within a spark sql we need to:
import spark.implicits._
This is however not working afaict inside a jupyter notebook cell: the result is:
Name: Compile Error
Message: <console>:49: error: stable identifier required, but this.$line7$read.spark.implicits found.
import spark.implicits._
^
I have seen notebooks in the past for which that did work - but they may have been zeppelin.. Is there a way to get this for jupyter ?
Here is a hack that works
val spark2: SparkSession = spark
import spark2.implicits._
So now the spark2 reference is "stable" apparently.

Error when running pyspark

I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.
You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.
In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.
Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()

spark context cannot reslove in MLUtils.loadLibSVMFile with Intellij

I try to run Multilayer perceptron classifier example here:https://spark.apache.org/docs/1.5.2/ml-ann.html, it seems works well at spark-shell, but not with IDE like Intellij and Eclipse. The problem comes from
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt").toDF()
IDE prompts cannot resolve symbol sc(sparkcontext), but the libraries path has been correctly configure. If anyone can helps me, thanks!
Actually there is no such value as sc by default. It's imported on spark-shell startup. In any ordinal scala\java\python code you should create it manually.
I've recently made very low quality answer. You can use part about sbt and libraries in it.
Next you can use something like following code as template to start.
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
object Spark extends App {
val config = new SparkConf().setAppName("odo").setMaster("local[2]").set("spark.driver.host", "localhost")
val sc = new SparkContext(config)
val sqlc = new SQLContext(cs)
import sqlc.implicits._
//here you code follows
}
Next you can just CtrlShiftF10 it