In order to make use of $"my_column" constructs within a spark sql we need to:
import spark.implicits._
This is however not working afaict inside a jupyter notebook cell: the result is:
Name: Compile Error
Message: <console>:49: error: stable identifier required, but this.$line7$read.spark.implicits found.
import spark.implicits._
^
I have seen notebooks in the past for which that did work - but they may have been zeppelin.. Is there a way to get this for jupyter ?
Here is a hack that works
val spark2: SparkSession = spark
import spark2.implicits._
So now the spark2 reference is "stable" apparently.
Related
I am trying to run a pyspark unit test in Visual studio code on my local windows machine. when i debug the test it gets stuck at line where I am creating a sparksession. It doesn't show any error/failure but status bar just shows "Running Tests" . Once it work, i can refactor my test to create sparksession as part of test fixture, but presently my test is getting stuck at sparksession creation.
Do i have to install/configure on my local machine for sparksession to work?
i tried a simple test with assert 'a' == 'b' and i can debug and test run succsfully, so i assume my pytest configurations are correct. Issue i am facing is with creating sparksession.
# test code
from pyspark.sql import SparkSession, Row, DataFrame
import pytest
def test_poc():
spark_session = SparkSession.builder.master('local[2]').getOrCreate() #this line never returns when debugging test.
spark_session.createDataFrame(data,schema) #data and schema not shown here.
Thanks
What I have done to make it work was:
Create a .env file in the root of the project
Add the following content to the created file:
SPARK_LOCAL_IP=127.0.0.1
JAVA_HOME=<java_path>/jdk/zulu#1.8.192/Contents/Home
SPARK_HOME=<spark_path>/spark-3.0.1-bin-hadoop2.7
PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.9-src.zip:$PYTHONPATH
Go to .vscode file in the root, expand and open settings.json. Add the following like (replace <workspace_path> with your actual workspace path):
"python.envFile": "<workspace_path>/.env"
After refreshing the Testing section in Visual Studio Code, the setup should succeed.
Note: I use pyenv to setup my python version, so I had to make sure that VS Code was using the correct python version with all the expected dependencies installed.
Solution inspired by py4j.protocol.Py4JError: org.apache.spark.api.python.PythonUtils.getEncryptionEnabled does not exist in the JVM and https://github.com/microsoft/vscode-python/issues/6594
I need to execute a shell file at the end of my code in spark using scala. I used count and groupby functions in my code. I should mention that, my code works perfectly without the last line of code.
(import sys.process._
/////////////////////////linux commands
val procresult="./Users/saeedtkh/Desktop/seprator.sh".!!)
could you please help me how to fix it.
You must use sys.process._ package from Scala SDK and use DSL with !:
import sys.process._
val result = "ls -al".!
Or make same with scala.sys.process.Process:
import scala.sys.process._
Process("cat data.txt")!
I tried to run pyspark via terminal. From my terminal, I runs snotebook and it will automatically load jupiter. After that, when I select python3, the error comes out from the terminal.
[IPKernelApp] WARNING | Unknown error in handling PYTHONSTARTUP file
/Users/simon/spark-1.6.0-bin-hadoop2.6/python/pyspark/shell.py
Here's my .bash_profile setting:
export PATH="/Users/simon/anaconda/bin:$PATH"
export SPARK_HOME=~/spark-1.6.0-bin-hadoop2.6
export PATH=$PATH:$SPARK_HOME/bin
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
export PYSPARK_PYTHON=python3
alias snotebook='$SPARK_HOME/bin/pyspark'
Please let me know if you have any ideas, thanks.
You need to add below line in your code
PYSPARK_DRIVER_PYTHON=ipython
or
PYSPARK_DRIVER_PYTHON=ipython3
Hope it will help.
In my case, I was using a virtual environment and forgot to install Jupyter, so it was using some version that it found in the $PATH. Installing it inside the environment fixed this issue.
Spark now includes PySpark as part of the install, so remove the PySpark library unless you really need it.
Remove the old Spark, install latest version.
Install (pip) findspark library.
In Jupiter, import and use findspark:
import findspark
findspark.init()
Quick PySpark / Python 3 Check
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext()
print(sc)
sc.stop()
I try to run Multilayer perceptron classifier example here:https://spark.apache.org/docs/1.5.2/ml-ann.html, it seems works well at spark-shell, but not with IDE like Intellij and Eclipse. The problem comes from
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_multiclass_classification_data.txt").toDF()
IDE prompts cannot resolve symbol sc(sparkcontext), but the libraries path has been correctly configure. If anyone can helps me, thanks!
Actually there is no such value as sc by default. It's imported on spark-shell startup. In any ordinal scala\java\python code you should create it manually.
I've recently made very low quality answer. You can use part about sbt and libraries in it.
Next you can use something like following code as template to start.
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkContext, SparkConf}
object Spark extends App {
val config = new SparkConf().setAppName("odo").setMaster("local[2]").set("spark.driver.host", "localhost")
val sc = new SparkContext(config)
val sqlc = new SQLContext(cs)
import sqlc.implicits._
//here you code follows
}
Next you can just CtrlShiftF10 it
I'm trying to use QTree and QTreeSemigroup in the algebird package but am unable to import them in the spark-shell.
I tried both:
spark-shell --jars ~/jars/algebird-core_2.10-0.1.11.jar
SPARK_CLASSPATH="~/jars/algebird-core_2.10-0.1.11.jar" spark-shell --jars ~/jars/algebird-core_2.10-0.1.11.jar
and I'm able to successfully import algebird like this:
import com.twitter.algebird._
But when I try to import Qtree I get that they are not members of the algebird package:
scala> import com.twitter.algebird.QTree
<console>:22: error: object QTree is not a member of package com.twitter.algebird
import com.twitter.algebird.QTree
^
scala> import com.twitter.algebird.QTreeSemigroup
<console>:22: error: object QTreeSemigroup is not a member of package com.twitter.algebird
import com.twitter.algebird.QTreeSemigroup
What gives?! Anyone seen this before?
Figured it out, I was using an old version of algebird. Works with the latest.