scipy UDF error when migrating to Spark 1.6 - scipy

I started migrating from Spark 1.5 (Python) to Spark 1.6 and for some reason the following commands does not work anymore :
from scipy.stats import binom
from pyspark.sql.types import FloatType
BCDF = lambda Ps : binom.cdf(Ps[0],Ps[1],Ps[2])
sqlContext.udf.register('bcdf', BCDF, FloatType())
It yields the error :
no module named _tkinter
I tested my scipy function was still working, everything as expected on that front.
Did anyone experienced a similar issue ?
Best

For some reasons importing stats rather than binom directly did the trick for me :
from scipy import stats
from pyspark.sql.types import FloatType
BCDF = lambda Ps : stats.binom.cdf(Ps[0],Ps[1],Ps[2])
sqlContext.udf.register('bcdf', BCDF, FloatType())

Related

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

Scala with filter column

I get the error;
error: not found: value col
when I issue the command in Databricks notebook which I don't get the error when running it from a spark-shell;
altitiduDF.select("height", "terrain").filter(col("height") >= 11000...
^
I tried importing the following before my command query, but it did not help;
import org.apache.spark.sql.Column
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
Where can I find what I need to import to use the col function?
Found that I need to import;
import org.apache.spark.sql.functions.{col}
to use the col function in Databricks.

Error in Pycharm when linking to pyspark: name 'spark' is not defined

When I run the example code in cmd, everything is ok.
>>> import pyspark
>>> l = [('Alice', 1)]
>>> spark.createDataFrame(l).collect()
[Row(_1='Alice', _2=1)]
But when I execute the code in pycharm, I get an error.
spark.createDataFrame(l).collect()
NameError: name 'spark' is not defined
Maybe something wrong when I link Pycharm to pyspark.
Environment Variable
Project Structure
Project Interpreter
When you start pyspark from the command line, you have a sparkSession object and a sparkContext available to you as spark and sc respectively.
For using it in pycharm, you should create these variables first so you can use them.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
EDIT:
Please have a look at : Failed to locate the winutils binary in the hadoop binary path

No module named 'pyspark' in Zeppelin

I am new to Spark and just started using it. Trying to import SparkSession from pyspark but it throws an error: 'No module named 'pyspark'. Please see my code below.
# Import our SparkSession so we can use it
from pyspark.sql import SparkSession
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName("basics").getOrCreate()```
Error:
```---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-6ce0f5f13dc0> in <module>
1 # Import our SparkSession so we can use it
----> 2 from pyspark.sql import SparkSession
3 # Create our SparkSession, this can take a couple minutes locally
4 spark = SparkSession.builder.appName("basics").getOrCreate()
ModuleNotFoundError: No module named 'pyspark'```
I am in my conda env and I tried ```pip install pyspark``` but I already have it.
If you are using Zepl, they have their own specific way of importing. This makes sense, they need their own syntax since they are running in the cloud. It clarifies their specific syntax vs. Python itself. For instance %spark.pyspark.
%spark.pyspark
from pyspark.sql import SparkSession

Spark Shell Import Fine, But Throws Error When Referencing Classes

I am a beginner in Apache Spark, so please excuse me if this is quite trivial.
Basically, I was running the following import in spark-shell:
import org.apache.spark.sql.{DataFrame, Row, SQLContext, DataFrameReader}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.hadoop.hive.ql.io.orc.{OrcInputFormat,OrcStruct};
import org.apa‌​che.hadoop.io.NullWritable;
...
val rdd = sc.hadoopFile(path,
classOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFor‌​mat],
classOf[NullWritable],
classOf[OrcStruct],
1)
The import statements up till OrcInputFormat works fine, with the exception that:
error: object apa‌​che is not a member of package org
import org.apa‌​che.hadoop.io.NullWritable;
It does not make sense, if the import statement before goes through without any issue.
In addition, when referencing OrcInputFormat, I was told:
error: type OrcInputFor‌​mat is not a member of package org.apache.hadoop.hive.ql.io.orc
It seems strange that import for OrcInputFormat to work (I assume it works, since no error is thrown), but then the above error message turns up. Basically, I am trying to read ORC files from S3.
I am also looking at what have I done wrong, and why this happens.
What I have done:
I have tried running spark-shell with the --jars option, and tried importing hadoop-common-2.6.0.jar (My current version of Spark is 1.6.1, compiled with Hadoop 2.6)
val df = sqlContext.read.format("orc").load(PathToS3), as referred by (Read ORC files directly from Spark shell). I have tried variations of S3, S3n, S3a, without any success.
You have 2 non-printing characters between org.ape and che in the last import, most certainly due to a copy paste :
import org.apa‌​che.hadoop.io.NullWritable;
Just rewrite the last import statement and it will work. Also you don't need these semi-colons.
You have the same problem with OrcInputFormat :
error: type OrcInputFor‌​mat is not member of package org.apache.hadoop.hive.ql.io.orc
That's funny, in the mobile version of Stackoverflow we can cleary see those non-printing characters :