Spark Shell Import Fine, But Throws Error When Referencing Classes - scala

I am a beginner in Apache Spark, so please excuse me if this is quite trivial.
Basically, I was running the following import in spark-shell:
import org.apache.spark.sql.{DataFrame, Row, SQLContext, DataFrameReader}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.hadoop.hive.ql.io.orc.{OrcInputFormat,OrcStruct};
import org.apa‌​che.hadoop.io.NullWritable;
...
val rdd = sc.hadoopFile(path,
classOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFor‌​mat],
classOf[NullWritable],
classOf[OrcStruct],
1)
The import statements up till OrcInputFormat works fine, with the exception that:
error: object apa‌​che is not a member of package org
import org.apa‌​che.hadoop.io.NullWritable;
It does not make sense, if the import statement before goes through without any issue.
In addition, when referencing OrcInputFormat, I was told:
error: type OrcInputFor‌​mat is not a member of package org.apache.hadoop.hive.ql.io.orc
It seems strange that import for OrcInputFormat to work (I assume it works, since no error is thrown), but then the above error message turns up. Basically, I am trying to read ORC files from S3.
I am also looking at what have I done wrong, and why this happens.
What I have done:
I have tried running spark-shell with the --jars option, and tried importing hadoop-common-2.6.0.jar (My current version of Spark is 1.6.1, compiled with Hadoop 2.6)
val df = sqlContext.read.format("orc").load(PathToS3), as referred by (Read ORC files directly from Spark shell). I have tried variations of S3, S3n, S3a, without any success.

You have 2 non-printing characters between org.ape and che in the last import, most certainly due to a copy paste :
import org.apa‌​che.hadoop.io.NullWritable;
Just rewrite the last import statement and it will work. Also you don't need these semi-colons.
You have the same problem with OrcInputFormat :
error: type OrcInputFor‌​mat is not member of package org.apache.hadoop.hive.ql.io.orc
That's funny, in the mobile version of Stackoverflow we can cleary see those non-printing characters :

Related

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

Unable to call Notebook when using scala code in Databricks

I am into a situation where I am able to successfully run the below snippet in azure Databricks from a separate CMD.
%run ./HSCModule
But running into issues when including that piece of code with other scala code which is importing below packages and getting following error.
import java.io.{File, FileInputStream}
import java.text.SimpleDateFormat
import java.util{Calendar, Properties}
import org.apache.spark.SparkException
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConverters._
import scala.util._
ERROR = :168: error: ';' expected but '.' found. %run
./HSCModule
FYI - I have also used dbutils.notebook.run and still facing same issues.
You can't mix the magic commands, like, %run, %pip, etc. with the Scala/Python code in the same cell. Documentation says:
%run must be in a cell by itself, because it runs the entire notebook inline.
So you need to put this magic command into a separate cell.

How to fix spark.read.format("parquet") error

I'm running Scala code on Azure databricks well. Now I want to move this code from Azure notebook to eclipse.
I install databricks connection following Microsoft document successfully. Pass databricks data connection test.
I also installed SBT and import to my project in eclipse
I create scala object in eclipse and also I import all jar files as external file in pyspark
package Student
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.SparkSession
import java.util.Properties
//import com.databricks.dbutils_v1.DBUtilsHolder.dbutils
object Test {
def isTypeSame(df: DataFrame, name: String, coltype: String) = (df.schema(name).dataType.toString == coltype)
def main(args: Array[String]){
var Result = true
val Borrowers = List(("col1", "StringType"),("col2", "StringType"),("col3", "DecimalType(38,18)"))
val dfPcllcus22 = spark.read.format("parquet").load("/mnt/slraw/ServiceCenter=*******.parquet")
if (Result == false) println("Test Fail, Please check") else println("Test Pass")
}
}
When I run this code in eclipse, it shows cannot find main class. But if I comment "val dfPcllcus22 = spark.read.format("parquet").load("/mnt/slraw/ServiceCenter=*******.parquet")", pass the test.
So it seems spark.read.format cannot be recognized.
I'm new to Scala and DataBricks.
I was researching result for several days but still cannot solve it.
If anyone can help, really appreciate.
Environment is a bit complicated to me, if more information required, please let me know
SparkSession is needed to run your code in eclipse, since your provided code does not have this line for SparkSession creation leads to an error,
val spark = SparkSession.builder.appName("SparkDBFSParquet").master("local[*]".getOrCreate()
Please add this line and run the code and it should work.

No module named 'pyspark' in Zeppelin

I am new to Spark and just started using it. Trying to import SparkSession from pyspark but it throws an error: 'No module named 'pyspark'. Please see my code below.
# Import our SparkSession so we can use it
from pyspark.sql import SparkSession
# Create our SparkSession, this can take a couple minutes locally
spark = SparkSession.builder.appName("basics").getOrCreate()```
Error:
```---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-6ce0f5f13dc0> in <module>
1 # Import our SparkSession so we can use it
----> 2 from pyspark.sql import SparkSession
3 # Create our SparkSession, this can take a couple minutes locally
4 spark = SparkSession.builder.appName("basics").getOrCreate()
ModuleNotFoundError: No module named 'pyspark'```
I am in my conda env and I tried ```pip install pyspark``` but I already have it.
If you are using Zepl, they have their own specific way of importing. This makes sense, they need their own syntax since they are running in the cloud. It clarifies their specific syntax vs. Python itself. For instance %spark.pyspark.
%spark.pyspark
from pyspark.sql import SparkSession

scipy UDF error when migrating to Spark 1.6

I started migrating from Spark 1.5 (Python) to Spark 1.6 and for some reason the following commands does not work anymore :
from scipy.stats import binom
from pyspark.sql.types import FloatType
BCDF = lambda Ps : binom.cdf(Ps[0],Ps[1],Ps[2])
sqlContext.udf.register('bcdf', BCDF, FloatType())
It yields the error :
no module named _tkinter
I tested my scipy function was still working, everything as expected on that front.
Did anyone experienced a similar issue ?
Best
For some reasons importing stats rather than binom directly did the trick for me :
from scipy import stats
from pyspark.sql.types import FloatType
BCDF = lambda Ps : stats.binom.cdf(Ps[0],Ps[1],Ps[2])
sqlContext.udf.register('bcdf', BCDF, FloatType())