Scala with filter column - scala

I get the error;
error: not found: value col
when I issue the command in Databricks notebook which I don't get the error when running it from a spark-shell;
altitiduDF.select("height", "terrain").filter(col("height") >= 11000...
^
I tried importing the following before my command query, but it did not help;
import org.apache.spark.sql.Column
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.Row
Where can I find what I need to import to use the col function?

Found that I need to import;
import org.apache.spark.sql.functions.{col}
to use the col function in Databricks.

Related

Pyspark ModuleNotFound when importing custom package

Context: I'm running a script on azure databricks and I'm using imports to import functions from a given file
Let's say we have something like this in a file called "new_file"
from old_file import x
from pyspark.sql import SparkSession
from pyspark.context import SparkContext
from pyspark.sql.types import *
spark = SparkSession.builder.appName('workflow').config(
"spark.driver.memory", "32g").getOrCreate()
The imported funcion "x" will take as argument a string that was read as a pyspark dataframe as such:
new_df_spark = spark.read.parquet(new_file)
new_df = ps.DataFrame(new_df_spark)
new_df is then passed as argument to a function that calls the function x
I then get an error like
ModuleNotFoundError: No module named "old_file"
Does this mean I can't use imports? Or do I need to install the old_file in the cluster for this to work? If so, how would this work and will the package update if I change old_file again?
Thanks

Unable to call Notebook when using scala code in Databricks

I am into a situation where I am able to successfully run the below snippet in azure Databricks from a separate CMD.
%run ./HSCModule
But running into issues when including that piece of code with other scala code which is importing below packages and getting following error.
import java.io.{File, FileInputStream}
import java.text.SimpleDateFormat
import java.util{Calendar, Properties}
import org.apache.spark.SparkException
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConverters._
import scala.util._
ERROR = :168: error: ';' expected but '.' found. %run
./HSCModule
FYI - I have also used dbutils.notebook.run and still facing same issues.
You can't mix the magic commands, like, %run, %pip, etc. with the Scala/Python code in the same cell. Documentation says:
%run must be in a cell by itself, because it runs the entire notebook inline.
So you need to put this magic command into a separate cell.

Scala: Import scala.io.StdIn.readLIne -- error importing

I am getting an error if I import readline, as follows:
import scala.io.StdIn.{readline, readInt} =>
error: value readline is not a member of object scala.io.StdIn
import scala.io.StdIn.{readline, readInt}
Scala code runner version 2.12.1
If I don't import this, I get a deprecated message:
warning: there was one deprecation warning (since 2.11.0); re-run with -deprecation for details
one warning found
I get no errors if I use the fill path to the function:
var x = scala.io.StdIn.readLine.toInt
Let me know if you can help me resolve the import. Thanks.
A very tiny overlook:
import scala.io.StdIn.{readLine, readInt}
readLine has an upper case L

1: error: ';' expected but 'import' found

I am running this code in Zeppelin, I am getting following error message
from pyspark import SparkContext
from pyspark.sql import HiveContext
sc = SparkContext(appName="PythonSQL")
hive_context = HiveContext(sc)
bank = hive_context.table("default.invites_orc")
bank.show()
bank.registerTempTable("bank_temp")
hive_context.sql("select * from bank_temp").show()
sc.stop()
:1: error: ';' expected but 'import' found.
from pyspark import SparkContext
^
The Spark Interpreter group currently has 4 interpreter as listed here...
https://zeppelin.incubator.apache.org/docs/0.5.0-incubating/interpreter/spark.html
The default interpreter is %spark and default interpreter is selected based on the order of the interpreter listed in the zeppelin.interpreters property in zeppelin-site.xml config file.
The current order of interpreter in your zeppelin-site.xml (zeppelin.interpreters property) will be this ...
org.apache.zeppelin.spark.SparkInterpreter,org.apache.zeppelin.spark.PySparkInterpreter
Modify this to ...
org.apache.zeppelin.spark.PySparkInterpreter, org.apache.zeppelin.spark.SparkInterpreter
and restart Zeppelin (zeppelin-daemon.sh restart)
This will make %pyspark as default interpreter.
OR
You can write like this
%pyspark
from pyspark import SparkContext

Spark Shell Import Fine, But Throws Error When Referencing Classes

I am a beginner in Apache Spark, so please excuse me if this is quite trivial.
Basically, I was running the following import in spark-shell:
import org.apache.spark.sql.{DataFrame, Row, SQLContext, DataFrameReader}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql._
import org.apache.hadoop.hive.ql.io.orc.{OrcInputFormat,OrcStruct};
import org.apa‌​che.hadoop.io.NullWritable;
...
val rdd = sc.hadoopFile(path,
classOf[org.apache.hadoop.hive.ql.io.orc.OrcInputFor‌​mat],
classOf[NullWritable],
classOf[OrcStruct],
1)
The import statements up till OrcInputFormat works fine, with the exception that:
error: object apa‌​che is not a member of package org
import org.apa‌​che.hadoop.io.NullWritable;
It does not make sense, if the import statement before goes through without any issue.
In addition, when referencing OrcInputFormat, I was told:
error: type OrcInputFor‌​mat is not a member of package org.apache.hadoop.hive.ql.io.orc
It seems strange that import for OrcInputFormat to work (I assume it works, since no error is thrown), but then the above error message turns up. Basically, I am trying to read ORC files from S3.
I am also looking at what have I done wrong, and why this happens.
What I have done:
I have tried running spark-shell with the --jars option, and tried importing hadoop-common-2.6.0.jar (My current version of Spark is 1.6.1, compiled with Hadoop 2.6)
val df = sqlContext.read.format("orc").load(PathToS3), as referred by (Read ORC files directly from Spark shell). I have tried variations of S3, S3n, S3a, without any success.
You have 2 non-printing characters between org.ape and che in the last import, most certainly due to a copy paste :
import org.apa‌​che.hadoop.io.NullWritable;
Just rewrite the last import statement and it will work. Also you don't need these semi-colons.
You have the same problem with OrcInputFormat :
error: type OrcInputFor‌​mat is not member of package org.apache.hadoop.hive.ql.io.orc
That's funny, in the mobile version of Stackoverflow we can cleary see those non-printing characters :