Apache spark error: not found: value sqlContext - scala

I am trying to set up spark in Windows 10. Initially, I faced this error while starting and the solution in the link helped. Now I am still not able to run import sqlContext.sql as it still throws me an error
----------------------------------------------------------------
Fri Mar 24 12:07:05 IST 2017:
Booting Derby version The Apache Software Foundation - Apache Derby - 10.12.1.1 - (1704137): instance a816c00e-015a-ff08-6530-00000ac1cba8
on database directory C:\metastore_db with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1#37606fee
Loaded from file:/F:/Soft/spark/spark-2.1.0-bin-hadoop2.7/bin/../jars/derby-10.12.1.1.jar
java.vendor=Oracle Corporation
java.runtime.version=1.8.0_101-b13
user.dir=C:\
os.name=Windows 10
os.arch=amd64
os.version=10.0
derby.system.home=null
Database Class Loader started - derby.database.classpath=''
17/03/24 12:07:09 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Spark context Web UI available at http://10.128.18.22:4040
Spark context available as 'sc' (master = local[*], app id = local-1490337421381).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_101)
Type in expressions to have them evaluated.
Type :help for more information.
scala> import sqlContext.sql
<console>:23: error: not found: value sqlContext
import sqlContext.sql
^

Spark context available as 'sc' (master = local[*], app id =
local-1490337421381).
Spark session available as 'spark'.
In Spark 2.0.x, the entry point of Spark is SparkSession and that is available in Spark shell as spark, so try this way:
spark.sqlContext.sql(...)
You can also create your Spark Context like this
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
First option is my choice as Spark shell has already created one for you, so make use of it.

Since you are using Spark 2.1 you'll have to use the SparkSession object. You can get a reference to SparkContext from the SparkSession object
var sSession = org.apache.spark.sql.SparkSession.getOrCreate();
var sContext = sSession.sparkContext;

If you are on Cloudera and have this issue, the solution from this Github ticket worked for me (https://github.com/cloudera/clusterdock/issues/30):
The root user (who you're running as when you start spark-shell) has no user directory in HDFS. If you create one (sudo -u hdfs hdfs dfs -mkdir /user/root followed by sudo -u hdfs dfs -chown root:root /user/root), this should be fixed.
I.e. create a user home directory for the user running spark-shell. This fixed it for me.

And don't forget to import the context!
import org.apache.spark.sql.{SparkSession, types}

You have to create sqlContext in order to access it to execute SQL statements. In Spark 2.0, you can create SQLContext easily using SparkSession like below.
val sqlContext = spark.sqlContext
sqlContext.sql("SELECT * FROM sometable")
Alternatively, you could also execute SQL statements using SparkSession like below.
spark.sql("SELECT * FROM sometable")

Related

Unable to create dataframe using SQLContext object in spark2.2

I am using spark 2.2 version on Microsoft Windows 7. I want to load csv file in one variable to perform SQL related actions later on but unable to do so. I referred accepted answer from this link but of no use. I followed below steps for creating SparkContext object and SQLContext object:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc=SparkContext.getOrCreate() // Creating spark context object
val sqlContext = new org.apache.spark.sql.SQLContext(sc) // Creating SQL object for query related tasks
Objects are created successfully but when I execute below code it throws an error which can't be posted here.
val df = sqlContext.read.format("csv").option("header", "true").load("D://ResourceData.csv")
And when I try something like df.show(2) it says that df was not found. I tried databricks solution for loading CSV from the attached link. It downloads the packages but doesn't load csv file. So how can I rectify my problem?? Thanks in advance :)
I solved my problem for loading local file in dataframe using 1.6 version in cloudera VM with the help of below code:
1) sudo spark-shell --jars /usr/lib/spark/lib/spark-csv_2.10-1.5.0.jar,/usr/lib/spark/lib/commons-csv-1.5.jar,/usr/lib/spark/lib/univocity-parsers-1.5.1.jar
2) val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("treatEmptyValuesAsNulls", "true" ).option("parserLib", "univocity").load("file:///home/cloudera/Desktop/ResourceData.csv")
NOTE: sc and sqlContext variables are automatically created
But there are many improvements in the latest version i.e 2.2.1 which I am unable to use because metastore_db doesn't gets created in windows 7. I ll post a new question regarding the same.
In reference with your comment that you are able to access SparkSession variable, then follow below steps to process your csv file using SparkSQL.
Spark SQL is a Spark module for structured data processing.
There are mainly two abstractions - Dataset and Dataframe :
A Dataset is a distributed collection of data.
A DataFrame is a Dataset organized into named columns.
In the Scala API, DataFrame is simply a type alias of Dataset[Row].
With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
You have a csv file and you can simply create a dataframe by doing one of the following:
From your spark-shell using the SparkSession variable spark:
val df = spark.read
.format("csv")
.option("header", "true")
.load("sample.csv")
After reading the file into dataframe, you can register it into a temporary view.
df.createOrReplaceTempView("foo")
SQL statements can be run by using the sql methods provided by Spark
val fooDF = spark.sql("SELECT name, age FROM foo WHERE age BETWEEN 13 AND 19")
You can also query that file directly with SQL:
val df = spark.sql("SELECT * FROM csv.'file:///path to the file/'")
Make sure that you run spark in local mode when you load data from local, or else you will get error. The error occurs when you have already set HADOOP_CONF_DIR environment variable,and which expects "hdfs://..." otherwise "file://".
Set your spark.sql.warehouse.dir (default: ${system:user.dir}/spark-warehouse).
.config("spark.sql.warehouse.dir", "file:///C:/path/to/my/")
It is the default location of Hive warehouse directory (using Derby)
with managed databases and tables. Once you set the warehouse directory, Spark will be able to locate your files, and you can load csv.
Reference : Spark SQL Programming Guide
Spark version 2.2.0 has built-in support for csv.
In your spark-shell run the following code
val df= spark.read
.option("header","true")
.csv("D:/abc.csv")
df: org.apache.spark.sql.DataFrame = [Team_Id: string, Team_Name: string ... 1 more field]

spark cassandra intelliJ scala

Environment:
My Laptop:
IntelliJ IDEA 2017.1.4
JRE: 1.8.0_112-release-736-b21 x86_64
JVM: OpenJDK 64-Bit Server VM by JetBrains s.r.o
Mac OS X 10.12.2
IP Address: 192.168.3.203
AWS Environment:
Linux ip-172-31-16-45 x86_64 x86_64 GNU/Linux
DSE Version : 5.1.0
IP Address: 172.31.16.45
PUBLIC IP: 35.35.42.42
Ecexuted the following commands:
[ec2-user#ip-172-31-16-45 ~]$ dse cassandra -k
and then
[ec2-user#ip-172-31-16-45 ~]$ dse spark
The log file is at /home/ec2-user/.spark-shell.log
warning: there was one deprecation warning; re-run with -deprecation for details
New Spark Session
WARN 2017-09-13 13:18:26,907 org.apache.spark.SparkContext: Use an existing SparkContext, some configuration may not take effect.
Extracting Spark Context
Extracting SqlContext
Spark context Web UI available at http://172.31.16.45:4040
Spark context available as 'sc' (master = dse://?, app id = app-20170913131826-0001).
Spark session available as 'spark'.
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
Spark Version: 2.0.2.6 ( One I see on Web Browser when I open http://35.35.42.42:4040/executors/)
Both Spark and Cassandra database are running on 35.35.42.42(public IP) or (Local IP: 172.31.16.45)
When I continue coding from scala env on AWS my code is running fine. What I need is to run the same code from intelliJ . Is it possible. If not what is the process.
Now I opened IntelliJ on my laptop and tried running below code:
// In .setMaster I tried all - .setMaster ("all combinations of local IP, public IP, spark, dse etc")
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import com.datastax.spark.connector._
object helloworld {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Simple Application").setMaster("dse://35.35.42.42:4040")
// val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://172.31.16.45:4040")
// val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://35.35.42.42:7077")
val conf = new SparkConf().setAppName("Simple Application").setMaster("spark://172.31.16.45:7077")
val sc = new SparkContext(conf)
// val rdd = sc.cassandraTable("tdata", "map")
}
}
Error:
org.apache.spark.SparkException: Could not parse Master URL: 'dse://35.35.42.42:4040'

Unable to connect to remote Spark Master - ROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message

I posted this question some time ago but it came out that I was using my local resources instead of remote's ones.
I have a remote machine configured with spark : 2.1.1, cassandra : 3.0.9 and scala : 2.11.6.
Cassandra is configured at localhost:9032 and spark master at localhost:7077.
Spark master is set to 127.0.0.1 and its port to 7077.
I'm able to connect to cassandra remotely but unable to do the same
thing with spark.
When connecting to the remote spark master, I get the following error:
ERROR TransportRequestHandler: Error while invoking RpcHandler#receive() for one-way message.
Here are my settings via code
val configuration = new SparkConf(true)
.setAppName("myApp")
.setMaster("spark://xx.xxx.xxx.xxx:7077")
.set("spark.cassandra.connection.host", "xx.xxx.xxx.xxx")
.set("spark.cassandra.connection.port", 9042)
.set("spark.cassandra.input.consistency.level","ONE")
.set("spark.driver.allowMultipleContexts", "true")
val sparkSession = SparkSession
.builder()
.appName("myAppEx")
.config(configuration)
.enableHiveSupport()
.getOrCreate()
I don't understand why cassandra works just fine and spark does not.
What's causing this? How can I solve?
I answer to this question in order to help other people who are struggling with this problem.
It came out that it was caused by a mismatch between Intellij Idea's scala version and server's one.
Server had scala ~ 2.11.6 while the IDE was using scala ~ 2.11.8.
In order to make sure of using the very same version, it was necessary to change IDE's scala version by doing the following steps:
File > Project Structure > + > Scala SDK > Find and select server's scala Version > Download it if you haven't it already installed > Ok > Apply > Ok
Could this be a typo? The errormessage reports 8990, in your connect config you have port 8090 for spark.

Spark-Scala with Cassandra

I am beginner with Spark, Scala and Cassandra. I am working with ETL programming.
Now my project ETL POCs required Spark, Scala and Cassandra. I configured Cassandra with my ubuntu system in /usr/local/Cassandra/* and after that I installed Spark and Scala. Now I am using Scala editor to start my work, I created simply load a file in landing location, but after that I am trying to connect with cassandra in scala but I am not getting an help how we can connect and process the data in destination database?.
Any one help me Is this correct way? or some where I am wrong? please help me to how we can achieve this process with above combination.
Thanks in advance!
Add spark-cassandra-connector to your pom or sbt by reading instruction, then work this way
Import this in your file
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.sql.cassandra._
spark scala file
object SparkCassandraConnector {
def main(args: Array[String]) {
val conf = new SparkConf(true)
.setAppName("UpdateCassandra")
.setMaster("spark://spark:7077") // spark server
.set("spark.cassandra.input.split.size_in_mb","67108864")
.set("spark.cassandra.connection.host", "192.168.3.167") // cassandra host
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
// connecting with cassandra for spark and sql query
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
// Load data from node publish table
val df = spark
.read
.cassandraFormat( "table_nmae", "keyspace_name")
.load()
}
}
This will work for spark 2.2 and cassandra 2
you can perform this easly with spark-cassandra-connector

SparkSession return nothing with an HiveServer2 connection throught JDBC

I have an issue about reading data from a remote HiveServer2 using JDBC and SparkSession in Apache Zeppelin.
Here is the code.
%spark
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
val prop = new java.util.Properties
prop.setProperty("user","hive")
prop.setProperty("password","hive")
prop.setProperty("driver", "org.apache.hive.jdbc.HiveDriver")
val test = spark.read.jdbc("jdbc:hive2://xxx.xxx.xxx.xxx:10000/", "tests.hello_world", prop)
test.select("*").show()
When i run this, I've got no errors but no data too, i just retrieve all the column name of table, like this :
+--------------+
|hello_world.hw|
+--------------+
+--------------+
Instead of this :
+--------------+
|hello_world.hw|
+--------------+
+ data_here +
+--------------+
I'am running all of this on :
Scala 2.11.8,
OpenJDK 8,
Zeppelin 0.7.0,
Spark 2.1.0 ( bde/spark ),
Hive 2.1.1 ( bde/hive )
I run this setup in Docker which each of those have their own container but connected in the same network.
Furthermore it just works when i use use the spark beeline to connect to my remote Hive.
Did i have forgot something ?
Any help would be appreciated.
Thanks in advance.
EDIT :
I've found a workaround, which is sharing docker volume or docker data-container between Spark and Hive, more precisily the Hive warehouse folder between them, and with configuring the spark-defaults.conf. Then you can acces hive through SparkSession without JDBC. Here is the step by step to how to do it :
Share the Hive warehouse folder between Spark and Hive
Configure spark-defaults.conf with like this :
spark.serializer org.apache.spark.serializer.KryoSerializer
spark.driver.memory Xg
spark.driver.cores X
spark.executor.memory Xg
spark.executor.cores X
spark.sql.warehouse.dir file:///your/path/here
Replace 'X' with your values.
Hope it helps.