I am trying to run simple test code in pyspark for printing points using magellan library like from the github repository, but I have problem of undefined sc context.
If I run it from command line with proposed command $SPARK_HOME/bin/spark-submit --packages harsha2010:magellan:1.0.2-s_2.10 everything works because sc is imported automatically but if I run it as a standalone application from eclipse it does not recognize sc.
I have tried all combinations for its initialization including this piece of code:
from pyspark import SparkConf,SparkContext
from magellan.types import Point
from pyspark.sql import Row, SQLContext
#from magellan-master.python.magellan.context import sc
sc = SparkContext(appName="MyGeoFencing")
#sql = SQLContext(sc)
#from magellan.context import sc
#from magellan.context import sc
#from magellan.context import SQLContext
PointRecord = Row("id", "point")
#sparkConf = SparkConf().setAppName("MyGeoFencing")
#sc = SparkContext(conf=sparkConf)
#sql = SQLContext(sc)
sqlCont = SQLContext(sc)
points = sqlCont.parallelize([
(0, Point(-1.0, -1.0)),
(1, Point(-1.0, 1.0)),
(2, Point(1.0, -1.0))]).map(lambda x: PointRecord(*x)).toDF()
points.show()
Here is the problem that sqlCont does not have method parallelize.
I even tried importing directly sc from magellan.context, but does not work either.
The same problem stands when I use scala!
Do you have some idea how this should work?
Thanks!
This works for me:
sc = spark.sparkContext
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlContext = SQLContext(sc)
Related
I followed the solutions in here, however, I am still getting the "cannot resolve symbol SQLContext" error. ".implicits._" cannot be resolved either. What would be the reason for it?
Spark/Scala versions I use:
Scala 2.12.13
Spark 3.0.1 (without bundled Hadoop)
Here is my related code part:
import org.apache.log4j.LogManager
import org.apache.spark.{SparkConf, SparkContext}
object Count {
def main(args: Array[String]) {
...
...
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
}}
You didn't import SQLContext at all:
import org.apache.spark.sql.SQLContext
You should probably not use SQLContext anymore in the first place though:
As of Spark 2.0, this is replaced by SparkSession. However, we are keeping the class here for backward compatibility.
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/SQLContext.html
See how to use a SparkSession from SparkContext at How to create SparkSession from existing SparkContext and then import sparkSession.implicits._.
I would like to know the PySpark equivalent of the following code in Scala. I am using databricks. I need the same output as below:-
to create new Spark session and output the session id (SparkSession#123d0e8)
val new_spark = spark.newSession()
**Output**
new_spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#123d0e8
to view SparkContext and output the SparkContext id (SparkContext#2dsdas33)
new_spark.sparkContext
**Output**
org.apache.spark.SparkContext = org.apache.spark.SparkContext#2dsdas33
It's very similar. If you have already a session and want to open another one, you can use
my_session = spark.newSession()
print(my_session)
This will produce the new session object I think you are trying to create
<pyspark.sql.session.SparkSession object at 0x7fc3bae3f550>
spark is a session object already running, because you are using a databricks notebook
SparkSession could be created as http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
>>> from pyspark.sql import SparkSession
>>> from pyspark.conf import SparkConf
>>> SparkSession.builder.config(conf=SparkConf())
or
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('FirstSparkApp').getOrCreate()
I am new to Spark and I would like to read a CSV-file to a Dataframe.
Spark 1.3.0 / Scala 2.3.0
This is what I have so far:
# Start Scala with CSV Package Module
spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
# Import Spark Classes
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlCtx ._
# Create SparkConf
val conf = new SparkConf().setAppName("local").setMaster("master")
val sc = new SparkContext(conf)
# Create SQLContext
val sqlCtx = new SQLContext(sc)
# Create SparkSession and use it for all purposes:
val session = SparkSession.builder().appName("local").master("master").getOrCreate()
# Read CSV-File and turn it into Dataframe.
val df_fc = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")
However at SparkSession.builder() it gives the following error:
^
How can I fix this error?
SparkSession is available in spark 2. No need to create sparkcontext in spark version 2. sparksession itself provides the gateway to all .
Try below as you are using version 1.x:
val df_fc = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")
I am using spark 1.4.0
When I tried to import spark.implicits using this command:
import spark.implicits._, this error appear:
<console>:19: error: not found: value spark
import spark.implicits._
^
Can anyone help me to resolve this problem ?
It's because SparkSession is avialable from Spark 2.0 and spark value is an object of type SparkSession in Spark REPL.
In Spark 1.4 use
import sqlContext.implicits._
Value sqlContext is automatically created in Spark REPL for Spark 1.x
To make it complete, first you have to create a sqlContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setMaster("local").setAppName("my app")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
I am using Spark 1.6.1, and Scala 2.10.5. I am trying to read the csv file through com.databricks.
While launching the spark-shell, I use below lines as well
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0 --driver-class-path path to/sqljdbc4.jar, and below is the whole code
import java.util.Properties
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
val conf = new SparkConf().setAppName("test").setMaster("local").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val df = SQLContext.read().format("com.databricks.spark.csv").option("inferScheme","true").option("header","true").load("path_to/data.csv");
I am getting below error:-
error: value read is not a member of object org.apache.spark.sql.SQLContext,
and the "^" is pointing toward "SQLContext.read().format" in the error message.
I did try the suggestions available in stackoverflow, as well as other sites as well. but nothing seems to be working.
SQLContext means object access - static methods in class.
You should use sqlContext variable, as methods are not static, but are in class
So code should be:
val df = sqlContext.read.format("com.databricks.spark.csv").option("inferScheme","true").option("header","true").load("path_to/data.csv");