Unable to create SparkContext object using Apache Spark 2.2 version - scala

I use MS Windows 7.
Initially, I tried one program using scala in Spark 1.6 and it worked fine (where I am getting SparkContext object as sc automatically).
When I tried Spark 2.2, I am not getting sc automatically so I created one by doing the following steps:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sc = new SparkConf().setAppName("myname").setMaster("mast")
new SparkContext(sc)
Now when I am trying to execute below parallelize method it gives me one error:
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
Error:
Value parallelize is not a member of org.apache.spark.SparkConf
I followed these steps using official documentation only. So can anybody explain me where I went wrong? Thanks in advance. :)

If spark-shell doesn't show this line on start:
Spark context available as 'sc' (master = local[*], app id = local-XXX).
Run
val sc = SparkContext.getOrCreate()

The issue is that you created sc of type SparkConfig not SparkContext (both have the same initials).
For using parallelize method in Spark 2.0 version or any other version, sc should be SparkContext and not SparkConf. The correct code should be like this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
val sparkConf = new SparkConf().setAppName("myname").setMaster("mast")
val sc = new SparkContext(sparkConf)
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
This will give you the desired result.

You should prefer to use SparkSession as it is the the entry point for Spark from version 2. You could try something like :
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.
master("local")
.appName("spark session example")
.getOrCreate()
val sc = spark.sparkContext
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)

There is some problem with 2.2.0 version of Apache Spark. I replaced it with 2.2.1 version which is the latest one and i am able to get sc and spark variables automatically when I start spark-shell via cmd in windows 7. I hope it will help someone.
I executed below code which creates rdd and it works perfectly. No need to import any packages.
val dataOne=sc.parallelize(1 to 10)
dataOne.collect(); //Will print 1 to 10 numbers in array

Your code shud like this
val conf = new SparkConf()
conf.setMaster("local[*]")
conf.setAppName("myname")
val sc = new SparkContext(conf)
NOTE: master url should be local[*]

Related

CSV format is not loading in spark-shell

Using spark 1.6
I tried following code:
val diamonds = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/got_own/com_sep_fil.csv")
which caused the error
error: not found: value spark
In Spark 1.6 shell you get sc of type SparkContext, not spark of type SparkSession, if you want to get that functionlity you will need to instantiate a SqlContext
import org.apache.spark.sql._
val spark = new SQLContext(sc)
sqlContext is implict object SQL contect which can be used to load csv file and use com.databricks.spark.csv for mentionin csv file format
val df = sqlContext.read.format("csv").option("header", "true").option("inferSchema", "true").load("data.csv")
You need to initialize instance using SQLContext(spark version<2.0) or SparkSession(spark version>=2.0) to use methods provided by Spark.
To initialize spark instance for spark version < 2.0 use:
import org.apache.spark.sql._
val spark = new SQLContext(sc)
To initialize spark instance for spark version >= 2.0 use:
val spark = new SparkConf().setAppName("SparkSessionExample").setMaster("local")
To read the csv using spark 1.6 and databricks spark-csv package:
val df = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "true").load("data.csv")

Spark Error: java.io.NotSerializableException: scala.runtime.LazyRef

I am new to spark, can you please help in this?
The below simple pipeline to do a logistic regression produces an exception:
The Code:
package pipeline.tutorial.com
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.tuning.TrainValidationSplit
object PipelineDemo {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
conf.set("spark.master", "local")
conf.set("spark.app.name", "PipelineDemo")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("PipelineDemo").getOrCreate()
val df = spark.read.json("C:/Spark-The-Definitive-Guide-master/data/simple-ml")
val rForm = new RFormula()
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
val stages = Array(rForm, lr)
val pipeline = new Pipeline().setStages(stages)
val params = new ParamGridBuilder().addGrid(rForm.formula, Array(
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2")).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).addGrid(lr.regParam, Array(0.1, 2.0)).build()
val evaluator = new BinaryClassificationEvaluator()
.setMetricName("areaUnderROC")
.setRawPredictionCol("prediction")
.setLabelCol("label")
val tvs = new TrainValidationSplit()
.setTrainRatio(0.75)
.setEstimatorParamMaps(params)
.setEstimator(pipeline)
.setEvaluator(evaluator)
val Array(train, test) = df.randomSplit(Array(0.7, 0.3))
val model = tvs.fit(train)
val rs = model.transform(test)
rs.select("features", "label", "prediction").show()
}
}
// end code.
The code runs fine from the spark-shell
when writing it as a spark application (using eclipse scala ide) it gives the error:
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Thanks.
I solved it by removing scala library from the build path, to do this, right click on the scala library container > build path > remove from build path
not sure about the root cause though.
this error can be resolved by changing the scala version in your project to 2.12.8 or higher. Scala 2.12.8 works and is very stable. You can change this by going to your project structure (In Intellij you can go by pressing 'Ctrl+alt+shift+S'). Go to Global libraries and in there you have to remove the old scala version by using the - symbol and add the new scala version i.e. 2.12.8 or higher from the + symbol.

How to fix 22: error: not found: value SparkSession in Scala?

I am new to Spark and I would like to read a CSV-file to a Dataframe.
Spark 1.3.0 / Scala 2.3.0
This is what I have so far:
# Start Scala with CSV Package Module
spark-shell --packages com.databricks:spark-csv_2.10:1.3.0
# Import Spark Classes
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import sqlCtx ._
# Create SparkConf
val conf = new SparkConf().setAppName("local").setMaster("master")
val sc = new SparkContext(conf)
# Create SQLContext
val sqlCtx = new SQLContext(sc)
# Create SparkSession and use it for all purposes:
val session = SparkSession.builder().appName("local").master("master").getOrCreate()
# Read CSV-File and turn it into Dataframe.
val df_fc = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")
However at SparkSession.builder() it gives the following error:
^
How can I fix this error?
SparkSession is available in spark 2. No need to create sparkcontext in spark version 2. sparksession itself provides the gateway to all .
Try below as you are using version 1.x:
val df_fc = sqlCtx.read.format("com.databricks.spark.csv").option("header", "true").load("/home/Desktop/test.csv")

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.

Importing Spark libraries using Intellij IDEA

I would like to use spark SQL in an Intellij IDEA SBT project.
Even though I have imported the library the code does not seem to import it.
Spark Core seems to be working however.
You can't create a DataFrame from a scala List[A]. You need first to create an RDD[A], and then transform that to a DataFrame. You also need an SQLContext:
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("test")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val test = sc.parallelize(List(1,2,3,4)).toDF
For reference this is how the Spark 2.0 boilerplate with spark sql should look like:
import org.apache.spark.sql.SparkSession
object Test {
def main(args: Array[String]) {
val spark = SparkSession.builder()
.master("local")
.appName("some name")
.getOrCreate()
import spark.sqlContext.implicits._
}
}