sparkml setParallelism for crossvalidator - scala

so I am trying to set a cross validation using SparkML but I am getting a run time error saying that
"value setParallelism is not a member of org.apache.spark.ml.tuning.CrossValidator"
I am currently following the spark page tutorial. I am new to this so any help is appreciated. Bellow is my code snippet:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
// Tokenizer
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
// HashingTF
val hash_tf = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
// ML models
val l_regression = new LogisticRegression().setMaxIter(100).setRegParam(0.15)
// Pipeline
val pipe = new Pipeline().setStages(Array(tokenizer, hash_tf, l_regression))
val paramGrid = new ParamGridBuilder()
.addGrid(hash_tf.numFeatures, Array(10,100,1000))
.addGrid(l_regression.regParam, Array(0.1,0.01,0.001))
.build()
val c_validator = new CrossValidator()
.setEstimator(pipe)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
.setParallelism(2)

setParallelism is available only in Spark 2.3 or later. You must be using earlier version:
(expert-only) Parameter setters
(...)
def setParallelism(value: Int): CrossValidator.this.type
Set the maximum level of parallelism to evaluate models in parallel. Default is 1 for serial evaluation
Annotations #Since( "2.3.0" )

Related

Spark Error: java.io.NotSerializableException: scala.runtime.LazyRef

I am new to spark, can you please help in this?
The below simple pipeline to do a logistic regression produces an exception:
The Code:
package pipeline.tutorial.com
import org.apache.log4j.Level
import org.apache.log4j.Logger
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.RFormula
import org.apache.spark.ml.tuning.ParamGridBuilder
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.tuning.TrainValidationSplit
object PipelineDemo {
def main(args: Array[String]) {
Logger.getLogger("org").setLevel(Level.ERROR)
val conf = new SparkConf()
conf.set("spark.master", "local")
conf.set("spark.app.name", "PipelineDemo")
val sc = new SparkContext(conf)
val spark = SparkSession.builder().appName("PipelineDemo").getOrCreate()
val df = spark.read.json("C:/Spark-The-Definitive-Guide-master/data/simple-ml")
val rForm = new RFormula()
val lr = new LogisticRegression().setLabelCol("label").setFeaturesCol("features")
val stages = Array(rForm, lr)
val pipeline = new Pipeline().setStages(stages)
val params = new ParamGridBuilder().addGrid(rForm.formula, Array(
"lab ~ . + color:value1",
"lab ~ . + color:value1 + color:value2")).addGrid(lr.elasticNetParam, Array(0.0, 0.5, 1.0)).addGrid(lr.regParam, Array(0.1, 2.0)).build()
val evaluator = new BinaryClassificationEvaluator()
.setMetricName("areaUnderROC")
.setRawPredictionCol("prediction")
.setLabelCol("label")
val tvs = new TrainValidationSplit()
.setTrainRatio(0.75)
.setEstimatorParamMaps(params)
.setEstimator(pipeline)
.setEvaluator(evaluator)
val Array(train, test) = df.randomSplit(Array(0.7, 0.3))
val model = tvs.fit(train)
val rs = model.transform(test)
rs.select("features", "label", "prediction").show()
}
}
// end code.
The code runs fine from the spark-shell
when writing it as a spark application (using eclipse scala ide) it gives the error:
Caused by: java.io.NotSerializableException: scala.runtime.LazyRef
Thanks.
I solved it by removing scala library from the build path, to do this, right click on the scala library container > build path > remove from build path
not sure about the root cause though.
this error can be resolved by changing the scala version in your project to 2.12.8 or higher. Scala 2.12.8 works and is very stable. You can change this by going to your project structure (In Intellij you can go by pressing 'Ctrl+alt+shift+S'). Go to Global libraries and in there you have to remove the old scala version by using the - symbol and add the new scala version i.e. 2.12.8 or higher from the + symbol.

SQLContext.gerorCreate is not a value

I am getting error SQLContext.gerorCreate is not a value of object org.apache.spark.SQLContext. This is my code
import org.apache.spark.SparkConf
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.sql.functions
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types
import org.apache.spark.SparkContext
import java.io.Serializable
case class Sensor(id:String,date:String,temp:String,press:String)
object consum {
def main(args: Array[String]) {
val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val sc=new SparkContext(sparkConf)
val lines = KafkaUtils.createStream(ssc, "localhost:2181", "spark-streaming-consumer-group", Map("hello" -> 5))
def parseSensor(str:String): Sensor={
val p=str.split(",")
Sensor(p(0),p(1),p(2),p(3))
}
val data=lines.map(_._2).map(parseSensor)
val sqlcontext=new SQLContext(sc)
import sqlcontext.implicits._
data.foreachRDD { rdd=>
val sensedata=sqlcontext.getOrCreate(rdd.sparkContext)
}
I have tried with SQLContext.getOrCreate as well but same error.
There is no such getOrCreate function defined for neither SparkContext nor SQLContext.
getOrCreate function is defined for SparkSession instances from which SparkSession instances are created. And we get sparkContext instance or sqlContext instance from the SparkSession instance created using getOrCreate method call.
I hope the explanation is clear.
Updated
The explanation I did above is suitable for higher versions of spark. In the blog as the OP is referencing, the author is using spark 1.6 and the api doc of 1.6.3 clearly states
Get the singleton SQLContext if it exists or create a new one using the given SparkContext

How to cast a variable of type MLlib vector type to ML vector type? [duplicate]

I am trying to create a LDA model on a JSON file.
Creating a spark context with the JSON file :
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder
.master("local")
.appName("my-spark-app")
.config("spark.some.config.option", "config-value")
.getOrCreate()
val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt")
Displaying the df should show the DataFrame
display(df)
Tokenize the text
import org.apache.spark.ml.feature.RegexTokenizer
// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
.setPattern("[\\W_]+")
.setMinTokenLength(4) // Filter away tokens with length < 4
.setInputCol("text")
.setOutputCol("tokens")
// Tokenize document
val tokenized_df = tokenizer.transform(df)
This should be displaying the tokenized_df
display(tokenized_df)
Get the stopwords
%sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words > -O /tmp/stopwords
Optional: copying the stopwords to the tmp folder
%fs cp file:/tmp/stopwords dbfs:/tmp/stopwords
Collecting all the stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()
Filtering out the stopwords
import org.apache.spark.ml.feature.StopWordsRemover
// Set params for StopWordsRemover
val remover = new StopWordsRemover()
.setStopWords(stopwords) // This parameter is optional
.setInputCol("tokens")
.setOutputCol("filtered")
// Create new DF with Stopwords removed
val filtered_df = remover.transform(tokenized_df)
Displaying the filtered df should verify the stopwords got removed
display(filtered_df)
Vectorizing the frequency of occurrence of words
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.ml.feature.CountVectorizer
// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.fit(filtered_df)
Verify the vectorizer
vectorizer.transform(filtered_df)
.select("id", "text","features","filtered").show()
After this I am seeing an issue in fitting this vectorizer in LDA. The issue which I believe is CountVectorizer is giving sparse vector but LDA requires dense vector. Still trying to figure out the issue.
Here is the exception where map is not able to convert.
import org.apache.spark.mllib.linalg.Vector
val ldaDF = countVectors.map {
case Row(id: String, countVector: Vector) => (id, countVector)
}
display(ldaDF)
Exception :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4083.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4083.0 (TID 15331, 10.209.240.17): scala.MatchError: [0,(1252,[13,17,18,20,30,37,45,50,51,53,63,64,96,101,108,125,174,189,214,221,224,227,238,268,291,309,328,357,362,437,441,455,492,493,511,528,561,613,619,674,764,823,839,980,1098,1143],[1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,5.0,1.0,2.0,2.0,1.0,4.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
There is a working sample for LDA which is not throwing any issue
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
val a = Vectors.dense(Array(1.0,2.0,3.0))
val b = Vectors.dense(Array(3.0,4.0,5.0))
val df = Seq((1L,a),(2L,b),(2L,a)).toDF
val ldaDF = df.map { case Row(id: Long, countVector: Vector) => (id, countVector) }
val model = new LDA().setK(3).run(ldaDF.javaRDD)
display(df)
The only difference is in the second snippet we are having a dense matrix.
This has nothing to do with sparsity. Since Spark 2.0.0 ML Transformers no longer generate o.a.s.mllib.linalg.VectorUDT but o.a.s.ml.linalg.VectorUDT and are mapped locally to subclasses of o.a.s.ml.linalg.Vector. These are not compatible with old MLLib API which is moving towards deprecation in Spark 2.0.0.
You can convert between to "old" using Vectors.fromML:
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.ml.linalg.{Vectors => NewVectors}
OldVectors.fromML(NewVectors.dense(1.0, 2.0, 3.0))
OldVectors.fromML(NewVectors.sparse(5, Seq(0 -> 1.0, 2 -> 2.0, 4 -> 3.0)))
but it make more sense to use ML implementation of LDA if you already use ML transformers.
For convenience you can use implicit conversions:
import scala.languageFeature.implicitConversions
object VectorConversions {
import org.apache.spark.mllib.{linalg => mllib}
import org.apache.spark.ml.{linalg => ml}
implicit def toNewVector(v: mllib.Vector) = v.asML
implicit def toOldVector(v: ml.Vector) = mllib.Vectors.fromML(v)
}
I changed:
val ldaDF = countVectors.map {
case Row(id: String, countVector: Vector) => (id, countVector)
}
to:
val ldaDF = countVectors.map { case Row(docId: String, features: MLVector) =>
(docId.toLong, Vectors.fromML(features)) }
And it worked like a charm! It is aligned with what #zero323 has written.
List of imports:
import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
Solution is very simple guys.. find below
//import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.linalg.Vector

Spark dataframe join is failing if key column contains a period(".") in the end

I am getting below exception if I do join in between two dataframes in spark (ver 1.5, scala 2.10).
Exception in thread "main" org.apache.spark.sql.AnalysisException: syntax error in attribute name: col1.;
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:99)
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:118)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:182)
at org.apache.spark.sql.DataFrame.resolve(DataFrame.scala:158)
at org.apache.spark.sql.DataFrame.col(DataFrame.scala:653)
at com.nielsen.buy.integration.commons.Demo$.main(Demo.scala:62)
at com.nielsen.buy.integration.commons.Demo.main(Demo.scala)
Code works fine if column in dataframe does not contain any period . Please do help me out.
You can find the code that I am using.
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
import com.google.gson.Gson
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.Row
object Demo
{
lazy val sc: SparkContext = {
val conf = new SparkConf().setMaster("local")
.setAppName("demooo")
.set("spark.driver.allowMultipleContexts", "true")
new SparkContext(conf)
}
sc.setLogLevel("ERROR")
lazy val sqlcontext=new SQLContext(sc)
val data=List(Row("a","b"),Row("v","b"))
val dataRdd=sc.parallelize(data)
val schema = new StructType(Array(StructField("col.1",StringType,true),StructField("col2",StringType,true)))
val df1=sqlcontext.createDataFrame(dataRdd, schema)
val data2=List(Row("a","b"),Row("v","b"))
val dataRdd2=sc.parallelize(data2)
val schema2 = new StructType(Array(StructField("col3",StringType,true),StructField("col4",StringType,true)))
val df2=sqlcontext.createDataFrame(dataRdd2, schema2)
val val1="col.1"
val df3= df1.join(df2,df1.col(val1).equalTo(df2.col("col3")),"outer").show
}
In general, period is used to access members of a struct field.
The spark version you are using (1.5) is relatively old. Several such issues were fixed in later versions so if you upgrade it might just solve the issue.
That said, you can simply use withColumnRenamed to rename the column to something which does not have a period before the join.
So you basically do something like this:
val dfTmp = df1.withColumnRenamed(val1, "JOIN_COL")
val df3= dfTmp.join(df2,dfTmp.col("JOIN_COL").equalTo(df2.col("col3")),"outer").withColumnRenamed("JOIN_COL", val1)
df3.show
btw show returns a Unit so you probably meant df3 to be equal to the expression without it and do df3.show separately.

Error with RDD[Vector] in function parameter

I am trying to define a function in scala to iterate on it with Spark.
Here is my code :
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.rdd._
val assembler = new VectorAssembler()
.setInputCols(Array("feature1", "feature2", "feature3"))
.setOutputCol("features")
val assembled = assembler.transform(df)
// measures the average distance to centroid, for a model built with a given k.
def clusteringScore(data: RDD[Vector],k:Int) = {
val kmeans = new KMeans()
.setK(k)
.setFeaturesCol("features")
.setPredictionCol("prediction")
val model = kmeans.fit(data)
val WSSSE = model.computeCost(data) println(s"Within Set Sum of Squared Errors = $WSSSE")
}
(5 to 40 by 5).map(k => (k, clusteringScore(assembled, k))).
foreach(println)
With this code I get this error :
type Vector takes type parameters
I don't know what means this error...
You are not showing your imports, but you are probably importing Scala standard collections' Vector(this one takes a type parameter, e.g. Vector[Int]) instead of the SparkML Vector, which is a different type and you should import like this:
import org.apache.spark.mllib.linalg.Vector