Error with RDD[Vector] in function parameter - scala

I am trying to define a function in scala to iterate on it with Spark.
Here is my code :
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.SQLContext
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.feature.VectorIndexer
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.rdd._
val assembler = new VectorAssembler()
.setInputCols(Array("feature1", "feature2", "feature3"))
.setOutputCol("features")
val assembled = assembler.transform(df)
// measures the average distance to centroid, for a model built with a given k.
def clusteringScore(data: RDD[Vector],k:Int) = {
val kmeans = new KMeans()
.setK(k)
.setFeaturesCol("features")
.setPredictionCol("prediction")
val model = kmeans.fit(data)
val WSSSE = model.computeCost(data) println(s"Within Set Sum of Squared Errors = $WSSSE")
}
(5 to 40 by 5).map(k => (k, clusteringScore(assembled, k))).
foreach(println)
With this code I get this error :
type Vector takes type parameters
I don't know what means this error...

You are not showing your imports, but you are probably importing Scala standard collections' Vector(this one takes a type parameter, e.g. Vector[Int]) instead of the SparkML Vector, which is a different type and you should import like this:
import org.apache.spark.mllib.linalg.Vector

Related

select the first element after sorting column and convert it to list in scala

what is the most efficient way to sort one column in data frame, convert it to list, and assign the first element to variable in scala. I tried the following
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, first, regexp_replace}
import org.apache.spark.sql.functions._
println(CONFIG.getString("spark.appName"))
val conf = new SparkConf()
.setAppName(CONFIG.getString("spark.appName"))
.setMaster(CONFIG.getString("spark.master"))
val spark: SparkSession = SparkSession.builder().config(conf).getOrCreate()
val df = spark.read.format("com.databricks.spark.csv").option("delimiter", ",").load("file.csv")
val dfb=df.sort(desc("_c0"))
val list=df.select(df("_c0")).distinct
but I'm still no able to save the first element as variable
Use select, orderBy, map & head
Assuming column _c0 is of type string, If it is different type you have to modify your column data type in _.getAs[<your column datatype>]
Check below code.
scala> import spark.implicits._
import spark.implicits._
scala> val first = df
.select($"_c0")
.orderBy($"_c0".desc)
.map(_.getAs[String](0))
.head
Or
scala> import spark.implicits._
import spark.implicits._
scala> val first = df
.select($"_c0")
.orderBy($"_c0".desc)
.head
.getAs[String](0)

sparkml setParallelism for crossvalidator

so I am trying to set a cross validation using SparkML but I am getting a run time error saying that
"value setParallelism is not a member of org.apache.spark.ml.tuning.CrossValidator"
I am currently following the spark page tutorial. I am new to this so any help is appreciated. Bellow is my code snippet:
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
// Tokenizer
val tokenizer = new Tokenizer().setInputCol("tweet").setOutputCol("words")
// HashingTF
val hash_tf = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")
// ML models
val l_regression = new LogisticRegression().setMaxIter(100).setRegParam(0.15)
// Pipeline
val pipe = new Pipeline().setStages(Array(tokenizer, hash_tf, l_regression))
val paramGrid = new ParamGridBuilder()
.addGrid(hash_tf.numFeatures, Array(10,100,1000))
.addGrid(l_regression.regParam, Array(0.1,0.01,0.001))
.build()
val c_validator = new CrossValidator()
.setEstimator(pipe)
.setEvaluator(new BinaryClassificationEvaluator)
.setEstimatorParamMaps(paramGrid)
.setNumFolds(3)
.setParallelism(2)
setParallelism is available only in Spark 2.3 or later. You must be using earlier version:
(expert-only) Parameter setters
(...)
def setParallelism(value: Int): CrossValidator.this.type
Set the maximum level of parallelism to evaluate models in parallel. Default is 1 for serial evaluation
Annotations #Since( "2.3.0" )

How to provide multiple columns to setInputCol()

I am very new to Spark Machine Learning I want to pass multiple columns to features, in my below code I am only passing the Date column to features but now I want to pass Userid and Date columns to features. I tried to Use Vector but It only support Double data type but in My case I have Int and String
I would be thankful if anyone provide any suggestion/solution or any code example which will fulfill my requirement
Code:
case class LabeledDocument(Userid: Double, Date: String, label: Double)
val training = spark.read.option("inferSchema", true).csv("/root/Predictiondata3.csv").toDF("Userid","Date","label").toDF().as[LabeledDocument]
import scala.beans.BeanInfo
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{Row, SQLContext}
val tokenizer = new Tokenizer().setInputCol("Date").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression().setMaxIter(100).setRegParam(0.001).setElasticNetParam(0.0001)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training.toDF())
case class Document(Userid: Integer, Date: String)
val test = sc.parallelize(Seq(Document(4, "04-Jan-18"),Document(5, "01-Jan-17"),Document(2, "03-Jan-17")))
model.transform(test.toDF()).show()
Input Data with Columns
Userid,Date,SwipeIntime
1, 1-Jan-2017,9.30
1, 2-Jan-2017,9.35
1, 3-Jan-2017,9.45
1, 4-Jan-2017,9.26
2, 1-Jan-2017,9.37
2, 2-Jan-2017,9.35
2, 3-Jan-2017,9.45
2, 4-Jan-2017,9.46
I got the solution I was able to do so.
import scala.beans.BeanInfo
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.ml.attribute.NominalAttribute
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType,StructField,StringType}
case class LabeledDocument(Userid: Double, Date: String, label: Double)
val trainingData = spark.read.option("inferSchema", true).csv("/root/Predictiondata10.csv").toDF("Userid","Date","label").toDF().as[LabeledDocument]
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.ml.feature.VectorAssembler
val DateIndexer = new StringIndexer().setInputCol("Date").setOutputCol("DateCat")
val indexed = DateIndexer.fit(trainingData).transform(trainingData)
val assembler = new VectorAssembler().setInputCols(Array("DateCat", "Userid")).setOutputCol("rawfeatures")
val output = assembler.transform(indexed)
val rows = output.select("Userid","Date","label","DateCat","rawfeatures").collect()
val asTuple=rows.map(a=>(a.getInt(0),a.getString(1),a.getDouble(2),a.getDouble(3),a(4).toString()))
val r2 = sc.parallelize(asTuple).toDF("Userid","Date","label","DateCat","rawfeatures")
val Array(training, testData) = r2.randomSplit(Array(0.7, 0.3))
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
val tokenizer = new Tokenizer().setInputCol("rawfeatures").setOutputCol("words")
val hashingTF = new HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")
import org.apache.spark.ml.regression.LinearRegression
val lr = new LinearRegression().setMaxIter(100).setRegParam(0.001).setElasticNetParam(0.0001)
val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training.toDF())
model.transform(testData.toDF()).show()

How to cast a variable of type MLlib vector type to ML vector type? [duplicate]

I am trying to create a LDA model on a JSON file.
Creating a spark context with the JSON file :
import org.apache.spark.sql.SparkSession
val sparkSession = SparkSession.builder
.master("local")
.appName("my-spark-app")
.config("spark.some.config.option", "config-value")
.getOrCreate()
val df = spark.read.json("dbfs:/mnt/JSON6/JSON/sampleDoc.txt")
Displaying the df should show the DataFrame
display(df)
Tokenize the text
import org.apache.spark.ml.feature.RegexTokenizer
// Set params for RegexTokenizer
val tokenizer = new RegexTokenizer()
.setPattern("[\\W_]+")
.setMinTokenLength(4) // Filter away tokens with length < 4
.setInputCol("text")
.setOutputCol("tokens")
// Tokenize document
val tokenized_df = tokenizer.transform(df)
This should be displaying the tokenized_df
display(tokenized_df)
Get the stopwords
%sh wget http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words > -O /tmp/stopwords
Optional: copying the stopwords to the tmp folder
%fs cp file:/tmp/stopwords dbfs:/tmp/stopwords
Collecting all the stopwords
val stopwords = sc.textFile("/tmp/stopwords").collect()
Filtering out the stopwords
import org.apache.spark.ml.feature.StopWordsRemover
// Set params for StopWordsRemover
val remover = new StopWordsRemover()
.setStopWords(stopwords) // This parameter is optional
.setInputCol("tokens")
.setOutputCol("filtered")
// Create new DF with Stopwords removed
val filtered_df = remover.transform(tokenized_df)
Displaying the filtered df should verify the stopwords got removed
display(filtered_df)
Vectorizing the frequency of occurrence of words
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.ml.feature.CountVectorizer
// Set params for CountVectorizer
val vectorizer = new CountVectorizer()
.setInputCol("filtered")
.setOutputCol("features")
.fit(filtered_df)
Verify the vectorizer
vectorizer.transform(filtered_df)
.select("id", "text","features","filtered").show()
After this I am seeing an issue in fitting this vectorizer in LDA. The issue which I believe is CountVectorizer is giving sparse vector but LDA requires dense vector. Still trying to figure out the issue.
Here is the exception where map is not able to convert.
import org.apache.spark.mllib.linalg.Vector
val ldaDF = countVectors.map {
case Row(id: String, countVector: Vector) => (id, countVector)
}
display(ldaDF)
Exception :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4083.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4083.0 (TID 15331, 10.209.240.17): scala.MatchError: [0,(1252,[13,17,18,20,30,37,45,50,51,53,63,64,96,101,108,125,174,189,214,221,224,227,238,268,291,309,328,357,362,437,441,455,492,493,511,528,561,613,619,674,764,823,839,980,1098,1143],[1.0,1.0,2.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,3.0,1.0,2.0,1.0,5.0,1.0,2.0,2.0,1.0,4.0,1.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,2.0,1.0,1.0,1.0])] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
There is a working sample for LDA which is not throwing any issue
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.clustering.{DistributedLDAModel, LDA}
val a = Vectors.dense(Array(1.0,2.0,3.0))
val b = Vectors.dense(Array(3.0,4.0,5.0))
val df = Seq((1L,a),(2L,b),(2L,a)).toDF
val ldaDF = df.map { case Row(id: Long, countVector: Vector) => (id, countVector) }
val model = new LDA().setK(3).run(ldaDF.javaRDD)
display(df)
The only difference is in the second snippet we are having a dense matrix.
This has nothing to do with sparsity. Since Spark 2.0.0 ML Transformers no longer generate o.a.s.mllib.linalg.VectorUDT but o.a.s.ml.linalg.VectorUDT and are mapped locally to subclasses of o.a.s.ml.linalg.Vector. These are not compatible with old MLLib API which is moving towards deprecation in Spark 2.0.0.
You can convert between to "old" using Vectors.fromML:
import org.apache.spark.mllib.linalg.{Vectors => OldVectors}
import org.apache.spark.ml.linalg.{Vectors => NewVectors}
OldVectors.fromML(NewVectors.dense(1.0, 2.0, 3.0))
OldVectors.fromML(NewVectors.sparse(5, Seq(0 -> 1.0, 2 -> 2.0, 4 -> 3.0)))
but it make more sense to use ML implementation of LDA if you already use ML transformers.
For convenience you can use implicit conversions:
import scala.languageFeature.implicitConversions
object VectorConversions {
import org.apache.spark.mllib.{linalg => mllib}
import org.apache.spark.ml.{linalg => ml}
implicit def toNewVector(v: mllib.Vector) = v.asML
implicit def toOldVector(v: ml.Vector) = mllib.Vectors.fromML(v)
}
I changed:
val ldaDF = countVectors.map {
case Row(id: String, countVector: Vector) => (id, countVector)
}
to:
val ldaDF = countVectors.map { case Row(docId: String, features: MLVector) =>
(docId.toLong, Vectors.fromML(features)) }
And it worked like a charm! It is aligned with what #zero323 has written.
List of imports:
import org.apache.spark.ml.feature.{CountVectorizer, RegexTokenizer, StopWordsRemover}
import org.apache.spark.ml.linalg.{Vector => MLVector}
import org.apache.spark.mllib.clustering.{LDA, OnlineLDAOptimizer}
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
Solution is very simple guys.. find below
//import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.ml.linalg.Vector

Count calls of UDF in Spark

Using Spark 1.6.1 I want to call the number of times a UDF is called. I want to do this because I have a very expensive UDF (~1sec per call) and I suspect the UDF being called more often than the number of records in my dataframe, making my spark job slower than necessary.
Although I could not reproduce this situation, I came up with a simple example showing that the number of calls to the UDF seems to be different (here: less) than the number of rows, how can that be?
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo extends App {
val conf = new SparkConf().setMaster("local[4]").setAppName("Demo")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._
val callCounter = sc.accumulator(0)
val df= sc.parallelize(1 to 10000,numSlices = 100).toDF("value")
println(df.count) // gives 10000
val myudf = udf((d:Int) => {callCounter.add(1);d})
val res = df.withColumn("result",myudf($"value")).cache
println(res.select($"result").collect().size) // gives 10000
println(callCounter.value) // gives 9941
}
If using an accumulator is not the right way to call the counts of the UDF, how else could I do it?
Note: In my actual Spark-Job, get a call-count which is about 1.7 times higher than the actual number of records.
Spark applications should define a main() method instead of extending scala.App. Subclasses of scala.App may not work correctly.
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf
object Demo extends App {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[4]")
val sc = new SparkContext(conf)
// [...]
}
}
This should solve your problem.