NullPointerException in org.apache.spark.ml.feature.Tokenizer - scala

I want to separately use TF-IDF features on the title and description fields, respectively and then combine those features in the VectorAssembler so that the final classifier can operate on those features.
It works fine if I use a single serial flow that is simply
titleTokenizer -> titleHashingTF -> VectorAssembler
But I need both like so:
titleTokenizer -> titleHashingTF
-> VectorAssembler
descriptionTokenizer -> descriptionHashingTF
Code here:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.{Pipeline, PipelineModel}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer, StringIndexer, VectorAssembler}
import org.apache.spark.ml.linalg.Vector
import org.apache.spark.sql.Row
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.log4j.{Level, Logger}
object SimplePipeline {
def main(args: Array[String]) {
// setup boilerplate
val conf = new SparkConf()
.setAppName("Pipeline example")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Session for SimplePipeline")
.getOrCreate()
val all_df = spark.read.json("file:///Users/me/data.json")
val numLabels = all_df.count()
// split into training and testing
val Array(training, testing) = all_df.randomSplit(Array(0.75, 0.25))
val nTraining = training.count();
val nTesting = testing.count();
println(s"Loaded $nTraining training labels...");
println(s"Loaded $nTesting testing labels...");
// convert string labels to integers
val indexer = new StringIndexer()
.setInputCol("rating")
.setOutputCol("label")
// tokenize our string inputs
val titleTokenizer = new Tokenizer()
.setInputCol("title")
.setOutputCol("title_words")
val descriptionTokenizer = new Tokenizer()
.setInputCol("description")
.setOutputCol("description_words")
// count term frequencies
val titleHashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(titleTokenizer.getOutputCol)
.setOutputCol("title_tfs")
val descriptionHashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(descriptionTokenizer.getOutputCol)
.setOutputCol("description_tfs")
// combine features together
val assembler = new VectorAssembler()
.setInputCols(Array(titleHashingTF.getOutputCol, descriptionHashingTF.getOutputCol))
.setOutputCol("features")
// set params for our model
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
// pipeline that combines all stages
val stages = Array(indexer, titleTokenizer, titleHashingTF, descriptionTokenizer, descriptionHashingTF, assembler, lr);
val pipeline = new Pipeline().setStages(stages);
// Fit the pipeline to training documents.
val model = pipeline.fit(training)
// Make predictions.
val predictions = model.transform(testing)
// Select example rows to display.
predictions.select("label", "rawPrediction", "prediction").show()
sc.stop()
}
}
and my data file is simply a line-break separated file of JSON objects:
{"title" : "xxxxxx", "description" : "yyyyy" .... }
{"title" : "zzzzzz", "description" : "zxzxzx" .... }
The error I get is very long a difficult to understand, but the important part (I think) is a java.lang.NullPointerException:
ERROR Executor: Exception in task 0.0 in stage 9.0 (TID 12)
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:215)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:957)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:888)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:694)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
... 23 more
How should I be properly crafting my Pipeline to do this?
(Also I'm completely new to Scala)

The problem here is that you don't validate the data and some of the values are NULL. It is pretty easy to reproduce this:
val df = Seq((1, Some("abcd bcde cdef")), (2, None)).toDF("id", "description")
val tokenizer = new Tokenizer().setInputCol("description")
tokenizer.transform(df).foreach(_ => ())
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$createTransformFunc$1: (string) => array<string>)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072)
...
Caused by: java.lang.NullPointerException
at org.apache.spark.ml.feature.Tokenizer$$anonfun$createTransformFunc$1.apply(Tokenizer.scala:39)
...
You can for example drop:
tokenizer.transform(df.na.drop(Array("description")))
or replace these with empty strings:
tokenizer.transform(df.na.fill(Map("description" -> "")))
whichever makes more sense in your application.

Related

How to create a PolygonRDD from H3 boundary?

I'm using Apache Spark with Apache Sedona (previously called GeoSpark), and I'm trying to do the following:
Take a DataFrame containing latitude and longitude in each row (it comes from an arbitrary source, it neither is a PointRDD nor comes from a specific file format) and transform it into a DataFrame with the H3 index of each point.
Take that DataFrame and create a PolygonRDD containing the H3 cell boundaries of each distinct H3 index.
This is what I have so far:
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.sedona.core.spatialRDD.PolygonRDD
import org.apache.sedona.sql.utils.SedonaSQLRegistrator
import org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
import org.apache.sedona.viz.sql.utils.SedonaVizRegistrator
import org.locationtech.jts.geom.{Polygon, GeometryFactory, Coordinate}
import com.uber.h3core.H3Core
import com.uber.h3core.util.GeoCoord
object Main {
def main(args: Array[String]) {
val sparkSession: SparkSession = SparkSession
.builder()
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName)
.master("local[*]")
.appName("Sedona-Analysis")
.getOrCreate()
import sparkSession.implicits._
SedonaSQLRegistrator.registerAll(sparkSession)
SedonaVizRegistrator.registerAll(sparkSession)
val df = Seq(
(-8.01681, -34.92618),
(-25.59306, -49.39895),
(-7.17897, -34.86518),
(-20.24521, -42.14273),
(-20.24628, -42.14785),
(-27.01641, -50.94109),
(-19.72987, -47.94319)
).toDF("latitude", "longitude")
val core: H3Core = H3Core.newInstance()
val geoFactory = new GeometryFactory()
val geoToH3 = udf((lat: Double, lng: Double, res: Int) => core.geoToH3(lat, lng, res))
val trdd = df
.select(geoToH3($"latitude", $"longitude", lit(7)).as("h3index"))
.distinct()
.rdd
.map(row => {
val h3 = row.getAs[Long](0)
val lboundary = core.h3ToGeoBoundary(h3)
val aboundary = lboundary.toArray(Array.ofDim[GeoCoord](lboundary.size))
val poly = geoFactory.createPolygon(
aboundary.map((c: GeoCoord) => new Coordinate(c.lat, c.lng))
)
poly.setUserData(h3)
poly
})
val polyRDD = new PolygonRDD(trdd)
polyRDD.rawSpatialRDD.foreach(println)
sparkSession.stop()
}
}
However, after running sbt assembly and submitting the output jar to spark-submit, I get this error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2362)
at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.map(RDD.scala:395)
at Main$.main(Main.scala:44)
at Main.main(Main.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: com.uber.h3core.H3Core
Serialization stack:
- object not serializable (class: com.uber.h3core.H3Core, value: com.uber.h3core.H3Core#3407ded1)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 2)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class Main$, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic Main$.$anonfun$main$2:(Lcom/uber/h3core/H3Core;Lorg/locationtech/jts/geom/GeometryFactory;Lorg/apache/spark/sql/Row;)Lorg/locationtech/jts/geom/Polygon;, instantiatedMethodType=(Lorg/apache/spark/sql/Row;)Lorg/locationtech/jts/geom/Polygon;, numCaptured=2])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class Main$$$Lambda$1710/0x0000000840d7f040, Main$$$Lambda$1710/0x0000000840d7f040#4853f592)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
... 22 more
What is the proper way to achieve what I'm trying to do?
So, basically just adding the Serializable trait to an object containing the H3Core was enough. Also, I had to adjust the Coordinate array to begin and end with the same point.
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.sedona.core.spatialRDD.PolygonRDD
import org.apache.sedona.sql.utils.SedonaSQLRegistrator
import org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
import org.apache.sedona.viz.sql.utils.SedonaVizRegistrator
import org.locationtech.jts.geom.{Polygon, GeometryFactory, Coordinate}
import com.uber.h3core.H3Core
import com.uber.h3core.util.GeoCoord
object H3 extends Serializable {
val core = H3Core.newInstance()
val geoFactory = new GeometryFactory()
}
object Main {
def main(args: Array[String]) {
val sparkSession: SparkSession = SparkSession
.builder()
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName)
.master("local[*]")
.appName("Sedona-Analysis")
.getOrCreate()
import sparkSession.implicits._
SedonaSQLRegistrator.registerAll(sparkSession)
SedonaVizRegistrator.registerAll(sparkSession)
val df = Seq(
(-8.01681, -34.92618),
(-25.59306, -49.39895),
(-7.17897, -34.86518),
(-20.24521, -42.14273),
(-20.24628, -42.14785),
(-27.01641, -50.94109),
(-19.72987, -47.94319)
).toDF("latitude", "longitude")
val geoToH3 = udf((lat: Double, lng: Double, res: Int) => H3.core.geoToH3(lat, lng, res))
val trdd = df
.select(geoToH3($"latitude", $"longitude", lit(7)).as("h3index"))
.distinct()
.rdd
.map(row => {
val h3 = row.getAs[Long](0)
val lboundary = H3.core.h3ToGeoBoundary(h3)
val aboundary = lboundary.toArray(Array.ofDim[GeoCoord](lboundary.size))
val poly = H3.geoFactory.createPolygon({
val ps = aboundary.map((c: GeoCoord) => new Coordinate(c.lat, c.lng))
ps :+ ps(0)
})
poly.setUserData(h3)
poly
})
val polyRDD = new PolygonRDD(trdd)
polyRDD.rawSpatialRDD.foreach(println)
sparkSession.stop()
}
}

type mismatch; found : org.apache.spark.sql.DataFrame required: org.apache.spark.rdd.RDD

I am new to scala and mllib and I have been getting the following error. Please let me know if anyone has been able to resolve something similar.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
.
.
.
val conf = new SparkConf().setMaster("local").setAppName("SampleApp")
val sContext = new SparkContext(conf)
val sc = SparkSession.builder().master("local").appName("SampleApp").getOrCreate()
val sampleData = sc.read.json("input/sampleData.json")
val clusters = KMeans.train(sampleData, 10, 10)
val WSSSE = clusters.computeCost(sampleData)
clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
val sameModel = KMeansModel.load(sContext, "target/org/apache/spark/KMeansExample/KMeansModel")
this above line gives an error as:
type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried:
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(20)
val model = kmeans.fit(sampleData)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
This gives the error:
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
Available fields: address, attributes, business_id
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
at org.apache.spark.ml.util.SchemaUtils$.validateVectorCompatibleColumn(SchemaUtils.scala:119)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:96)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:285)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:382)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:341)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
I have been referring to https://spark.apache.org/docs/latest/ml-clustering.html and https://spark.apache.org/docs/latest/mllib-clustering.html
Edit
Using setFeaturesCol()
import org.apache.spark.ml.clustering.KMeans
val assembler = new VectorAssembler()
.setInputCols(Array("is_open", "review_count", "stars"))
.setOutputCol("features")
val output = assembler.transform(sampleData).select("features")
val kmeans = new KMeans().setK(20).setFeaturesCol("features")
val model = kmeans.fit(output)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
This gives a different error still:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.getSimpleName(Ljava/lang/Class;)Ljava/lang/String;
at org.apache.spark.ml.util.Instrumentation.logPipelineStage(Instrumentation.scala:52)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:350)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
Thanks.
Use the scala pipeline
val assembler = new VectorAssembler()
.setInputCols(Array("feature1",feature2","feature3"))
.setOutputCol("assembled_features")
val scaler = new StandardScaler()
.setInputCol("assembled_features")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().setK(2).setSeed(1L)
// create the pipeline
val pipeline = new Pipeline()
.setStages(Array(assembler, scaler, kmeans))
// Fit the model
val clussterModel = pipeline.fit(train)

Spark: java.lang.IllegalArgumentException: requirement failed kmeans (mllib)

I am trying to do a clustering aplicaction with kmeans.
My dataset is:
https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014#
I do not have much experience with spark, I have been working only a few months, the error occurs when I try to apply kmean.train which has a inputs: vector, num_cluster and iterations.
I am running locally, is it possible that my machine can not computing so much data?
The main code is:
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import scala.collection._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.Row
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
object Preprocesado {
def main(args: Array[String]) {
val spark = SparkSession.builder.appName("Preprocesado").getOrCreate()
import spark.implicits._
val sc = spark.sparkContext
val datos = spark.read.format("csv").option("sep", ";").option("inferSchema", "true").option("header", "true").load("input.csv")
var df= datos.select("data", "MT_001").withColumn("data", to_date($"data").cast("string")).withColumn("data", concat(lit("MT_001 "), $"data"))
val col=datos.columns
for(a<- 2 to col.size-1) {
var user = col(a)
println(user)
var df_$a = datos.select("data", col(a)).withColumn("data", to_date($"data").cast("string")).withColumn("data", concat(lit(user), lit(" "), $"data"))
df = df.unionAll(df_$a)
}
val rd=df.withColumnRenamed("MT_001", "values")
val df2 = rd.groupBy("data").agg(collect_list("values"))
val convertUDF = udf((array : Seq[Double]) => {
Vectors.dense(array.toArray)
})
val withVector = df2.withColumn("collect_list(values)", convertUDF($"collect_list(values)"))
val items : Array[Double] = new Array[Double](96)
val vecToRemove = Vectors.dense(items)
def vectors_unequal(vec1: Vector) = udf((vec2: Vector) => !vec1.equals(vec2))
val filtered = withVector.filter(vectors_unequal(vecToRemove)($"collect_list(values)"))
val Array(a, b) = filtered.randomSplit(Array(0.7,0.3))
val trainingData = a.select("collect_list(values)").rdd.map{x:Row => x.getAs[Vector](0)}
val testData = b.select("collect_list(values)").rdd.map{x:Row => x.getAs[Vector](0)}
trainingData.cache()
testData.cache()
val numClusters = 4
val numIterations = 20
val clusters = KMeans.train(trainingData, numClusters, numIterations)
clusters.predict(testData).coalesce(1,true).saveAsTextFile("output")
spark.stop()
}
}
When I compile there is no errors.
Then I submit with:
spark-submit \
--class "spark.Preprocesado.Preprocesado" \
--master local[4] \
--executor-memory 7g \
--driver-memory 6g \
target/scala-2.11/preprocesado_2.11-1.0.jar
The problem is in the clustering:
This is the error:
18/05/20 16:45:48 ERROR Executor: Exception in task 10.0 in stage 7.0 (TID 6347)
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:212)
at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:486)
at org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:589)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:563)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:557)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:557)
at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:580)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$initKMeansParallel$2.apply(KMeans.scala:371)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$initKMeansParallel$2.apply(KMeans.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
How can I solve this error?
Thank you
I think you are generating your DataFrame df and consequently df2 in the wrong way.
Maybe you are trying to do this:
case class Data(values: Double, data: String)
var df = spark.emptyDataset[Data]
df = datos.columns.filter(_.startsWith("MT")).foldLeft(df)((df, c) => {
val values = col(c).cast("double").as("values")
val data = concat(lit(c), lit(" "), to_date($"_c0").cast("string")).as("data")
df.union(datos.select(values, data).as[Data])
})
val df2 = df.groupBy("data").agg(collect_list("values"))
As i think, you only need two columns: data and values, but in the for loop you are generating a DataFrame with 140256 columns (one for each attribute) and maybe this is the source of your problems.
pd: sorry for my english!.

Spark NaiveBayesTextClassification

i'm trying to create a text classifier spark(1.6.2) app, but I don't know what am I doing wrong. This is my code:
import org.apache.spark.ml.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.mllib
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
/**
* Created by kebodev on 2016.11.29..
*/
object PredTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("IktatoSparkRunner")
.set("spark.executor.memory", "2gb")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainData = sqlContext.read.json("src/main/resources/tst.json")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val model = NaiveBayes.train(featurizedData)
}
}
The NaiveBayes object doesn't have train method, what should I import?
If i try to use this way:
val naBa = new NaiveBayes()
naBa.fit(featurizedData)
I get this exception:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column label must be of type DoubleType but was actually StringType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:56)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:40)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:68)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
at PredTest$.main(PredTest.scala:37)
at PredTest.main(PredTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
This is how my json file looks like:
{"text":"any text","label":"6.0"}
I'm really noob in this topic. Can anyone help me how to create a model, and then how to predict a new value.
Thank you!
Labels and Feature Vectors only contain Doubles. Your label column contains a String.
See your stacktrace:
Column label must be of type DoubleType but was actually StringType.
You can use the StringIndexer or CountVectorizer to convert it appropriately. See http://spark.apache.org/docs/latest/ml-features.html#stringindexer for further details.

Spark Streaming into HBase with filtering logic

I have been trying to understand how spark streaming and hbase connect, but have not been successful. What I am trying to do is given a spark stream, process that stream and store the results in an hbase table. So far this is what I have:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "table")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes(row(0)), Bytes.toBytes(row(0)))
hTable.put(thePut)
}
val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.map(_.split(","))
val store = words.foreachRDD(rdd => rdd.foreach(blah))
ssc.start()
I am currently running the above code in spark-shell. I am not sure what I am doing wrong.
I get the following error in the shell:
14/09/03 16:21:03 ERROR scheduler.JobScheduler: Error running job streaming job 1409786463000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I also double checked the hbase table, just in case, and nothing new is written in there.
I am running nc -lk 9999 on another terminal to feed in data into the spark-shell for testing.
With help from users on the spark user group, I was able to figure out how to get this to work. It looks like I needed to wrap my streaming, mapping and foreach call around a serializable object:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "table")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes(row(0)), Bytes.toBytes(row(0)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.map(_.split(","))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
Seems to be a typical antipattern.
See "Design Patterns for using foreachRDD" chapter at http://spark.apache.org/docs/latest/streaming-programming-guide.html for correct pattern.