Why do I get a type error in model.predictOnValues when I try the official example of Streaming Kmeans Clustering of Apache Spark? - scala

I'm trying the Streaming Clustering example code at the end of the official guide, but I get a type error. Here is my code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.clustering.StreamingKMeans
object Kmeans {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("kmeans")
val ssc = new StreamingContext(conf, Seconds(3))
val trainingData = ssc.textFileStream("training").map(Vectors.parse)
val testData = ssc.textFileStream("test").map(LabeledPoint.parse)
val numDimensions = 3
val numClusters = 2
val model = new StreamingKMeans()
.setK(numClusters)
.setDecayFactor(1.0)
.setRandomCenters(numDimensions, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData).print()
ssc.start()
ssc.awaitTermination()
}
}
But when I run
sbt package
I get the following error:
[error] found : org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.regression.LabeledPoint]
[error] required: org.apache.spark.streaming.dstream.DStream[(?, org.apache.spark.mllib.linalg.Vector)]
[error] model.predictOnValues(testData).print()
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed

You need to map testData: DStream[LabeledPoint] to a DStream[(K, Vector)]:
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
You can find the complete example here: StreamingKMeansExample.scala

Related

not found: value fromRegistries

I am trying to use https://doc.akka.io/docs/alpakka/current/mongodb.html as the following:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import cats.data.Chain
import com.mongodb.reactivestreams.client.MongoClients
import org.mongodb.scala.bson.codecs.DEFAULT_CODEC_REGISTRY
import org.mongodb.scala.bson.codecs.Macros._
object Main extends App {
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
val preFailure = MsgPreFailure("Hello", Chain("Foo", "Too"))
val codecRegistry = fromRegistries(fromProviders(classOf[MsgPreFailure]), DEFAULT_CODEC_REGISTRY)
private val client = MongoClients.create("mongodb://localhost:27017")
private val db = client.getDatabase("MongoSourceSpec")
private val preFailureColl = db
.getCollection("numbers", classOf[MsgPreFailure])
.withCodecRegistry(codecRegistry)
}
and the compiler complains:
[error] /home/developer/scala/trymongo/src/main/scala/Main.scala:15:23: not found: value fromRegistries
[error] val codecRegistry = fromRegistries(fromProviders(classOf[MsgPreFailure]), DEFAULT_CODEC_REGISTRY)
[error] ^
[error] /home/developer/scala/trymongo/src/main/scala/Main.scala:15:38: not found: value fromProviders
[error] val codecRegistry = fromRegistries(fromProviders(classOf[MsgPreFailure]), DEFAULT_CODEC_REGISTRY)
[error] ^
What am I missing? The project can be find here https://gitlab.com/playscala/trymongo
I think you might need to import it, Try importing :
import org.bson.codecs.configuration.CodecRegistries.{fromProviders, fromRegistries}

H2O implicit conversion throws compilation error

The code below is throwing an error when assigning the H2OFrame, most likely something is wrong in the implicit conversion. The error is:
type mismatch; found : org.apache.spark.h2o.RDD[Int] (which expands
to) org.apache.spark.rdd.RDD[Int] required:
org.apache.spark.h2o.H2OFrame (which expands to) water.fvec.H2OFrame
and the code:
import org.apache.spark.h2o._
import org.apache.spark._
import org.apache.spark.SparkContext._
object App1 extends App{
val conf = new SparkConf()
conf.setAppName("Test")
conf.setMaster("local[1]")
conf.set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val response: RDD[Int] = data.map(row => row(0).toInt)
val h2oResponse: H2OFrame = response // <-- this line throws the error
sc.stop
}
All you are missing is h2oContext's implicits as
import h2oContext.implicits._
val h2oResponse: H2OFrame = response.toDF()

Error while running the spark scala code to do bulk load

I am using the following code in REPL to create hfiles and to the bulk load into hbase.I used the same code and done the spark-submit it was working fine with no errors but when i run it in REPL it is throwing the error
import org.apache.spark._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.client.{ConnectionFactory, HTable}
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.StringType
import scala.collection.mutable.ArrayBuffer
import org.apache.hadoop.hbase.KeyValue
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
val cdt = "dt".getBytes
val ctemp="temp".getBytes
val ctemp_min="temp_min".getBytes
val ctemp_max="temp_max".getBytes
val cpressure="pressure".getBytes
val csea_level="sea_level".getBytes
val cgrnd_level="grnd_level".getBytes
val chumidity="humidity".getBytes
val ctemp_kf="temp_kf".getBytes
val cid="id".getBytes
val cweather_main="weather_main".getBytes
val cweather_description="weather_description".getBytes
val cweather_icon="weather_icon".getBytes
val cclouds_all="clouds_all".getBytes
val cwind_speed="wind_speed".getBytes
val cwind_deg="wind_deg".getBytes
val csys_pod="sys_pod".getBytes
val cdt_txt="dt_txt".getBytes
val crain="rain".getBytes
val COLUMN_FAMILY = "data".getBytes
val cols = ArrayBuffer(cdt,ctemp,ctemp_min,ctemp_max,cpressure,csea_level,cgrnd_level,chumidity,ctemp_kf,cid,cweather_main,cweather_description,cweather_icon,cclouds_all,cwind_speed,cwind_deg,csys_pod,cdt_txt,crain)
val rowKey = new ImmutableBytesWritable()
val conf = HBaseConfiguration.create()
val ZOOKEEPER_QUORUM = "address"
conf.set("hbase.zookeeper.quorum", ZOOKEEPER_QUORUM);
val connection = ConnectionFactory.createConnection(conf)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferschema","true").load("Hbasedata/Weatherdata.csv")
val rdd = df.flatMap(x => { //Error when i run this
rowKey.set(x(0).toString.getBytes)
for(i <- 0 to cols.length - 1) yield {
val index = x.fieldIndex(new String(cols(i)))
val value = if (x.isNullAt(index)) "".getBytes else x(index).toString.getBytes
(rowKey,new KeyValue(rowKey.get, COLUMN_FAMILY, cols(i), value))
}
})
It is throwing the following error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:333)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:332)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
at org.apache.spark.sql.DataFrame.flatMap(DataFrame.scala:1418)
The error is thrown when i tried to create the rdd.I have used the same code in spark-submit it was working fine.
Issue in
val rowKey = new ImmutableBytesWritable()
ImmutableBytesWritable is not serializable, and located outside "flatMap" function. Please check exception full stack trace.
You can move mentioned statement inside "flatMap" function, at least for check.

SparkConf not found running spark neo4j connector example

I execute the spark neo4j example code like so:
spark-shell --conf spark.neo4j.bolt.password=TestNeo4j --packages neo4j-contrib:neo4j-spark-connector:1.0.0-RC1,graphframes:graphframes:0.1.0-spark1.6 -i neo4jspark.scala
My Scalafile:
import org.neo4j.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
val conf = new SparkConf.setMaster("local").setAppName("neo4jspark")
val sc = new SparkContext(conf)
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (p:POINT) RETURN p").loadRowRdd
rdd.count
The error:
Loading neo4jspark.scala...
import org.neo4j.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
<console>:38: error: not found: value SparkConf
val conf = new SparkConf.setMaster("local").setAppName("neo4jspark")
^
<console>:38: error: not found: value conf
val sc = new SparkContext(conf)
^
<console>:39: error: not found: value Neo4j
val neo = Neo4j(sc)
^
<console>:38: error: not found: value neo
val rdd = neo.cypher("MATCH (p:POINT) RETURN p").loadRowRdd
^
<console>:39: error: object count is not a member of package org.apache.spark.streaming.rdd
rdd.count
^
I am importing the SparkConf, no idea whay it says there is no value for that, am I missing something (simple I hope)??
EDIT: This seems to be a version error:
I ran it with this start up:
spark-shell --conf spark.neo4j.bolt.password=TestNeo4j --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jspark.scala
Still get the conf error but it does run. I just need to now figure out why anything i do with the returned RDD breaks :D This is on mac, just tested same versions of everything on my windows and it breaks like it showed her mostly because it couldn't get the import.org.neo4j.spark._ here is the error:
<console>:23: error: object neo4j is not a member of package org
import org.neo4j.spark._
^
No idea what is different about windows than the mac I am deving on :/

cant find temp in zeppelin

enter image description hereI receive a error when try do select over my temp table. Somebody can help me please?
object StreamingLinReg extends java.lang.Object{
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1").setAppName("Streaming Liniar Regression")
.set("spark.cassandra.connection.port", "9042")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val sc = new SparkContext(conf);
val ssc = new StreamingContext(sc, Seconds(1));
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._
val trainingData = ssc.cassandraTable[String]("features","consumodata").select("consumo", "consumo_mensal", "soma_pf", "tempo_gasto").map(LabeledPoint.parse)
trainingData.toDF.registerTempTable("training")
val dstream = new ConstantInputDStream(ssc, trainingData)
val numFeatures = 100
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
.setNumIterations(1)
.setStepSize(0.1)
.setMiniBatchFraction(1.0)
model.trainOn(dstream)
model.predictOnValues(dstream.map(lp => (lp.label, lp.features))).foreachRDD { rdd =>
val metrics = new RegressionMetrics(rdd)
val MSE = metrics.meanSquaredError //Squared error
val RMSE = metrics.rootMeanSquaredError //Squared error
val MAE = metrics.meanAbsoluteError //Mean absolute error
val Rsquared = metrics.r2
//val Explained variance = metrics.explainedVariance
rdd.toDF.registerTempTable("liniarRegressionModel")
}
}
ssc.start()
ssc.awaitTermination()
//}
}
%sql
select * from liniarRegressionModel limit 10
when I do select the temporary table I get an error message.I run first paragraph after execute the select over temp table.
org.apache.spark.sql.AnalysisException: Table not found: liniarRegressionModel; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.
package$AnalysisErrorAt.failAnalysis (package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$.getTable (Analyzer.scala:305) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$$anonfun$apply$9.applyOrElse
(Analyzer.scala:314) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$$anonfun$apply$9.applyOrElse(Analyzer.scala:309) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.trees.CurrentOrigin
$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators
(LogicalPlan.scala:56) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$1.apply(LogicalPlan.scala:54) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply
(LogicalPlan.scala:54) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply
(TreeNode.scala:281) at scala.collection.Iterator
$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$
class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach
(Iterator.scala:1157) at scala.collection.generic.Growable $class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.
$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.
$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to
(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer
(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer
(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray
(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray
(Iterator.scala:1157)
My output after execute the code
import java.lang.Object
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.StreamingContext._
import com.datastax.spark.connector.streaming._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.mllib.evaluation.RegressionMetrics
defined module StreamingLinReg
FINISHED
Took 15 seconds