I'm trying the Streaming Clustering example code at the end of the official guide, but I get a type error. Here is my code:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.clustering.StreamingKMeans
object Kmeans {
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local[2]").setAppName("kmeans")
val ssc = new StreamingContext(conf, Seconds(3))
val trainingData = ssc.textFileStream("training").map(Vectors.parse)
val testData = ssc.textFileStream("test").map(LabeledPoint.parse)
val numDimensions = 3
val numClusters = 2
val model = new StreamingKMeans()
.setK(numClusters)
.setDecayFactor(1.0)
.setRandomCenters(numDimensions, 0.0)
model.trainOn(trainingData)
model.predictOnValues(testData).print()
ssc.start()
ssc.awaitTermination()
}
}
But when I run
sbt package
I get the following error:
[error] found : org.apache.spark.streaming.dstream.DStream[org.apache.spark.mllib.regression.LabeledPoint]
[error] required: org.apache.spark.streaming.dstream.DStream[(?, org.apache.spark.mllib.linalg.Vector)]
[error] model.predictOnValues(testData).print()
[error] ^
[error] one error found
[error] (compile:compile) Compilation failed
You need to map testData: DStream[LabeledPoint] to a DStream[(K, Vector)]:
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
You can find the complete example here: StreamingKMeansExample.scala
Related
I am trying to use https://doc.akka.io/docs/alpakka/current/mongodb.html as the following:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import cats.data.Chain
import com.mongodb.reactivestreams.client.MongoClients
import org.mongodb.scala.bson.codecs.DEFAULT_CODEC_REGISTRY
import org.mongodb.scala.bson.codecs.Macros._
object Main extends App {
implicit val system = ActorSystem()
implicit val mat = ActorMaterializer()
val preFailure = MsgPreFailure("Hello", Chain("Foo", "Too"))
val codecRegistry = fromRegistries(fromProviders(classOf[MsgPreFailure]), DEFAULT_CODEC_REGISTRY)
private val client = MongoClients.create("mongodb://localhost:27017")
private val db = client.getDatabase("MongoSourceSpec")
private val preFailureColl = db
.getCollection("numbers", classOf[MsgPreFailure])
.withCodecRegistry(codecRegistry)
}
and the compiler complains:
[error] /home/developer/scala/trymongo/src/main/scala/Main.scala:15:23: not found: value fromRegistries
[error] val codecRegistry = fromRegistries(fromProviders(classOf[MsgPreFailure]), DEFAULT_CODEC_REGISTRY)
[error] ^
[error] /home/developer/scala/trymongo/src/main/scala/Main.scala:15:38: not found: value fromProviders
[error] val codecRegistry = fromRegistries(fromProviders(classOf[MsgPreFailure]), DEFAULT_CODEC_REGISTRY)
[error] ^
What am I missing? The project can be find here https://gitlab.com/playscala/trymongo
I think you might need to import it, Try importing :
import org.bson.codecs.configuration.CodecRegistries.{fromProviders, fromRegistries}
The code below is throwing an error when assigning the H2OFrame, most likely something is wrong in the implicit conversion. The error is:
type mismatch; found : org.apache.spark.h2o.RDD[Int] (which expands
to) org.apache.spark.rdd.RDD[Int] required:
org.apache.spark.h2o.H2OFrame (which expands to) water.fvec.H2OFrame
and the code:
import org.apache.spark.h2o._
import org.apache.spark._
import org.apache.spark.SparkContext._
object App1 extends App{
val conf = new SparkConf()
conf.setAppName("Test")
conf.setMaster("local[1]")
conf.set("spark.executor.memory","1g");
val sc = new SparkContext(conf)
val rawData = sc.textFile("c:\\spark\\data.csv")
val data = rawData.map(line => line.split(',').map(_.toDouble))
val response: RDD[Int] = data.map(row => row(0).toInt)
val h2oResponse: H2OFrame = response // <-- this line throws the error
sc.stop
}
All you are missing is h2oContext's implicits as
import h2oContext.implicits._
val h2oResponse: H2OFrame = response.toDF()
I am using the following code in REPL to create hfiles and to the bulk load into hbase.I used the same code and done the spark-submit it was working fine with no errors but when i run it in REPL it is throwing the error
import org.apache.spark._
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.fs.Path
import org.apache.hadoop.hbase.client.{ConnectionFactory, HTable}
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.StringType
import scala.collection.mutable.ArrayBuffer
import org.apache.hadoop.hbase.KeyValue
import org.apache.spark.rdd.RDD.rddToPairRDDFunctions
val cdt = "dt".getBytes
val ctemp="temp".getBytes
val ctemp_min="temp_min".getBytes
val ctemp_max="temp_max".getBytes
val cpressure="pressure".getBytes
val csea_level="sea_level".getBytes
val cgrnd_level="grnd_level".getBytes
val chumidity="humidity".getBytes
val ctemp_kf="temp_kf".getBytes
val cid="id".getBytes
val cweather_main="weather_main".getBytes
val cweather_description="weather_description".getBytes
val cweather_icon="weather_icon".getBytes
val cclouds_all="clouds_all".getBytes
val cwind_speed="wind_speed".getBytes
val cwind_deg="wind_deg".getBytes
val csys_pod="sys_pod".getBytes
val cdt_txt="dt_txt".getBytes
val crain="rain".getBytes
val COLUMN_FAMILY = "data".getBytes
val cols = ArrayBuffer(cdt,ctemp,ctemp_min,ctemp_max,cpressure,csea_level,cgrnd_level,chumidity,ctemp_kf,cid,cweather_main,cweather_description,cweather_icon,cclouds_all,cwind_speed,cwind_deg,csys_pod,cdt_txt,crain)
val rowKey = new ImmutableBytesWritable()
val conf = HBaseConfiguration.create()
val ZOOKEEPER_QUORUM = "address"
conf.set("hbase.zookeeper.quorum", ZOOKEEPER_QUORUM);
val connection = ConnectionFactory.createConnection(conf)
val df = sqlContext.read.format("com.databricks.spark.csv").option("header","true").option("inferschema","true").load("Hbasedata/Weatherdata.csv")
val rdd = df.flatMap(x => { //Error when i run this
rowKey.set(x(0).toString.getBytes)
for(i <- 0 to cols.length - 1) yield {
val index = x.fieldIndex(new String(cols(i)))
val value = if (x.isNullAt(index)) "".getBytes else x(index).toString.getBytes
(rowKey,new KeyValue(rowKey.get, COLUMN_FAMILY, cols(i), value))
}
})
It is throwing the following error
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2067)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:333)
at org.apache.spark.rdd.RDD$$anonfun$flatMap$1.apply(RDD.scala:332)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.flatMap(RDD.scala:332)
at org.apache.spark.sql.DataFrame.flatMap(DataFrame.scala:1418)
The error is thrown when i tried to create the rdd.I have used the same code in spark-submit it was working fine.
Issue in
val rowKey = new ImmutableBytesWritable()
ImmutableBytesWritable is not serializable, and located outside "flatMap" function. Please check exception full stack trace.
You can move mentioned statement inside "flatMap" function, at least for check.
I execute the spark neo4j example code like so:
spark-shell --conf spark.neo4j.bolt.password=TestNeo4j --packages neo4j-contrib:neo4j-spark-connector:1.0.0-RC1,graphframes:graphframes:0.1.0-spark1.6 -i neo4jspark.scala
My Scalafile:
import org.neo4j.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
val conf = new SparkConf.setMaster("local").setAppName("neo4jspark")
val sc = new SparkContext(conf)
val neo = Neo4j(sc)
val rdd = neo.cypher("MATCH (p:POINT) RETURN p").loadRowRdd
rdd.count
The error:
Loading neo4jspark.scala...
import org.neo4j.spark._
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.SparkConf
<console>:38: error: not found: value SparkConf
val conf = new SparkConf.setMaster("local").setAppName("neo4jspark")
^
<console>:38: error: not found: value conf
val sc = new SparkContext(conf)
^
<console>:39: error: not found: value Neo4j
val neo = Neo4j(sc)
^
<console>:38: error: not found: value neo
val rdd = neo.cypher("MATCH (p:POINT) RETURN p").loadRowRdd
^
<console>:39: error: object count is not a member of package org.apache.spark.streaming.rdd
rdd.count
^
I am importing the SparkConf, no idea whay it says there is no value for that, am I missing something (simple I hope)??
EDIT: This seems to be a version error:
I ran it with this start up:
spark-shell --conf spark.neo4j.bolt.password=TestNeo4j --packages neo4j-contrib:neo4j-spark-connector:2.0.0-M2,graphframes:graphframes:0.2.0-spark2.0-s_2.11 -i neo4jspark.scala
Still get the conf error but it does run. I just need to now figure out why anything i do with the returned RDD breaks :D This is on mac, just tested same versions of everything on my windows and it breaks like it showed her mostly because it couldn't get the import.org.neo4j.spark._ here is the error:
<console>:23: error: object neo4j is not a member of package org
import org.neo4j.spark._
^
No idea what is different about windows than the mac I am deving on :/
enter image description hereI receive a error when try do select over my temp table. Somebody can help me please?
object StreamingLinReg extends java.lang.Object{
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1").setAppName("Streaming Liniar Regression")
.set("spark.cassandra.connection.port", "9042")
.set("spark.driver.allowMultipleContexts", "true")
.set("spark.streaming.receiver.writeAheadLog.enable", "true")
val sc = new SparkContext(conf);
val ssc = new StreamingContext(sc, Seconds(1));
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
import sqlContext.implicits._
val trainingData = ssc.cassandraTable[String]("features","consumodata").select("consumo", "consumo_mensal", "soma_pf", "tempo_gasto").map(LabeledPoint.parse)
trainingData.toDF.registerTempTable("training")
val dstream = new ConstantInputDStream(ssc, trainingData)
val numFeatures = 100
val model = new StreamingLinearRegressionWithSGD()
.setInitialWeights(Vectors.zeros(numFeatures))
.setNumIterations(1)
.setStepSize(0.1)
.setMiniBatchFraction(1.0)
model.trainOn(dstream)
model.predictOnValues(dstream.map(lp => (lp.label, lp.features))).foreachRDD { rdd =>
val metrics = new RegressionMetrics(rdd)
val MSE = metrics.meanSquaredError //Squared error
val RMSE = metrics.rootMeanSquaredError //Squared error
val MAE = metrics.meanAbsoluteError //Mean absolute error
val Rsquared = metrics.r2
//val Explained variance = metrics.explainedVariance
rdd.toDF.registerTempTable("liniarRegressionModel")
}
}
ssc.start()
ssc.awaitTermination()
//}
}
%sql
select * from liniarRegressionModel limit 10
when I do select the temporary table I get an error message.I run first paragraph after execute the select over temp table.
org.apache.spark.sql.AnalysisException: Table not found: liniarRegressionModel; line 1 pos 14 at org.apache.spark.sql.catalyst.analysis.
package$AnalysisErrorAt.failAnalysis (package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$.getTable (Analyzer.scala:305) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$$anonfun$apply$9.applyOrElse
(Analyzer.scala:314) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations
$$anonfun$apply$9.applyOrElse(Analyzer.scala:309) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:57) at org.apache.spark.sql.catalyst.trees.CurrentOrigin
$.withOrigin(TreeNode.scala:69) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators
(LogicalPlan.scala:56) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
$$anonfun$1.apply(LogicalPlan.scala:54) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$1.apply
(LogicalPlan.scala:54) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply
(TreeNode.scala:281) at scala.collection.Iterator
$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$
class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach
(Iterator.scala:1157) at scala.collection.generic.Growable $class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.
$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.
$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to
(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to
(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer
(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer
(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray
(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray
(Iterator.scala:1157)
My output after execute the code
import java.lang.Object
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.SaveMode
import org.apache.spark.sql
import org.apache.spark.streaming._
import org.apache.spark.streaming.dstream.DStream
import org.apache.spark.streaming.StreamingContext._
import com.datastax.spark.connector.streaming._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.dstream.ConstantInputDStream
import org.apache.spark.mllib.evaluation.RegressionMetrics
defined module StreamingLinReg
FINISHED
Took 15 seconds