I'm tring to implement this Graphx example:
import org.apache.spark._
import org.apache.spark.graphx._
val conf = new SparkConf().setAppName("GraphX Example")
val sc = new SparkContext(conf)
// Create an RDD of vertices
val verticesRDD = sc.parallelize(Seq((-1L, "nowhere"), (1L, "yahou"), (2L, "sanae"), (3L, "hanane"), (4L, "said"), (5L, "halima")))
// Create an RDD of edges
val edgesRDD = sc.parallelize(Seq(Edge(1L, 3L, "commenter"), Edge(1L, 3L, "suivre"), Edge(2L, 3L, "commenter"), Edge(2L, 5L, "connecter"), Edge(4L, 2L, "connecter")))
// Create the graph with the default vertex
val graph = Graph(verticesRDD, edgesRDD, "nowhere")
graph.vertices.collect.foreach(println)
graph.edges.collect.foreach(println)
val numVertices = graph.numVertices
val numEdges = graph.numEdges
println(s"Number of vertices: $numVertices")
println(s"Number of edges: $numEdges")
and it returns me always 0 on $numVertices
it doesn't seem that something is wrong
PS: In my example i expect the result to be 6
The issue finally was with:
val conf = new SparkConf(). setAppName("GraphX Example") val sc = new SparkContext(conf)
so when i use it two times in a spark-shell it shutdowns automaticlly
the solution is that i restart my machine and rexecute the script without these two lines and it worked, thank you all
Related
I am new to scala and mllib and I have been getting the following error. Please let me know if anyone has been able to resolve something similar.
import org.apache.spark.sql.SparkSession
import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
import org.apache.spark.mllib.linalg.Vectors
.
.
.
val conf = new SparkConf().setMaster("local").setAppName("SampleApp")
val sContext = new SparkContext(conf)
val sc = SparkSession.builder().master("local").appName("SampleApp").getOrCreate()
val sampleData = sc.read.json("input/sampleData.json")
val clusters = KMeans.train(sampleData, 10, 10)
val WSSSE = clusters.computeCost(sampleData)
clusters.save(sc, "target/org/apache/spark/KMeansExample/KMeansModel")
val sameModel = KMeansModel.load(sContext, "target/org/apache/spark/KMeansExample/KMeansModel")
this above line gives an error as:
type mismatch; found : org.apache.spark.sql.DataFrame (which expands to) org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] required: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector]
So I tried:
import org.apache.spark.ml.clustering.KMeans
val kmeans = new KMeans().setK(20)
val model = kmeans.fit(sampleData)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
This gives the error:
Exception in thread "main" java.lang.IllegalArgumentException: Field "features" does not exist.
Available fields: address, attributes, business_id
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:267)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:266)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58)
at org.apache.spark.ml.util.SchemaUtils$.validateVectorCompatibleColumn(SchemaUtils.scala:119)
at org.apache.spark.ml.clustering.KMeansParams$class.validateAndTransformSchema(KMeans.scala:96)
at org.apache.spark.ml.clustering.KMeans.validateAndTransformSchema(KMeans.scala:285)
at org.apache.spark.ml.clustering.KMeans.transformSchema(KMeans.scala:382)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:341)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
I have been referring to https://spark.apache.org/docs/latest/ml-clustering.html and https://spark.apache.org/docs/latest/mllib-clustering.html
Edit
Using setFeaturesCol()
import org.apache.spark.ml.clustering.KMeans
val assembler = new VectorAssembler()
.setInputCols(Array("is_open", "review_count", "stars"))
.setOutputCol("features")
val output = assembler.transform(sampleData).select("features")
val kmeans = new KMeans().setK(20).setFeaturesCol("features")
val model = kmeans.fit(output)
val predictions = model.transform(sampleData)
val evaluator = new ClusteringEvaluator()
val silhouette = evaluator.evaluate(predictions)
println(s"Silhouette with squared euclidean distance = $silhouette")
This gives a different error still:
Exception in thread "main" java.lang.NoSuchMethodError: org.apache.spark.util.Utils$.getSimpleName(Ljava/lang/Class;)Ljava/lang/String;
at org.apache.spark.ml.util.Instrumentation.logPipelineStage(Instrumentation.scala:52)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:350)
at org.apache.spark.ml.clustering.KMeans$$anonfun$fit$1.apply(KMeans.scala:340)
at org.apache.spark.ml.util.Instrumentation$$anonfun$11.apply(Instrumentation.scala:183)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:183)
at org.apache.spark.ml.clustering.KMeans.fit(KMeans.scala:340)
Thanks.
Use the scala pipeline
val assembler = new VectorAssembler()
.setInputCols(Array("feature1",feature2","feature3"))
.setOutputCol("assembled_features")
val scaler = new StandardScaler()
.setInputCol("assembled_features")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
val kmeans = new KMeans().setK(2).setSeed(1L)
// create the pipeline
val pipeline = new Pipeline()
.setStages(Array(assembler, scaler, kmeans))
// Fit the model
val clussterModel = pipeline.fit(train)
I have to query HBASE and then work with the data with spark and scala.
My problem is that with my solution, i take ALL the data of my HBASE table and then i filter, it's not an efficient way because it takes too much memory. So i would like to do the filter directly, how can i do that ?
def HbaseSparkQuery(table: String, gatewayINPUT: String, sparkContext: SparkContext): DataFrame = {
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val conf = HBaseConfiguration.create()
val tableName = table
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, tableName)
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("rssi"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "RSSI")
.filter($"GatewayIMEA" === gatewayINPUT)
DATAFRAME
}
As you can see in my code, I do the filter after the creation of the dataframe, after the loading of Hbase data ..
Thank you in advance for your answers
Here is the solution I found
import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.filter._
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil
object HbaseConnector {
def main(args: Array[String]): Unit = {
// System.setProperty("hadoop.home.dir", "/usr/local/hadoop")
val sparkConf = new SparkConf().setAppName("CoverageAlgPipeline").setMaster("local[*]")
val sparkContext = new SparkContext(sparkConf)
val sqlContext = new SQLContext(sparkContext)
import sqlContext.implicits._
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("Coverage Algorithm")
.getOrCreate
val GatewayIMEA = "123"
val TABLE_NAME = "TABLE"
val conf = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", "localhost")
conf.set("hbase.master", "localhost:60000")
conf.set(TableInputFormat.INPUT_TABLE, TABLE_NAME)
val connection = ConnectionFactory.createConnection(conf)
val table = connection.getTable(TableName.valueOf(TABLE_NAME))
val scan = new Scan
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
scan.setFilter(GatewayIDFilter)
conf.set(TableInputFormat.SCAN, TableMapReduceUtil.convertScanToString(scan))
val hBaseRDD = sparkContext.newAPIHadoopRDD(conf, classOf[TableInputFormat], classOf[ImmutableBytesWritable], classOf[Result])
val DATAFRAME = hBaseRDD.map(x => {
(Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("header"), Bytes.toBytes("eventTime"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("node"), Bytes.toBytes("imei"))),
Bytes.toString(x._2.getValue(Bytes.toBytes("measure"), Bytes.toBytes("Measure"))))
}).toDF()
.withColumnRenamed("_1", "GatewayIMEA")
.withColumnRenamed("_2", "EventTime")
.withColumnRenamed("_3", "ap")
.withColumnRenamed("_4", "measure")
DATAFRAME.show()
}
}
What is done is to set your input table, set your filter, do the scan with the filter and get the scan to a RDD, and then transform the RDD to a dataframe (optional)
To do multiple filters :
val timestampFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("eventTime"), CompareFilter.CompareOp.GREATER, Bytes.toBytes(String.valueOf(dateOfDayTimestamp)))
val GatewayIDFilter = new SingleColumnValueFilter(Bytes.toBytes("header"), Bytes.toBytes("gatewayIMEA"), CompareFilter.CompareOp.EQUAL, Bytes.toBytes(String.valueOf(GatewayIMEA)))
val filters = new FilterList(GatewayIDFilter, timestampFilter)
scan.setFilter(filters)
You can use a spark-hbase connector with predicate pushdown. e.g.https://spark-packages.org/package/Huawei-Spark/Spark-SQL-on-HBase
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.spark._
import org.apache.spark.sql.types._
import org.apache.spark.sql._
object fixedLength {
def main(args:Array[String]) {
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
df.show() // Error
println("End of the program")
}
}
I'm getting error in the df.show() command.
My file content is
56 apple TRUE 0.56
45 pear FALSE1.34
34 raspberry TRUE 2.43
34 plum TRUE 1.31
53 cherry TRUE 1.4
23 orange FALSE2.34
56 persimmon FALSE23.2
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: org.apache.spark.util.SerializableConfiguration cannot be cast to [B
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:81)
Can you please help?
You are creating rdd in old way SparkContext(conf)
val conf = new SparkConf().setAppName("FixedLength").setMaster("local[*]").set("spark.driver.allowMultipleContexts", "true");
val sc = new SparkContext(conf)
val fruits = sc.textFile("in/fruits.txt")
whereas you are creating dataframe in new way using SparkSession
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
Ultimately you are mixing rdd created with old sparkContext functions with dataframe created by using new sparkSession.
I would suggest you to use only one way.
I guess thats the reason for the issue
Update
doing the following should work for you
def getRow(x : String) : Row={
val columnArray = new Array[String](4)
columnArray(0)=x.substring(0,3)
columnArray(1)=x.substring(3,13)
columnArray(2)=x.substring(13,18)
columnArray(3)=x.substring(18,22)
Row.fromSeq(columnArray)
}
Logger.getLogger("org").setLevel(Level.ERROR)
val spark = SparkSession.builder().master("local").appName("ReadingCSV").getOrCreate()
val fruits = spark.sparkContext.textFile("in/fruits.txt")
val schemaString = "id,fruitName,isAvailable,unitPrice";
val fields = schemaString.split(",").map( field => StructField(field,StringType,nullable=true))
val schema = StructType(fields)
val df = spark.createDataFrame(fruits.map { x => getRow(x)} , schema)
I am new to Spark and trying a basic classifier in Scala.
I'm trying to get the accuracy, but when using MulticlassClassificationEvaluator it gives the error below:
Caused by: java.lang.IllegalArgumentException: Field "label" does not exist.
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)
at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
at scala.collection.AbstractMap.getOrElse(Map.scala:59)
at org.apache.spark.sql.types.StructType.apply(StructType.scala:227)
at org.apache.spark.ml.util.SchemaUtils$.checkNumericType(SchemaUtils.scala:71)
at org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:76)
at com.classifier.classifier_app.App$.<init>(App.scala:90)
at com.classifier.classifier_app.App$.<clinit>(App.scala)
The code is as below:
val conf = new SparkConf().setMaster("local[*]").setAppName("Classifier")
val sc = new SparkContext(conf)
val spark = SparkSession
.builder()
.appName("Email Classifier")
.config("spark.some.config.option", "some-value")
.getOrCreate()
import spark.implicits._
val spamInput = "TRAIN_00000_0.eml" //files to train model
val normalInput = "TRAIN_00002_1.eml"
val spamData = spark.read.textFile(spamInput)
val normalData = spark.read.textFile(normalInput)
case class Feature(index: Int, value: String)
val indexer = new StringIndexer()
.setInputCol("value")
.setOutputCol("label")
val regexTokenizer = new RegexTokenizer()
.setInputCol("value")
.setOutputCol("cleared")
.setPattern("\\w+").setGaps(false)
val remover = new StopWordsRemover()
.setInputCol("cleared")
.setOutputCol("filtered")
val hashingTF = new HashingTF()
.setInputCol("filtered").setOutputCol("features")
.setNumFeatures(100)
val nb = new NaiveBayes()
val indexedSpam = spamData.map(x=>Feature(0, x))
val indexedNormal = normalData.map(x=>Feature(1, x))
val trainingData = indexedSpam.union(indexedNormal)
val pipeline = new Pipeline().setStages(Array (indexer, regexTokenizer, remover, hashingTF, nb))
val model = pipeline.fit(trainingData)
model.write.overwrite().save("myNaiveBayesModel")
val spamTest = spark.read.textFile("TEST_00009_0.eml")
val normalTest = spark.read.textFile("TEST_00000_1.eml")
val sameModel = PipelineModel.load("myNaiveBayesModel")
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("label")
.setPredictionCol("prediction")
.setMetricName("accuracy")
Console.println("Spam Test")
val predictionSpam = sameModel.transform(spamTest).select("prediction")
predictionSpam.foreach(println(_))
val accuracy = evaluator.evaluate(predictionSpam)
println("Accuracy Spam: " + accuracy)
Console.println("Normal Test")
val predictionNorm = sameModel.transform(normalTest).select("prediction")
predictionNorm.foreach(println(_))
val accuracyNorm = evaluator.evaluate(predictionNorm)
println("Accuracy Normal: " + accuracyNorm)
The error occurs when initializing the MulticlassClassificationEvaluator. How should the column names be specified? Any help is appreciated.
The error is in this line:
val predictionSpam = sameModel.transform(spamTest).select("prediction")
Your dataframe contains only prediction column and no label column.
My goal is to count triangles in multiple subgraphs from a common full graph. The subgraph is defined by a constant set of nodes + a node from an RDD[Long]. I'm new to spark/graphx, so this may be an improper use of map. The code I'm sharing will reproduce my error.
To start, I have a subgraph of a full graph declared as shown below
import org.apache.spark.rdd._
import org.apache.spark.graphx._
val nodes: RDD[(VertexId, String)] = sc.parallelize(Array((3L, "3"), (7L, "7"), (5L, "5"), (2L, "2"),(4L,"4")))
val vertices: RDD[Edge[String]] = sc.parallelize(Array(Edge(3L, 7L, "a"), Edge(3L, 5L, "b"), Edge(2L, 5L, "c"), Edge(5L, 7L, "d"), Edge(2L, 7L, "e"),Edge(4L,5L,"f")))
val graph: Graph[String,String] = Graph(nodes, vertices, "z")
val baseNodes: Array[Long] = Array(2L,5L,7L)
val subgraph = graph.subgraph(vpred = (vid,attr)=> baseNodes contains vid)
Then I declare an RDD[Long] of other nodes from the graph.
val testNodes: RDD[Long] = sc.parallelize(Array(3L,4L))
I want to add each testNode to the subgraph and count the triangles present at testNode.
val triangles: RDD[(Long,Int)] = testNodes.map{ newNode =>
val newNodes: Array[Long] = baseNodes :+ newNode
val newSubgraph = graph.subgraph(vpred = (vid,attr)=> newNodes contains vid)
(newNode,findTriangles(7L,newSubgraph))
}
triangles.foreach(x=>x.toString)
My findTriangles works fine if I call it outside of the map function.
def findTriangles(id:Long,subgraph:Graph[String,String]): Int = {
val triCounts = subgraph.triangleCount().vertices
val count:Int = triCounts.filter{case(item,count)=> {item.toInt == id}}.map{case(item,count)=>count}.first
count
}
val triangles = findTriangles(7L,subgraph) //1
But when I run my map function to calculate triangles, I get a NullPointerException. I think the problem is in using my graph val inside the mapping function. Is that the issue? Is there a way to workaround this?
I think that the issue should be the baseNodes variable. Variables that are declared locally, such as baseNodes in your example, are only visible in the Spark driver, not in the executors that actually execute transformations and actions. To avoid the NullPointerException, you need to parallelize any variable that you'll need in the transformations (like map) that are executed on the executors. As an alternative, if the variable you have is read-only, you can broadcast that variable to executors using the broadcast construct in Spark. In your case, it seems that baseNodes doesn't get modified within the map operation, so it's a good candidate to be broadcast instead of parallelized.