How to construct graph in graphx

How to construct graph in graphx - scala

I am new to scala and graphx and am having problems converting a tsv file to a graph.
I have a flat tab separated file like below:
n1 P1 n2
n3 P1 n4
n2 P2 n3
n3 P2 n1
n1 P3 n4
n3 P3 n2
where n1,n2,n3,n4 are the nodes of the graph and R1,P2,P3 are the properties which should form the edges between the nodes.
How can I construct a graph from the above file in SPARK GraphX ?
Example code would be very helpful.

There is some code for you (of course you should build it in jar file using sbt):
package vinnie.pooh
import org.apache.spark.SparkContext._
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
object Main {
def main(args: Array[String]) {
if (args.length != 1) {
System.err.println(
"Should be one parameter: <path/to/edges>")
System.exit(1)
}
val conf = new SparkConf()
.setAppName("Load graph")
.setSparkHome(System.getenv("SPARK_HOME"))
.setJars(SparkContext.jarOfClass(this.getClass).toList)
val sc = new SparkContext(conf)
val edges: RDD[Edge[String]] =
sc.textFile(args(0)).map { line =>
val fields = line.split(" ")
Edge(fields(0).toLong, fields(2).toLong, fields(1))
}
val graph : Graph[Any, String] = Graph.fromEdges(edges, "defaultProperty")
println("num edges = " + graph.numEdges);
println("num vertices = " + graph.numVertices);
}
}
and I have edge.txt:
1 Prop12 2
2 Prop24 4
4 Prop45 5
5 Prop52 2
6 Prop65 7
and then, for example, you can launch it locally:
$SPARK_HOME>./bin/spark-submit --class vinnie.pooh.Main --master local[2] ~/justBuiltJar.jar ~/edge.txt

Related

Unable to "explode" spark vector from PCA

I am trying to scatter plot the 2 features resulting from the PCA in spark ml library.
To be more precise I am trying to convert result into something like this:
_________
id | X | Y
__________
1 |0.1|0.1
2 |0.2|0.2
3 |0.4|0.4
4 |0.3|0.3
...
from something like this
_________
id | pca
__________
1 |[0.1,0.1]
2 |[0.2,0.2]
3 |[0.4,0.4]
4 |[0.3,0.3]
...
But it seem that spark vector aren't iterable or something like this. I don't understand what is going on. If someone know the answer that would be grate
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.feature.VectorAssembler
val convertToVector = udf((array: Array[Double]) => {
Vectors.dense(array.toArray)
})
val convertToDouble = udf((array: Array[Float]) => {
array.map(_.toDouble).toArray
})
val ds = model.userFactors.withColumn("features", convertToDouble($"features"))
val userMatrixDs = ds.withColumn("features", convertToVector($"features"))
//val df3 = assembler.transform(df2)
val pca = new PCA()
.setInputCol("features")
.setOutputCol("pca")
.setK(2)
.fit(userMatrixDs)
// Project vectors to the linear space spanned by the top 2 principal
// components, keeping the label
val result = pca.transform(userMatrixDs).select("id","pca");
result.show()
result.select(
result.id,
result.col("pca")[0].as("eigenVector1"),
result.col("pca")[1].as("eigenVector2")
)
.show()

Welcome to StackOverflow. Take a look to this example:
val df = spark.createDataFrame(
spark.sparkContext.parallelize(Seq(Row(1, 1.0, 2.0))),
StructType(
List(
StructField("id", IntegerType),
StructField("one", DoubleType),
StructField("two", DoubleType)
)
))
import org.apache.spark.ml.linalg.Vector
import spark.implicits._
val assembler =
new VectorAssembler()
.setInputCols(Array("one", "two"))
.setOutputCol("vector")
val df0 = assembler.transform(df)
df0
.select("id", "vector")
.as[(Int, Vector)]
.map { case (id, vector) =>
val arr = vector.toArray
(id, arr(0), arr(1))
}
.select($"_1".as("id"), $"_2".as("pca_x"), $"_3".as("pca_y"))
First I create with VectorAsembler a Vector column and then extract the value transforming it to a Dataset[(Int, Vector)]. With map you can easily manipulate the row.

Apply PCA on specific columns with Apache Spark

i am trying to apply PCA on a dataset that contains a header and contains fields
Here is the code i used , any help to be able to select a specific columns on which we apply PCA .
val inputMatrix = sc.textFile("C:/Users/mhattabi/Desktop/Realase of 01_06_2017/TopDrive_WithoutConstant.csv").map { line =>
val values = line.split(",").map(_.toDouble)
Vectors.dense(values)
}
val mat: RowMatrix = new RowMatrix(inputMatrix)
val pc: Matrix = mat.computePrincipalComponents(4)
// Project the rows to the linear space spanned by the top 4 principal components.
val projected: RowMatrix = mat.multiply(pc)
//updated version
i tried to do this
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val dataframe = spark.read.format("com.databricks.spark.csv")
val columnsToUse: Seq[String] = Array("Col0","Col1", "Col2", "Col3", "Col4").toSeq
val k: Int = 2
val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("C:/Users/mhattabi/Desktop/donnee/cassandraTest_1.csv")
val rf = new RFormula().setFormula(s"~ ${columnsToUse.mkString(" + ")}")
val pca = new PCA().setInputCol("features").setOutputCol("pcaFeatures").setK(k)
val featurized = rf.fit(df).transform(df)
//prinpal component
val principalComponent = pca.fit(featurized).transform(featurized)
principalComponent.select("pcaFeatures").show(4,false)
+-----------------------------------------+
|pcaFeatures |
+-----------------------------------------+
|[-0.536798281241379,0.495499034754084] |
|[-0.32969328815797916,0.5672811417154808]|
|[-1.32283465170085,0.5982789033642704] |
|[-0.6199718696225502,0.3173072633712586] |
+-----------------------------------------+
I got this for pricipal component , the question i want to save this in csv file and add header.Any help many thanks
Any help would be appreciated .
Thanks a lot

You can use the RFormula in this case :
import org.apache.spark.ml.feature.{RFormula, PCA}
val columnsToUse: Seq[String] = ???
val k: Int = ???
val df = spark.read.format("csv").options(Map("header" -> "true", "inferSchema" -> "true")).load("/tmp/foo.csv")
val rf = new RFormula().setFormula(s"~ ${columnsToUse.mkString(" + ")}")
val pca = new PCA().setInputCol("features").setK(k)
val featurized = rf.fit(df).transform(df)
val projected = pca.fit(featurized).transform(featurized)

java.lang.NumberFormatException: For input string: "DateTime"
it means that in your input file there is a value DateTime that you then try to convert to Double.
Probably it is somewhere in the header of you input file

Cosine Similarity via DIMSUM in Spark

I have a very simple code to try Cosine Similarity:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows= Array(((1,2,3,4,5),(1,2,3,4,5),(1,2,4,5,8),(3,4,1,2,7),(7,7,7,7,7)))
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
I run this code on Amazon AWS which has Spark 1.5 however I got the following message for the last two lines:
"Erroe: value columnSimilarities is not a memeber of org.apache.spark.rdd.RDD[(int,int)]"
Could you please help to resolve this issue?

I found the answer. I need to convert the matrix to RDD. Here is the right code:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
import org.apache.spark.rdd._
import org.apache.spark.mllib.linalg._
def matrixToRDD(m: Matrix): RDD[Vector] = {
val columns = m.toArray.grouped(m.numRows)
val rows = columns.toSeq.transpose // Skip this if you want a column-major RDD.
val vectors = rows.map(row => new DenseVector(row.toArray))
sc.parallelize(vectors)
}
val dm: Matrix = Matrices.dense(5, 5,Array(1,2,3,4,5,1,2,3,4,5,1,2,4,5,8,3,4,1,2,7,7,7,7,7,7))
val rows = matrixToRDD(dm)
val mat = new RowMatrix(rows)
val simsPerfect = mat.columnSimilarities()
val simsEstimate = mat.columnSimilarities(0.8)
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
println("Estimated pairwise similarities are: " + simsEstimate.entries.collect.mkString(", "))
Cheers

Spark Multiclass Classification Example

Do you guys know where can I find examples of multiclass classification in Spark. I spent a lot of time searching in books and in the web, and so far I just know that it is possible since the latest version according the documentation.

ML
(Recommended in Spark 2.0+)
We'll use the same data as in the MLlib below. There are two basic options. If Estimator supports multilclass classification out-of-the-box (for example random forest) you can use it directly:
val trainRawDf = trainRaw.toDF
import org.apache.spark.ml.feature.{Tokenizer, CountVectorizer, StringIndexer}
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
val transformers = Array(
new StringIndexer().setInputCol("group").setOutputCol("label"),
new Tokenizer().setInputCol("text").setOutputCol("tokens"),
new CountVectorizer().setInputCol("tokens").setOutputCol("features")
)
val rf = new RandomForestClassifier()
.setLabelCol("label")
.setFeaturesCol("features")
val model = new Pipeline().setStages(transformers :+ rf).fit(trainRawDf)
model.transform(trainRawDf)
If model supports only binary classification (logistic regression) and extends o.a.s.ml.classification.Classifier you can use one-vs-rest strategy:
import org.apache.spark.ml.classification.OneVsRest
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression()
.setLabelCol("label")
.setFeaturesCol("features")
val ovr = new OneVsRest().setClassifier(lr)
val ovrModel = new Pipeline().setStages(transformers :+ ovr).fit(trainRawDf)
MLLib
According to the official documentation at this moment (MLlib 1.6.0) following methods support multiclass classification:
logistic regression,
decision trees,
random forests,
naive Bayes
At least some of the examples use multiclass classification:
Naive Bayes example - 3 classes
Logistic regression - 10 classes for classifier although only 2 in the example data
General framework, ignoring method specific arguments, is pretty much the same as for all the other methods in MLlib. You have to pre-processes your input to create either data frame with columns representing label and features:
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
or RDD[LabeledPoint].
Spark provides broad range of useful tools designed to facilitate this process including Feature Extractors and Feature Transformers and pipelines.
You'll find a rather naive example of using Random Forest below.
First lets import required packages and create dummy data:
import sqlContext.implicits._
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.linalg.{Vectors, Vector}
import org.apache.spark.mllib.evaluation.MulticlassMetrics
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
case class LabeledRecord(group: String, text: String)
val trainRaw = sc.parallelize(
LabeledRecord("foo", "foo v a y b foo") ::
LabeledRecord("bar", "x bar y bar v") ::
LabeledRecord("bar", "x a y bar z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foo", "foo x") ::
LabeledRecord("foobar", "z y x foo a b bar v") ::
Nil
)
Now let's define required transformers and process train Dataset:
// Tokenizer to process text fields
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
// HashingTF to convert tokens to the feature vector
val hashingTF = new HashingTF()
.setInputCol("words")
.setOutputCol("features")
.setNumFeatures(10)
// Indexer to convert String labels to Double
val indexer = new StringIndexer()
.setInputCol("group")
.setOutputCol("label")
.fit(trainRaw.toDF)
def transfom(rdd: RDD[LabeledRecord]) = {
val tokenized = tokenizer.transform(rdd.toDF)
val hashed = hashingTF.transform(tokenized)
val indexed = indexer.transform(hashed)
indexed
.select($"label", $"features")
.map{case Row(label: Double, features: Vector) =>
LabeledPoint(label, features)}
}
val train: RDD[LabeledPoint] = transfom(trainRaw)
Please note that indexer is "fitted" on the train data. It simply means that categorical values used as the labels are converted to doubles. To use classifier on a new data you have to transform it first using this indexer.
Next we can train RF model:
val numClasses = 3
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 10
val featureSubsetStrategy = "auto"
val impurity = "gini"
val maxDepth = 4
val maxBins = 16
val model = RandomForest.trainClassifier(
train, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity,
maxDepth, maxBins
)
and finally test it:
val testRaw = sc.parallelize(
LabeledRecord("foo", "foo foo z z z") ::
LabeledRecord("bar", "z bar y y v") ::
LabeledRecord("bar", "a a bar a z") ::
LabeledRecord("foobar", "foo v b bar z") ::
LabeledRecord("foobar", "a foo a bar") ::
Nil
)
val test: RDD[LabeledPoint] = transfom(testRaw)
val predsAndLabs = test.map(lp => (model.predict(lp.features), lp.label))
val metrics = new MulticlassMetrics(predsAndLabs)
metrics.precision
metrics.recall

Are you using Spark 1.6 rather than Spark 2.1?
I think the problem is that in spark 2.1 the transform method returns a dataset, which can be implicitly converted to a typed RDD, where as prior to that, it returns a data frame or row.
Try as a diagnostic specifying the return type of the transform function as RDD[LabeledPoint] and see if you get the same error.

TaskSchedulerImpl: Initial job has not accepted any resources. (Error in Spark)

I'm trying to run SparkPi example on my standalone mode cluster.
package org.apache.spark.examples
import scala.math.random
import org.apache.spark._
/** Computes an approximation to pi */
object SparkPi {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("SparkPi")
.setMaster("spark://192.168.17.129:7077")
.set("spark.driver.allowMultipleContexts", "true")
val spark = new SparkContext(conf)
val slices = if (args.length > 0) args(0).toInt else 2
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = spark.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
spark.stop()
}
}
Note: I made a little change in this line:
val conf = new SparkConf().setAppName("SparkPi")
.setMaster("spark://192.168.17.129:7077")
.set("spark.driver.allowMultipleContexts", "true")
Problem: I'm using spark-shell (Scala interface) to run this code. When I try this code, I receive this error repeatedly:
15/02/09 06:39:23 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
Note: I can see my workers in my Master's WebUI and also I can see a new job in the Running Applications section. But there is no end for this application and I see error repeatedly.
What is the problem?
Thanks

If you want to run this from spark shell, then start the shell with argument --master spark://192.168.17.129:7077 and enter the following code:
import scala.math.random
import org.apache.spark._
val slices = 10
val n = math.min(100000L * slices, Int.MaxValue).toInt // avoid overflow
val count = sc.parallelize(1 until n, slices).map { i =>
val x = random * 2 - 1
val y = random * 2 - 1
if (x*x + y*y < 1) 1 else 0
}.reduce(_ + _)
println("Pi is roughly " + 4.0 * count / n)
Otherwise, compile the code into a jar and run it with spark-submit. But remove setMaster from the code and add it as 'master' argument to spark-submit script. Also remove the allowMultipleContexts argument from the code.
You need only one spark context.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to construct graph in graphx - scala

Related

Unable to "explode" spark vector from PCA

Apply PCA on specific columns with Apache Spark

Cosine Similarity via DIMSUM in Spark

Spark Multiclass Classification Example

TaskSchedulerImpl: Initial job has not accepted any resources. (Error in Spark)

Categories

Resources