Graphframes and BFS - scala

I'm having some problem to understand BFS on Graphframe. I´m trying to get the "father of all" - the one that has no parent in the graph.
See, I have this Dataframe:
val df = sqlContext.createDataFrame(List(
("153030152492012801800", ""),
("153030152492012801845", ""),
("153030152492013801220","153030152492012801845"),
("153030152492013800151","153030152492012801845"),
("153030152492014800546","153030152492012801845"),
("153030152492013800497", "153030152492013800151"),
("153030152492013801860", "153030152492013800151"),
("153030152492014800038", "153030152492013801860"),
("153030152492014801015", "153030152492014800038"),
("153030152492014801235", "153030152492014801015")
)).toDF("id", "parent")
And I build my Graph as:
val df_vertices = df.selectExpr("id", "parent")
val df_edges = df.withColumnRenamed("id", "src").withColumnRenamed("parent", "dst").withColumn("relation", lit("parent"))
val g = GraphFrame(df_vertices, df_edges)
Finally, I run BFS as:
g.bfs.fromExpr("id = 153030152492013800151").toExpr("parent = ''").run().show(false)
But I'm getting this:
+----------------------------------------------+------------------------------------------------------+-------------------------+
|from |e0 |to |
+----------------------------------------------+------------------------------------------------------+-------------------------+
|[153030152492013801220, 153030152492012801845]|[153030152492013801220, 153030152492012801845, parent]|[153030152492012801845, ]|
|[153030152492013800151, 153030152492012801845]|[153030152492013800151, 153030152492012801845, parent]|[153030152492012801845, ]|
+----------------------------------------------+------------------------------------------------------+-------------------------+
The question is, why am I getting the first line, the one that starts with "153030152492013801220" ? See, this value is not in the path of "153030152492013800151". So, why this result?
Thank you all!

I figured out my error. The problem is that "ID" is a string, so, I had to call BFS as:
g.bfs.fromExpr("id = '153030152492013800151'").toExpr("parent = ''").run().show(false)
I posted the answer because graphframe doesn't show any error. Just to warn you about that.

Related

collect on a dataframe spark

I wrote this:
df.select(col("colname")).distinct().collect.map(_.toString()).toList
the result is
List("[2019-06-24]", "[2019-06-22]", "[2019-06-23]")
Whereas I want to get :
List("2019-06-24", "2019-06-22", "2019-06-23")
How to change this please
You need to change .map(_.toString()) to .map(_.getAs[String]("colname")).With .map(_.toString()), you are calling org.apache.spark.sql.Row.toString, that's why the output is like List("[2019-06-24]", "[2019-06-22]", "[2019-06-23]").Correct way is:
val list = df.select("colname").distinct().collect().map(_.getAs[String]("colname")).toList
Output will be:
List("2019-06-24", "2019-06-22", "2019-06-23")
Sample data:
val df=sc.parallelize(Seq(("2019-06-24"),( "2019-06-22"),("2019-06-23"))).toDF("cn")
Now select column then apply map to get first index value then add quotes and convert to string.
df.select("cn").collect().map(x => x(0)).map(x => s""""$x"""".toString)
//res36: Array[String] = Array("2019-06-24", "2019-06-22", "2019-06-23")
(or)
df.select("cn").collect().map(x => x(0)).map(x => s""""$x"""".toString).toList
//res37: List[String] = List("2019-06-24", "2019-06-22", "2019-06-23")

Mixed Content XML parsing using DataFrame

I have an XML document that has mixed content and I am using a custom schema in Dataframe to parse it. I am having an issue where the schema will only pick up the text for "Measure".
The XML looks like this
<QData>
<Measure> some text here
<Answer>Answer1</Answer>
<Question>Question1</Question>
</Measure>
<Measure> some text here
<Answer>Answer1</Answer>
<Question>Question1</Question>
</Meaure>
</QData>
My schema is as follows:
def getCustomSchema():StructType = {StructField("QData",
StructType(Array(
StructField("Measure",
StructType( Array(
StructField("Answer",StringType,true),
StructField("Question",StringType,true)
)),true)
)),true)}
When I try to access the data in Measure I am only getting "some text here" and it fails when I try to get info from Answer. I am also just getting one Measure.
EDIT: This is how I am trying to access the data
val result = sc.read.format("com.databricks.spark.xml").option("attributePrefix", "attr_").schema(getCustomSchema)
.load(filename.toString)
val qDfTemp = result.mapPartitions(partition =>{val mapper = new QDMapper();partition.map(row=>{mapper(row)}).flatMap(list=>list)}).toDF()
case class QDMapper(){
def apply(row: Row):List[QData]={
val qDList = new ListBuffer[QData]()
val qualData = row.getAs[Row]("QData") //When I print as list I get the first Measure text and that is it
val measure = qualData.getAs[Row]("Measure") //This fails
}
}
you can use row tag as a root tag and access other element:-
df_schema = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='<xml_tag_name>').load(schema_path)
please visit https://github.com/harshaltaware/Pyspark/blob/main/Spark-data-parsing/xmlparsing.py for brief code

Spark collect()/count() never finishes while show() runs fast

I'm running Spark locally on my Mac and there is a weird issue. Basically, I can output any number of rows using show() method of the DataFrame, however, when I try to use count() or collect() even on pretty small amounts of data, the Spark is getting stuck on that stage. And never finishes its job. I'm using gradle for building and running.
When I run
./gradlew clean run
The program gets stuck at
> Building 83% > :run
What could cause this problem?
Here is the code.
val moviesRatingsDF = MongoSpark.load(sc).toDF().select("movieId", "userId","rating")
val movieRatingsDF = moviesRatingsDF
.groupBy("movieId")
.pivot("userId")
.max("rating")
.na.fill(0)
val ratingColumns = movieRatingsDF.columns.drop(1) // drop the name column
val movieRatingsDS:Dataset[MovieRatingsVector] = movieRatingsDF
.select( col("movieId").as("movie_id"), array(ratingColumns.map(x => col(x)): _*).as("ratings") )
.as[MovieRatingsVector]
val moviePairs = movieRatingsDS.withColumnRenamed("ratings", "ratings1")
.withColumnRenamed("movie_id", "movie_id1")
.crossJoin(movieRatingsDS.withColumnRenamed("ratings", "ratings2").withColumnRenamed("movie_id", "movie_id2"))
.filter(col("movie_id1") < col("movie_id2"))
val movieSimilarities = moviePairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
MovieSimilarity(row.getAs[Long]("movie_id1"), row.getAs[Long]("movie_id2"), corr)
}).cache()
val collectedData = movieSimilarities.collect()
println(collectedData.length)
log.warn("I'm done") //never gets here
close
Spark does lazy evaluation and creates rdd/df the when an action is called.
To answer you are question
1 .In the collect/Count you are calling two different actions, incase if you are
not persisting the data, which will cause the RDD/df to be re-evaluated, hence
forth more time than anticipated.
In the show only one action. and it shows only top 1000 rows( fingers crossed
) hence it finishes

How to extract best parameters from a CrossValidatorModel

I want to find the parameters of ParamGridBuilder that make the best model in CrossValidator in Spark 1.4.x,
In Pipeline Example in Spark documentation, they add different parameters (numFeatures, regParam) by using ParamGridBuilder in the Pipeline. Then by the following line of code they make the best model:
val cvModel = crossval.fit(training.toDF)
Now, I want to know what are the parameters (numFeatures, regParam) from ParamGridBuilder that produces the best model.
I already used the following commands without success:
cvModel.bestModel.extractParamMap().toString()
cvModel.params.toList.mkString("(", ",", ")")
cvModel.estimatorParamMaps.toString()
cvModel.explainParams()
cvModel.getEstimatorParamMaps.mkString("(", ",", ")")
cvModel.toString()
Any help?
Thanks in advance,
One method to get a proper ParamMap object is to use CrossValidatorModel.avgMetrics: Array[Double] to find the argmax ParamMap:
implicit class BestParamMapCrossValidatorModel(cvModel: CrossValidatorModel) {
def bestEstimatorParamMap: ParamMap = {
cvModel.getEstimatorParamMaps
.zip(cvModel.avgMetrics)
.maxBy(_._2)
._1
}
}
When run on the CrossValidatorModel trained in the Pipeline Example you cited gives:
scala> println(cvModel.bestEstimatorParamMap)
{
hashingTF_2b0b8ccaeeec-numFeatures: 100,
logreg_950a13184247-regParam: 0.1
}
val bestPipelineModel = cvModel.bestModel.asInstanceOf[PipelineModel]
val stages = bestPipelineModel.stages
val hashingStage = stages(1).asInstanceOf[HashingTF]
println("numFeatures = " + hashingStage.getNumFeatures)
val lrStage = stages(2).asInstanceOf[LogisticRegressionModel]
println("regParam = " + lrStage.getRegParam)
source
To print everything in paramMap, you actually don't have to call parent:
cvModel.bestModel.extractParamMap()
To answer OP's question, to get a single best parameter, for example regParam:
cvModel.bestModel.extractParamMap().apply(cvModel.bestModel.getParam("regParam"))
This is how you get the chosen parameters
println(cvModel.bestModel.getMaxIter)
println(cvModel.bestModel.getRegParam)
this java code should work:
cvModel.bestModel().parent().extractParamMap().you can translate it to scala code
parent()method will return an estimator, you can get the best params then.
This is the ParamGridBuilder()
paraGrid = ParamGridBuilder().addGrid(
hashingTF.numFeatures, [10, 100, 1000]
).addGrid(
lr.regParam, [0.1, 0.01, 0.001]
).build()
There are 3 stages in pipeline. It seems we can assess parameters as the following:
for stage in cv_model.bestModel.stages:
print 'stages: {}'.format(stage)
print stage.params
print '\n'
stage: Tokenizer_46ffb9fac5968c6c152b
[Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='inputCol', doc='input column name'), Param(parent='Tokenizer_46ffb9fac5968c6c152b', name='outputCol', doc='output column name')]
stage: HashingTF_40e1af3ba73764848d43
[Param(parent='HashingTF_40e1af3ba73764848d43', name='inputCol', doc='input column name'), Param(parent='HashingTF_40e1af3ba73764848d43', name='numFeatures', doc='number of features'), Param(parent='HashingTF_40e1af3ba73764848d43', name='outputCol', doc='output column name')]
stage: LogisticRegression_451b8c8dbef84ecab7a9
[]
However, there is no parameter in the last stage, logiscRegression.
We can also get weight and intercept parameter from logistregression like the following:
cv_model.bestModel.stages[1].getNumFeatures()
10
cv_model.bestModel.stages[2].intercept
1.5791827733883774
cv_model.bestModel.stages[2].weights
DenseVector([-2.5361, -0.9541, 0.4124, 4.2108, 4.4707, 4.9451, -0.3045, 5.4348, -0.1977, -1.8361])
Full exploration:
http://kuanliang.github.io/2016-06-07-SparkML-pipeline/
I am working with Spark Scala 1.6.x and here is a full example of how i can set and fit a CrossValidator and then return the value of the parameter used to get the best model (assuming that training.toDF gives a dataframe ready to be used) :
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
// Instantiate a LogisticRegression object
val lr = new LogisticRegression()
// Instantiate a ParamGrid with different values for the 'RegParam' parameter of the logistic regression
val paramGrid = new ParamGridBuilder().addGrid(lr.regParam, Array(0.0001, 0.001, 0.01, 0.1, 0.25, 0.5, 0.75, 1)).build()
// Setting and fitting the CrossValidator on the training set, using 'MultiClassClassificationEvaluator' as evaluator
val crossVal = new CrossValidator().setEstimator(lr).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid)
val cvModel = crossVal.fit(training.toDF)
// Getting the value of the 'RegParam' used to get the best model
val bestModel = cvModel.bestModel // Getting the best model
val paramReference = bestModel.getParam("regParam") // Getting the reference of the parameter you want (only the reference, not the value)
val paramValue = bestModel.get(paramReference) // Getting the value of this parameter
print(paramValue) // In my case : 0.001
You can do the same for any parameter or any other type of model.
If java,see this debug show;
bestModel.parent().extractParamMap()
Building in the solution of #macfeliga, a single liner that works for pipelines:
cvModel.bestModel.asInstanceOf[PipelineModel]
.stages.foreach(stage => println(stage.extractParamMap))
This SO thread kinda answers the question.
In a nutshell, you need to cast each object to its supposed-to-be class.
For the case of CrossValidatorModel, the following is what I did:
import org.apache.spark.ml.tuning.CrossValidatorModel
import org.apache.spark.ml.PipelineModel
import org.apache.spark.ml.regression.RandomForestRegressionModel
// Load CV model from S3
val inputModelPath = "s3://path/to/my/random-forest-regression-cv"
val reloadedCvModel = CrossValidatorModel.load(inputModelPath)
// To get the parameters of the best model
(
reloadedCvModel.bestModel
.asInstanceOf[PipelineModel]
.stages(1)
.asInstanceOf[RandomForestRegressionModel]
.extractParamMap()
)
In the example, my pipeline has two stages (a VectorIndexer and a RandomForestRegressor), so the stage index is 1 for my model.
For me, the #orangeHIX solution is perfect:
val cvModel = cv.fit(training)
val cvMejorModelo = cvModel.bestModel.asInstanceOf[ALSModel]
cvMejorModelo.parent.extractParamMap()
res86: org.apache.spark.ml.param.ParamMap =
{
als_08eb64db650d-alpha: 0.05,
als_08eb64db650d-checkpointInterval: 10,
als_08eb64db650d-coldStartStrategy: drop,
als_08eb64db650d-finalStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-implicitPrefs: false,
als_08eb64db650d-intermediateStorageLevel: MEMORY_AND_DISK,
als_08eb64db650d-itemCol: product,
als_08eb64db650d-maxIter: 10,
als_08eb64db650d-nonnegative: false,
als_08eb64db650d-numItemBlocks: 10,
als_08eb64db650d-numUserBlocks: 10,
als_08eb64db650d-predictionCol: prediction,
als_08eb64db650d-rank: 1,
als_08eb64db650d-ratingCol: rating,
als_08eb64db650d-regParam: 0.1,
als_08eb64db650d-seed: 1994790107,
als_08eb64db650d-userCol: user
}

String category names with Wisp scala plotting library

I am trying to plot bars for String categories with Wisp. In other words, I have a string (key) and a count (value) in my repl and I want to have a bar chart of the count versus the key.
I don't know if something easy exists. I went as far as the following hack:
val plot = bar(topWords.map(_._2).toList)
val axisType: com.quantifind.charts.highcharts.AxisType.Type = "category"
val newPlot = plot.copy(xAxis = plot.xAxis.map {
axisArray => axisArray.map { _.copy(axisType = Option(axisType),
categories = Option(topWords.map(_._1))) }
})
but I don't know if it works because I don't find a way to visualize newPlot. Or maybe adding a method to the Wisp source implementing the above is the way to go?
Thanks for any help.
PS : I don't have the reputation to create the wisp tag, but I would have...
Update: in wisp 0.0.5 or later this will be supported directly:
column(List(("alpha", 1), ("omega", 5), ("zeta", 8)))
====
Thanks for trying out wisp (I am the author). I think a problem that you may have encountered is by naming your variable plot, you overrode access to the plot method defined by Highcharts, which allows you to plot any Highchart object. (now that I see it, naming your plot "plot" is an unfortunately natural thing to do!)
I'm filing this as a github issue, as category names are very common.
This works for me. I used column for vertical bars:
import com.quantifind.charts.Highcharts._
val topWords = Array(("alpha", 14), ("beta", 23), ("omega", 18))
val numberedColumns = column(topWords.map(_._2).toList)
val axisType: com.quantifind.charts.highcharts.AxisType.Type = "category"
val namedColumns = numberedColumns.copy(xAxis = numberedColumns.xAxis.map {
axisArray => axisArray.map { _.copy(axisType = Option(axisType),
categories = Option(topWords.map(_._1))) }
})
plot(namedColumns)
producing this visualization:
(You could also consider creating one bar at a time, and use the legend to name them)