The spark version is 2.3.1.
Spark-Mlib library provide a BinaryClassificationEvaluator (BinaryClassificationEvaluator.scala) class to evaluate the algorithm, and also can be used to gird search. But it only suppose two metrics
val metric = $(metricName) match {
case "areaUnderROC" => metrics.areaUnderROC()
case "areaUnderPR" => metrics.areaUnderPR()
//what i want todo
case "areUnderXX"=> myCustomMetric()
}
I try to add more, but BinaryClassificationEvaluator have some members that are set to private, so i can't just extends it. Here are the code that can't be viewed outside the package:
SchemaUtils.checkColumnTypes(schema, $(rawPredictionCol), Seq(DoubleType, new VectorUDT))
SchemaUtils.checkNumericType(schema, $(labelCol))
These code do some type check, so if i remove it, it would workaround. But, It seems unsafe and ugly. So, is there another way to do it? any help would be appreciated!
You can use MulticlassMetrics. It provides more metrics. As an example using a DataFrame with your labels and predictions:
+---------+----------+
|label |prediction|
+---------+-----+----+
| 1.0 | 0.0 |
| 0.0 | 0.0 |
+---------+----------+
You must pivot your dataframe for the label field
val metrics = df.select("prediction", labelName)
.as[(Double, Double)]
.rdd
val multim = new MulticlassMetrics(metrics)
val labels = multim.labels
val accuracy = multim.accuracy
println("Summary Statistics")
println(s"Accuracy = $accuracy")
labels.foreach { l =>
println(s"Precision($l) = " + multim.precision(l))
}
// Recall by label
labels.foreach { l =>
println(s"Recall($l) = " + multim.recall(l))
}
// False positive rate by label
labels.foreach { l =>
println(s"FPR($l) = " + multim.falsePositiveRate(l))
}
There are more metrics you can see in https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html
Depending of what kind of metrics you need maybe you should deal with the dataframe with the dataframe. For example, if you want to compute your confusion matrix, you can proceed pivoting over the "prediction" column like:
df.groupBy("label").
pivot("prediction", range)
.count()
.na.fill(0.0)
.orderBy("label)
.show()
Related
I put together some custom classes I would like to use as native datatypes is SparkSQL. I see UDTs just got opened up to the public, but they are quite hard to figure out. Is there any way I'd be able to do this?
Example
case class IPv4(ipAddress: String){
// IPv4 converted to a number
val addrL: Long = IPv4ToLong(ipAddress)
}
// Will read in a bunch of random IPs in the form {"ipAddress": "60.80.39.27"}
val IPv4DF: DataFrame = spark.read.json(path)
IPv4DF.createOrReplaceTempView("IPv4")
spark.sql(
"""SELECT *
FROM IPv4
WHERE ipAddress.addrL > 100000"""
)
You can construct a Dataset and filter using the case class addrL attribute:
case class IPv4(ipAddress: String){
// IPv4 converted to a number
val addrL: Long = IPv4ToLong(ipAddress)
}
val ds = Seq("60.80.39.27").toDF("ipAddress").as[IPv4]
ds.filter(_.addrL > 100000).show
+-----------+
| ipAddress|
+-----------+
|60.80.39.27|
+-----------+
I am currently working with Apache Flink's SVM-Class to predict some text data.
The class provides a predict-function which is taking a DataSet[Vector] as an input and gives me a DataSet[Prediction] as result. So far so good.
My problem is, that i dont have the context which prediction belongs to which text and i cant insert the text within the predict()-function to have it afterwards.
Code:
val tweets: DataSet[(SparseVector, String)] =
source.flatMap(new SelectEnglishTweetWithCreatedAtFlatMapper)
.map(tweet => (featureVectorService.transform(tweet._2))
model.predict(tweets).print
result example:
(SparseVector((462,8.73165920153676), (10844,8.508515650222549), (15656,2.931052542245018)),-1.0)
Is there a way to keep other data next to the prediction to have everything together ? because without context the prediction is not helping me.
Or maybe there is a way to just predict one vector instead of a DataSet, that i could call the function inside the map function above.
The SVM predictor expects as input a sub type of Vector. Hence there are two options to solve this problem:
Create a sub type of Vector which contains the tweet text as a tag. It will then be looped through the predictor. This approach has the advantage that no additional operation is needed. However, one needs define new classes an utilities to represent different vector types with tags:
val env = ExecutionEnvironment.getExecutionEnvironment
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
new DenseVectorWithTag(Array(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult: DataSet[(DenseVectorWithTag, Double)] = svm.predict(vectorizedInput)
class DenseVectorWithTag(override val data: Array[Double], tag: String)
extends DenseVector(data) {
override def toString: String = "(" + super.toString + ", " + tag + ")"
}
Join the prediction DataSet with the input DataSet on the vectorized representation of the tweets. This approach has the advantage that we don't need to introduce new classes. The price we pay for this is an additional join operation which might be expensive:
val input = env.fromElements("foobar", "barfo", "test")
val vectorizedInput = input.map(word => {
val value = word.chars().sum()
(DenseVector(value), word)
})
val svm = SVM().setBlocks(env.getParallelism)
val weights = env.fromElements(DenseVector(1.0))
svm.weightsOption = Option(weights) // skipping the training here
val predictionResult = svm.predict(vectorizedInput.map(a => a._1))
val inputWithPrediction: DataSet[(String, Double)] = vectorizedInput
.join(predictionResult)
.where(0)
.equalTo(0)
.apply((t, p) => (t._2, p._2))
I am using Spark 2.2
I feel like I have something odd going on here. Basic premise is that
I have a set of KIE/Drools rules running through a Dataset of profile objects
I am then trying to show/collect-print the resulting output
I then cast the output as a tuple to flatmap it later
Code below
implicit val mapEncoder = Encoders.kryo[java.util.HashMap[String, Any]]
implicit val recommendationEncoder = Encoders.kryo[Recommendation]
val mapper = new ObjectMapper()
val kieOuts = uberDs.map(profile => {
val map = mapper.convertValue(profile, classOf[java.util.HashMap[String, Any]])
val profile = Profile(map)
// setup the kie session
val ks = KieServices.Factory.get
val kContainer = ks.getKieClasspathContainer
val kSession = kContainer.newKieSession() //TODO: stateful session, how to do stateless?
// insert profile object into kie session
val kCmds = ks.getCommands
val cmds = new java.util.ArrayList[Command[_]]()
cmds.add(kCmds.newInsert(profile))
cmds.add(kCmds.newFireAllRules("outFired"))
// fire kie rules
val results = kSession.execute(kCmds.newBatchExecution(cmds))
val fired = results.getValue("outFired").toString.toInt
// collect the inserted recommendation objects and create uid string
import scala.collection.JavaConversions._
var gresults = kSession.getObjects
gresults = gresults.drop(1) // drop the inserted profile object which also gets collected
val recommendations = scala.collection.mutable.ListBuffer[Recommendation]()
gresults.toList.foreach(reco => {
val recommendation = reco.asInstanceOf[Recommendation]
recommendations += recommendation
})
kSession.dispose
val uIds = StringBuilder.newBuilder
if(recommendations.size > 0) {
recommendations.foreach(recommendation => {
uIds.append(recommendation.getOfferId + "_" + recommendation.getScore)
uIds.append(";")
})
uIds.deleteCharAt(uIds.size - 1)
}
new ORecommendation(profile.getAttributes().get("cId").toString.toLong, fired, uIds.toString)
})
println("======================Output#1======================")
kieOuts.show(1000, false)
println("======================Output#2======================")
kieOuts.collect.foreach(println)
//separating cid and and each uid into individual rows
val kieOutsDs = kieOuts.as[(Long, Int, String)]
println("======================Output#3======================")
kieOutsDs.show(1000, false)
(I have sanitized/shortened the id's below, they are much bigger but with a similar format)
What I am seeing as outputs
Output#1 has a set of uIds(as String) come up
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|cId |rulesFired | eligibleUIds |
|842 | 17|123-25_2.0;12345678-48_9.0;28a-ad_5.0;123-56_10.0;123-27_2.0;123-32_3.0;c6d-e5_5.0;123-26_2.0;123-51_10.0;8e8-c1_5.0;123-24_2.0;df8-ad_5.0;123-36_5.0;123-16_2.0;123-34_3.0|
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Output#2 has mostly a similar set of uIds show up(usually off by 1 element)
ORecommendation(842,17,123-36_5.0;123-24_2.0;8e8-c1_5.0;df8-ad_5.0;28a-ad_5.0;660-73_5.0;123-34_3.0;123-48_9.0;123-16_2.0;123-51_10.0;123-26_2.0;c6d-e5_5.0;123-25_2.0;123-56_10.0;123-32_3.0)
Output#3 is same as #Output1
+----+-----------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|842 | 17 |123-32_3.0;8e8-c1_5.0;123-51_10.0;123-48_9.0;28a-ad_5.0;c6d-e5_5.0;123-27_2.0;123-16_2.0;123-24_2.0;123-56_10.0;123-34_3.0;123-36_5.0;123-6_2.0;123-25_2.0;660-73_5.0|
Every time I run it the difference between Output#1 and Output#2 is 1 element but never the same element (In the above example, Output#1 has 123-27_2.0 but Output#2 has 660-73_5.0)
Should they not be the same? I am still new to Scala/Spark and feel like I am missing something very fundamental
I think I figured this out, adding cache to kieOuts atleast got me identical outputs between show and collect.
I will be looking at why KIE gives me different output for every run of the same input but that is a different issue
I have records like below. I would like to convert a single record into two records with values EXTERNAL and INTERNAL each if the 3rd attribute is All.
Input dataset:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,All
Expected output:
Surender,cts,INTERNAL
Raja,cts,EXTERNAL
Ajay,tcs,INTERNAL
Ajay,tcs,EXTERNAL
My Spark Code :
case class Customer(name:String,organisation:String,campaign_type:String)
val custRDD = sc.textFile("/user/cloudera/input_files/customer.txt")
val mapRDD = custRDD.map(record => record.split(","))
.map(arr => (arr(0),arr(1),arr(2))
.map(tuple => {
val name = tuple._1.trim
val organisation = tuple._2.trim
val campaign_type = tuple._3.trim.toUpperCase
Customer(name, organisation, campaign_type)
})
mapRDD.toDF().registerTempTable("customer_processed")
sqlContext.sql("SELECT * FROM customer_processed").show
Could Someone help me to fix this issue?
Since it's Scala...
If you want to write a more idiomatic Scala code (and perhaps trading some performance due to lack of optimizations to have a more idiomatic code), you can use flatMap operator (removed the implicit parameter):
flatMap[U](func: (T) ⇒ TraversableOnce[U]): Dataset[U] Returns a new Dataset by first applying a function to all elements of this Dataset, and then flattening the results.
NOTE: flatMap is equivalent to explode function, but you don't have to register a UDF (as in the other answer).
A solution could be as follows:
// I don't care about the names of the columns since we use Scala
// as you did when you tried to write the code
scala> input.show
+--------+---+--------+
| _c0|_c1| _c2|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs| All|
+--------+---+--------+
val result = input.
as[(String, String, String)].
flatMap { case r # (name, org, campaign) =>
if ("all".equalsIgnoreCase(campaign)) {
Seq("INTERNAL", "EXTERNAL").map { cname =>
(name, org, cname)
}
} else Seq(r)
}
scala> result.show
+--------+---+--------+
| _1| _2| _3|
+--------+---+--------+
|Surender|cts|INTERNAL|
| Raja|cts|EXTERNAL|
| Ajay|tcs|INTERNAL|
| Ajay|tcs|EXTERNAL|
+--------+---+--------+
Comparing performance of the two queries, i.e. flatMap-based vs explode-based queries, I think explode-based may be slightly faster and optimized better as some code is under Spark's control (using logical operators before they get mapped to physical couterparts). In flatMap the entire optimization is your responsibility as a Scala developer.
The below red-bounded area corresponds to flatMap-based code and the warning sign are very cost expensive DeserializeToObject and SerializeFromObject operators.
What's interesting is the number of Spark jobs per query and their durations. It appears that explode-based query takes 2 Spark jobs and 200 ms while flatMap-based take only 1 Spark job and 43 ms.
That surprises me a lot and suggests that flatMap-based query could be faster (!)
You can use and udf to transform the campaign_type column containing a Seq of strings to map it to the campaigns type and then explode :
val campaignType_ : (String => Seq[String]) = {
case s if s == "ALL" => Seq("EXTERNAL", "INTERNAL")
case s => Seq(s)
}
val campaignType = udf(campaignType_)
val df = Seq(("Surender", "cts", "INTERNAL"),
("Raja", "cts", "EXTERNAL"),
("Ajay", "tcs", "ALL"))
.toDF("name", "organisation", "campaign_type")
val step1 = df.withColumn("campaign_type", campaignType($"campaign_type"))
step1.show
// +--------+------------+--------------------+
// | name|organisation| campaign_type|
// +--------+------------+--------------------+
// |Surender| cts| [INTERNAL]|
// | Raja| cts| [EXTERNAL]|
// | Ajay| tcs|[EXTERNAL, INTERNAL]|
// +--------+------------+--------------------+
val step2 = step1.select($"name", $"organisation", explode($"campaign_type"))
step2.show
// +--------+------------+--------+
// | name|organisation| col|
// +--------+------------+--------+
// |Surender| cts|INTERNAL|
// | Raja| cts|EXTERNAL|
// | Ajay| tcs|EXTERNAL|
// | Ajay| tcs|INTERNAL|
// +--------+------------+--------+
EDIT:
You don't actually need a udf, you can use a when().otherwise predicate instead on step1 as followed :
val step1 = df.withColumn("campaign_type",
when(col("campaign_type") === "ALL", array("EXTERNAL", "INTERNAL")).otherwise(array(col("campaign_type")))
I have a file and I want to give it to an mllib algorithm. So I am following the example and doing something like:
val data = sc.textFile(my_file).
map {line =>
val parts = line.split(",");
Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray)
};
and this works except that sometimes I have a missing feature. That is sometimes one column of some row does not have any data and I want to throw away rows like this.
So I want to do something like this map{line => if(containsMissing(line) == true){ skipLine} else{ ... //same as before}}
how can I do this skipLine action?
You can use filter function to filter out such lines:
val data = sc.textFile(my_file)
.filter(_.split(",").length == cols)
.map {line =>
// your code
};
Assuming variable cols holds number of columns in a valid row.
You can use flatMap, Some and None for this:
def missingFeatures(stuff): Boolean = ??? // Determine if features is missing
val data = sc.textFile(my_file)
.flatMap {line =>
val parts = line.split(",");
if(missingFeatures(parts)) None
else Some(Vectors.dense(parts.slice(1, parts.length).map(x => x.toDouble).toArray))
};
This way you avoid mapping over the rdd more than once.
Java code to skip empty lines / header from Spark RDD:
First the imports:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
Now, the filter compares total columns to 17 or header column which starts with VendorID.
Function<String, Boolean> isInvalid = row -> (row.split(",").length == 17 && !(row.startsWith("VendorID")));
JavaRDD<String> taxis = sc.textFile("datasets/trip_yellow_taxi.data")
.filter(isInvalid);