FlinkML: Joining DataSets of LabeledVector does not work - scala

I am currently trying to join two DataSets (part of the flink 0.10-SNAPSHOT API). Both DataSets have the same form:
predictions:
6.932018685453303E155 DenseVector(0.0, 1.4, 1437.0)
org:
2.0 DenseVector(0.0, 1.4, 1437.0)
general form:
LabeledVector(Double, DenseVector(Double,Double,Double))
What I want to create is a new DataSet[(Double,Double)] containing only the labels of the two DataSets i.e.:
join:
6.932018685453303E155 2.0
Therefore I tried the following command:
val join = org.join(predictions).where(0).equalTo(0){
(l, r) => (l.label, r.label)
}
But as a result 'join' is empty. Am I missing something?

You are joining on the label field (index 0) of the LabeledVector type, i.e., building all pairs of elements with matching labels. Your example indicates that you want to join on the vector field instead.
However, joining on the vector field, for example by calling:
org.join(predictions).where("vector").equalTo("vector"){
(l, r) => (l.label, r.label)
}
will not work, because DenseVector, the type of the vector field, is not recognized as key type by Flink (such as all kinds of Arrays).
Till describes how to compare prediction and label values in a comment below.

Related

Pyspark label points aggregation

I am performing a binary classification using LabeledPoint. I then attempt to sum() the number of labeled points with 1.0 to verify if the classification.
I have labelled an RDD as follows
lp_RDD = RDD.map(lambda x: LabeledPoint(1 if (flag in x[0]) else 0,x[1]))
I thought perhaps I could get a count of how many have been labelled with 1 using:
cnt = lp_RDD.map(lambda x: x[0]).sum()
But I get the following error :
'LabeledPoint' object does not support indexing
I have verified the labeled RDD as correct by printing the entire RDD and then doing a search for the string "LabeledPoint(1.0". I was simply wondering if there was a shortcut by trying to do a sum?
LabeledPoint has label value member which can be used to find the count or sum.Please try,
cnt = lp_RDD.map(lambda x: x.label).sum()

How does scala slick determin which rows to update in this query

I was asked how scala slick determines which rows need to update given this code
def updateFromLegacy(criteria: CertificateGenerationState, fieldA: CertificateGenerationState, fieldB: Option[CertificateNotification]) = {
val a: Query[CertificateStatuses, CertificateStatus, Seq] = CertificateStatuses.table.filter(status => status.certificateState === criteria)
val b: Query[(Column[CertificateGenerationState], Column[Option[CertificateNotification]]), (CertificateGenerationState, Option[CertificateNotification]), Seq] = a.map(statusToUpdate => (statusToUpdate.certificateState, statusToUpdate.notification))
val c: (CertificateGenerationState, Option[CertificateNotification]) = (fieldA, fieldB)
b.update(c)
}
Above code is (as i see it)
a) looking for all rows that have "criteria" for "certificateState"
b) a query for said columns is created
c) a tuple with the values i want to update to is created
then the query is used to find rows where tuple needs to be applied.
Background
I wonder were slick keeps track of the Ids of the rows to update.
What i would like to find out
What is happening behind the covers?
What is Seq in "val a: Query[CertificateStatuses, CertificateStatus, Seq]"
Can someone maybe point out the slick source where the moving parts are located?
OK - I reformatted your code a little bit to easier see it here and divided it into chunks. Let's go through this one by one:
val a: Query[CertificateStatuses, CertificateStatus, Seq] =
CertificateStatuses.table
.filter(status => status.certificateState === criteria)
Above is a query that translated roughly to something along these lines:
SELECT * // Slick would list here all your columns but it's essiantially same thing
FROM certificate_statuses
WHERE certificate_state = $criteria
Below this query is mapped that is, there is a SQL projection applied to it:
val b: Query[
(Column[CertificateGenerationState], Column[Option[CertificateNotification]]),
(CertificateGenerationState, Option[CertificateNotification]),
Seq] = a.map(statusToUpdate =>
(statusToUpdate.certificateState, statusToUpdate.notification))
So instead of * you will have this:
SELECT certificate_status, notification
FROM certificate_statuses
WHERE certificate_state = $criteria
And last part is reusing this constructed query to perform update:
val c: (CertificateGenerationState, Option[CertificateNotification]) =
(fieldA, fieldB)
b.update(c)
Translates to:
UPDATE certificate_statuses
SET certificate_status = $fieldA, notification = $fieldB
WHERE certificate_state = $criteria
I understand that last step may be a little bit less straightforward then others but that's essentially how you do updates with Slick (here - although it's in monadic version).
As for your questions:
What is happening behind the covers?
This is actually outside of my area of expertise. That being said it's relatively straightforward piece of code and I guess that an update transformation may be of some interest. I provided you a link to relevant piece of Slick sources at the end of this answer.
What is Seq in "val a:Query[CertificateStatuses, CertificateStatus, Seq]"
It's collection type. Query specifies 3 type parameters:
mixed type - Slick representation of table (or column - Rep)
unpacked type - type you get after executing query
collection type - collection type were above unpacked types are placed for you as a result of a query.
So to have an example:
CertificateStatuses - this is your Slick table definition
CertificateStatus this is your case class
Seq - this is how your results would be retrieved (it would be Seq[CertificateStatus] basically)
I have it explained here: http://slides.com/pdolega/slick-101#/47 (and 3 next slides or so)
Can someone maybe point out the slick source where the moving parts are located?
I think this part may be of interest - it shows how query is converted in update statement: https://github.com/slick/slick/blob/51e14f2756ed29b8c92a24b0ae24f2acd0b85c6f/slick/src/main/scala/slick/jdbc/JdbcActionComponent.scala#L320
It may be also worth to emphasize this:
I wonder were slick keeps track of the Ids of the rows to update.
It doesn't. Look at generated SQLs. You may see them by adding following configuration to your logging (but you also have them in this answer):
<logger name="slick.jdbc.JdbcBackend.statement" level="DEBUG" />
(I assumed logback above).

Is it possible to return a map of key values using gremlin scala

Currently i have two gremlin queries which will fetch two different values and i am populating in a map.
Scenario : A->B , A->C , A->D
My queries below,
graph.V().has(ID,A).out().label().toList()
Fetch the list of outE labels of A .
Result : List(B,C,D)
graph.traversal().V().has("ID",A).outE("interference").as("x").otherV().has("ID",B).select("x").values("value").headOption()
Given A and B , get the egde property value (A->B)
Return : 10
Is it possible that i can combine both there queries to get a return as Map[(B,10)(C,11)(D,12)]
I am facing some performance issue when i have two queries. Its taking more time
There is probably a better way to do this but I managed to get something with the following traversal:
gremlin> graph.traversal().V().has("ID","A").outE("interference").as("x").otherV().has("ID").label().as("y").select("x").by("value").as("z").select("y", "z").select(values);
==>[B,1]
==>[C,2]
I would wait for more answers though as I suspect there is a better traversal out there.
Below is working in scala
val b = StepLabel[Edge]()
val y = StepLabel[Label]()
val z = StepLabel[Integer]()
graph.traversal().V().has("ID",A).outE("interference").as(b)
.otherV().label().as(y)
.select(b).values("name").as(z)
.select((y,z)).toMap[String,Integer]
This will return Map[String,Int]

Sum of elements based on grouping an element using scala?

I have following scala list-
List((192.168.1.1,8590298237), (192.168.1.1,8590122837), (192.168.1.1,4016236988),
(192.168.1.1,1018539117), (192.168.1.1,2733649135), (192.168.1.2,16755417009),
(192.168.1.1,3315423529), (192.168.1.2,1523080027), (192.168.1.1,1982762904),
(192.168.1.2,6148851261), (192.168.1.1,1070935897), (192.168.1.2,276531515092),
(192.168.1.1,17180030107), (192.168.1.1,8352532280), (192.168.1.3,8590120563),
(192.168.1.3,24651063), (192.168.1.3,4431959144), (192.168.1.3,8232349877),
(192.168.1.2,17493253102), (192.168.1.2,4073818556), (192.168.1.2,42951186251))
I want following output-
List((192.168.1.1, sum of all values of 192.168.1.1),
(192.168.1.2, sum of all values of 192.168.1.2),
(192.168.1.3, sum of all values of 192.168.1.3))
How do I get sum of second elements from list by grouping on first element using scala??
Here you can use the groupBy function in Scala. You do, in your input data however have some issues, the ip numbers or whatever that is must be Strings and the numbers Longs. Here is an example of the groupBy function:
val data = ??? // Your list
val sumList = data.groupBy(_._1).map(x => (x._1, x._2.map(_._2).sum)).toList
If the answer is correct accept it or comment and I'll explain some more.

Calculate sums of even/odd pairs on Hadoop?

I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.