OrientDB: how to update a field with data from other vertex - orientdb

I'm working with some document trying to implement the TF-IDF method to search similarities over documents.
At one point, I need to calculate the Term Frequency (TF).
I have two vertex in this realation:
Documento ---> DocWord
DocWord vertex has the following fields:
int frequence
double tf
double idf
Documento has:
int wordCount
I need to update all DocWord tf field with:
frequence/Documento.wordCount
The query I'm trying to run is:
update DocWord set tf = frequence/in("Documento_docwords").wordCount[0];
but this fail.

try with:
update DocWord set tf = (frequence / in("Documento_docwords").wordCount[0])

Related

How to predict the outcome variables using a saved pipeline when the data set does not contain the actual outcome?

I have a data set that contains the following columns: outcome (this is the outcome that we want to predict), and raw (a column that consists of text). I want to develop an ML model that will predict the outcome from the raw column. I have trained an ML model in Databricks using the following pipeline:
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")
countVec = CountVectorizer(inputCol="words", outputCol="features")
indexer = StringIndexer(inputCol="outcome", outputCol="label").setHandleInvalid("skip").fit(trainDF)
inverter = IndexToString(inputCol="prediction", outputCol="prediction_label", labels=indexer.labels)
nb = NaiveBayes(labelCol="label", featuresCol="features", smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[regexTokenizer, indexer, countVec, nb, inverter])
model = pipeline.fit(trainDF)
model.write().overwrite().save("/FileStore/project")
In another notebook, I load the model and try to predict the values for a new data set. This data set does not contain the outcome variable ("outcome" in this case):
model = PipelineModel.load("/FileStore/project")
score_output_df = model.transform(score_this)
When I try to predict the values for the new data set, I get an error message that the column "outcome" cannot be found. I suspect that this is due to the fact that some stages in the pipeline transform this column (the indexer and inverter stages are used to convert the outcome column to numbers and then back to string labels.).
My question is this, how can I load a saved model and use it to predict values when the original pipeline contains stages that have this column as an input.
instead of using
model.write().overwrite().save("/FileStore/project")
you have to write it like this
model.write().overwrite().save("/FileStore/project/model.sav")
and then for loading you will use this
model = PipelineModel.load("/FileStore/project/model.sav")
score_output_df = model.transform(score_this)
I have found a solution to the problem and will post it here so that if someone faces the same problem they can benefit from it. The solution was simply to extract the stages that I want to use in the prediction and save them to the model as such:
model = PipelineModel.load("/FileStore/project")
stages1 = []
stages1 += [model.stages[0]]
stages1 += [model.stages[2]]
stages1 += [model.stages[3]]
stages1 += [model.stages[4]]
model.stages = stages1
score_output_df = model.transform(score_this)
In this code, I exclude the second step ([1]) because it contains the indexer. Once I do this, I can predict values when the "outcome" column is not available.

Pyspark label points aggregation

I am performing a binary classification using LabeledPoint. I then attempt to sum() the number of labeled points with 1.0 to verify if the classification.
I have labelled an RDD as follows
lp_RDD = RDD.map(lambda x: LabeledPoint(1 if (flag in x[0]) else 0,x[1]))
I thought perhaps I could get a count of how many have been labelled with 1 using:
cnt = lp_RDD.map(lambda x: x[0]).sum()
But I get the following error :
'LabeledPoint' object does not support indexing
I have verified the labeled RDD as correct by printing the entire RDD and then doing a search for the string "LabeledPoint(1.0". I was simply wondering if there was a shortcut by trying to do a sum?
LabeledPoint has label value member which can be used to find the count or sum.Please try,
cnt = lp_RDD.map(lambda x: x.label).sum()

Can't delete/remove multiple property keys on Vertex Titan 1.0 Tinkerpop 3

Very basic question,
I just upgraded my Titan from 0.54 to Titan 1.0 Hadoop 1 / TP3 version 3.01.
I encounter a problem with deleting values of
Property key: Cardinality.LIST/SET
Maybe it is due to upgrade process or just my TP3 misunderstanding.
// ----- CODE ------:
tg = TitanFactory.open(c);
TitanManagement mg = tg.openManagement();
//create KEY (Cardinality.LIST) and commit changes
tm.makePropertyKey("myList").dataType(String.class).cardinality( Cardinality.LIST).make();
mg.commit();
//add vertex with multi properties
Vertex v = tg.addVertex();
v.property("myList", "role1");
v.property("myList", "role2");
v.property("myList", "role3");
v.property("myList", "role4");
v.property("myList", "role4");
Now, I want to delete all the values "role1,role2...."
// iterate over all values and try to remove the values
List<String> values = IteratorUtils.toList(v.values("myList"));
for (String val : values) {
v.property("myList", val).remove();
}
tg.tx().commit();
//---------------- THE EXPECTED RESULT ----------:
Empty vertex properties
But unfortunately the result isn't empty:
System.out.println("Values After Delete" + IteratorUtils.toList(v.values("myList")));
//------------------- OUTPUT --------------:
After a delete, values are still apparent!
15:19:59,780 INFO ThriftKeyspaceImpl:745 - Detected partitioner org.apache.cassandra.dht.Murmur3Partitioner for keyspace titan
15:19:59,784 INFO Values After Delete [role1, role2, role3, role4, role4]
Any ideas?
You're not executing graph traversals with the higher level Gremlin API, but you're currently mutating the graph with the lower level graph API. Doing for loops in Gremlin is often an antipattern.
According to the TinkerPop 3.0.1 Drop Step documentation, you should be able to do the following from the Gremlin console:
v = g.addV().next()
g.V(v).property("myList", "role1")
g.V(v).property("myList", "role2")
// ...
g.V(v).properties('myList').drop()
property(key, value) will set the value of the property on the vertex (javadoc). What you should do is get the VertexProperties (javadoc).
for (VertexProperty vp : v.properties("name")) {
vp.remove();
}
#jbmusso offered a solid solution using the GraphTraversal instead.

Printing the calculated distance in SQLAlchemy

I am using Flask-SQLAlchemy with Postgres, Postgis and GEOAlchemy. I am able to sort entries in a table according to a point submitted by the user. I wonder how I could also return the calculated distance...
This is how I sort the items:
result = Event.query.order_by(func.ST_Distance(Event.address_gps, coordinates_point)).paginate(page, 10).items
for result in results:
result_dict = result.to_dict()
return result_dict
according to the users position (coordinates_point). I would like to add an entry in each result in the result_dict which also contains the distance that the item was ordered by. How do I do that? What does func.ST_Distance return?
I tried to add this in the for loop above:
current_distance = func.ST_Distance(Event.address_gps, coordinates_point)
result_dict['distance'] = current_distance
But that did not seem to work.
You can use column labels
query = Event.query.with_entities(Event, func.ST_Distance(Event.address_gps, coordinates_point).label('distance')).order_by('distance')
results = query.paginate(page, 10).items
for result in results:
event = result.Event
distance = result.distance
result_dict = event.to_dict()
result_dict['distance'] = distance

NDepend - Query to Reduce Matrix

I'm using NDepend to write a query to extract a subset of my assemblies and their dependent assemblies into a Dependency Matrix.
I would like to further reduce the size of the matrix to show only dependent assemblies that have a small or medium coupling (the ones that would be relatively easy to decouple) Therefore I only want to show the assemblies that have < 20 method usages.
how do I update this query to show this?
let agentAssemblies =Assemblies.WithNameLike("Agent")
let assembliesUsedByAgents = Assemblies.ExceptThirdParty().UsedByAny(agentAssemblies)
from a in agentAssemblies.Union(assembliesUsedByAgents )
select a
You can refine the query this way:
let agentAssemblies = Assemblies.WithNameLike("Agent")
let assembliesUsedByAgents = Assemblies.ExceptThirdParty().UsedByAny(agentAssemblies)
from a in assembliesUsedByAgents
let methodsUsedFromAgentAssemblies = a.ChildMethods.UsedByAny(agentAssemblies)
where methodsUsedFromAgentAssemblies.Count() < 20
let agentAssembliesMethodsUsingMe = agentAssemblies.ChildMethods().UsingAny(methodsUsedFromAgentAssemblies)
select new {
a,
methodsUsedFromAgentAssemblies ,
agentAssembliesMethodsUsingMe
}
From the code query result you can visualise both methodsUsedFromAgentAssemblies and agentAssembliesMethodsUsingMe...
.. and by right clicking methods sets, you can export both sets to the Dependency Matrix to have a clear understanding of which method is calling which method.