Titan Db ignoring index - titan

I have a graph with a couple of indices. They're two composite indices with label restraints. (both are exactly the same just on different properties/labels).
One definitely seems to work but the other doesn't. I've done the following profile() to doubled check:
One is called KeyOnNode : property uid and label node :
gremlin> g.V().hasLabel("node").has("uid", "xxxxxxxx").profile().cap(...)
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
TitanGraphStep([~label.eq(node), uid.eq(dammit_... 1 1 2.565 96.84
optimization 1.383
backend-query 1 0.231
SideEffectCapStep([~metrics]) 1 1 0.083 3.16
>TOTAL - - 2.648 -
The above is perfectly acceptable and works well. I'm assuming the magic line is backend-query.
The other is called NameOnSuperNode : property name and label supernode:
gremlin> g.V().hasLabel("supernode").has("name", "xxxxxxxx").profile().cap(...)
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
TitanGraphStep([~label.eq(supernode), name.eq(n... 1 1 5763.163 100.00
optimization 2.261
scan 0.000
SideEffectCapStep([~metrics]) 1 1 0.073 0.00
>TOTAL - - 5763.236 -
Here the query takes an outrageous amount of time and we have a scan line. I originally wondered if the index wasn't commit through the management system but alas the following seems to work just fine :
gremlin> m = graphT.openManagement();
==>com.thinkaurelius.titan.graphdb.database.management.ManagementSystem#73c1c105
gremlin> index = m.getGraphIndex("NameOnSuperNode")
==>NameOnSuperNode
gremlin> index.getFieldKeys()
==>name
gremlin> import static com.thinkaurelius.titan.graphdb.types.TypeDefinitionCategory.*
==>null
gremlin> sv = m.getSchemaVertex(index)
==>NameOnSuperNode
gremlin> rel = sv.getRelated(INDEX_SCHEMA_CONSTRAINT, Direction.OUT)
==>com.thinkaurelius.titan.graphdb.types.SchemaSource$Entry#26b2b8e2
gremlin> sse = rel.iterator().next()
==>com.thinkaurelius.titan.graphdb.types.SchemaSource$Entry#2d39a135
gremlin> sse.getSchemaType()
==>supernode
I can't just reset the db at this point. Any help pinpointing what the issues could be would be amazing, I'm hitting a wall here.
Is this a sign that I need to reindex?
INFO: Titan DB 1.1 (TP 3.1.1)
Cheers
UPDATE : I've found that the index in question is not in a REGISTERED state:
gremlin> :> m = graphT.openManagement(); index = m.getGraphIndex("NameOnSuperNode"); pkey = index.getFieldKeys()[0]; index.getIndexStatus(pkey)
==>INSTALLED
How do I get it to register? I've tried m.updateIndex(index, SchemaAction.REGISTER_INDEX).get(); m.commit(); graphT.tx().commit(); but it doesn't seem to do anything
UPDATE 2 : I've tried regitering the index in order to reindex with the following :
gremlin> m = graphT.openManagement();
index = m.getGraphIndex("NameOnSuperNode") ;
import static com.thinkaurelius.titan.graphdb.types.TypeDefinitionCategory.*;
import com.thinkaurelius.titan.graphdb.database.management.ManagementSystem;
m.updateIndex(index, SchemaAction.REGISTER_INDEX).get();
ManagementSystem.awaitGraphIndexStatus(graphT, "NameOnSuperNode").status(SchemaStatus.REGISTERED).timeout(20, java.time.temporal.ChronoUnit.MINUTES).call();
m.commit();
graphT.tx().commit()
But this isn't working. I still have my index in the INSTALLED status and I'm still getting a timeout. I've checked that there were no open transactions. Anyone have an idea? FYI the graph is running on a single server and has ~100K vertices and ~130k edges.

So there are a few things that can be happening here:
If both of those indices you describe were not created in the same transaction (and the problem index in question was created in after the name propertyKey was already defined) then you should issue a reindex, as per Titan docs:
The name of a graph index must be unique. Graph indexes built against
newly defined property keys, i.e. property keys that are defined in
the same management transaction as the index, are immediately
available. Graph indexes built against property keys that are already
in use require the execution of a reindex procedure to ensure that the
index contains all previously added elements. Until the reindex
procedure has completed, the index will not be available. It is
encouraged to define graph indexes in the same transaction as the
initial schema.
The index may be timing out the process that takes to move from REGISTERED to INSTALLED, in which case you want to use mgmt.awaitGraphIndexStatus(). You can even specify the amount of time you are willing to wait here.
Make sure there are no open transactions on your graph or the index status will indeed not change, as described here.
This is clearly not the case for you, but there is a bug in Titan (fixed in JanusGraph via this PR) such that if you create an index against a newly created propertyKey as well as a previously used propertyKey, the index will get stuck in the REGISTERED state
Indexes will not move to REGISTERED unless every Titan/JanusGraph node in the cluster acknowledges the index creation. If your indexes are getting stuck in the INSTALLED state, there is a chance that the other nodes in the system are not acknowledging the indexes existence. This can be due to issues with another server in the cluster, backfill in the messaging queue Titan/JanusGraph uses to talk with each other, or most unexpectedly: the existence of phantom instances. These can occur every time your server is killed through non-normal JVM shutdown processes, i.e. kill -9 the server due to it being stuck in thrash the world garbage collection. If you expect backfill to be the problem, the comments in this class offer good insight to customizable configuration options that may help fix the problem. To check for the existence of phantom nodes, use this function and then this function to kill the phantom instances.

I think you missed config to your graph.
If you used backend is cassandra, you must config with elasticsearch.
If you used backend is hbase, you must config with caching.
Read more in link below:
https://docs.janusgraph.org/0.2.0/configuration.html

Related

Race condition in amplify datastore

When updating an object, how can I handle race condition?
final object = await Amplify.Datastore.query(Object.classtype, where: Object.ID.eq('aa');
Amplify.Datastore.save(object.copywith(count: object.count + 1 ));
user A : execute first statement
user B : execute first statement
user A : execute second statement
user B : execute second statement
=> only updated + 1
Apparently the way to resolve this is to either
1 - use conflict resolution, available from Datastore 0.5.0
One of your users (whichever is slowest) gets sent back the rejected version plus the latest version from server, you get both objects back to resolve discrepancies locally and retry update.
2 - Use a custom resolver
here..
and check ADD expressions
You save versions locally and your vtl is configured to provide additive values to the pipeline instead of set values.
This nice article might also help to understand that
Neither really worked for me, one of my devices could be offline for days at a time and i would need multiple updates to objects to be performed in order, not just the last current version of the local object.
What really confuses me is that there is no immediate way to just increment values, and keep all incremented objects' updates in the outbox instead of just the latest object, then apply them in order when connection is made..
I basically wrote in a separate table to do just that to solve my problem, but of course with more tables and rows, comes more reads and writes and therefore more expense.
Have a look at my attempts here if you want the full code lmk
And then i guess hope for an update to amplify that includes increment values logic to update values atomically out of the box to avoid these common race conditions.
Here is some more context

GCP datastore sudden extreme data inconsistency (NDB 1.8.0)

I have 6 months old Py38 standard gae project in europe-west3 region along with Firestore in DATASTORE mode.
Even with Redis as global cache or without, I have never had any inconsistency issues. Immediate (1 sec took the redirect) fetch after put (insert) yielded fresh results, up until last week. I have made some benching and it takes around 30s for put to result in global query. It actually behaves similar to datastore emulator with consistency parameter set to 0.05
I have read a lot about datastore and its eventual consistency here, but as the document says, this is true for "old" version. New firestore in datastore mode should ensure strong consistency as per this part.
Eventual consistency, all Datastore queries become strongly consistent.
Am I interpreting this claim wrong?
I have also created a fresh project (same region) with only the essential ndb initialization and still extreme "lag".
I'm running out of ideas what could cause this new behavior. Could be that Warshaw datacenter just started and this is causing the issues?
Abstract code with google-cloud-ndb==1.8.0
class X(ndb.Model):
foo = ndb.StringProperty()
x = X(foo="a")
x.put()
time.sleep(5)
for y in X.query(): # returns 0 results
print(y)
If I get entity its by key, it's there and fresh. It even instantly shows up in datastore admin.
This was also filed as https://github.com/googleapis/python-ndb/issues/666 . It turns out Cloud NDB before 1.9.0 was explicitly requesting eventually consistent queries.

Variable-length path query runs forever but executes immediately when edge is bidrectional

Problem
We have a graph in which locations are connected by services. Services that have common key-values are connected by service paths. We want to find all the service paths we can use to get from Location A to Location Z.
The following query matches services that go directly from A to Z and service paths that take one hop to get from A to Z:
MATCH p=(origin:location)-[:route]->(:service)-[:service_path*0..1]->
(:service)-[:route]->(destination:location)
WHERE origin.name='A' AND destination.name='Z'
RETURN p;
and runs fine.
But if we expand the search to service paths that may take two hops between A and Z:
MATCH p=(origin:location)-[:route]->(:service)-[:service_path*0..2]->
(:service)-[:route]->(destination:location)
WHERE origin.name='A' AND destination.name='Z'
RETURN p;
the query runs forever.
It never times out or crashes the server - it just runs continuously.
However, if we make the variable-length part of the query bidirectional:
MATCH p=(origin:location)-[:route]->(:service)-[:service_path*0..2]-
(:service)-[:route]->(destination:location)
WHERE origin.name='A' AND destination.name='Z'
RETURN p;
The same query that ran forever now executes instantaneously (~30ms on a dev database with default Postgres config).
More info
Behavior in different versions
This problem occurs in AgensGraph 2.2dev (cloned from GitHub). In Agens 2.1.0 , the first query -[:service_path*0..1]-> still works fine, but the broken query -[:service_path*0..2]-> AND the version that works in 2.2dev, -[:service_path*0..2]- result in an error:
ERROR: btree index keys must be ordered by attribute
This leads us to believe that the problem is related to this commit, which was included as a bug fix in Agens 2.1.1
fix: Make re-scanning inner scan work
VLE threw "btree index keys must be ordered by attribute" because
re-scanning inner scan is not properly done in some cases. A regression
test is added to check it.
Duplicate paths returned, endlessly
In AgensBrowser v1.0, we are able to get results out using LIMIT on the end of the broken query. The query always returns the maximum number of rows, but the resulting graph is very sparse. Direct paths and paths with one hop show up, but only one longer path appears.
In the result set, the shorter paths are returned in individual records as expected, but first occurrence of a two-hop path is duplicated for the rest of the rows.
If we return some collection along with the paths, like RETURN p, nodes(p) LIMIT 100, the query again runs infinitely.
(Interestingly, this row duplication also occurs for a slightly different query in which we used the bidirectional fix, but the entire expected graph was returned. This may deserve its own post.)
The EXPLAIN plans are identical for the one-hop and two-hop queries
We could not compare EXPLAIN ANALYZE (because you can't ANALYZE a query that never finishes running) but the query plans were exactly identical between all of the queries - those that ran and those that didn't.
Increased logging revealed nothing
We set the logging level for Postgres to DEBUG5, the highest level, and ran the infinitely-running query. The logs showed nothing amiss.
Is this a bug or a problem with our data model?

Can you calculate active users using time series

My atomist client exposes metrics on commands that are run. Each command is a metric with a username element as well a status element.
I've been scraping this data for months without resetting the counts.
My requirement is to show the number of active users over a time period. i.e 1h, 1d, 7d and 30d in Grafana.
The original query was:
count(count({Username=~".+"}) by (Username))
this is an issue because I dont clear the metrics so its always a count since inception.
I then tried this:
count(max_over_time(help_command{job=“Application
Name”,Username=~“.+“}[1w]) -
max_over_time(help_command{job=“Application name”,Username=~“.+“}[1w]
offset 1w) > 0)
which works but only for one command I have about 50 other commands that need to be added to that count.
I tried the:
"{__name__=~".+_command",job="app name"}[1w] offset 1w"
but this is obviously very expensive (timeout in browser) and has issues with integrating max_over_time which doesn't support it.
Any help, am I using the metric in the wrong way. Is there a better way to query... my only option at the moment is the count (format working above for each command)
Thanks in advance.
To start, I will point out a number of issues with your approach.
First, the Prometheus documentation recommends against using arbitrarily large sets of values for labels (as your usernames are). As you can see (based on your experience with the query timing out) they're not entirely wrong to advise against it.
Second, Prometheus may not be the right tool for analytics (such as active users). Partly due to the above, partly because it is inherently limited by the fact that it samples the metrics (which does not appear to be an issue in your case, but may turn out to be).
Third, you collect separate metrics per command (i.e. help_command, foo_command) instead of a single metric with the command name as label (i.e. command_usage{commmand="help"}, command_usage{commmand="foo"})
To get back to your question though, you don't need the max_over_time, you can simply write your query as:
count by(__name__)(
(
{__name__=~".+_command",job=“Application Name”}
-
{__name__=~".+_command",job=“Application name”} offset 1w
) > 0
)
This only works though because you say that whatever exports the counts never resets them. If this is simply because that exporter never restarted and when it will the counts will drop to zero, then you'd need to use increase instead of minus and you'd run into the exact same performance issues as with max_over_time.
count by(__name__)(
increase({__name__=~".+_command",job=“Application Name”}[1w]) > 0
)

Why is index not created after teardown if some connections persist?

I setup and teardown my MongoDB database during functional test.
One of my models will make use of GridFS and I am going to run that test (which also calls setup and teardown). Suppose we started out with a clean empty database called test_repoapi:
python serve.py testing.ini
nosetests -a 'write-file'
The second time I run the test, I am getting this:
OperationFailure: command SON([('filemd5', ObjectId('518ec7d84b8aa41dec957d3c')), ('root', u'fs')]) failed: need an index on { files_id : 1 , n : 1 }
If we look at client:
> use test_repoapi
switched to db test_repoapi
> show collections
fs.chunks
system.indexes
users
Here is the log: http://pastebin.com/1adX4svG
There are three kinds of timestamps:
(1) the top one is when I first launched the web app
(2) anything before 23:06:27 were the first iteration
(3) then everything else were the second iteration
As you can see I did initialized commands to drop the database. Two possible explanations:
(1) Web app holds two active connections to the database, and
(2) Some kind of "lock" prevents the index from fully created. Also look fs.files was not recreated.
The workaround is to stop the web app, start again, and run the test; then the error will not appear.
By the way, I am using Mongoengine as my ODM in my web app.
Any thoughts on this?
We used to have similar issue with mongoengine failing to recreate indexes after drop_collection() during tests because it failed to realise dropping collection also drops the indexes. But that was happening with normal collections and a rather ancient version of mongoengine (and a call to QuerySet._reset_already_indexed() fixed it for us - but we haven't needed that since 0.6)
Maybe this is another case of mongoengine internally keeping track of which indexes have been created and it's just failing to realize the database/collection vanished and those indexes must be recreated? FWIW using drop_collection() between tests is working for us and that includes GridFS.