Titan Warning: Query requires iterating over all vertices - titan

Below I am adding cdate index and then some data:
baseGraph.makeKey("cdate").dataType(Long.class).indexed(Vertex.class).make();
for(int i=0;i<20;i++){
Vertex page = g.addVertex("P0"+i);
page.setProperty("cdate", new Date().getTime());
page.setProperty("pName","pName-P0"+i);
Edge e =g.addEdge(null, user, page, "created");
e.setProperty("time", i);
Thread.sleep(2000);
}
for(int i=20;i<25;i++){
Vertex page = g.addVertex("P0"+i);
page.setProperty("cdate", new Date().getTime());
page.setProperty("pName","pName-P0"+i);
Edge e =g.addEdge(null, user, page, "notcreated");
e.setProperty("time", i);
Thread.sleep(2000);
}
g.commit();
Now when I run following query:
Iterable<Vertex> vertices = g.query().interval("cdate",0,time).
orderBy("cdate", Order.DESC).limit(5).vertices();
It gives the output in correct order but It shows:
WARN com.thinkaurelius.titan.graphdb.transaction.StandardTitanTx -
Query requires iterating over all vertices [(cdate >= 0 AND cdate < 1392198350796)].
For better performance, use indexes
But I have already defined cdate as index (see top line).

In your type definition for cdate you are using Titan's standard index (by not specifying any other index). Titan's standard index only supports equality comparisons (i.e. no range queries).
To get support for range queries, you need to use an indexing backend that must be registered with Titan and then explicitly reference it in the type definition.
Check out the documentation on this page Chapter 8. Indexing for better Performance:
Titan supports two different kinds of indexing to speed up query processing: graph indexes and vertex-centric indexes. Most graph queries start the traversal from a list of vertices or edges that are identified by their properties. Graph indexes make these global retrieval operations efficient on large graphs. Vertex-centric indexes speed up the actual traversal through the graph, in particular when traversing through vertices with many incident edges.
Bottom line: Titan supports multiple types of indexes and it automatically picks the most suitable index to answer a particular query. In your case, there is none that supports range queries, hence the warning and slow performance. The documentation above outlines how to register additional indexes that provide the support you need.

Related

Multi Column Indexes with Order By and OR clause

I have below query to fetch list of tickets.
EXPLAIN select * from ticket_type
where ticket_type.event_id='89898'
and ticket_type.active=true
and (ticket_type.is_unlimited = true OR ticket_type.number_of_sold_tickets < ticket_type.number_of_tickets)
order by ticket_type.ticket_type_order
I have created below indexes but not working.
Index on (ticket_type_order,event_id,is_unlimited,active)
Index on (ticket_type_order,event_id,active,number_of_sold_tickets,number_of_tickets).
The perfect index for this query would be
CREATE INDEX ON ticket_type (event_id, ticket_type_order)
WHERE active AND (is_unlimited OR number_of_sold_tickets < number_of_tickets);
Of course, a partial index like that might only be useful for this specific query.
If the WHERE conditions from the index definition are not very selective, or a somewhat slower execution is also acceptable, you can omit parts of or the whole WHERE clause. That makes the index more widely useful.
What is the size of the table and usual query result? The server is usually smart enough and disables indexes, if it expects to return more than the half of the table.
Index makes no sense, if the result is rather small. If the server has - let say - 1000 records after several filtration steps, the server stops using indexes. It is cheaper the finish the query using CPU, then loading an index from HDD. As result, indexes are never applied to small tables.
Order by is applied at the very end of the query processing. The first field in the index should be one of the fields from the where filter.
Boolean fields are seldom useful in the index. It has only two possible values. Index should be created for fields with a lot of different values.
Avoid or filtering. It is easy in your case. Put a very big number into number_of_tickets, if the tickets are unlimited.
The better index in your case would be just event_id. If the database server supports functional indexes, then you can try to add number_of_tickets - number_of_sold_tickets. Rewrite the statement as where number_of_tickets - number_of_sold_tickets > 0
UPDATE: Postgresql calls it "Index on Expression":
https://www.postgresql.org/docs/current/indexes-expressional.html

Multiple indexes vs single index on multiple columns in postgresql

I could not reach any conclusive answers reading some of the existing posts on this topic.
I have certain data at 100 locations the for past 10 years. The table has about 800 million rows. I need to primarily generate yearly statistics for each location. Some times I need to generate monthly variation statistics and hourly variation statistics as well. I'm wondering if I should generate two indexes - one for location and another for year or generate one index on both location and year. My primary key currently is a serial number (Probably I could use location and timestamp as the primary key).
Thanks.
Regardless of how many indices have you created on relation, only one of them will be used in a certain query (which one depends on query, statistics etc). So in your case you wouldn't get a cumulative advantage from creating two single column indices. To get most performance from index I would suggest to use composite index on (location, timestamp).
Note, that queries like ... WHERE timestamp BETWEEN smth AND smth will not use the index above while queries like ... WHERE location = 'smth' or ... WHERE location = 'smth' AND timestamp BETWEEN smth AND smth will. It's because the first attribute in index is crucial for searching and sorting.
Don't forget to perform
ANALYZE;
after index creation in order to collect statistics.
Update:
As #MondKin mentioned in comments certain queries can actually use several indexes on the same relation. For example, query with OR clauses like a = 123 OR b = 456 (assuming that there are indexes for both columns). In this case postgres would perform bitmap index scans for both indexes, build a union of resulting bitmaps and use it for bitmap heap scan. In certain conditions the same scheme may be used for AND queries but instead of union there would be an intersection.
There is no rule of thumb for situations like these, I suggest you experiment in a copy of your production DB to see what works best for you: a single multi-column index or 2 single-column indexes.
One nice feature of Postgres is you can have multiple indexes and use them in the same query. Check this chapter of the docs:
... PostgreSQL has the ability to combine multiple indexes ... to handle cases that cannot be implemented by single index scans ....
... Sometimes multicolumn indexes are best, but sometimes it's better to create separate indexes and rely on the index-combination feature ...
You can even experiment creating both the individual and combined indexes, and checking how big each one is and determine if it's worth having them at the same time.
Some things that you can also experiment with:
If your table is too large, consider partitioning it. It looks like you could partition either by location or by date. Partitioning splits your table's data in smaller tables, reducing the amount of places where a query needs to look.
If your data is laid out according to a date (like transaction date) check BRIN indexes.
If multiple queries will be processing your data in a similar fashion (like aggregating all transactions over the same period, check materialized views so you only need to do those costly aggregations once.
About the order in which to put your multi-column index, put first the column on which you will have an equality operation, and later the column in which you have a range, >= or <= operation.
An index on (location,timestamp) should work better that 2 separate indexes for you case. Note that the order of the columns is important.

Titan: How to efficienlty get the maximum value of a Long property?

So if I want to retrieve the vertex that has the maximum value of a Long property, I should run:
graph.traversal().V().has("type","myType").values("myProperty").max().next()
This is really slow as it has to load all vertices to find out the maximum value. Is there any way faster?
Any indexing would help? I believe composite indexes won't help but is there a way to do it using mixed index with ElasticSearch back end?
Using Titan to create a Mixed Index on a numeric value will result in Elasticsearch indexing the property correctly. Kind of similarly to you, we want to know all our vertices ordered by a property DEGREE from max to min so we currently do the following for the property DEGREE:
TitanGraph titanGraph = TitanFactory.open("titan-cassandra-es.properties");
TitanManagement management = graph.openManagement();
PropertyKey degreeKey = management.makePropertyKey("DEGREE").dataType(Long.class).make();
management.buildIndex("byDegree", Vertex.class)
.addKey(degreeKey)
.buildMixedIndex("search");
We are currently having issues getting Titan to traverse this quickly (for some reason it can create the index but struggles to use it for certain queries) but we can query Elasticsearch directly:
curl -XGET 'localhost:9200/titan/byDegree/_search?size=80' -d '
{
"sort" : [
{ "DEGREE" : {"order" : "desc"}}
],
"query" : {
}
}
The answer is returned extremely quickly so for now we create the index with Titan but query elastic search directly.
Short Answer: Elasticsearch can do what is needed with numeric ranges very easily, the problem on our side at least seems to be getting Titan to use these indices fully. However the traversal you are trying to execute is simpler than ours (you just want the max) so you may not encounter these issues and you may just be able to stick with Titan traversals fully.
Edit:
I have recently confirmed that elasticsearch and titan can fulfill your needs (as it does mine). Just be wary of how you create your indices. Titan will be able to execute your query quickly as long as you create your Mixed index with the Type key being set to a String match not a Text Match.

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!
Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.

Postgresql compound index for spatial + time parameters

We have a table that has millions of rows with PostGIS geometries in them. The query we want to perform is: what are the most recent entries that fall within a bounding geometry? The problem with this query is that we'll often have a large number of items that match the bounding box (which has around 5km in radius), and Postgres will then have to re-check all the returned items inside the bounding box to get their timestamp, then sort and return the latest N.
It feels like what we need is a (compound?) index that takes into account both the GIST spatial index and the timestamp as well. Is such a thing possible? I've tried several combinations in the CREATE INDEX step and nothing's worked so far.
I'd rather make two indexes, one spatial and the second on the timestamp column. PostgreSQL can combine indexes pretty nice and it doesn't need to 're-check' found rows. It could use indexes to get the rows within the geometry and sort them using the other index.