Inconsistent results in filtered queries in Weaviate. HNSW graph traversal + filtering

Inconsistent results in filtered queries in Weaviate. HNSW graph traversal + filtering - neural-network

I have setup a Weaviate db witch holds about 12M vectors, along with some metadata for each one.
I am getting inconsistent/wrong/weird??? results when I perform filtered search,
i.e. filter on a meta data field and then perform ANN search. (I have turned brute force search off completely by passing in a very small number to the "flatSearchCutoff" parameter '500').
Each vector has a 'user' field attached to it which is set to a 'string' type, moreover there is a 'status_int' field which takes the values of [0,4]. There is also a unique identifier which is another 'string' field.
Firstly my use case requires to query the DB for a specific entry and retrieve the vector representation...by using the unique identifier field, which I do with the following:
where_filter = {
"path": ["resource_identifier"],
"operator": "Equal",
"valueString": identifier_var
}
result = client.query.get("Resource", ["resource_identifier" ])\
.with_where(where_filter)\
.with_additional(['vector'])\
.with_limit(1)\
.do()
feature_vector = np.array(result['data']['Get']['Resource'][0]['_additional']['vector'],dtype=np.float32)
nearVector = {"vector":feature_vector, "certainty":SIMILARITY_THRESHOLD}
This works really well - and is blazing fast.
Secondly I want to search for nearby vectors to the one I just retrieved... while making sure that the results fit my criteria.
I am applying the following filter using the metadata fields.
filter_ ={
'operator':'And',
'operands':
[
{
'path': 'user',
'operator': 'Equal',
'valueString': user_var
},
{
'path': 'status_int',
'operator': 'GreaterThanEqual',
'valueInt': status_var
}
]
}
result = client.query.get("Resource", ["resource_uri","user", "_additional{certainty}",'status_int'])\
.with_where(filter_)\
.with_near_vector(nearVector)\
.with_limit(RETRIEVE_RECORDS_LIMIT).do()
While the filtering process does not throw any errors...the results look... weird...I set the similarity threshold to 0.75, so that I get plenty of results... However the results only include really close matches to the query vector... even at a very small similarity threshold which I found really odd.
More specifically I query the DB as follows:
When I query for a specific user and status_int >= 0 i'm stuck with 2 identical results of similarity 1.
However there should be a lot more results.... since the filter covers 3295 objects.
When I query for the same user and status_int >= 1 I get 1 resource as a result... again with similarity=1, which is one of the 2 results i'm given above...( this filter encompasses 2578 objects)
HOWEVER when I query for the same user and status_int >= 2 I GET ALOT MORE RESULTS ! With no exact matches (as it should be) but with 0.85 and below similarity. (this filter encompasses 1900 resources)
So my question is.... isn't this weird or is this intended behaviour and I've misunderstood how Weaviate and HNSW work?
-In my mind the path through the NHSW graph is the same across these queries... its just that results are retrieved based on the filters passed...
Shouldn't the first 2 queries present the perfect matches AS WELL AS the less relevant matches of the third query?
Really confused on this one :(

Related

Need help querying distinct combinations of nested fields

Desired result
I am trying to query my collection and obtain every unique combination of a batch and entry code. I don't care about anything other than these fields, the parent objects do not matter to me.
What I have tried
I tried running:
db.accountant_ledgers.aggregate( [ {"$group": { "_id": { entryCode: "$actions.entry.entryCode", batchCode: "$actions.entry.batchCode" } } } ]);
Problem
I get unexpected results when I run that query. I'm looking for a list of every unique combination of batch and entry codes, but instead I get a list of arrays? Perhaps these are the results I'm looking for, but I have no idea how to read them if they are.
Theory
I think perhaps this could have to do with the fact that these fields are nested. Each object has several actions, each action has several entries. I believe that the result from that query is just the aggregated entry and batch codes found in each object. I don't know how long the list of results is, but I'd guess it's the same number as the total number of objects in my collection (~90 million).
EDIT: I found out that there are only 182 results from my query, which is clearly significantly smaller than 90 million. My new theory is that it has found all unique objects, with the criteria for "uniqueness" being the list of the batch and entry codes that appear in their actions, which makes sense. There should be a lot of repetition in the collection.
Question
How can I achieve the result I'm looking for? I'm expecting something like:
FEE, MG
EXN, WT
ACH, 9C
...etc
Notes
I apologize if this is a bad question, I'm not sure how else to frame it. Let me know if I can improve my question at all.
Picture below shows the results of the query.
EDIT FOR ADDITIONAL INFORMATION
I can't share any sample documents, but the general structure of the data is shown (crudely) in the below image. Each Entity has several Actions, each Action has one Entry and each Entry has one Batch code and one Entry code.
List item

You are getting a list of documents (each is a map or a hash), not a list of arrays.
The GUI you are using is trying to show you the contents of each document on the top level which is maybe what is confusing.
If you run the query in mongo shell you should see a list of documents.
It looks like your inputs are documents where entry code and batch code are arrays, if so:
Edit your question to include sample documents you are querying as text
You could use $unwind to flatten those arrays before using $group.

What is the correct way to Index in MongoDB when big combination of fields exist

Considering I have search pannel that inculude multiple options like in the picture below:
I'm working with mongo and create compound index on 3-4 properties with specific order.
But when i run a different combinations of searches i see every time different order in execution plan (explain()). Sometime i see it on Collection scan (bad) , and sometime it fit right to the index (IXSCAN).
The selective fields that should handle by mongo indexes are:(brand,Types,Status,Warehouse,Carries ,Search - only by id)
My question is:
Do I have to create all combination with all fields with different order , it can be 10-20 compound indexes. Or 1-3 big Compound Index , but again it will not solve the order.
What is the best strategy to deal with big various of fields combinations.
I use same structure queries with different combinations of pairs
// Example Query.
// fields could be different every time according to user select (and order) !!
db.getCollection("orders").find({
'$and': [
{
'status': {
'$in': [
'XXX',
'YYY'
]
}
},
{
'searchId': {
'$in': [
'3859447'
]
}
},
{
'origin.brand': {
'$in': [
'aaaa',
'bbbb',
'cccc',
'ddd',
'eee',
'bundle'
]
}
},
{
'$or': [
{
'origin.carries': 'YYY'
},
{
'origin.carries': 'ZZZ'
},
{
'origin.carries': 'WWWW'
}
]
}
]
}).sort({"timestamp":1})
// My compound index is:
{status:1 ,searchId:-1,origin.brand:1, origin.carries:1 , timestamp:1}
but it only 1 combination ...it could be plenty like
a. {status:1} {b.status:1 ,searchId:-1} {c. status:1 ,searchId:-1,origin.brand:1} {d.status:1 ,searchId:-1,origin.brand:1, origin.carries:1} ........
Additionally , What will happened with Performance write/read ? , I think write will decreased over reads ...
The queries pattern are :
1.find(...) with '$and'/'$or' + sort
2.Aggregation with Match/sort
thanks

Generally, indexes are only useful if they are over a selective field. This means the number of documents that have a particular value is small relative to the overall number of documents.
What "small" means varies on the data set and the query. A 1% selectivity is pretty safe when deciding whether an index makes sense. If an particular value exists in, say, 10% of documents, performing a table scan may be more efficient than using an index over the respective field.
With that in mind, some of your fields will be selective and some will not be. For example, I suspect filtering by "OK" will not be very selective. You can eliminate non-selective fields from indexing considerations - if someone wants all orders which are "OK" with no other conditions they'll end up doing a table scan. If someone wants orders which are "OK" and have other conditions, whatever index is applicable to other conditions will be used.
Now that you are left with selective (or at least somewhat selective) fields, consider what queries are both popular and selective. For example, perhaps brand+type would be such a combination. You could add compound indexes that match popular queries which you expect to be selective.
Now, what happens if someone filters by brand only? This could be selective or not depending on the data. If you already have a compound index on brand+type, you'd leave it up to the database to determine whether a brand only query is more efficient to fulfill via the brand+type index or via a collection scan.
Continue in this manner with other popular queries and fields.

So you have subdocuments, ranged queries, and sorting by 1 field only.
It can eliminate most of the possible permutations. Assuming there are no other surprises.
D. SM already covered selectivity - you should really listen what the man says and at least upvote.
The other things to consider is the order of the fields in the compound index:
fields that have direct match like $eq
fields you sort on
fields with ranged queries: $in, $lt, $or etc
These are common rules for all b-trees. Now things that are specific to mongo:
A compound index can have no more than 1 multikey index - the index by a field in subdocuments like "origin.brand". Again I assume origins are embedded docs, so the document's shape is like this:
{
_id: ...,
status: ...,
timestamp: ....,
origin: [
{brand: ..., carries: ...},
{brand: ..., carries: ...},
{brand: ..., carries: ...}
]
}
For your query the best index would be
{
searchId: 1,
timestamp: 1,
status: 1, /** only if it is selective enough **/
"origin.carries" : 1 /** or brand, depending on data **/
}
Regarding the number of indexes - it depends on data size. Ensure all indexes fit into RAM otherwise it will be really slow.
Last but not least - indexing is not a one off job but a lifestyle. Data change over time, so do queries. If you care about performance and have finite resources you should keep an eye on the database. Check slow queries to add new indexes, collect stats from user's queries to remove unused indexes and free up some room. Basically apply common sense.

I noticed this one-year-old topic, because I am more or less struggling with a similar issue: users can request queries with an unpredictable set of the fields, which makes it near to impossible to decide (or change) how indexes should be defined.
Even worse: the user should indicate some value (or range) for the fields that make up the sharding-key, otherwise we cannot help MongoDB to limit its search in only a few shards (or chunks, for that matter).
When the user needs the liberty to search on other fields that are not necessariy the ones which make up the sharding-key, then we're stuck with a full-database search. Our dbase is some 10's of TB size...
Indexes should fit in RAM ? This can only be achieved with small databases, meaning some 100's GB max. How about my 37 TB database ? Indexes won't fit in RAM.
So I am trying out a POC inspired by the UNIX filesystem structures where we have inodes pointing to data blocks:
we have a cluster with 108 shards, each contains 100 chunks
at insert time, we take some fields of which we know they yield a good cardinality of the data, and we compute the sharding-key with those fields; the document goes into the main collection (call it "Main_col") on that computed shard, so with a certain chunk-number (equals our computed sharding-key value)
from the original document, we take a few 'crucial' fields (the list of such fields can evolve as your needs change) and store a small extra document in another collection (call these "Crucial_col_A", Crucial_col_B", etc, one for each such field): that document contains the value of this crucial field, plus an array with the chunk-number where the original full document has been stored in the 'big' collection "Main_col"; consider this as a 'pointer' to the chunk in collecton "Main_col" where this full document exists. These "Crucial_col_X" collections are sharded based on the value of the 'crucial' field.
when we insert another document that has the same value for some 'crucial' field "A", then that array in "Crucial_col_A" with chunk-numbers with be updated (with 'merge') to contain the different or same chunk number of this next full document from "Main_col"
a user can now define queries with criteria for at least one of those 'crucial' fields, plus (optional) any other criteria on other fields in the documents; the first criterium for the crucial field (say field "B") will run very quickly (because sharded on the value of "B") and return the small document from "Crucial_col_B", in which we have the array of chunk-numbers in "Main_col" where any document exists that has field "B" equal to the given criterium. Then we run a second set of parallel queries, one for each shardkey-value=chunk-number (or one per shard, to be decided) that we find in the array from before. We combine the results of those parallel subqueries, and then apply further filtering if the user gave additional criteria.
Thus this involves 2 query-steps: first in the "Crucial_col_X" collection to obtain the array with chunk-numbers where the full documents exist, and then the second query on those specific chunks in "Main_col".
The first query is done with a precise value for the 'crucial' field, so the exact shard/chunk is known, thus this query goes very fast.
The second (set of) queries are done with precise values for the sharding-keys (= the chunk numbers), so these are expected to go also very fast.
This way of working would eliminate the burden of defining many index combinations.

MongoDB skip & limit when querying two collections

Let's say I have two collections, A and B, and a single document in A is related to N documents in B. For example, the schemas could look like this:
Collection A:
{id: (int),
propA1: (int),
propA2: (boolean)
}
Collection B:
{idA: (int), # id for document in Collection A
propB1: (int),
propB2: (...),
...
propBN: (...)
}
I want to return properties propB2-BN and propA2 from my API, and only return information where (for example) propA2 = true, propB6 = 42, and propB1 = propA1.
This is normally fairly simple - I query Collection B to find documents where propB6 = 42, collect the idA values from the result, query Collection A with those values, and filter the results with the Collection A documents from the query.
However, adding skip and limit parameters to this seems impossible to do while keeping the behavior users would expect. Naively applying skip and limit to the first query means that, since filtering occurs after the query, less than limit documents could be returned. Worse, in some cases no documents could be returned when there are actually still documents in the collection to be read. For example, if the limit was 10 and the first 10 Collection B documents returned pointed to a document in Collection A where propA2 = false, the function would return nothing. Then the user would assume there's nothing left to read, which may not be the case.
A slightly less naive solution is to simply check if the return count is < limit, and if so, repeat the queries until the return count = limit. The problem here is that skip/limit queries where the user would expect exclusive sets of documents returned could actually return the same documents.
I want to apply skip and limit at the mongo query level, not at the API level, because the results of querying collection B could be very large.
MapReduce and the aggregation framework appear to only work on a single collection, so they don't appear to be alternatives.
This seems like something that'd come up a lot in Mongo use - any ideas/hints would be appreciated.
Note that these posts ask similar sounding questions but don't actually address the issues raised here.

Sounds like you already have a solution (2).
You cannot optimize/skip/limit on first query, depending on search you can perhaps do it on second query.
You will need a loop around it either way, like you write.
I suppose, the .skip will always be costly for you, since you will need to get all the results and then throw them away, to simulate the skip, to give the user consistent behavior.
All the logic would have to go to your loop - unless you can match in a clever way to second query (depending on requirements).
Out of curiosity: Given the time passed, you should have a solution by now?!

Mongodb: Skip collection values from between (not a normal pagination)

I have browsed through various examples but have failed to find what I am looking for.. What I want is to search for a specific document by _id and skip multiple times between a collection by using one query? Or some alternative which is fast enough to my case.
Following query would skip first one and return second in advance:
db.posts.find( { "_id" : 1 }, { comments: { $slice: [ 1, 1 ] } } )
That would be skip 0, return 1 and leaves the rest out from result..
But what If there would be like 10000 comments and I would want to use same pattern, but return that array values like this:
skip 0, return 1, skip 2, return 3, skip 4, return 5
So that would return collection which comments would be size of 5000, because half of them is skipped away. Is this possible? I applied large number like 10000 because I fear that using multiple queries to apply this would not be performance wise.. (example shown in here: multiple queries to accomplish something similar). Thnx!

I went through several resources and concluded that currently this is impossible to make with one query.. Instead, I agreed on that there are only two options to overcome this problem:
1.) Make a loop of some sort and run several slice queries while increasing the position of a slice. Similar to resource I linked:
var skip = NUMBER_OF_ITEMS * (PAGE_NUMBER - 1)
db.companies.find({}, {$slice:[skip, NUMBER_OF_ITEMS]})
However, depending on the type of a data, I would not want to run 5000 individual queries to get only half of the array contents, so I decided to use option 2.) Which seems for me relatively fast and performance wise.
2.) Make single query by _id to row you want and before returning results to client or some other part of your code, skip your unwanted array items away by using for loop and then return the results. I made this at java side since I talked to mongo via morphia. I also used query explain() to mongo and understood that returning single line with array which has 10000 items while specifying _id criteria is so fast, that speed wasn't really an issue, I bet that slice skip would only be slower.

Data model for geo spatial data and multiple queries

I have some mongodb object let's call it place which contains geo information, look at the example:
{
"_id": "234235425e3g33424".
"geo": {
"lon": 12.23456,
"lat": 34.23322
}
"some_field": "value"
}
With every place, a list of features is associated with:
{
"_id": "2334sgfgsr435d",
"place_id": "234235425e3g33424",
"feature_field" : "some_value"
}
As you see features are linked to places thanks to place_id field. Now I would like to find: list of features connected with nearest places. But I would like also add search contition on place.some_field and feature.feature_field. And what is important I would like to limit results.
Now I am using such approach:
I query on places with condition on geo and some_filed
I query on features with condition on feature_field and place_id (limit only to ones found in 1.)
I limit results in my application code
My question is: is there better approach to such task? Now I cannot use mongo limit() function, as when I do it on places I can end with too few results as I need to make second query. I cannot limit() on second query as results will come up with random order, and I would like to sort it by distance.
I know I can put data into one document, but I presume that list of features will be long and I can exceed BSON size limit.

Running out of 16mb for just the features seems unlikely... but it's possible. I don't think you realize how much 16mb is, so do the maths before assuming anything!
In any case, with MongoDB you can not do a query with fields from two collections. A query always deals with one specific collection only. I have done a very similar thing than what you have here though, which i've described in an article: http://derickrethans.nl/indexing-free-tags.html — have a look at that for some more inspiration.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse