Grouping fields within a Lucene.Net document - lucene.net

Within our application, we've been working with Lucene.Net to index large numbers of data. The fields themselves are configurable, so the name and type of the fields can change with each rebuild. Within each document, we can have multiple fields having the same name and a various number of Numeric and text fields. Since we've put a lot of work in the current development, changing to a different search engine is a no go a.t.m.
The fact is, that for the most part, it is working as a charm, but we do have one difficulty which we do not seem to get around.
Suppose we want to index document "X" containing:
Row A - Field1: 4 + Field2: a
Row B - Field1: 8 + Field2: b
The index we would make would contain 4 fields:
Document X:
Field1: 4 (Numeric)
Field2: a (Text)
Field1: 8 (Numeric)
Field2: b (Text)
(The row Ids are not important)
Doing a search for Field1:[3 TO 6] AND Field2:b would have a hit on this document.
However, the link between the fields represented by the row (linking 4 and 'a') is gone.
We can concatenate the values like 4_a, but that would crush our numeric search and would require the clients to know which fields are concatenated for proper results. It would also increase difficulty with our analyzer, as for each field, we can add a different analyzer (mostly for language purposes).
Also, we can create a separate document for each row with the same key and add a distinct to the search results, but that doesn't sound like the way to go, does it? It would seriously multiply the number of documents as we would create between the 20 - 100 documents for each document we would create now. I haven't tested this on performance or usability, as the current implementation doesn't allow me to try this out very easily :-)
Does anyone know how I can force a link between certain fields within Lucene.Net, but still keep a way to search for each field individually?

I personally don't see why increased number of documents would affect the performance. At least in Java version of Lucene, the bulk of the memory is used for the terms cache - which is per term and has no relationship to document count (providing term count doesn't change). Can't elaborate on usability though as this is specific to your app.
The main point is that once you group the rows into documents, you lose row relationship info. You can fix that by adding extra fields (something like rowInfoA:4_a, rowInfoB:8_b) but this seems too cumbersome and will actually require far more memory. Yes, you can select not to index but only store these auxiliary fields but I (having given information) would still prefer the 1:1 row:document mapping.

One kludge is to add another field for linkages:
Document X:
Field1: 4 (Numeric)
Field2: a (Text)
Field1: 8 (Numeric)
Field2: b (Text)
Link: 4_a
Link: 8_b
Another kludge is to add a field like MyDocument:X and index each row separately, with each row containing a MyDocument field for its Document. This would let you filter by document later in your process.

Related

There is a doc in firestore that has a ID_number field when created. If I want to create another doc, how do I make its ID equal to the previous ID+1?

I'm using Flutter and Firestore, and I want to be able to create documents with an assigned ID field inside them, but I don't know how to make that the new document IDs Field is equal to the last document ID field number + 1.
For Example, if I have this field inside a document, I want to make the next document's correlativeNumber equal to this one + 1 = 87
The best option that you have is to store the last correlativeNumber into a document. Each time a new document is added, increment that number by 1. In this way, you can always know which number was used previously.
But there is something that should take into consideration. When it comes to document IDs, according to the official documentation:
Do not use monotonically increasing document IDs such as:
Customer1, Customer2, Customer3, ...
Product 1, Product 2, Product 3, ...
Such sequential IDs can lead to hotspots that impact latency.
So it's best to use the Firestore's built-in identifiers, which are by definition completely unique.

Difference between wildcard search and individual text search

Is there a difference between a wildcard search index like $** and text indexes that I create for each of the fields in the collection ?
I do see a small difference in response time when I individually create text indexes. Using individual indexes, returns a better response. I am not able to post an example now, but will try to.
A wildcard text search will index every field that contains string data for each document in the collection (https://docs.mongodb.com/manual/core/index-text/#wildcard-text-indexes).
Because you are essentially increasing the number of fields indexed with a wild card text index, it would take longer to run compared to targeting specific fields for a text index.
Since you can only have one text index per collection (https://docs.mongodb.com/manual/core/index-text/#create-text-index), its worth considering which fields you plan on querying against beforehand.

Extensive filtering

Example:
{
shortName: "KITT",
longName: "Knight Industries Two Thousand",
fromZeroToSixty: 2,
year: 1982,
manufacturer: "Pontiac",
/* 25 more fields */
}
Ability to query by at least 20 fields which means that only 10 fields are left unindexed
There's 3 fields (all number) that could be used for sorting (both ways)
This leaves me wondering that how does sites with lots of searchable fields do it: e.g real estate or car sale sites where you can filter by every small detail and can choose between several sort options.
How could I pull this off with MongoDB? How should I index that kind of collection?
Im aware that there are dbs specifically made for searching but there must be general rules of thumb to do this (even if less performant) in every db. Im sure not everybody uses Elasticsearch or similar.
---
Optional reading:
My reasoning is that index could be huge but the index order matters. You'll always make sure that fields that return the least results are first and most generic fields are last in index. However, what if user chooses only generic fields? Should I include non-generic fields to query anyway? How to solve ordering in both ways? Or index intersection saves the day and I should just add 20 different indexes?
text index is your friend.
Read up on it here: https://docs.mongodb.com/v3.2/core/index-text/
In short, it's a way to tell mongodb that you want full text search over a specific field, multiple fields, or all fields (yay!)
To allow text indexing of all fields, use the special symbol $**, and define it of type 'text':
db.collection.createIndex( { "$**": "text" } )
you can also configure it with Case Insensitivity or Diacritic Insensitivity, and more.
To perform text searches using the index, use the $text query helper, see: https://docs.mongodb.com/v3.2/reference/operator/query/text/#op._S_text
Update:
In order to allow user to select specific fields to search on, it's possible to use weights when creating the text-index: https://docs.mongodb.com/v3.2/core/index-text/#specify-weights
If you carefully select your fields' weights, for example using different prime numbers only, and then add the $meta text score to your results you may be able to figure out from the "textScore" which field was matched on this query, and so filter out the results that didn't get a hit from a selected search field.
Read more here: https://docs.mongodb.com/v3.2/tutorial/control-results-of-text-search/

Mongodb compare two big data collection

I want to compare two very big collection, the main of the operation is two know what element is change or deleted
My collection 1 and 2 have a same structure and have more 3 million records
example :
record 1 {id:'7865456465465',name:'tototo', info:'tototo'}
So i want to know : what element is change, and what element is not present in collection 2.
What is the best solution to do this?
1) Define what equality of 2 documents means. For me it would be: both documents should contain all fields with exact same values given their ids are unique. Note that mongo does not guarantee field order, and if you update a field it might move to the end of the document which is fine.
2) I would use some framework that can connect to mongo and fetch data at the same time converting it to a map-like data structure or even JSON. For instance I would go with Scala + Lift record (db.coll.findAll()) + Lift JSON. Lift JSON library has Diff function that will give you a diff of 2 JSON docs.
3) Finally I would sort both collections by ids, open db cursors, iterate and compare.
if the schema is flat in your case it is, you can use a free tool to compare the data(dataq.io) in two tables.
Disclaimer : I am the founder of this product.

The fastest way to show Documents with certain property first in MongoDB

I have collections with huge amount of Documents on which I need to do custom search with various different queries.
Each Document have boolean property. Let's call it "isInTop".
I need to show Documents which have this property first in all queries.
Yes. I can easy do sort in this field like:
.sort( { isInTop: -1 } );
And create proper index with field "isInTop" as last field in it. But this will be work slowly, as indexes in mongo works best with unique fields.
So is there is solution to show Documents with field "isInTop" on top of each query?
I see two solutions here.
First: set Documents wich need to be in top the _id from "future". As you know, ObjectId contains timestamp. So I can create ObjectId with timestamp from future and use natural order
Second: create separate collection for Ducuments wich need to be in top. And do queries in it first.
Is there is any other solutions for this problem? Which will work fater?
UPDATE
I have done this issue with sorting on custom field which represent rank.
Using the _id field trick you mention has the problem that at some point in time you will reach the special time, and you can't change the _id field (without inserting a new document and removing the old one).
Creating a special collection which just holds the ones you care about is probably the best option. It gives you the ability to logically (and to some extent, physically) separate the documents.
Newly introduced in mongodb there is also support for a "sparse" index which may fulfill your needs as well. You could only set the "isInTop" field when you want it to be special, and then create a sparse index on it which would not have the problems you would normally have with a single indexed boolean field (in btrees).