Sphinx: too many string attributes - sphinx

when indexing writes error:
too many string attributes (current index format allows up to 4 GB).
how to fix this problem
db on 80 mln rows

Use less attributes, or split you index into parts.

Related

Postgres full text search against arbitrary data - possible or not?

I was hoping to get some advice or guidance around a problem I'm having.
We currently store event data in Postgres 12 (AWS RDS) - this data could contain anything. To reduce the amount of data (alot of keys for example are common across all events) we flatten this data and store it across 3 tables -
event_keys - the key names from events
event_values - the values from events
event_key_values - a lookup table, containing the event_id, and key_id and value_id.
First inserting the key and value (or returning the existing id), and finally storing the ids in the event_key_values table. So 2 simple events such as
[
{
"event_id": 1,
"name": "updates",
"meta": {
"id": 1,
"value: "some random value"
}
},
{
"event_id": 2,
"meta": {
"id": 2,
"value: "some random value"
}
}
]
would become
event_keys
id key
1 name
2 meta.id
3 meta.value
event_values
id value
1 updates
2 1
3 some random value
4 2
event_key_values
event_id key_id value_id
1 1 1
1 2 2
1 3 3
2 2 4
2 3 3
All values are converted to text before storing, and a GIN index has been added to the event_key and event_values tables.
When attempting to search this data, we are able to retrieve results, however once we start hitting 1 million or more rows (we are expecting billions!) this can take anywhere from 10 seconds too minutes to find data. The key-values could have multiple search operations applied to them - equality, contains (case-sensitive and case-insensitive) and regex. To complicate things a bit more, the user can also search against all events, or a filtered selection (so only search against the last 10 days, events belonging to a certain application etc).
Some things I have noticed from testing
searching with multiple WHERE conditions on the same key e.g meta.id, the GIN index is used. However, a WHERE condition with multiple keys does not hit the index.
searching with multiple WHERE conditions on both the event_keys and event_values table does not hit the GIN index.
using 'raw' SQL - we use Jooq in this project and this was to rule out any issues caused by it's SQL generation.
I have tried a few things
denormalising the data and storing everything in one table - however this resulted in the database (200 GB disk) becoming filled within a few hours, with the index taking up more space than the data.
storing the key-values as a JSONB value against an event_id, the JSON blob containing the flattened key-value pairs as a map - this had the same issues as above, with the index taking up 1.5 times the space as the data.
building a document from the available key-values using concatenation using both a sub-query and CTE - from testing with a few million rows this takes forever, even when attempting to tune some parameters such as work_mem!
From reading solutions and examples here, it seems full text search provides the most benefits and performance when applied against known columns e.g. a table with first_name, last_name and a GIN index against these two columns, but I hope I am wrong. I don't believe the JOINs across tables is an issue, or event_values needing to be stored in the TOAST storage due to the size to be an issue (I have tried with truncated test values, all of the same length, 128 chars and the results still take 60+ seconds).
From running EXPLAIN ANALYSE it appears no matter how I tweak the queries or tables, most of the time is spent searching the tables sequentially.
Am I simply spending time trying to make Postgres and full text search suit a problem it may never work (or at least have acceptable performance) for? Or should I look at other solutions e.g. One possible advantage of the data is it is 'immutable' and never updated once persisted, so something syncing the data to something like Elasticsearch and running search queries against it first might be a solution.
I would really like to use Postgres for this as I've seen it is possible, and read several articles where fantastic performance has been achieved - but maybe this data just isn't suited?
Edit;
Due to the size of the values (some of these could be large JSON blobs, several 100Kbs), the GIN index on event_values is based on the MD5 hash - for equality checks the index is used but never for searching as expected. For event_keys the GIN index is against the key_name column. Users can search against key names, values or both, for example "List all event keys beginning with 'meta.hashes'"

Should i add index on very small mongodb objects?

I have a collection with ~ 100k rows of very small objects - 3 fields of integers values and 3 dateTime.
The queries will be on 3 fields - around 3/sec.
How much benefit do i get for adding indexes on those fields?
Is it something that should be consider about or should i just add indexes on all heavily used queries ?
Thanks

MongoDb java driver projection performance

I have encounter probably a problem using MongoDB like this. I have 860000 documents in a collection and have 500 collections like this. I have 3 columns, first and second field is type of Array contains 10 elements, third is type of Int64 that keeps currentTimeMillis. When i query 1000 document from one table it tooks ~2500 ms. But when i execute same query getting only first elements of two fields (using $slice operator for Array) (each other contains 10 elements), it takes ~2000 ms. This looks weird. MongoDB is in remote host, so i take approximately 10 times smaller data from network but it takes almost same amount of time. Any thoughts?
Problem turns into this :) When i query 1000 documents using collection.find(whereQuery), it takes ~2400ms. But when i query 13 documents using same code, it takes ~1500ms. Data taken 100 times smaller but time even not half. Am i missing something.

mongodb index strategy for range query with different fields

Almost all my documents include 2 fields, start timestamp and final timestamp. And in each query, I need to retrieve elements which are in selected period of time. so start should be after selected value and final should be before selected timestamp.
query looks like
db.collection.find({start:{$gt:DateTime(...)}, final:{$lt:DateTime(...)}})
So what is the best indexing strategy for that scenario?
By the way, which is better for performance - to store date as datetimes or as unix timestamps, which is long value itself
To add a little more to baloo's answer.
On the time-stamp vs. long issue. Generally the MongoDB server will not see a difference. The BSON encoding length is the same (64 bits). You may see a performance different on the client side depending on the driver's encoding. As an example, on the Java side a using the 10gen driver a time-stamp is rendered as Date that is a lot heavier than Long. There are drivers that try to avoid that overhead.
The other issue is that you will see a performance improvement if you close the range for the first field of the index. So if you use the index suggested by baloo:
db.collection.ensureIndex({start: 1, final: 1})
The query will perform (potentially much) better if it is:
db.collection.find({start:{$gt:DateTime(...),$lt:DateTime(...)},
final:{$lt:DateTime(...)}})
Conceptually, if you think of the indexes as a a tree the closed range limits both sides of the tree instead of just one side. Without the closed range the server has to "check" all of the entries with a start greater than the time stamp provided since it does not know of the relation between start and final.
You may even find that that the query performance is no better using a single field index like:
db.collection.ensureIndex({start: 1})
Most of the savings is from the first field's pruning. The case where this will not be the case is when the query is covered by the index or the ordering/sort for the results can be derived from the index.
You can use a Compound index in order to create an index for multiple fields.
db.collection.ensureIndex({start: 1, final: 1})
Compare different queries and indexes by using explain() to get the most out of your database

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!
Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.