I was hoping to get some advice or guidance around a problem I'm having.
We currently store event data in Postgres 12 (AWS RDS) - this data could contain anything. To reduce the amount of data (alot of keys for example are common across all events) we flatten this data and store it across 3 tables -
event_keys - the key names from events
event_values - the values from events
event_key_values - a lookup table, containing the event_id, and key_id and value_id.
First inserting the key and value (or returning the existing id), and finally storing the ids in the event_key_values table. So 2 simple events such as
[
{
"event_id": 1,
"name": "updates",
"meta": {
"id": 1,
"value: "some random value"
}
},
{
"event_id": 2,
"meta": {
"id": 2,
"value: "some random value"
}
}
]
would become
event_keys
id key
1 name
2 meta.id
3 meta.value
event_values
id value
1 updates
2 1
3 some random value
4 2
event_key_values
event_id key_id value_id
1 1 1
1 2 2
1 3 3
2 2 4
2 3 3
All values are converted to text before storing, and a GIN index has been added to the event_key and event_values tables.
When attempting to search this data, we are able to retrieve results, however once we start hitting 1 million or more rows (we are expecting billions!) this can take anywhere from 10 seconds too minutes to find data. The key-values could have multiple search operations applied to them - equality, contains (case-sensitive and case-insensitive) and regex. To complicate things a bit more, the user can also search against all events, or a filtered selection (so only search against the last 10 days, events belonging to a certain application etc).
Some things I have noticed from testing
searching with multiple WHERE conditions on the same key e.g meta.id, the GIN index is used. However, a WHERE condition with multiple keys does not hit the index.
searching with multiple WHERE conditions on both the event_keys and event_values table does not hit the GIN index.
using 'raw' SQL - we use Jooq in this project and this was to rule out any issues caused by it's SQL generation.
I have tried a few things
denormalising the data and storing everything in one table - however this resulted in the database (200 GB disk) becoming filled within a few hours, with the index taking up more space than the data.
storing the key-values as a JSONB value against an event_id, the JSON blob containing the flattened key-value pairs as a map - this had the same issues as above, with the index taking up 1.5 times the space as the data.
building a document from the available key-values using concatenation using both a sub-query and CTE - from testing with a few million rows this takes forever, even when attempting to tune some parameters such as work_mem!
From reading solutions and examples here, it seems full text search provides the most benefits and performance when applied against known columns e.g. a table with first_name, last_name and a GIN index against these two columns, but I hope I am wrong. I don't believe the JOINs across tables is an issue, or event_values needing to be stored in the TOAST storage due to the size to be an issue (I have tried with truncated test values, all of the same length, 128 chars and the results still take 60+ seconds).
From running EXPLAIN ANALYSE it appears no matter how I tweak the queries or tables, most of the time is spent searching the tables sequentially.
Am I simply spending time trying to make Postgres and full text search suit a problem it may never work (or at least have acceptable performance) for? Or should I look at other solutions e.g. One possible advantage of the data is it is 'immutable' and never updated once persisted, so something syncing the data to something like Elasticsearch and running search queries against it first might be a solution.
I would really like to use Postgres for this as I've seen it is possible, and read several articles where fantastic performance has been achieved - but maybe this data just isn't suited?
Edit;
Due to the size of the values (some of these could be large JSON blobs, several 100Kbs), the GIN index on event_values is based on the MD5 hash - for equality checks the index is used but never for searching as expected. For event_keys the GIN index is against the key_name column. Users can search against key names, values or both, for example "List all event keys beginning with 'meta.hashes'"
Related
I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);
I have a table with a column that is an array hstore. In JSON notation, the data for this field looks like:
[
{ type: "Pickup", id: "49593034" },
{ type: "User", id: "5903" },
...
]
The number of hstore's in the array for a single record might be as high as 10,000 although it may be less. This table only receives inserts. No data is removed or updated once a record has been added.
There is currently about 90K records in this table and it will generally grow by a few hundred each day. The values in later records will likely have different IDs than the values in earlier records.
I need to be able to do searches on this column. For example I might want to know all records where this column includes 'type=>User,id=>5903'::hstore. I found if I created a GIN index on this column and queried with the && operator it is able to search this data very quick.
I had a problem where the bloat on this query had gotten very high. PG stopped using the index and switched to a table scan causing it to be very slow. I fixed this by running REINDEX on the index.
I'm finding the bloat comes back after a bit. While it hasn't switched to table scans again (even with a high bloat) I worry it will so have preemptively been reindexing when I see it get high again. Whenever I do this I have to take a function off-line as reindexing blocks usage of that index according to the documentation.
I am using Heroku's PG service and they provide a tool to query for bloat. The query it executes can be found here.
After doing the REINDEX the index size is about 1GB. When it bloats it grows to about 11GB according to Heroku's query.
My questions are:
What exactly is causing the bloat? I understand with tables that a DELETE or UPDATE will cause bloat. Is the addition of data causing it to have to rebalance a tree or something leading to dead pages in the index or something?
Is the data from Heroku's query accurate. I've read some things about some bloat queries being for btrees and not gin indexes. Perhaps it's pointing to a non-existant problem. The only time it actually stopped using the index was when I first did a mass insert to populate the table. After that it has bloated but kept using the index. Maybe I know longer have an issue and am just pre-emptively reindexing for no reason.
Is there something I should change in my schema to make this maintenance free?
I was thinking of refactoring this so that the data in this column is stored in a separate table. This separate table would store the type, id and the ID of my main table as separate columns. I would create an btree index on the type and id field. Would this give me a generally maintenance free lookup? I'm thinking it would also be faster since the id could be stored as a true number. Currently it is stored as a string since hstore values are always strings.
I've created a Redis key / value index this way :
set 7:12:321 '{"some:"JSON"}'
The key is delimited by a colon separator, each part of the key represents a hierarchic index.
get 7:12:321 means that I know the exact hierarchy and want only one single item
scan 7:12:* means that I want every item under id 7 in the first level of hierarchy and id 12 in the second layer of hierarchy.
Problem is : if I want the JSON values, I have to first scan (~50000 entries in a few ms) then get every key returned by scan one by one (800ms).
This is not very efficient. And this is the only answer I found on stackoverflow searching "scanning Redis values".
1/ Is there another way of scanning Redis to get values or key / value pairs and not only keys ? I tried hscan with as follows :
hset myindex 7:12:321 '{"some:"JSON"}'
hscan myindex MATCH 7:12:*
But it destroys the performance (almost 4s for the 50000 entries)
2/ Is there another data structure in Redis I could use in the same way but which could "scan for values" (hset ?)
3/ Should I go with another data storage solution (PostgreSQL ltree for instance ?) to suit my use case with huge performance ?
I must be missing something really obvious, 'cause this sounds like a common use case.
Thanks for your answers.
Optimization for your current solution
Instead of geting every key returned by scan one-by-one, you should use mget to batch get key-value pairs, or use pipeline to reduce RTT.
Efficiency problem of your current solution
scan command iterates all keys in the database, even if the number of keys that match the pattern is small. The performance decreases when the number of keys increases.
Another solution
Since the hierarchic index is an integer, you can encode the hierarchic indexes into a number, and use that number as the score of a sorted set. In this way, instead of searching by pattern, you can search by score range, which is very fast with a sorted set. Take the following as an example.
Say, the first (right-most) hierarchic index is less than 1000, the second index is less than 100, then you can encode the index (e.g. 7:12:321) into a score (321 + 12 * 1000 + 7 * 100 * 1000 = 712321). Then set the score and the value into a sorted set: zadd myindex 712321 '{"some:"JSON"}'.
When you want to search keys that match 7:12:*, just use zrangebyscore command to get data with a score between 712000 and 712999: zrangebyscore myindex 712000 712999 withscores.
In this way, you can get key (decoded with the returned score) and value together. Also it should be faster than the scan solution.
UPDATE
The solution has a little problem: members of sorted set must be unique, so you cannot have 2 keys with the same value (i.e. json string).
// insert OK
zadd myindex 712321 '{"the_same":"JSON"}'
// failed to insert, members should be unique
zadd myindex 712322 '{"the_same":"JSON"}'
In order to solve this problem, you can combine the key with the json string to make it unique:
zadd myindex 712321 '7:12:321-{"the_same":"JSON"}'
zadd myindex 712321 '7:12:322-{"the_same":"JSON"}'
You could consider using a Sorted Set and lexicographical ranges as long as you only need to perform prefix searches. For more information about this and indexing in general refer to http://redis.io/topics/indexes
Updated with an example:
Consider the following -
$ redis-cli
127.0.0.1:6379> ZADD anotherindex 0 '7:12:321:{"some:"JSON"}'
(integer) 1
127.0.0.1:6379> ZRANGEBYLEX anotherindex [7:12: [7:12:\xff
1) "7:12:321:{\"some:\"JSON\"}"
Now go and read about this so you 1) understand what it does and 2) know how to avoid possible pitfalls :)
I could not reach any conclusive answers reading some of the existing posts on this topic.
I have certain data at 100 locations the for past 10 years. The table has about 800 million rows. I need to primarily generate yearly statistics for each location. Some times I need to generate monthly variation statistics and hourly variation statistics as well. I'm wondering if I should generate two indexes - one for location and another for year or generate one index on both location and year. My primary key currently is a serial number (Probably I could use location and timestamp as the primary key).
Thanks.
Regardless of how many indices have you created on relation, only one of them will be used in a certain query (which one depends on query, statistics etc). So in your case you wouldn't get a cumulative advantage from creating two single column indices. To get most performance from index I would suggest to use composite index on (location, timestamp).
Note, that queries like ... WHERE timestamp BETWEEN smth AND smth will not use the index above while queries like ... WHERE location = 'smth' or ... WHERE location = 'smth' AND timestamp BETWEEN smth AND smth will. It's because the first attribute in index is crucial for searching and sorting.
Don't forget to perform
ANALYZE;
after index creation in order to collect statistics.
Update:
As #MondKin mentioned in comments certain queries can actually use several indexes on the same relation. For example, query with OR clauses like a = 123 OR b = 456 (assuming that there are indexes for both columns). In this case postgres would perform bitmap index scans for both indexes, build a union of resulting bitmaps and use it for bitmap heap scan. In certain conditions the same scheme may be used for AND queries but instead of union there would be an intersection.
There is no rule of thumb for situations like these, I suggest you experiment in a copy of your production DB to see what works best for you: a single multi-column index or 2 single-column indexes.
One nice feature of Postgres is you can have multiple indexes and use them in the same query. Check this chapter of the docs:
... PostgreSQL has the ability to combine multiple indexes ... to handle cases that cannot be implemented by single index scans ....
... Sometimes multicolumn indexes are best, but sometimes it's better to create separate indexes and rely on the index-combination feature ...
You can even experiment creating both the individual and combined indexes, and checking how big each one is and determine if it's worth having them at the same time.
Some things that you can also experiment with:
If your table is too large, consider partitioning it. It looks like you could partition either by location or by date. Partitioning splits your table's data in smaller tables, reducing the amount of places where a query needs to look.
If your data is laid out according to a date (like transaction date) check BRIN indexes.
If multiple queries will be processing your data in a similar fashion (like aggregating all transactions over the same period, check materialized views so you only need to do those costly aggregations once.
About the order in which to put your multi-column index, put first the column on which you will have an equality operation, and later the column in which you have a range, >= or <= operation.
An index on (location,timestamp) should work better that 2 separate indexes for you case. Note that the order of the columns is important.
I am new to mongoDb and planning to use map reduce for computing large amount of data.
As you know we have map function to match the criteria and then emit the required data for a given filed. In my map function I have multiple emits. As of now I have 50 Fields emitted from a single document. That means from a single document in a collection explodes to 40 document in temp table. So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Thought of splitting the map function into two….but then one more problem …while performing map function if by chance I ran into an exception thought of skipping entire document data (I.e not to emit any data from that document) but if I split I won't be able to….
In mongoDB.org i found a comment which said..."When I run MR job, with sort - it takes 1.5 days to reach 23% at first stage of MR. When I run MR job, without sort, it takes about 24-36 hours for all job.Also when turn off jsMode is speed up my MR twice ( before i turn off sorting )"
Will enabling sort help? or Will turning OFF jsmode help? i am using mongo 2.0.5
Any suggestion?
Thanks in advance .G
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Don't know what you mean, MR's don't have sort params, only the incoming query has a sort param. The sort param of the incoming query only sorts the data going in. Unless you are looking for some specific behaviour that will avoid sorting the final output using an incoming sort you don't normally need to sort.
How are you looking to use this MR. Obviusly it won't be in realtime else you would just kill your servers so Ima guess it is a background process that runs and formats data to the way you want. I would suggest looking into incremental MRs so that you do delta updates throughout the day to limit the amount of resources used at any given time.
So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
Are you emiting multiple times? If not then the temp table should have only one key per row with documents of the format:
{
_id: emitted_id
[{ //each of your docs that you emit }]
}
This is shown: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/
or Will turning OFF jsmode help? i am using mongo 2.0.5
Turning off jsmode is unlikely to do anything significant and results from it have varied.