Get Redis values while scanning - postgresql

I've created a Redis key / value index this way :
set 7:12:321 '{"some:"JSON"}'
The key is delimited by a colon separator, each part of the key represents a hierarchic index.
get 7:12:321 means that I know the exact hierarchy and want only one single item
scan 7:12:* means that I want every item under id 7 in the first level of hierarchy and id 12 in the second layer of hierarchy.
Problem is : if I want the JSON values, I have to first scan (~50000 entries in a few ms) then get every key returned by scan one by one (800ms).
This is not very efficient. And this is the only answer I found on stackoverflow searching "scanning Redis values".
1/ Is there another way of scanning Redis to get values or key / value pairs and not only keys ? I tried hscan with as follows :
hset myindex 7:12:321 '{"some:"JSON"}'
hscan myindex MATCH 7:12:*
 
But it destroys the performance (almost 4s for the 50000 entries)
2/ Is there another data structure in Redis I could use in the same way but which could "scan for values" (hset ?)
3/ Should I go with another data storage solution (PostgreSQL ltree for instance ?) to suit my use case with huge performance ?
I must be missing something really obvious, 'cause this sounds like a common use case.
Thanks for your answers.

Optimization for your current solution
Instead of geting every key returned by scan one-by-one, you should use mget to batch get key-value pairs, or use pipeline to reduce RTT.
Efficiency problem of your current solution
scan command iterates all keys in the database, even if the number of keys that match the pattern is small. The performance decreases when the number of keys increases.
Another solution
Since the hierarchic index is an integer, you can encode the hierarchic indexes into a number, and use that number as the score of a sorted set. In this way, instead of searching by pattern, you can search by score range, which is very fast with a sorted set. Take the following as an example.
Say, the first (right-most) hierarchic index is less than 1000, the second index is less than 100, then you can encode the index (e.g. 7:12:321) into a score (321 + 12 * 1000 + 7 * 100 * 1000 = 712321). Then set the score and the value into a sorted set: zadd myindex 712321 '{"some:"JSON"}'.
When you want to search keys that match 7:12:*, just use zrangebyscore command to get data with a score between 712000 and 712999: zrangebyscore myindex 712000 712999 withscores.
In this way, you can get key (decoded with the returned score) and value together. Also it should be faster than the scan solution.
UPDATE
The solution has a little problem: members of sorted set must be unique, so you cannot have 2 keys with the same value (i.e. json string).
// insert OK
zadd myindex 712321 '{"the_same":"JSON"}'
// failed to insert, members should be unique
zadd myindex 712322 '{"the_same":"JSON"}'
In order to solve this problem, you can combine the key with the json string to make it unique:
zadd myindex 712321 '7:12:321-{"the_same":"JSON"}'
zadd myindex 712321 '7:12:322-{"the_same":"JSON"}'

You could consider using a Sorted Set and lexicographical ranges as long as you only need to perform prefix searches. For more information about this and indexing in general refer to http://redis.io/topics/indexes
Updated with an example:
Consider the following -
$ redis-cli
127.0.0.1:6379> ZADD anotherindex 0 '7:12:321:{"some:"JSON"}'
(integer) 1
127.0.0.1:6379> ZRANGEBYLEX anotherindex [7:12: [7:12:\xff
1) "7:12:321:{\"some:\"JSON\"}"
Now go and read about this so you 1) understand what it does and 2) know how to avoid possible pitfalls :)

Related

Multi Column Indexes with Order By and OR clause

I have below query to fetch list of tickets.
EXPLAIN select * from ticket_type
where ticket_type.event_id='89898'
and ticket_type.active=true
and (ticket_type.is_unlimited = true OR ticket_type.number_of_sold_tickets < ticket_type.number_of_tickets)
order by ticket_type.ticket_type_order
I have created below indexes but not working.
Index on (ticket_type_order,event_id,is_unlimited,active)
Index on (ticket_type_order,event_id,active,number_of_sold_tickets,number_of_tickets).
The perfect index for this query would be
CREATE INDEX ON ticket_type (event_id, ticket_type_order)
WHERE active AND (is_unlimited OR number_of_sold_tickets < number_of_tickets);
Of course, a partial index like that might only be useful for this specific query.
If the WHERE conditions from the index definition are not very selective, or a somewhat slower execution is also acceptable, you can omit parts of or the whole WHERE clause. That makes the index more widely useful.
What is the size of the table and usual query result? The server is usually smart enough and disables indexes, if it expects to return more than the half of the table.
Index makes no sense, if the result is rather small. If the server has - let say - 1000 records after several filtration steps, the server stops using indexes. It is cheaper the finish the query using CPU, then loading an index from HDD. As result, indexes are never applied to small tables.
Order by is applied at the very end of the query processing. The first field in the index should be one of the fields from the where filter.
Boolean fields are seldom useful in the index. It has only two possible values. Index should be created for fields with a lot of different values.
Avoid or filtering. It is easy in your case. Put a very big number into number_of_tickets, if the tickets are unlimited.
The better index in your case would be just event_id. If the database server supports functional indexes, then you can try to add number_of_tickets - number_of_sold_tickets. Rewrite the statement as where number_of_tickets - number_of_sold_tickets > 0
UPDATE: Postgresql calls it "Index on Expression":
https://www.postgresql.org/docs/current/indexes-expressional.html

Postgres full text search against arbitrary data - possible or not?

I was hoping to get some advice or guidance around a problem I'm having.
We currently store event data in Postgres 12 (AWS RDS) - this data could contain anything. To reduce the amount of data (alot of keys for example are common across all events) we flatten this data and store it across 3 tables -
event_keys - the key names from events
event_values - the values from events
event_key_values - a lookup table, containing the event_id, and key_id and value_id.
First inserting the key and value (or returning the existing id), and finally storing the ids in the event_key_values table. So 2 simple events such as
[
{
"event_id": 1,
"name": "updates",
"meta": {
"id": 1,
"value: "some random value"
}
},
{
"event_id": 2,
"meta": {
"id": 2,
"value: "some random value"
}
}
]
would become
event_keys
id key
1 name
2 meta.id
3 meta.value
event_values
id value
1 updates
2 1
3 some random value
4 2
event_key_values
event_id key_id value_id
1 1 1
1 2 2
1 3 3
2 2 4
2 3 3
All values are converted to text before storing, and a GIN index has been added to the event_key and event_values tables.
When attempting to search this data, we are able to retrieve results, however once we start hitting 1 million or more rows (we are expecting billions!) this can take anywhere from 10 seconds too minutes to find data. The key-values could have multiple search operations applied to them - equality, contains (case-sensitive and case-insensitive) and regex. To complicate things a bit more, the user can also search against all events, or a filtered selection (so only search against the last 10 days, events belonging to a certain application etc).
Some things I have noticed from testing
searching with multiple WHERE conditions on the same key e.g meta.id, the GIN index is used. However, a WHERE condition with multiple keys does not hit the index.
searching with multiple WHERE conditions on both the event_keys and event_values table does not hit the GIN index.
using 'raw' SQL - we use Jooq in this project and this was to rule out any issues caused by it's SQL generation.
I have tried a few things
denormalising the data and storing everything in one table - however this resulted in the database (200 GB disk) becoming filled within a few hours, with the index taking up more space than the data.
storing the key-values as a JSONB value against an event_id, the JSON blob containing the flattened key-value pairs as a map - this had the same issues as above, with the index taking up 1.5 times the space as the data.
building a document from the available key-values using concatenation using both a sub-query and CTE - from testing with a few million rows this takes forever, even when attempting to tune some parameters such as work_mem!
From reading solutions and examples here, it seems full text search provides the most benefits and performance when applied against known columns e.g. a table with first_name, last_name and a GIN index against these two columns, but I hope I am wrong. I don't believe the JOINs across tables is an issue, or event_values needing to be stored in the TOAST storage due to the size to be an issue (I have tried with truncated test values, all of the same length, 128 chars and the results still take 60+ seconds).
From running EXPLAIN ANALYSE it appears no matter how I tweak the queries or tables, most of the time is spent searching the tables sequentially.
Am I simply spending time trying to make Postgres and full text search suit a problem it may never work (or at least have acceptable performance) for? Or should I look at other solutions e.g. One possible advantage of the data is it is 'immutable' and never updated once persisted, so something syncing the data to something like Elasticsearch and running search queries against it first might be a solution.
I would really like to use Postgres for this as I've seen it is possible, and read several articles where fantastic performance has been achieved - but maybe this data just isn't suited?
Edit;
Due to the size of the values (some of these could be large JSON blobs, several 100Kbs), the GIN index on event_values is based on the MD5 hash - for equality checks the index is used but never for searching as expected. For event_keys the GIN index is against the key_name column. Users can search against key names, values or both, for example "List all event keys beginning with 'meta.hashes'"

Possible to retrieve multiple random, non-sequential documents from MongoDB?

I'd like to retrieve a random set of documents from a MongoDB database. So far after lots of Googling, I've only seen ways to retrieve one random document OR a set of documents starting at a random skip position but where the documents are still sequential.
I've tried mongoose-simple-random, and unfortunately it doesn't retrieve a "true" random set. What it does is skip to a random position and then retrieve n documents from that position.
Instead, I'd like to retrieve a random set like MySQL does using one query (or a minimal amount of queries), and I need this list to be random every time. I need this to be efficient -- relatively on par with such a query with MySQL. I want to reproduce the following but in MongoDB:
SELECT * FROM products ORDER BY rand() LIMIT 50;
Is this possible? I'm using Mongoose, but an example with any adapter -- or even a straight MongoDB query -- is cool.
I've seen one method of adding a field to each document, generating a random value for each field, and using {rand: {$gte:rand()}} each query we want randomized. But, my concern is that two queries could theoretically return the same set.
You may do two requests, but in an efficient way :
Your first request just gets the list of all "_id" of document of your collections. Be sure to use a mongo projection db.products.find({}, { '_id' : 1 }).
You have a list of "_id", just pick N randomly from the list.
Do a second query using the $in operator.
What is especially important is that your first query is fully supported by an index (because it's "_id"). This index is likely fully in memory (else you'd probably have performance problems). So, only the index is read while running the first query, and it's incredibly fast.
Although the second query means reading actual documents, the index will help a lot.
If you can do things this way, you should try.
I don't think MySQL ORDER BY rand() is particularly efficient - as I understand it, it essentially assigns a random number to each row, then sorts the table on this random number column and returns the top N results.
If you're willing to accept some overhead on your inserts to the collection, you can reduce the problem to generating N random integers in a range. Add a counter field to each document: each document will be assigned a unique positive integer, sequentially. It doesn't matter what document gets what number, as long as the assignment is unique and the numbers are sequential, and you either don't delete documents or you complicate the counter document scheme to handle holes. You can do this by making your inserts two-step. In a separate counter collection, keep a document with the first number that hasn't been used for the counter. When an insert occurs, first findAndModify the counter document to retrieve the next counter value and increment the counter value atomically. Then insert the new document with the counter value. To find N random values, find the max counter value, then generate N distinct random numbers in the range defined by the max counter, then use $in to retrieve the documents. Most languages should have random libraries that will handle generating the N random integers in a range.

compound Index or single index in mongodb

I got a query like this that gets called 90% of the times:
db.xyz.find({ "ws.wz.eId" : 665 , "ws.ce1.id" : 665)
and another one like this that gets called 10% of the times:
db.xyz.find({ "ws.wz.eId" : 111 , "ws.ce2.id" : 111)
You can see that the id for the two collections in both queries are the same.
Now I'm wondering if I should just create a single index just for "ws.wz.eId" or if I should create two compound indexes: one for {"ws.wz.eId", "ws.ce.id"} and another one for {"ws.wz.eId", "ws.ce2.id"}
It seems to me that the single index is the best choice; however I might be wrong; so I would like to know if there is value in creating the compound index, or any other type.
As muratgu already pointed out, the best way to reason about performance is to stop reasoning and start measuring instead.
However, since measurements can be quite tricky, here's some theory:
You might want to consider one compound index {"ws.wz.eId", "ws.ce1.id"} because that can be used for the 90% case and, for the ten percent case, is equivalent to just having an index on ws.wz.eId.
When you do this, the first query can be matched through the index, the second query will have to find all candidates with matching ws.wz.eId first (fast, index present) and then scan-and-match all candidates to filter out those documents that don't match the ws.ce2.id criterion. Whether that is expensive or not depends on the number of documents with same ws.wz.eId that must be scanned, so this depends very much on your data.
An important factor is the selectivity of the key. For example, if there's a million documents with same ws.wz.eId and only one of those has the ws.ce2.id you're looking for, you might need the index, or want to reverse the query.

Using Mongo: should we create an index tailored to each type of high-volume query?

We have two types of high-volume queries. One looks for docs involving 5 attributes: a date (lte), a value stored in an array, a value stored in a second array, one integer (gte), and one float (gte).
The second includes these five attributes plus two more.
Should we create two compound indices, one for each query? Assume each attribute has a high cardinality.
If we do, because each query involves multiple arrays, it doesn't seem like we can create an index because of Mongo's restriction. How do people structure their Mongo databases in this case?
We're using MongoMapper.
Thanks!
Indexes for queries after the first ranges in the query the value of the additional index fields drops significantly.
Conceptually, I find it best to think of the addition fields in the index pruning ever smaller sub-trees from the query. The first range chops off a large branch, the second a smaller, the third smaller, etc. My general rule of thumb is only the first range from the query in the index is of value.
The caveat to that rule is that additional fields in the index can be useful to aid sorting returned results.
For the first query I would create a index on the two array values and then which ever of the ranges will exclude the most documents. The date field is unlikely to provide high exclusion unless you can close the range (lte and gte). The integer and float is hard to tell without knowing the domain.
If the second query's two additional attributes also use ranges in the query and do not have a significantly higher exclusion value then I would just work with the one index.
Rob.