I am researching the possibility of using secondary index feature in Cassandra using Aquiles. I know for the primary index (key), a I must be using OrderPreservingPartitioner in order to query. At first, I thought that with secondary indexes, there is no such limitation, but I noticed that start key is part of GetIndexedSlicesCommand. Does that imply that under RandomPartitioner, this command is unusable?
You don't need OrderPreservingPartitioner to query by row key, it's only needed if you want to get a meaningful range of rows by their key, like 'all rows with a key between 5 and 9'. (Note that can and should almost always use RandomPartitioner instead.)
The start key for get_indexed_slices behaves the same way that it does for get_range_slices. That is, it's not very meaningful for examining a range of rows between two keys when using RandomPartitioner, but it is useful for paging through a lot of rows. There's even a FAQ entry on the topic. Basically, if you're going to get a ton of results from a call to get_indexed_slices, you don't want to fetch them all at once, you want to get a chunk (of 10, 100, or 1000, depending on size) at a time, and then set the start_key to the last key you saw in the previous chunk to get the next chunk.
Related
I am designing a sharded database. Many times we use two columns, first for logical shard and second for uniquely identifying a row in the shard. Instead of that I am planning to have just a 1 column with long datatype for primary key. To have unique key throughout the servers I am planning to have bigserial that will generate non overlapping range.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
39,999,999,999
5
40,000,000,000
49,999,999,999
and so on.
In future I should be able to
Split large server to two or more small servers
Join two or more small servers to 1 big server
Move some rows from Server A to Server B for better utilization of resources.
I will also have a lookup table to which will contain information on range and target server.
I would like to learn about drawbacks of this approach.
I recommend that you create the primary key column on server 1 like this:
CREATE TABLE ... (
id bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1000),
...
);
On the second server you use START WITH 2, and so on.
Adding a new shard is easy, just use a new start value. Splitting or joining shards is trivial (as far as the primary keys are concerned...), because new value can never conflict with old values, and each shard generates different values.
The two most common types of sharding keys are basically:
Based on a deterministic expression, like the modulus-based method suggested in #LaurenzAlbe's answer.
Based on a lookup table, like the method you describe in your post.
A drawback of the latter type is that your app has to check the lookup table frequently (perhaps even on every query), because the ranges could change. The ranges stored in the lookup table might be a good thing to put in cache, to avoid frequent SQL queries. Then replace the cached ranges when you change them. I assume this will be infrequent. With a cache, it's a pretty modest drawback.
I worked on a system like this, where we had a schema for each of our customers, distributed over 8 shards. At the start of each session, the app queried the lookup table to find out which shard the respective customer's data was stored on. Once a year or so, we would move some customer schemas around to new shards, because naturally they tend to grow. This included updating the lookup table. It wasn't a bad solution.
I suggest that you will eventually have multiple non-contiguous ranges per server, because there are hotspots of data or traffic, and it makes sense to split the ranges if you want to move the least amount of data.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
32,999,999,999
3
33,000,000,000
29,999,999,999
5
40,000,000,000
49,999,999,999
If you anticipate moving subsets of data from time to time, this can be a better design than the expression-based type of sharding. If you use an expression, and need to move data between shards, you must either make a more complex expression, or else rebalance a lot of data.
I have created in PostgreSQL a table partitioned (see here) by received column. Let's use a toy example:
CREATE TABLE measurement (
received timestamp without timezone PRIMARY KEY,
city_id int not null,
peaktemp int,
unitsales int
);
I have created one partition for each month for several years (measurement_y2012m01 ... measurement_y2016m03).
I have noticed that postgresql is not aware of the order of the partitions, so for a query like below:
select * from measurement where ... order by received desc limit 1000;
postgresql performs index scan over all partitions, even though it is very likely that the first 1000 results are located in the latest partition (or the first two or three).
Do you have an idea how to take advantage of partitions for such query? I want to emphasize that where clause may vary, I don't want to hardcode it.
The first idea is to iterate partitions in a proper order until 1000 records are fetched or all partitions are visited. But how to implement it in a flexible way? I want to avoid implementing the aforementioned iteration in the application, but I don't mind if the app needs to call a stored procedure.
Thanks in advance for your help!
Grzegorz
If you really don't know how many partitions to scan to get your desired 1000 rows in the output you could build up your resultset in a stored procedure and fetch results iterating over partitions until your limit condition is satisfied.
Starting with the most recent partition would be a wise thing to do.
select * from measurement_y2016m03 where ... order by received desc limit 1000;
You could store the immediate resultset in a record and issue a count over it and change the limit dynamically for the next scanned partition, so that if you fetch for example 870 rows in first partition, you could build up a second query with limit 130 and then perform count once again after that and increase the counter if it still doesn't satisfy your 1000 rows condition.
Why Postgres doesn't know when to stop during planning?
Planner is unaware of how many partitions are needed to satisfy your LIMIT clause. Thus, it has to order the entire set by appending results from each partition and then perform a limit (unless it already satisfies this condition during run time). The only way to do this in an SQL statement would be to restrict the lookup only to a few partitions - but that may not be the case for you. Also, increasing work_mem setting may speed things up for you if you're hitting disk during lookups.
Key note
Also, a thing to remember is that when you setup your partitioning, you should have a descending order of mostly accessed partitions. This would speed up your inserts, because Postgres checks conditions one by one and stops on first that satisfies.
Instead of iterating the partitions, you could guess at the range of received that will satisfy your query and expand it until you get the desired number of rows. Adding the range to WHERE will exclude the unnecessary partitions (assuming you have exclusion constraints set).
Edit
Correct, that's what I meant (could've phrased it better).
Simplicity seems like a pretty reasonable advantage. I don't see the performance being different, either way. This might actually be a little more efficient if you guess reasonably close to the desired range most of the time, but probably won't make a significant difference.
It's also a little more flexible, since you're not relying on the particular partitioning scheme in your query code.
I have a large number of records to iterate over (coming from an external data source) and then insert into a mongo db.
I do not want to allow duplicates. How can this be done in a way that will not affect performance.
The number of records is around 2 million.
I can think of two fairly straightforward ways to do this in mongodb, although a lot depends upon your use case.
One, you can use the upsert:true option to update, using whatever you define as your unique key as the query for the update. If it does not exist it will insert it, otherwise it will update it.
http://docs.mongodb.org/manual/reference/method/db.collection.update/
Two, you could just create a unique index on that key and then insert ignoring the error generated. Exactly how to do this will be somewhat dependent on language and driver used along with version of mongodb. This has the potential to be faster when performing batch inserts, but YMMV.
2 million is not a huge number that will affect performance, split your records fields into diffent collections will be good enough.
i suggest create a unique index on your unique key before insert into the mongodb.
unique index will filter redundant data and lose some records and you can ignore the error.
I'm trying to determine the best way to deal with a composite primary key in a mongo db. The main key for interacting with the data in this system is made up of 2 uuids. The combination of uuids is guaranteed to be unique, but neither of the individual uuids is.
I see a couple of ways of managing this:
Use an object for the primary key that is made up of 2 values (as suggested here)
Use a standard auto-generated mongo object id as the primary key, store my key in two separate fields, and then create a composite index on those two fields
Make the primary key a hash of the 2 uuids
Some other awesome solution that I currently am unaware of
What are the performance implications of these approaches?
For option 1, I'm worried about the insert performance due to having non sequential keys. I know this can kill traditional RDBMS systems and I've seen indications that this could be true in MongoDB as well.
For option 2, it seems a little odd to have a primary key that would never be used by the system. Also, it seems that query performance might not be as good as in option 1. In a traditional RDBMS a clustered index gives the best query results. How relevant is this in MongoDB?
For option 3, this would create one single id field, but again it wouldn't be sequential when inserting. Are there any other pros/cons to this approach?
For option 4, well... what is option 4?
Also, there's some discussion of possibly using CouchDB instead of MongoDB at some point in the future. Would using CouchDB suggest a different solution?
MORE INFO: some background about the problem can be found here
You should go with option 1.
The main reason is that you say you are worried about performance - using the _id index which is always there and already unique will allow you to save having to maintain a second unique index.
For option 1, I'm worried about the insert performance do to having
non sequential keys. I know this can kill traditional RDBMS systems
and I've seen indications that this could be true in MongoDB as well.
Your other options do not avoid this problem, they just shift it from the _id index to the secondary unique index - but now you have two indexes, once that's right-balanced and the other one that's random access.
There is only one reason to question option 1 and that is if you plan to access the documents by just one or just the other UUID value. As long as you are always providing both values and (this part is very important) you always order them the same way in all your queries, then the _id index will be efficiently serving its full purpose.
As an elaboration on why you have to make sure you always order the two UUID values the same way, when comparing subdocuments { a:1, b:2 } is not equal to { b:2, a:1 } - you could have a collection where two documents had those values for _id. So if you store _id with field a first, then you must always keep that order in all of your documents and queries.
The other caution is that index on _id:1 will be usable for query:
db.collection.find({_id:{a:1,b:2}})
but it will not be usable for query
db.collection.find({"_id.a":1, "_id.b":2})
I have an option 4 for you:
Use the automatic _id field and add 2 single field indexes for both uuid's instead of a single composite index.
The _id index would be sequential (although that's less important in MongoDB), easily shardable, and you can let MongoDB manage it.
The 2 uuid indexes let you to make any kind of query you need (with the first one, with the second or with both in any order) and they take up less space than 1 compound index.
In case you use both indexes (and other ones as well) in the same query MongoDB will intersect them (new in v2.6) as if you were using a compound index.
I'd go for the 2 option and there is why
Having two separate fields instead of the one concatenated from both uuids as suggested in 1st, will leave you the flexibility to create other combinations of indexes to support the future query requests or if turns out, that the cardinality of one key is higher then another.
having non sequential keys could help you to avoid the hotspots while inserting in sharded environment, so its not such a bad option. Sharding is the best way, for my opinion, to scale inserts and updates on the collections, since the write locking is on database level (prior to 2.6) or collection level (2.6 version)
I would've gone with option 2. You can still make an index that handles both the UUID fields, and performance should be the same as a compound primary key, except it'll be much easier to work with.
Also, in my experience, I've never regretted giving something a unique ID, even if it wasn't strictly required. Perhaps that's an unpopular opinion though.
I'm trying to retrieve all of the keys from a DynamoDB table in an optimized way. There are millions of keys.
In Cassandra I would probably create a single row with a column for every key which would eliminate to do a full table scan. DynamoDBs 64k limit per Item would seemingly preclude this option though.
Is there a quick way for me to get back all of the keys?
Thanks.
I believe the DynamoDB analogue would be to use composite keys: have a primary key of "allmykeys" and a range attribute of the originals being tracked: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/DataModel.html#DataModelPrimaryKey
I suspect this will scale poorly to billions of entries, but should work adequately for a few million.
Finally, again as with Cassandra, the most straightforward solution is to use map/reduce to get the keys: http://docs.amazonwebservices.com/amazondynamodb/latest/developerguide/EMRforDynamoDB.html