Redis scan issue for multiple replica configuration - scala

I am using AWS Elasticache redis(cluster mode off) with 1 primary and 2 replica and trying to get keys from redis for a given key pattern using redis scan command (scan (cursor, count, matchGlobe)) but it is giving an inconsistent result i.e its not giving complete set of keys (actual retrieved keys size < expected keys size).
It works perfectly fine when I am using 1 primary and 1 replica but start seeing the issue when I increase replica count greater than 1.
I have some intuition of what might be going wrong but can't affirm it. Scan basically starts with cursor value 0 and picks n(given count) matching keys and then returns result as well as next cursor value which must be used for next iteration of scan and so on until cursor value again becomes 0 which signals the end of iteration and in the process we collect all the keys. But when we call replica to scan keys, it can go to one replica in 1st iteration and may go to another replica for 2nd iteration which can give us few redundant keys and this is what we want to avoid to make it work (I don't know if this is the case though).
Few more details:
Redis engine used - 6.2.6
Shard - 1
Number of nodes used - 3 (1 primary, 2 replica)
Cluster Mode - disabled
Here is the Scala code for scanning the keys (I am using etaty library v1.9.0 for redis) -
def scan(pattern: String): Seq[String] = {
val CHUNKSIZE = 10000
val buffer = ListBuffer[String]()
var index = 0
do {
val cursor = synchronized {
Await.result({
replicasClient.scan(index, Some(CHUNKSIZE), Some(pattern))
}
} , 1.minute)
buffer.addAll(cursor.data)
index = cursor.index
} while (index > 0)
buffer.toSeq
}
Looked at few documents explaining the working of scan but all of them were either for single replica case or for cluster mode enabled case, none of them were for multi-replica with cluster mode disabled case.
Highlights:
During scan iteration process, redis keys collection remains fixed. It doesn't change. However, this collection keeps on updating throughout the day except a specific time window during which scanning is performed.

As your keys may change between one SCAN iteration and the next one, there is no guarantee you will get the expected number of keys here, even within a single stand-alone Redis instance.
Quoting the official documentation:
The SCAN family of commands only offer limited guarantees about the
returned elements since the collection that we incrementally iterate
can change during the iteration process.
If you absolutely need to get a snapshot of the keys in a given Redis instance then you could use the KEYS command instead but beware of the negative performance implications (see the official documentation for the details - basically, Redis blocks until all the keys are enumerated and this is an O(N) operation depending on the number of keys).
As an alternative to the above, I would suggest to review your application logic so that it stores the number of keys you would like to monitor elsewhere, for example in a dedicated key which you can INCR / DECR and GET when needed: this way, you could completely avoid scanning your keys in the first place.
Update: given your keys do not change during the SCAN iteration, the reason why your are getting a different number of keys from your replica(s) can be that Redis uses by default asynchronous replication and your replica(s) may not yet include the whole set of keys you have in your primary, at the moment of your SCAN.
To overcome this limitation, you could execute the WAIT command to make sure your primary has synchronized with (at least) n replicas.
As an alternative which I would opt for, you could just iterate over your keys on your primary Redis instance, without querying your replica(s).

Related

What are the drawbacks of per server long range as primary keys for sharding database?

I am designing a sharded database. Many times we use two columns, first for logical shard and second for uniquely identifying a row in the shard. Instead of that I am planning to have just a 1 column with long datatype for primary key. To have unique key throughout the servers I am planning to have bigserial that will generate non overlapping range.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
39,999,999,999
5
40,000,000,000
49,999,999,999
and so on.
In future I should be able to
Split large server to two or more small servers
Join two or more small servers to 1 big server
Move some rows from Server A to Server B for better utilization of resources.
I will also have a lookup table to which will contain information on range and target server.
I would like to learn about drawbacks of this approach.
I recommend that you create the primary key column on server 1 like this:
CREATE TABLE ... (
id bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1000),
...
);
On the second server you use START WITH 2, and so on.
Adding a new shard is easy, just use a new start value. Splitting or joining shards is trivial (as far as the primary keys are concerned...), because new value can never conflict with old values, and each shard generates different values.
The two most common types of sharding keys are basically:
Based on a deterministic expression, like the modulus-based method suggested in #LaurenzAlbe's answer.
Based on a lookup table, like the method you describe in your post.
A drawback of the latter type is that your app has to check the lookup table frequently (perhaps even on every query), because the ranges could change. The ranges stored in the lookup table might be a good thing to put in cache, to avoid frequent SQL queries. Then replace the cached ranges when you change them. I assume this will be infrequent. With a cache, it's a pretty modest drawback.
I worked on a system like this, where we had a schema for each of our customers, distributed over 8 shards. At the start of each session, the app queried the lookup table to find out which shard the respective customer's data was stored on. Once a year or so, we would move some customer schemas around to new shards, because naturally they tend to grow. This included updating the lookup table. It wasn't a bad solution.
I suggest that you will eventually have multiple non-contiguous ranges per server, because there are hotspots of data or traffic, and it makes sense to split the ranges if you want to move the least amount of data.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
32,999,999,999
3
33,000,000,000
29,999,999,999
5
40,000,000,000
49,999,999,999
If you anticipate moving subsets of data from time to time, this can be a better design than the expression-based type of sharding. If you use an expression, and need to move data between shards, you must either make a more complex expression, or else rebalance a lot of data.

What happens when Index creation in MongoDB which is running in background fails

There are existing collections in MongoDB on which need to be programmatically updated for new indexes.
So there is an admin web API in my ASP.net application when invoked will invoke the create index API in MongoDB. In order to not cause an impact due to index building process, it is performed in background.
It is not known whether the existing data is good as per the index definition. Because Mongo DB imposes index key size limit to 1024, and it may be possible that values of indexed fields in some of the existing documents may sum up to length more than 1024.
So the question is when this happens what would happen when the index building fails due to this.
Also how can I programmatically (C# driver) find the status of the index build operation at a later point in time?
According to the MongoDB Documentation
MongoDB will not create an index on a collection if the index entry for an existing document exceeds the index key limit. Previous versions of MongoDB would create the index but not index such documents.
So this means, background or foreground, an index key too long will cause the creation to fail. However, no matter how you create the index, the session issuing the create index command, will block. This means if the index build fails, you should be notified by an exception thrown while await-ing the task returned by the Indexes.CreateManyASync() method.
Since you are unsure if the data will be affected by the maximum key length, I strongly suggest you test this in a pre-production environment before attempting it in production. Since production is (I assume) active, the pre-production environment won't match the data exactly (writes still happening) it will reduce the possibility of finding a failed index build in production.
Additionally, even if the index is able to be built, in the future, writes that break that key length will be rejected. This can be avoided by setting failIndexKeyTooLong server parameter. However this has its own set of caveats. Specifically,
Setting failIndexKeyTooLong to false is a temporary workaround, not a permanent solution to the problem of oversized index keys. With failIndexKeyTooLong set to false, queries can return incomplete results if they use indexes that skip over documents whose indexed fields exceed the Index Key Length Limit.
I strongly suggest you read and understand those docs before implementing that particular parameter.
In general, it is considered by many to be bad practice to build an index at run-time. If the collection is already empty, this is not a big deal, however on a collection with a large amount of data, this can cause the create command to block for quite some time. This is especially true on a busy mongod when creating the index in the background.
If you are building this index on a Replica Set or Sharded Cluster, I strongly recommend you take a look at the documentation specific to those use cases before implementing the build in code.

Aggregation-framework: optimization

I have a document structure like this
{
id,
companyid,
fieldA1,
valueA1,
fieldA2,
valueA2,
.....
fieldB15,
valueB15,
fieldF150
valueF150
}
my job is to multiply fieldA1*valueA1 , fieldA2*valueA2 and sum it up to new field A_sum = sum( a fields * a values), B_sum = sum(b fields * b value), C_sum , etc
then in the next step I have to generate final_sum = ( A_sumA_val + B_SumB_val .....)
I have modeled to use aggregation framework with 3 projections for the three steps of calculations - now on this point I get about 100 sec for 750.000 docs, I have index only on _id which is a GUID. CPU is at 15%
I tried to group in order to force parallel ops and load more of cpu but seems is staking longer.
What else can I do to make it faster, means for me to load more cpu, use more paralelism?
I dont need for match as I have to process all docs.
You might get it done using sharding, as the scanning of the documents would be done in parallel.
Simply measure the time your aggregation needs now, and calculate the number of shards you need using
((t/100)+1)*s
where t is the time the aggregation took in seconds and s is the number of existing shards (1 if you have a standalone or replica set), rounded up, of course. The 1 is added to be sure that the overhead of doing an aggregation in a sharded environment is leveraged by the additional shard.
my only solution is to split the collection into smaller collections (same space after all) and command computation per smaller collections (via c# console line) using parallel library so I can raise CPU to 70%.
That reduces the time from aprox 395s, 15%CPU (script via robomongo, all docs) to 25-28s, 65-70%cpu (c# console app with parallelism)
using grouping did not help in my case.
sharding is not an option now.

MongoDB very slow deletes

I've got a small replica set of three mongod servers (16GB RAM each, at least 4 CPU cores and real HDDs) and one dedicated arbiter. The replicated data has about 100,000,000 records currently. Nearly all of this data is in one collection with an index on _id (the auto-generated Mongo ID) and date, which is a native Mongo date field. Periodically I delete old records from this collection using the date index, something like this (from the mongo shell):
db.repo.remove({"date" : {"$lt" : new Date(1362096000000)}})
This does work, but it runs very, very slowly. One of my nodes has slower I/O than the other two, having just a single SATA drive. When this node is primary, the deletes run at about 5-10 documents/sec. By using rs.stepDown() I have demoted this slower primary and forced an election to get a primary with better I/O. On that server, I am getting about 100 docs/sec.
My main question is, should I be concerned? I don't have the numbers from before I introduced replication, but I know the delete was much faster. I'm wondering if the replica set sync is causing I/O wait, or if there is some other cause. I would be totally happy with temporarily disabling sync and index updates until the delete statement finishes, but I don't know of any way to do that currently. For some reason, when I disable two of the three nodes, leaving just one node and the arbiter, the remaining node is demoted and writes are impossible (isn't the arbiter supposed to solve that?).
To give you some indication of the general performance, if I drop and recreate the date index, it takes about 15 minutes to scan all 100M docs.
This is happening because even though
db.repo.remove({"date" : {"$lt" : new Date(1362096000000)}})
looks like a single command it's actually operating on many documents - as many as satisfy this query.
When you use replication, every change operation has to be written to a special collection in the local database called oplog.rs - oplog for short.
The oplog has to have an entry for each deleted document and every one of those entries needs to be applied to the oplog on each secondary before it can also delete the same record.
One thing I can suggest that you consider is TTL indexes - they will "automatically" delete documents based on expiration date/value you set - this way you won't have one massive delete and instead will be able to spread the load more over time.
Another suggestion that may not fit you, but it was optimal solution for me:
drop indeces from collection
iterate over all entries of collection and store id's of records to delete into memory array
each time array is big enough (for me it was 10K records), i removed these records by ids
rebuild indeces
It is the fastest way, but it requires stopping the system, which was suitable for me.

secondary indexes in Cassandra under RandomPartitioner

I am researching the possibility of using secondary index feature in Cassandra using Aquiles. I know for the primary index (key), a I must be using OrderPreservingPartitioner in order to query. At first, I thought that with secondary indexes, there is no such limitation, but I noticed that start key is part of GetIndexedSlicesCommand. Does that imply that under RandomPartitioner, this command is unusable?
You don't need OrderPreservingPartitioner to query by row key, it's only needed if you want to get a meaningful range of rows by their key, like 'all rows with a key between 5 and 9'. (Note that can and should almost always use RandomPartitioner instead.)
The start key for get_indexed_slices behaves the same way that it does for get_range_slices. That is, it's not very meaningful for examining a range of rows between two keys when using RandomPartitioner, but it is useful for paging through a lot of rows. There's even a FAQ entry on the topic. Basically, if you're going to get a ton of results from a call to get_indexed_slices, you don't want to fetch them all at once, you want to get a chunk (of 10, 100, or 1000, depending on size) at a time, and then set the start_key to the last key you saw in the previous chunk to get the next chunk.