Citus data - How to query data from a single shard in a query? - postgresql

We are evaluating Citus data for the large-scale data use cases in our organization. While analyzing, I am trying to see if there is a way to achieve the following with Citus data:
We want to create a distributed table (customers) with customer_id being the shard/distribution key (customer_id is a UUID generated at the application end)
While we can use regular SQL queries for all the CRUD operations on these entities, we also have a need to query the table periodically (periodic task) to select multiple entries based on some filter criteria to fetch the result set to application and update a few columns and write back (Read and update operation).
Our application is a horizontally scalable microservice with multiple instances of the service running in parallel
So we want to split the periodic task (into multiple sub-tasks) to run on multiple instances of the service to execute this parallelly
So I am looking for a way to query results from a specific shard from the sub-task so that each sub-task is responsible to fetch and update the data on one shard only. This will let us run the periodic task parallelly without worrying about conflicts as each subtask is operating on one shard.
I am not able to find anything from the documentation on how we can achieve this. Is this possible with Citus data?

Citus (by default) distributes data accross the shards using the hash value of the distribution column, which is customer_id in your case.
To achieve this, you might need to store a (customer_id - shard_id) mapping in your application, and assign subtasks to shards, and send queries from sub-tasks by using this mapping.
One hacky solution that you might consider: You can add a dummy column (I will name it shard_id) and make it the distribution column. So that your application knows which rows should be fetched/updated from which sub-task. In other words, each sub-task will fetch/update the rows with a particular value of (shard_id) column, and all of those rows will be located on the same shard, because they have the same distribution column. In this case, you can manipulate which customer_ids will be on the same shard, and which ones should form a separate shard; by assigning them the shard_id you want.
Also I would suggest you to take a look at "tenant isolation", which is mentioned in the latest blog post: https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards-postgres-tables-without-interruption/#isolate-tenant
It basically isolates a tenant (all data with the same customer_id in your case) into a single shard. Maybe it works for you at some point.

Related

What are the drawbacks of per server long range as primary keys for sharding database?

I am designing a sharded database. Many times we use two columns, first for logical shard and second for uniquely identifying a row in the shard. Instead of that I am planning to have just a 1 column with long datatype for primary key. To have unique key throughout the servers I am planning to have bigserial that will generate non overlapping range.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
39,999,999,999
5
40,000,000,000
49,999,999,999
and so on.
In future I should be able to
Split large server to two or more small servers
Join two or more small servers to 1 big server
Move some rows from Server A to Server B for better utilization of resources.
I will also have a lookup table to which will contain information on range and target server.
I would like to learn about drawbacks of this approach.
I recommend that you create the primary key column on server 1 like this:
CREATE TABLE ... (
id bigint GENERATED ALWAYS AS IDENTITY (START WITH 1 INCREMENT BY 1000),
...
);
On the second server you use START WITH 2, and so on.
Adding a new shard is easy, just use a new start value. Splitting or joining shards is trivial (as far as the primary keys are concerned...), because new value can never conflict with old values, and each shard generates different values.
The two most common types of sharding keys are basically:
Based on a deterministic expression, like the modulus-based method suggested in #LaurenzAlbe's answer.
Based on a lookup table, like the method you describe in your post.
A drawback of the latter type is that your app has to check the lookup table frequently (perhaps even on every query), because the ranges could change. The ranges stored in the lookup table might be a good thing to put in cache, to avoid frequent SQL queries. Then replace the cached ranges when you change them. I assume this will be infrequent. With a cache, it's a pretty modest drawback.
I worked on a system like this, where we had a schema for each of our customers, distributed over 8 shards. At the start of each session, the app queried the lookup table to find out which shard the respective customer's data was stored on. Once a year or so, we would move some customer schemas around to new shards, because naturally they tend to grow. This included updating the lookup table. It wasn't a bad solution.
I suggest that you will eventually have multiple non-contiguous ranges per server, because there are hotspots of data or traffic, and it makes sense to split the ranges if you want to move the least amount of data.
server
PK starts from
PK ends at
1
1
9,999,999,999
2
10,000,000,000
19,999,999,999
3
20,000,000,000
29,999,999,999
4
30,000,000,000
32,999,999,999
3
33,000,000,000
29,999,999,999
5
40,000,000,000
49,999,999,999
If you anticipate moving subsets of data from time to time, this can be a better design than the expression-based type of sharding. If you use an expression, and need to move data between shards, you must either make a more complex expression, or else rebalance a lot of data.

MongoDB large one-time query load on production system

I'm having a MongoDB database, holding tens of millions of documents.
Let's say I want to query a single value out of each document (see image below: target key under 0 key under references key)
so it's a 3rd level nested key, and only if the referenceType equals "CopiedFrom" (references level doesn't exists in all documents)
there's ~10M documents that will answer this condition, and this is a one-time query.
The DBA in my org tells me this database is transactional (and not for reporting) and serves many clients in production, hence, a query like i'm asking will put great load on the system and will compromise production response times.
I don't have much experience with MongoDB and cannot evaluate this claim (besides the fact that it's absurd to have historical data you cannot effectivly access).
Is he right, or he's exaggerating?
knowing this can help me deal with his claim, and get the data i need.
thanks!
Your use case is addressed by adding dedicated hidden nodes to the replica set for analytics queries. See here for example.
The DBA is generally correct in that an expensive analytical query is unsuitable for executing against servers that serve transactional workloads.

Partitioning records in a collection in MongoDB

I have an usecase where a set of records in a collection need to be deleted after a specified interval of time.
For ex: Records older than 10hours be deleted every 10th hour.
We have tried deletion based on id but found it to be slow.
Is there a way to partition the records in a collection and drop a partition as and when required in Mongo
MongoDB does not currently support partitions, there is a JIRA ticket to add this as a feature (SERVER-2097).
One solution is to leverage multiple, time-based collections, cycling collections in a similar way as you would partitions. Typically we would do this when you'd usually only be querying one or few of these time-based collections. If you would often need to read across multiple collections, you could add some wrapper code to simplify that.
There's also TTL Indexes, which leverage a background thread in the mongod server to handle the deletes for you.
Your deletes by _id may have been slow for a number of reasons, and probably warrants more investigation beyond your original question.

Sorting Cassandra using individual components of Composite Keys

I want to store a list of users in a Cassandra Column Family(Wide rows).
The columns in the CF will have Composite Keys of pattern id:updated_time:name:score
After inserting all the users, i need to query users in a different sorted order each time.
For example, if i specify updated_time, i could be able to fetch the recent 10 users.
And, if i specify score, then i could be able to fetch the top 10 users based on score.
Does Cassandra supports this?
Kindly help me in this regard...
i need to query users in a different sorted order each time...
Does Cassandra supports this
It does not. Unlike a RDBMS, you can not make arbitrary queries and expect reasonable performance. Instead you must design you data model so the queries you anticipate will be made will be efficient:
The best way to approach data modeling for Cassandra is to start with your queries and work backwards from there. Think about the actions your application needs to perform, how you want to access the data, and then design column families to support those access patterns.
So rather than having one column family (table) for your data, you might want several with cross references between them. That is, you might have to denormalise your data.

Is it possible to have triggers (like SQL triggers) in DynamoDB?

I would like to have an action triggered every time an item is created or updated on a DynamoDB. I have been going through the doc, but cannot find anything like this. Is it possible?
Thanks.
This is not possible. DynamoDB doesn't let you run any code server-side. The only thing which might count as server-side actions as part of an update are conditional updates, but those can't trigger changes to other items.
The new update supports triggers.
https://aws.amazon.com/blogs/aws/dynamodb-update-triggers-streams-lambda-cross-region-replication-app/
Now you can use DynamoDb Streams.
A stream consists of stream records. Each stream record represents a single data modification in the DynamoDB table to which the stream belongs. Each stream record is assigned a sequence number, reflecting the order in which the record was published to the stream.
Stream records are organized into groups, or shards. Each shard acts as a container for multiple stream records, and contains information required for accessing and iterating through these records. The stream records within a shard are removed automatically after 24 hours.
The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Checkout http://zapier.com/help/dynamodb might be what you are looking for.