Aggregate elements returned from filetring processor in spring batch - spring-batch

My use case starts by reading PK from database, a filtering processor check if the id is already processed, if not a second processor will get the other columns using the id from db.
I'am trying to reduce the roundtrip with database, is it possible to aggregate the ids so i can execute batch read ?

One way is to use a database table (eg. PROCESSED) to store the ids that have already been processed, or use a status column on the existing table that you're reading. Your reader can use something similar to "WHERE NOT EXISTS (SELECT 1 FROM PROCESSED p WHERE p.ID = x.ID)" to avoid selecting already processed items.
For this solution you need to consider how job restarts behave and make sure to call saveState(false) since you're maintaining the state of already processed items yourself.
See https://docs.spring.io/spring-batch/docs/current/reference/html/readersAndWriters.html#process-indicator

Related

Citus data - How to query data from a single shard in a query?

We are evaluating Citus data for the large-scale data use cases in our organization. While analyzing, I am trying to see if there is a way to achieve the following with Citus data:
We want to create a distributed table (customers) with customer_id being the shard/distribution key (customer_id is a UUID generated at the application end)
While we can use regular SQL queries for all the CRUD operations on these entities, we also have a need to query the table periodically (periodic task) to select multiple entries based on some filter criteria to fetch the result set to application and update a few columns and write back (Read and update operation).
Our application is a horizontally scalable microservice with multiple instances of the service running in parallel
So we want to split the periodic task (into multiple sub-tasks) to run on multiple instances of the service to execute this parallelly
So I am looking for a way to query results from a specific shard from the sub-task so that each sub-task is responsible to fetch and update the data on one shard only. This will let us run the periodic task parallelly without worrying about conflicts as each subtask is operating on one shard.
I am not able to find anything from the documentation on how we can achieve this. Is this possible with Citus data?
Citus (by default) distributes data accross the shards using the hash value of the distribution column, which is customer_id in your case.
To achieve this, you might need to store a (customer_id - shard_id) mapping in your application, and assign subtasks to shards, and send queries from sub-tasks by using this mapping.
One hacky solution that you might consider: You can add a dummy column (I will name it shard_id) and make it the distribution column. So that your application knows which rows should be fetched/updated from which sub-task. In other words, each sub-task will fetch/update the rows with a particular value of (shard_id) column, and all of those rows will be located on the same shard, because they have the same distribution column. In this case, you can manipulate which customer_ids will be on the same shard, and which ones should form a separate shard; by assigning them the shard_id you want.
Also I would suggest you to take a look at "tenant isolation", which is mentioned in the latest blog post: https://www.citusdata.com/blog/2022/09/19/citus-11-1-shards-postgres-tables-without-interruption/#isolate-tenant
It basically isolates a tenant (all data with the same customer_id in your case) into a single shard. Maybe it works for you at some point.

Handling DB query "IN" with a list of values exceeding DB capacity

I am querying a CosmosDB with a huge list of ids, and i get an exception saying i have exceeded the permissible limit of 256 characters.
What is the best way to handle such huge queries ?
The only way i can think of is to split the list and execute in batches.
Any other suggestions ?
If you're querying data this way then your model is likely not optimal. I would look to remodel your data such that you can query on another property shared by the items you are looking for (in partition as well too).
Note that this could also be achieved by using Change Feed to copy the data into another container with a different partition key and a new property that groups the data together. whether you do this will depend on how often you run this query and whether this is cheaper than running this query in multiple batches.

Eclipselink batch fetch VS join fetch

When should I use "eclipselink.join-fetch", when should I use "eclipselink.batch" (batch type = IN)?
Is there any limitations for join fetch, such as the number of tables being fetched?
Answer is alway specific to your query, the specific use case, and the database, so there is no hard rule on when to use one over the other, or if to use either at all. You cannot determine what to use unless you are serious about performance and willing to test both under production load conditions - just like any query performance tweaking.
Join-fetch is just what it says, causing all the data to be brought back in the one query. If your goal is to reduce the number of SQL statements, it is perfect. But it comes at costs, as inner/outer joins, cartesian joins etc can increase the amount of data being sent across and the work the database has to do.
Batch fetching is one extra query (1+1), and can be done a number of ways. IN collects all the foreign key values and puts them into one statement (more if you have >1000 on oracle). Join is similar to fetch join, as it uses the criteria from the original query to select over the relationship, but won't return as much data, as it only fetches the required rows. EXISTS is very similar using a subquery to filter.

Handle join of KStream with KTable when key is missing from KTable

I recently started experimenting with kafka streams. I have a scenario where I need to join a KStream with a KTable. It may be the case that the KTable does not contain some of the keys. In that case I get a NullPointerException.
specifically I was getting
stream-thread [StreamThread-1] Streams application error during processing:
java.lang.NullPointerException
I don't know how I can handle that. I cannot somehow filter out the records of the stream that do not correspond to a table entry.
update
Looking a bit further I found that I can query the underlying store to find whether a key exists through the ReadOnlyKeyValueStore interface.
In this case my question is, would that be the best way to go? i.e. Filtering the stream to be joined based on whether a key exists in the local store?
My second question in this case would be, since I care about leveraging the Global State Store introduced in version 10.2 in a next phase, should I expect that I will be also able in the same manner to query the Global State Store?
update
The previous update is not accurate since it's not possible to query the state store from inside the topology
final update
After understanding the join semantics a bit better I was able to solve the issue just be simplifying the valueJoiner to only return the results, instead of performing actions on the joined values, and adding an extra filtering step after the join to filter out null values.
The solution to my problem came from understanding the join semantics a bit better.
Like in database joins (although I am not saying that Kstream joins follow the db join concepts precisely) the left join operation results in rows with null values wherever the right side keys are missing.
So eventually the only thing I had to do was to decouple my valueJoiner from the subsequent calculations / operations (I needed to perform some calculations on fields of the joined records and return a newly constructed object) and have it only return an array of the joined values. Then I could filter out the records that resulted in null values by checking those arrays.
Based on Matthias's J. Sax suggestion, I used the 0.10.2 version instead of the 0.10.1 which is compatible with broker version 0.10.1 and replace the whole leftJoin logic with inner join which removes the need for filtering out null values.

Is it possible to have triggers (like SQL triggers) in DynamoDB?

I would like to have an action triggered every time an item is created or updated on a DynamoDB. I have been going through the doc, but cannot find anything like this. Is it possible?
Thanks.
This is not possible. DynamoDB doesn't let you run any code server-side. The only thing which might count as server-side actions as part of an update are conditional updates, but those can't trigger changes to other items.
The new update supports triggers.
https://aws.amazon.com/blogs/aws/dynamodb-update-triggers-streams-lambda-cross-region-replication-app/
Now you can use DynamoDb Streams.
A stream consists of stream records. Each stream record represents a single data modification in the DynamoDB table to which the stream belongs. Each stream record is assigned a sequence number, reflecting the order in which the record was published to the stream.
Stream records are organized into groups, or shards. Each shard acts as a container for multiple stream records, and contains information required for accessing and iterating through these records. The stream records within a shard are removed automatically after 24 hours.
The relative ordering of a sequence of changes made to a single primary key will be preserved within a shard. Further, a given key will be present in at most one of a set of sibling shards that are active at a given point in time. As a result, your code can simply process the stream records within a shard in order to accurately track changes to an item.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html
Checkout http://zapier.com/help/dynamodb might be what you are looking for.