Non Distributed Data _ Spring Batch Partitioner - spring-batch

Just take a case where I am reading data from a Database with conditions (millions of rows), doing some business on data and then updating it.
I am using a Column Range Partitioner ( Id column ) taking Min and Max Ids to process, and creating partitions with max-min/gridSize.
Now imagine i have the Ids 1, 22, 23, 24, 30 with gridSize =3 , with that logic i will have 3 partitions:
partition1 processing Id 1
partition2 processing 0 Rows
partition3 processing 22, 23, 24 and 30
With millions of Data, parallel processing like this isn't useful, and trying to recover all the data in a single request for implementing distributed partitioning takes forever..
What's the best solution?

The ColumnRangePartitioner showed in the examples states that you need an evenly distributed column for it to be effective (as you have noted). Instead, you can typically add a row number to your query and partition on that since it will be a sequence over the results.
An example of the SQL would look something like this (for MySQL):
SELECT F.*,
#rownum := #rownum + 1 AS rank
FROM FOO F,
(SELECT #rownum := 0) r;
With that, the column rank would be a sequence autogenerated each time you run the query. From that value, you could partition the dataset. Since this isn't persistent, you'd need to do some gymnastics to get the right ids, but the basic logic of your Partitioner implementation would look something like this:
Run count query to find out how many records there are in your data set.
Run a query using above technique to figure out what the db id is for the start and the end of each partition range. This will give you the ids to filter by per partition.
Create a partition for each pair (start/end) using the actual db ids.
Setup your ItemReader to read the items only within the range of the db ids provided.

Related

Aggregate on Redshift SUPER type

Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.

Postgres extended statistics with partitioning

I am using Postgres 13 and have created a table with columns A, B and C. The table is partitioned by A with 2 possible values. Partition 1 contains 100 possible values each for B and C, whereas partition 2 has 100 completely different values for B, and 1 different value for C. I have set the statistics for both columns to maximum so that this definitely doesn't cause any issue
If I group by B and C on either partition, Postgres estimates the number of groups correctly. However if I run the query against the base table where I really want it, it estimates what I assume is no functional dependency between A, B and C, i.e. (p1B + p1C) * (p2B + p2C) for 200 * 101 as opposed to the reality of p1B * p1C + p2B * p2C for 10000 + 100.
I guess I was half expecting it to sum the underlying partitions rather than use the full count of 200 B's and 101 C's that the base table can see. Moreover, if I also add A into the group by then the estimate erroneously doubles further still, as it then thinks that this set will also be duplicated for each value of A.
This all made me think that I need an extended statistic to tell it that A influences either B or C or both. However if I set one on the base partition and analyze, the value in pg_statistic_ext_data->stxdndistinct is null. Whereas if I set it on the partitions themselves, this does appear to work, though isn't particularly useful because the estimation is already correct at this level. How do I go about having Postgres estimate against the base table correctly without having to run the query against all of the partitions and unioning them together?
You can define extended statistics on a partitioned table, but PostgreSQL doesn't collect any data in that case. You'll have to create extended statistics on all partitions individually.
You can confirm that by querying the collected data after an ANALYZE:
SELECT s.stxrelid::regclass AS table_name,
s.stxname AS statistics_name,
d.stxdndistinct AS ndistinct,
d.stxddependencies AS dependencies
FROM pg_statistic_ext AS s
JOIN pg_statistic_ext_data AS d
ON d.stxoid = s.oid;
There is certainly room for improvement here; perhaps don't allow defining extended statistics on a partitioned table in the first place.
I found that I just had to turn enable_partitionwise_aggregate on to get this to estimate correctly

Azure Data Explorer partitioning strategy

I have a table in Azure Data Explorer that collects data from IoT sensors. In the near future it will collect millions of records each day. So to get the best query performance I am looking into setting a partitioning policy: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/partitioningpolicy
My table has 5 important columns: TenantId, DeviceId, SensorId, Value, Timestamp
The combination of (TenantId, DeviceId, VariableId) makes a sensor globally unique, and almost all queries will contain a part that says TenantId = 'xx' and DeviceId = 'xx' and VariableId = 'xx'. All these columns are of type string, and have a high-cardinality (10.000+ Tenants, 1000+ DeviceIds, 10.000+ VariableIds)
Two questions:
Would it be wise to apply partitioning on this table based on one or more of the string columns? It complies with the advice in the documentation that says:
The majority of queries use equality filters (==, in()).
The majority of queries aggregate/join on a specific string-typed column of large-dimension (cardinality of 10M or higher) such as an application_ID, a tenant_ID, or a user_ID.
But later on the page, for the MaxPartitionCount they say that it should be not higher than 1024 and lower than the cardinality of the column. As I have high-cardinality columns this does not comply, so I am a bit confused.
Would it be best to concat the string columns before ingestion and partition on the new column? Or only on TenantId for example?
almost all queries will contain a part that says TenantId = 'xx' and DeviceId = 'xx' and VariableId = 'xx'.
Given this (and assuming you don't frequently join on any of these 3 columns), you could extend your data set with a new column, which is the concatenation of these 3 (e.g. strcat_delim("_", TenantId, DevideId, VariableId).
You can do this either before ingestion into Kusto (better), or - using an update policy at ingestion time.
Then, set that new column as the hash partition key in the table's data partitioning policy.
for the MaxPartitionCount they say that it should be not higher than 1024 and lower than the cardinality of the column. As I have high-cardinality columns this does not comply, so I am a bit confused.
Let's assume you have a cluster with 20 nodes, a column C with cardinality 10,000,000, and you want to set it as the table's hash partition key.
Following the guidelines in the documentation regarding MaxPartitionCount:
Supported values are in the range (1,1024]. -> MaxPartitionCount should be larger than 1 and lower than or equal to 1024.
The value is expected to be larger than the number of nodes in the cluster -> MaxPartitionCount should be larger than 20.
The value is expected to be smaller than the cardinality of the column -> MaxPartitionCount should be lower than 10,000,000.
We recommend that you start with a value of 256.
Adjust the value as needed, based on the above considerations, or based on the benefit in query performance vs. the overhead of partitioning the data post-ingestion.
As I don't see any conflicting information here (256 > 1, 256 <= 1024, 256 > 20, 256 < 10M) - you may want to clarify where the confusion is coming from.

How to implement application level pagination over ScalarDB

This question is part-Cassandra and part ScalarDB. I am using ScalarDB which provide ACID support on top of Cassandra. The library seem to be working well! Unfortunately, ScalarDB doesn't support pagination though so I have to implement it in the application.
Consider this scenario in which P is primary key, C is clustering key and E is other data within the partition
Partition => { P,C1,E1
P,C2,E1
P,C2,E2
P,C2,E3
P,C2,E4
P,C3,E1
...
P,Cm,En
}
In ScalarDB, I can specify start and end values of keys so I suppose ScalarDB will get data only from the specified rows. I can also limit the no. of entries fetched.
https://scalar-labs.github.io/scalardb/javadoc/com/scalar/db/api/Scan.html
Say I want to get entries E3 and E4 from P,C2. For smaller values, I can specify start and end clustering keys as C2 and set fetch limit to say 4 and ignore E1 and E2. But if there are several hundred records then this method will not scale.
For example say P,C1 has 10 records, P,C2 has 100 records and I want to implement pagination of 20 records per query. Then to implement this, I'll have to
Query 1 – Scan – primary key will be P, clustering start will be C1, clustering end will be Cn as I don’t know how many records are there.
get P,C1. This will give 10 records
get P,C2. This will give me 20 records. I'll ignore last 10 and combine P,C1's 10 with P,C2's first 10 and return the result.
I'll also have to maintain that the last cluster key queried was C2 and also that 10 records were fetched from it.
Query 2 (for next pagination request) - Scan – primary key will be P, clustering start will be C2, clustering end will be Cn as I don’t know how many records are there.
Now I'll fetch P,C2 and get 20, ignore 1st 10 (as they were sent last time), take the remaining 10, do another fetch using same Scan and take first 10 from that.
Is this how it should be done or is there a better way? My concern with above implementation is that every time I'll have to fetch loads of records and dump them. For example, say I want to get records 70-90 from P,C2 then I'll still query up to record 60 and dump the result!
Primary keys and Clustering keys compose a primary key so your above example looks not right.
Let' assume the following data structure.
P, C1, ...
P, C2, ...
P, C3, ...
...
Anyways, I think one of the ways could be as follows. Assuming the page size is 2.
Scan with start (P, C1) inclusive, ascending and with limit 2. Results stored in R1
Get the last record of R1 -> (P, C2).
Scan with start the previous last record (P, C2) not inclusive, ascending with limit 2.
...

Java 8 Entity List Iterator consume too much time to process the query

I have a table in PostgreSQL "items" and there I have some information like id, name, desc, config etc.
It contains 1.6 million records.
I need to make a query to get all result like "select id, name, description from items"
What is the proper pattern for iterating over large result sets?
I used EntityListIterator:
EntityListIterator iterator = EntityQuery.use(delegator)
.select("id", "name", "description")
.from("items")
.cursorScrollInsensitive()
.queryIterator();
int total = iterator.getResultsSizeAfterPartialList();
List<GenericValue> items = iterator.getPartialList(start+1, length);
iterator.close();
the start here is 0 and the length is 10.
I implemented this so I can do pagination with Datatables.
The problem with this is that I have millions of records and it takes like 20 seconds to complete.
What can I do to improve the performance?
If you are implementing pagination, you shouldn't load all 1.6 million records in memory at once. Use order by id in your query and id from 0 to 10, 10 to 20, etc. in the where clause. Keep a counter that denotes up till which id you have traversed.
If you really want to pull all records in memory, then just load the first few pages' records (e.g. from id=1 to id=100), return it to the client, and then use something like CompletableFuture to asynchronously retrieve the rest of the records in the background.
Another approach is to run multiple small queries in separate threads, depending on how many parallel reads your database supports, and then merge the results.
What about CopyManager? You could fetch your data as a text/csv outputstream, maybe in this way it would be faster to retrieve.
CopyManager cm = new CopyManager((BaseConnection) conn);
String sql = "COPY (SELECT id, name, description FROM items) TO STDOUT WITH DELIMITER ';'";
cm.copyOut(sql, new BufferedWriter(new FileWriter("C:/export_transaction.csv")));