This question is part-Cassandra and part ScalarDB. I am using ScalarDB which provide ACID support on top of Cassandra. The library seem to be working well! Unfortunately, ScalarDB doesn't support pagination though so I have to implement it in the application.
Consider this scenario in which P is primary key, C is clustering key and E is other data within the partition
Partition => { P,C1,E1
P,C2,E1
P,C2,E2
P,C2,E3
P,C2,E4
P,C3,E1
...
P,Cm,En
}
In ScalarDB, I can specify start and end values of keys so I suppose ScalarDB will get data only from the specified rows. I can also limit the no. of entries fetched.
https://scalar-labs.github.io/scalardb/javadoc/com/scalar/db/api/Scan.html
Say I want to get entries E3 and E4 from P,C2. For smaller values, I can specify start and end clustering keys as C2 and set fetch limit to say 4 and ignore E1 and E2. But if there are several hundred records then this method will not scale.
For example say P,C1 has 10 records, P,C2 has 100 records and I want to implement pagination of 20 records per query. Then to implement this, I'll have to
Query 1 – Scan – primary key will be P, clustering start will be C1, clustering end will be Cn as I don’t know how many records are there.
get P,C1. This will give 10 records
get P,C2. This will give me 20 records. I'll ignore last 10 and combine P,C1's 10 with P,C2's first 10 and return the result.
I'll also have to maintain that the last cluster key queried was C2 and also that 10 records were fetched from it.
Query 2 (for next pagination request) - Scan – primary key will be P, clustering start will be C2, clustering end will be Cn as I don’t know how many records are there.
Now I'll fetch P,C2 and get 20, ignore 1st 10 (as they were sent last time), take the remaining 10, do another fetch using same Scan and take first 10 from that.
Is this how it should be done or is there a better way? My concern with above implementation is that every time I'll have to fetch loads of records and dump them. For example, say I want to get records 70-90 from P,C2 then I'll still query up to record 60 and dump the result!
Primary keys and Clustering keys compose a primary key so your above example looks not right.
Let' assume the following data structure.
P, C1, ...
P, C2, ...
P, C3, ...
...
Anyways, I think one of the ways could be as follows. Assuming the page size is 2.
Scan with start (P, C1) inclusive, ascending and with limit 2. Results stored in R1
Get the last record of R1 -> (P, C2).
Scan with start the previous last record (P, C2) not inclusive, ascending with limit 2.
...
Related
I have a too complicated task for me, and hope someone can help me :)
I have two different structures, containing data about products:
1.
products with product_id, brand_id (products that I have)
products_sku with product_id, sku_id, vendor_code (SKUs for products)
products_avail with sku_id, scheme_id, quantity (availability for each product sku and scheme)
products_external with product_id_e, brand_id_e, vendor_code_e, sku_id_e
products_avail_external with product_id_e, quantity_e, scheme_id_e
Each SKU identified by (brand_id, vendor_code) pair, so one product from (2) corresponds to one SKU from (1). Also I can have several availability entries for different schemes. Availability records can count up to tens of millions records.
Field sku_id_e in (2) updates with cron task, so if it defined (i.e. not zero) - I can find corresponding record in (1).
I need to get all records from (2) with sku_id_e defined, group them by (sku_id_e, scheme_id_e) and make a set of records in (1), so one records will contain SUM(quantity) of all records with some (sku_id_e, scheme_id_e).
I can do UPSERT but in this case I will waste sequence numbers (which is problem in my case because request will be executed relatively frequently and on massive number of records).
I can use something like ON EMPTY or NOT EXISTS, but this is too complicated for me to combine in one request.
I can just select both datasets and make matching programmatically, but this is definitely not best solution.
Can you help me with making SQL code that will update records in (1) or insert them if such records does not exists (not wasting sequence)?
Thank you in advance!
Context
I'm trying to find the best way to represent and aggregate a high-cardinality column in Redshift. The source is event-based and looks something like this:
user
timestamp
event_type
1
2021-01-01 12:00:00
foo
1
2021-01-01 15:00:00
bar
2
2021-01-01 16:00:00
foo
2
2021-01-01 19:00:00
foo
Where:
the number of users is very large
a single user can have very large numbers of events, but is unlikely to have many different event types
the number of different event_type values is very large, and constantly growing
I want to aggregate this data into a much smaller dataset with a single record (document) per user. These documents will then be exported. The aggregations of interest are things like:
Number of events
Most recent event time
But also:
Number of events for each event_type
It is this latter case that I am finding difficult.
Solutions I've considered
The simple "columnar-DB-friendy" approach to this problem would simply be to have an aggregate column for each event type:
user
nb_events
...
nb_foo
nb_bar
1
2
...
1
1
2
2
...
2
0
But I don't think this is an appropriate solution here, since the event_type field is dynamic and may have hundreds or thousands of values (and Redshift has a upper limit of 1600 columns). Moreover, there may be multiple types of aggregations on this event_type field (not just count).
A second approach would be to keep the data in its vertical form, where there is not one row per user but rather one row per (user, event_type). However, this really just postpones the issue - at some point the data still needs to be aggregated into a single record per user to achieve the target document structure, and the problem of column explosion still exists.
A much more natural (I think) representation of this data is as a sparse array/document/SUPER:
user
nb_events
...
count_by_event_type (SUPER)
1
2
...
{"foo": 1, "bar": 1}
2
2
...
{"foo": 2}
This also pretty much exactly matches the intended SUPER use case described by the AWS docs:
When you need to store a relatively small set of key-value pairs, you might save space by storing the data in JSON format. Because JSON strings can be stored in a single column, using JSON might be more efficient than storing your data in tabular format. For example, suppose you have a sparse table, where you need to have many columns to fully represent all possible attributes, but most of the column values are NULL for any given row or any given column. By using JSON for storage, you might be able to store the data for a row in key:value pairs in a single JSON string and eliminate the sparsely-populated table columns.
So this is the approach I've been trying to implement. But I haven't quite been able to achieve what I'm hoping to, mostly due to difficulties populating and aggregating the SUPER column. These are described below:
Questions
Q1:
How can I insert into this kind of SUPER column from another SELECT query? All Redshift docs only really discuss SUPER columns in the context of initial data load (e.g. by using json_parse), but never discuss the case where this data is generated from another Redshift query. I understand that this is because the preferred approach is to load SUPER data but convert it to columnar data as soon as possible.
Q2:
How can I re-aggregate this kind of SUPER column, while retaining the SUPER structure? Until now, I've discussed a simplified example which only aggregates by user. In reality, there are other dimensions of aggregation, and some analyses of this table will need to re-aggregate the values shown in the table above. By analogy, the desired output might look something like (aggregating over all users):
nb_events
...
count_by_event_type (SUPER)
4
...
{"foo": 3, "bar": 1}
I can get close to achieving this re-aggregation with a query like (where the listagg of key-value string pairs is a stand-in for the SUPER type construction that I don't know how to do):
select
sum(nb_events) nb_events,
(
select listagg(s)
from (
select
k::text || ':' || sum(v)::text as s
from my_aggregated_table inner_query,
unpivot inner_query.count_by_event_type as v at k
group by k
) a
) count_by_event_type
from my_aggregated_table outer_query
But Redshift doesn't support this kind of correlated query:
[0A000] ERROR: This type of correlated subquery pattern is not supported yet
Q3:
Are there any alternative approaches to consider? Normally I'd handle this kind of problem with Spark, which I find much more flexible for these kinds of problems. But if possible it would be great to stick with Redshift, since that's where the source data is.
I am using Postgres 13 and have created a table with columns A, B and C. The table is partitioned by A with 2 possible values. Partition 1 contains 100 possible values each for B and C, whereas partition 2 has 100 completely different values for B, and 1 different value for C. I have set the statistics for both columns to maximum so that this definitely doesn't cause any issue
If I group by B and C on either partition, Postgres estimates the number of groups correctly. However if I run the query against the base table where I really want it, it estimates what I assume is no functional dependency between A, B and C, i.e. (p1B + p1C) * (p2B + p2C) for 200 * 101 as opposed to the reality of p1B * p1C + p2B * p2C for 10000 + 100.
I guess I was half expecting it to sum the underlying partitions rather than use the full count of 200 B's and 101 C's that the base table can see. Moreover, if I also add A into the group by then the estimate erroneously doubles further still, as it then thinks that this set will also be duplicated for each value of A.
This all made me think that I need an extended statistic to tell it that A influences either B or C or both. However if I set one on the base partition and analyze, the value in pg_statistic_ext_data->stxdndistinct is null. Whereas if I set it on the partitions themselves, this does appear to work, though isn't particularly useful because the estimation is already correct at this level. How do I go about having Postgres estimate against the base table correctly without having to run the query against all of the partitions and unioning them together?
You can define extended statistics on a partitioned table, but PostgreSQL doesn't collect any data in that case. You'll have to create extended statistics on all partitions individually.
You can confirm that by querying the collected data after an ANALYZE:
SELECT s.stxrelid::regclass AS table_name,
s.stxname AS statistics_name,
d.stxdndistinct AS ndistinct,
d.stxddependencies AS dependencies
FROM pg_statistic_ext AS s
JOIN pg_statistic_ext_data AS d
ON d.stxoid = s.oid;
There is certainly room for improvement here; perhaps don't allow defining extended statistics on a partitioned table in the first place.
I found that I just had to turn enable_partitionwise_aggregate on to get this to estimate correctly
Just take a case where I am reading data from a Database with conditions (millions of rows), doing some business on data and then updating it.
I am using a Column Range Partitioner ( Id column ) taking Min and Max Ids to process, and creating partitions with max-min/gridSize.
Now imagine i have the Ids 1, 22, 23, 24, 30 with gridSize =3 , with that logic i will have 3 partitions:
partition1 processing Id 1
partition2 processing 0 Rows
partition3 processing 22, 23, 24 and 30
With millions of Data, parallel processing like this isn't useful, and trying to recover all the data in a single request for implementing distributed partitioning takes forever..
What's the best solution?
The ColumnRangePartitioner showed in the examples states that you need an evenly distributed column for it to be effective (as you have noted). Instead, you can typically add a row number to your query and partition on that since it will be a sequence over the results.
An example of the SQL would look something like this (for MySQL):
SELECT F.*,
#rownum := #rownum + 1 AS rank
FROM FOO F,
(SELECT #rownum := 0) r;
With that, the column rank would be a sequence autogenerated each time you run the query. From that value, you could partition the dataset. Since this isn't persistent, you'd need to do some gymnastics to get the right ids, but the basic logic of your Partitioner implementation would look something like this:
Run count query to find out how many records there are in your data set.
Run a query using above technique to figure out what the db id is for the start and the end of each partition range. This will give you the ids to filter by per partition.
Create a partition for each pair (start/end) using the actual db ids.
Setup your ItemReader to read the items only within the range of the db ids provided.
I am moving from mysql to hbase due to increasing data.
I am designing rowkey for efficient access pattern.
I want to achieve 3 goals.
Get all results of email address
Get all results of email address + item_type
Get all results of particular email address + item_id
I have 4 attributes to choose from
user email
reverse timestamp
item_type
item_id
What should my rowkey look like to get rows efficiently?
Thanks
Assuming your main access is by email you can have your main table key as
email + reverse time + item_id (assuming item_id gives you uniqueness)
You can have an additional "index" table with email+item_type+reverse time+item_id and email+item_id as keys that maps to the first table (so retrieving by these is a two step process)
Maybe you are already headed in the right direction as far as concatenated row keys: in any case following comes to mind from your post:
Partitioning key likely consists of your reverse timestamp plus the most frequently queried natural key - would that be the email? Let us suppose so: then choose to make the prefix based on which of the two (reverse timestamp vs email) provides most balanced / non-skewed distribution of your data. That makes your region servers happier.
Choose based on better balanced distribution of records:
reverse timestamp plus most frequently queried natural key
e.g. reversetimestamp-email
or email-reversetimestamp
In that manner you will avoid hot spotting on your region servers.
.
To obtain good performance on the additional (secondary ) indexes, that is not "baked into" hbase yet: they have a design doc for it (look under SecondaryIndexing in the wiki).
But you can build your own a couple of ways:
a) use coprocessor to write the item_type as rowkey to separate tabole with a column containing the original (user_email-reverse timestamp (or vice-versa) fact table rowke
b) if disk space not issue and/or the rows are small, just go ahead and duplicate the entire row in the second (and third for the item-id case) tables.