Azure Data Explorer partitioning strategy - aggregate

I have a table in Azure Data Explorer that collects data from IoT sensors. In the near future it will collect millions of records each day. So to get the best query performance I am looking into setting a partitioning policy: https://learn.microsoft.com/en-us/azure/data-explorer/kusto/management/partitioningpolicy
My table has 5 important columns: TenantId, DeviceId, SensorId, Value, Timestamp
The combination of (TenantId, DeviceId, VariableId) makes a sensor globally unique, and almost all queries will contain a part that says TenantId = 'xx' and DeviceId = 'xx' and VariableId = 'xx'. All these columns are of type string, and have a high-cardinality (10.000+ Tenants, 1000+ DeviceIds, 10.000+ VariableIds)
Two questions:
Would it be wise to apply partitioning on this table based on one or more of the string columns? It complies with the advice in the documentation that says:
The majority of queries use equality filters (==, in()).
The majority of queries aggregate/join on a specific string-typed column of large-dimension (cardinality of 10M or higher) such as an application_ID, a tenant_ID, or a user_ID.
But later on the page, for the MaxPartitionCount they say that it should be not higher than 1024 and lower than the cardinality of the column. As I have high-cardinality columns this does not comply, so I am a bit confused.
Would it be best to concat the string columns before ingestion and partition on the new column? Or only on TenantId for example?

almost all queries will contain a part that says TenantId = 'xx' and DeviceId = 'xx' and VariableId = 'xx'.
Given this (and assuming you don't frequently join on any of these 3 columns), you could extend your data set with a new column, which is the concatenation of these 3 (e.g. strcat_delim("_", TenantId, DevideId, VariableId).
You can do this either before ingestion into Kusto (better), or - using an update policy at ingestion time.
Then, set that new column as the hash partition key in the table's data partitioning policy.
for the MaxPartitionCount they say that it should be not higher than 1024 and lower than the cardinality of the column. As I have high-cardinality columns this does not comply, so I am a bit confused.
Let's assume you have a cluster with 20 nodes, a column C with cardinality 10,000,000, and you want to set it as the table's hash partition key.
Following the guidelines in the documentation regarding MaxPartitionCount:
Supported values are in the range (1,1024]. -> MaxPartitionCount should be larger than 1 and lower than or equal to 1024.
The value is expected to be larger than the number of nodes in the cluster -> MaxPartitionCount should be larger than 20.
The value is expected to be smaller than the cardinality of the column -> MaxPartitionCount should be lower than 10,000,000.
We recommend that you start with a value of 256.
Adjust the value as needed, based on the above considerations, or based on the benefit in query performance vs. the overhead of partitioning the data post-ingestion.
As I don't see any conflicting information here (256 > 1, 256 <= 1024, 256 > 20, 256 < 10M) - you may want to clarify where the confusion is coming from.

Related

Snowflake: clustering on datetime key stored in variant field does not work / do partition pruning

We are ingesting data into Snowflake via the kafka connector.
To increase the data read performance / scan less partitions we decided to add a clustering key to a a key / combination of keys stored in the RECORD_CONTENT variant field.
The data in the RECORD_CONTENT field looks like this:
{
"jsonSrc": {
"Integerfield": 1,
"SourceDateTime": "2020-06-30 05:33:08:345",
*REST_OF_THE_KEY_VALUE_PAIRS*
}
Now, the issue is that clustering on a datetime col like SourceDateTime does NOT work:
CLUSTER BY (to_date(RECORD_CONTENT:jsonSrc:loadDts::datetime))
...while clustering on a field like Integerfield DOES work:
CLUSTER BY (RECORD_CONTENT:jsonSrc:Integerfield::int )
Not working means: when using a filter on RECORD_CONTENT:jsonSrc:loadDts::datetime, it has no effect on the partitions scanned, while filtering on RECORD_CONTENT:jsonSrc:Integerfield::int does perform partition pruning.
What is wrong here? Is this a bug?
Note that:
There is enough data to do meaningful clustering on RECORD_CONTENT:jsonSrc:loadDts::datetime
I validated clustering on RECORD_CONTENT:jsonSrc:loadDts::datetime working by making a copy of the raw table, with RECORD_CONTENT:jsonSrc:loadDts::datetime in a seperate column loadDtsCol and then adding a similar clustering key on that column: to_date(loadDtsCol).
For better pruning and less storage consumption, we recommend
flattening your object and key data into separate relational columns
if your semi-structured data includes: Dates and timestamps,
especially non-ISO 8601dates and timestamps, as string values
Numbers within strings
Arrays
Non-native values such as dates and timestamps are stored as strings
when loaded into a VARIANT column, so operations on these values could
be slower and also consume more space than when stored in a relational
column with the corresponding data type.
See this link: https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure

AWS RDS postgresql performance

We have around 90 million rows in a new Postgresql table in an RDS instance. It contains 2 numbers, start_num and end_num(Bigint, mostly finance related) and details related to those numbers. The PK is on the start_num and end_num and table is CLUSTERed on this. The query will always be range query. Input will be a number and the output will be range in which this number is falling along with details. For eg: There is a row which has start_num=112233443322 and end_num as 112233543322. The input comes in as 112233443645. So the row containing 112233443322, 112233543322 needs to be returned.
select start_num, end_num from ipinfo.ipv4 where input_value between start_num and end_num;
This is always going into seq scan and the PK is not getting used. I have tried creating separate indexes on start_num and end_num desc but not much change in time. We are looking for an output of less than 300 ms. Now, I am wondering if that is even possible in Postgresql for range queries on large data sets or this is due to the Postgresql being on AWS RDS.
Looking forward to some advice on steps to improve the performance.

Columnpartitioner for composite primay key tables?

In my case, I have to load huge data from one table to another. (Tera to sqlserver). Using JdbcCursorItemReader, on avg it takes 30 mins to load 200000 records since the table has 40 columns. So I am planning to use the partition technique.
Below are the challenges
The table has a composite primary key (2 columns).
And one of the column values having negative values.
Is this possible to do columnpartition technique in this case?
I see the columnpartition technique uses one primary key and finding max and min values. In my case, with composite primary , even if i figure someway for max, min, grid size. will the framework support to handle the composite primarykey for doing paritioning?
A couple things to note here:
JdbcCursorItemReader is not thread safe so it typically isn't used in partitioning scenarios. Instead the JdbcPagingItemReader is used.
Your logic to partition the data is purely up to you. While doing it via values in a column is useful, it doesn't apply to all use cases (like this one). In this specific use case, you may want to partition by ROW_NUMBER() or something similar, or add a column to partition by.

Cassandra CQL request

I have a little probleme with Cassandra performances when I use a select query with a condition, example:
SELECT name from Perso where age = 18
It takes too much time and when the table arrived to 1M rows, I got the timedoutexception().
Can I use the pagination in this case? if yes how to use with the condition in a request?
Cassandra is quick at where clauses if there is low cardinality (i.e. number of rows) in the data, and is notoriously slow when there is a high cardinality.
The Cassandra docs suggest to use one column family to store data and one or more other cfs to act as an indexes for that data.
So for example for your issue you could have two column families - one for Person and another index column family to map an age to a list of names. You can query this second table using the age as the key, and have the list of names returned to you. You can then use the individual returned names to query whatever data you want in the Person column family.

hbase rowkey design

I am moving from mysql to hbase due to increasing data.
I am designing rowkey for efficient access pattern.
I want to achieve 3 goals.
Get all results of email address
Get all results of email address + item_type
Get all results of particular email address + item_id
I have 4 attributes to choose from
user email
reverse timestamp
item_type
item_id
What should my rowkey look like to get rows efficiently?
Thanks
Assuming your main access is by email you can have your main table key as
email + reverse time + item_id (assuming item_id gives you uniqueness)
You can have an additional "index" table with email+item_type+reverse time+item_id and email+item_id as keys that maps to the first table (so retrieving by these is a two step process)
Maybe you are already headed in the right direction as far as concatenated row keys: in any case following comes to mind from your post:
Partitioning key likely consists of your reverse timestamp plus the most frequently queried natural key - would that be the email? Let us suppose so: then choose to make the prefix based on which of the two (reverse timestamp vs email) provides most balanced / non-skewed distribution of your data. That makes your region servers happier.
Choose based on better balanced distribution of records:
reverse timestamp plus most frequently queried natural key
e.g. reversetimestamp-email
or email-reversetimestamp
In that manner you will avoid hot spotting on your region servers.
.
To obtain good performance on the additional (secondary ) indexes, that is not "baked into" hbase yet: they have a design doc for it (look under SecondaryIndexing in the wiki).
But you can build your own a couple of ways:
a) use coprocessor to write the item_type as rowkey to separate tabole with a column containing the original (user_email-reverse timestamp (or vice-versa) fact table rowke
b) if disk space not issue and/or the rows are small, just go ahead and duplicate the entire row in the second (and third for the item-id case) tables.