I have over 800 data in AWS DynamoDB.
I configured id as partition key and its format is 4 digit number as String
( e.g 0001, 0050, 0800.)
But, I can't see what I expected.
How can I see them with alignment?
DynamoDB does not sort the items it returns to you by their partition key. If you want items to be sorted be field_x you need to define field_x as a sort key.
However, for the time being you are only going to store small amounts of data (800 items is certainly considered "small", or even "tiny", from DynamoDB's point of view) you can just do the sorting on your side.
Related
We are ingesting data into Snowflake via the kafka connector.
To increase the data read performance / scan less partitions we decided to add a clustering key to a a key / combination of keys stored in the RECORD_CONTENT variant field.
The data in the RECORD_CONTENT field looks like this:
{
"jsonSrc": {
"Integerfield": 1,
"SourceDateTime": "2020-06-30 05:33:08:345",
*REST_OF_THE_KEY_VALUE_PAIRS*
}
Now, the issue is that clustering on a datetime col like SourceDateTime does NOT work:
CLUSTER BY (to_date(RECORD_CONTENT:jsonSrc:loadDts::datetime))
...while clustering on a field like Integerfield DOES work:
CLUSTER BY (RECORD_CONTENT:jsonSrc:Integerfield::int )
Not working means: when using a filter on RECORD_CONTENT:jsonSrc:loadDts::datetime, it has no effect on the partitions scanned, while filtering on RECORD_CONTENT:jsonSrc:Integerfield::int does perform partition pruning.
What is wrong here? Is this a bug?
Note that:
There is enough data to do meaningful clustering on RECORD_CONTENT:jsonSrc:loadDts::datetime
I validated clustering on RECORD_CONTENT:jsonSrc:loadDts::datetime working by making a copy of the raw table, with RECORD_CONTENT:jsonSrc:loadDts::datetime in a seperate column loadDtsCol and then adding a similar clustering key on that column: to_date(loadDtsCol).
For better pruning and less storage consumption, we recommend
flattening your object and key data into separate relational columns
if your semi-structured data includes: Dates and timestamps,
especially non-ISO 8601dates and timestamps, as string values
Numbers within strings
Arrays
Non-native values such as dates and timestamps are stored as strings
when loaded into a VARIANT column, so operations on these values could
be slower and also consume more space than when stored in a relational
column with the corresponding data type.
See this link: https://docs.snowflake.com/en/user-guide/semistructured-considerations.html#storing-semi-structured-data-in-a-variant-column-vs-flattening-the-nested-structure
I have a Apache Phoenix table with composite rowkey (key1,key2).
key1 = sequence number (unique)
key2 = date with time stamp.(none unique)
Now when I am searching with key1 alone results are coming very quick even with 10 million records.
But when I am only using key2 it is slowing down.
My question is how does composite row key works in Phoenix? And what is the correct way to scan/filter based on individual keys which are part of the composite rowkey.
Because I don't know the key1 as this is a sequence if I have to filter it only using key2 which is a timestamp what is the best way of doing it ?
Just in case other people come across this:
Phoenix is scanning HBase where keys are sorted in lexicographic order like in this simplified example:
100_2021:04:01
200_2021:03:01
300_2021:02:01
where the key starts with a 3 digit sequence number (100,200,300) and has a simplified date.
As you can see the initial portion, the sequence number is ascending even though the dates might be descending in this example. The order here is important. If you want to find all entries from '2021:02:01' phoenix still has to scan the entire cluster because the sequence number really could be anything. So you don't want to do a query that basically is a "*_date" query but instead always lead with data and maybe leave the something open at the end.
Depending on your case you probably want to put date first and then a sequence number at the end. Then you can look for all items a specific date. To avoid hot partitions you might want to have a salt or something else be the start of your key though.
We have around 500k keys maintained by one memcache server. Around 499k keys are stored in one Slab [it is always Slab #8].
The key names have this format: BarData:Currency[0099]YYYY-MM-DD_HH:MM:SS
Currency is one of 23 different expressions [$EURUSD, $GBPUSD, ...]
The [] hold a 4 digit number which alternates between 0001, 0003, 0005, 0010, 0015, 0030, 0060, 0090 and 0120
The datetime format is very similar due the fact the data is saved for ascending continuous date period.
Does this affect the performance when accessing the memcache keys and should we consider to change the key names in order to spread it over more Slabs or can we leave it how it is?
According to this answer https://stackoverflow.com/a/10139350 memcache stores items with equal size in the same Slab. In my case hashing the key name does not change the Slab because all items have the same size.
I am moving from mysql to hbase due to increasing data.
I am designing rowkey for efficient access pattern.
I want to achieve 3 goals.
Get all results of email address
Get all results of email address + item_type
Get all results of particular email address + item_id
I have 4 attributes to choose from
user email
reverse timestamp
item_type
item_id
What should my rowkey look like to get rows efficiently?
Thanks
Assuming your main access is by email you can have your main table key as
email + reverse time + item_id (assuming item_id gives you uniqueness)
You can have an additional "index" table with email+item_type+reverse time+item_id and email+item_id as keys that maps to the first table (so retrieving by these is a two step process)
Maybe you are already headed in the right direction as far as concatenated row keys: in any case following comes to mind from your post:
Partitioning key likely consists of your reverse timestamp plus the most frequently queried natural key - would that be the email? Let us suppose so: then choose to make the prefix based on which of the two (reverse timestamp vs email) provides most balanced / non-skewed distribution of your data. That makes your region servers happier.
Choose based on better balanced distribution of records:
reverse timestamp plus most frequently queried natural key
e.g. reversetimestamp-email
or email-reversetimestamp
In that manner you will avoid hot spotting on your region servers.
.
To obtain good performance on the additional (secondary ) indexes, that is not "baked into" hbase yet: they have a design doc for it (look under SecondaryIndexing in the wiki).
But you can build your own a couple of ways:
a) use coprocessor to write the item_type as rowkey to separate tabole with a column containing the original (user_email-reverse timestamp (or vice-versa) fact table rowke
b) if disk space not issue and/or the rows are small, just go ahead and duplicate the entire row in the second (and third for the item-id case) tables.
(Not sure what its called... model.. schema.. super model?)
I have 'n' (uniquely id'd) sensors in 'm' (uniquely id'd) homes. Each of these fires 0 to 'k' times / day (in blocks of 1-5). This data is currently stored in MySQL with a table for each 'home' and a structure of:
time stamp
sensor id
firing count
Im having trouble wrapping my mind around a 'nosql' model of this data that would allow me to find counts of firings by home, time, or sensor.
.. Or maybe this isn't the right kind of data to push to nosql? Our current server is bogging down under the load ( hundreds of millions of rows x hundreds of homes ). Im very interested in finding a data store that allows the scalability of cassandra.
It depends. Think "Query first" approach:
identify the queries
model the data
So, while you might have a Column Family which is your physical model, you will also have one or more which provide the data as it is queried. And, you can further take advantage of Cassandra features, such as:
Column Names can contain data. You don't have to store a value, each of the names could be a timestamp, for example
It is well suited to store thousands of columns for each key and the columns will remain sorted and can access in forward or reverse order; so, to continue above example, can easily get list of all timestamps for a sensor
Composite data types allow you to combine multiple bits of data into keys, names, or values. e.g. combine house id and sensor id
Counter Columns provide an simple value increment, even for the initial value, so just always a write operation.
Indexes can be defined on static column names which in effect, provides a reverse Column Family with the key as the result, just be careful of bucket size (e.g. might not want values to millisec)
To store firing count by sensor and house:
House_Sensors <-Column family
house_id <-Key
sensor_id <-Column name
firing_count <-Column value
Data represented in JSON-ish notation
House_Sensors = {
house_1 : {
sensor_1: 3436,
sensor_2: 46,
sensor_3: 99,
...
},
house_2 : {
sensor_7: 0,
sensor_8: 444,
...
},
...
}
You may want to define another column family with sensor_id as key to store the firing timestamp.
Think what queries you need when designing the schema and denormalize as needed. Repeat data, Cassandra inserts are very fast.
The timestamp of the firing is not stored in House_Sensor column family. Create a new column family for that with sensor_id as key.
This way you can use House_Sensor family to query firing count and what sensor belongs to each house. Use the other column family to query the firing timestamp.