dynamodb access pattern for timeseries data - nosql

In AWS documentation for dynamodb they describe the best practices for storing timeseries data in dynamodb (https://docs.aws.amazon.com/en_pv/amazondynamodb/latest/developerguide/bp-time-series.html).
What is the best practice for accessing this data when you want following access pattern:
Get rows between two data ranges that are not a hole unit like an hour or a week. Eg. from 2019-11-03 22:01:50 to 2019-11-04 04:10:35. As you can not query hashkey, and inserting a dummy hashkey sounds like a bad idea. If i use 2019-11-03 as primary key then i have to query first 2019-11-03 and then 2019-11-04 which also sounds like a bad solution.

I wouldn't call multiple queries a "bad" solution...
In fact, since you should be able to do them in parallel, overall response time might be less than if you could do then with a single query.
You don't provide any estimate of the number of writes/reads per second, nor if you have other access requirements; so it's hard to say what would work best.
I will point out that AWS time series example supports very high volume of data flowing into the table.
If your volume isn't that high, you might be able to have a hash key of YYYY-MM instead of YYYY-MM-DD. Sort key would then be DD HH:MM:SS.xxxxxx
Still would need multiple queries if you query across months...

Related

How to store aggregated data in kdb+

I've faced with an architecture issue: what strategy should I choose to store aggregated data.
I know that in some Time Series DBs, like RRDTools, it is OK to have several db layers to store 1H,1W,1M,1Y aggregated data.
Is it a normal practice to use the same strategy for kdb+: to have several HDBs with date/month/year/int(for week and other) partitions? (with a rule on Gateway how to those an appropriate source.)
As an alternative I have in mind to store all data in a single HDB in tables like tablenameagg. But it looks not so smooth like several HDBs to me.
What points should I take into account for a decision?
It's hard to give a general answer as requirements are different for everyone but I can say in my experience that the normal practice is to have a single date-partitioned HDB as this can accommodate the widest range of historical datasets. In terms of increasing granularity of aggregation:
Full tick data - works best as date-partitioned with `p# on sym
Minutely aggregated data - still works well as date-partitioned with `p# on either sym or minute, `g# on either minute or sym
Hourly aggregated data - could be either date-partitioned or splayed depending on volume. Again you can have some combination of attributes on the sym and/or the aggregated time unit (in this case hour)
Weekly aggregated data - given how much this would compress the data you're now likely looking at a splayed table in this date-partitioned database. Use attributes as above.
Monthly/Yearly aggregated data - certainly could be splayed and possibly even flat given how small these tables would be. Attributes almost unnecessary in the flat case.
Maintaining many different HDBs with different partition styles would seem like overkill to me. But again it all depends on the situation and the volumes of data involved and in the expected usage pattern of the data.

Search Engine Database (Cassandra) & Best Practise

I'm currently storing rankings in MongoDB (+ nodejs as API) . It's now at 10 million records, so it's okay for now but the dataset will be growing drastically in the near future.
At this point I see two options:
MongoDB Sharding
Change Database
The queries performed on the database will not be text searches, but for example:
domain, keyword, language, start date, end date
keyword, language, start date, end date
A rank contains a:
1. domain
2. url
3. keyword
4. keyword language
5. position
6. date (unix)
Requirement is to be able to query and analyze the data without caching. For example get all data for domain x, between dates y, z and analyze the data.
I'm noticing a perfomance decrease lately and I'm looking into other databases. The one that seems to fit the job best is Cassandra, I did some testing and it looked promising, performance is good. Using Amazon EC2 + Cassandra seems a good solution, since it's easilly scalable.
Since I'm no expert on Cassandra I would like to know if Cassandra is the way to go. Secondly, what would be the best practice / database model.
Make a collection for (simplified):
domains (domain_id, name)
keywords (keyword_id, name, language)
rank (domain_id, keyword_id, position, url, unix)
Or put all in one row:
domain, keyword, language, position, url, unix
Any tips, insights would be greatly appreciated.
Cassandra relies heavily on query driven modelling. It's very restrictive in how you can query, but it is possible to fit an awful lot of requirements within those capabilities. For any large scale database, knowing your queries is important, but in terms of cassandra, it's almost vital.
Cassandra has the notion of primary keys. Each primary key consists of one or more keys (read columns). The first column (which may be a composite) is referred to as the partition key. Cassandra keeps all "rows" for a partition in the same place (on disk, in mem, etc.), and a partition is the unit of replication, etc.
Additional keys in the primary key are called clustering keys. Data within a partition are ordered according to successive clustering keys. For instance, if your primary key is (a, b, c, d) then data will be partitioned by hashing a, and within a partition, data will be ordered by b, c and d.
For efficient querying, you must hit one (or very few) partitions. So your query must have a partition key. This MUST be exact equality (no starts with, contains, etc.). Then you need to filter down to your targets. This can get interesting too:
Your query can specify exact equality conditions for successive clustering keys, and a range (or equality) for the last key in your query. So, in the previous example, this is allowed:
select * from tbl where a=a1 and b=b1 and c > c1;
This is not:
select * from tbl where a=a1 and b>20 and c=c1;
[You can use allow filtering for this]
or
select * from tbl where a=a1 and c > 20;
Once you understand the data storage model, this makes sense. One of the reason cassandra is so fast for queries is that it pin points data in a range and splats it out. If it needed to do pick and choose, it'd be slower. You can always grab data and filter client side.
You can also have secondary indexes on columns. These would allow you to filter on exact equality on non-key columns. Be warned, never use a query with a secondary index without specifying a partition key. You'll be doing a cluster query which will time out in real usage. (The exception is if you're using Spark and locality is being honoured, but that's a different thing altogether).
In general, it's good to limit partition sizes to less than a 100mb or at most a few hundred meg. Any larger, you'll have problems. Usually, a need for larger partitions suggests a bad data model.
Quite often, you'll need to denormalise data into multiple tables to satisfy all your queries in a fast manner. If your model allows you to query for all your needs with the fewest possible tables, that's a really good model. Often that might not be possible though, and denormalisation will be necessary. For your question, the answer to whether or not all of it goes in one row depends on whether you can still query it and keep partition sizes less than 100 meg or not if everything is in one row.
For OLTP, cassandra will be awesome IF you can build the data model that works the way Cassandra does. Quite often OLAP requirements won't be satisfied by this. The current tool of choice for OLAP with Cassandra data is the DataStax Spark connector + Apache Spark. It's quite simple to use, and is really powerful.
That's quite a brain dump. But it should give you some idea of the things you might need to learn if you intend to use Cassandra for a real world project. I'm not trying to put you off Cassandra or anything. It's an awesome data store. But you have to learn what it's doing to harness its power. It works very different to Mongo, and you should expect a mindshift when switching. It's most definitely NOT like switching from mysql to sql server.

Looking for an architecture that supports streaming counting, sketching and large set intersections

I wonder if the Stackoverflow community could help me by suggesting a technology (i.e. HBase, Raiku, Cassandra, etc.) that would solve my problem. I have a large dataset which we would like to update and query in real-time which is of the order of 10s of terabytes. Our dataset is a pixel stream which contains a user ID and one or more features (usually around 10). The total possible features number in the millions.
We are imagining our data model would look like:
FEATUREID_TO_USER_TABLE:
Feature id -> {UserID Hash, UserID Hash, ...}
FEATUREID_TO_COUNTER_TABLE:
feature id -> { Hour of since epic -> HyperLogLog byte blob }
We would like to keep a sorted set of User IDs sorted by the hash of the User ID. We also like to keep at most ~200k for each FEATUREID_TO_USER_TABLE entry evicting old IDs if a new ID has a lower hash value.
We would like the store to support the following operations (not necessarily expressed in SQL):
select FeatureID, count(FeatureID) from FEATUREID_TO_USER_TABLE where UserID in
(select UserID from FEATUREID_TO_USER_TABLE where FeatureID = 1234)
group by FeatureID;
And
update FEATUREID_TO_COUNTER_TABLE set HyperLogLog = NewBinaryValue where FEATUREID_TO_COUNTER_TABLE.id = 567
We believe the easiest way to shard this data across machines is by User ID.
Thanks for any ideas,
Mark
Cassandra is a great choice for persisting the data, but you'll want something else for processing it in real-time. I recommend you check out Storm, as it gives you real-time streaming data processing with relative ease. It's an open source framework that handles concurrency and parallelization for you. It's written on the JVM, but has language bindings for a variety of non-JVM languages as well.
I am not sure I understand your whole description though so I am shooting in the dark a bit on context.
Is there any way to partition your data so you can query into a partition? This helps alot with scalability and querying as you scale. You typically don't want to query into toooo large a table so instead query into a partition.
ie. PlayOrm has partitioning capabilities on cassandra so you can query one partition.
While PlayOrm does also have join queries, it does not do subselects at this time but typically clients just do a first call into the nosql store and then aggregate results and do a second query and it is still very very fast(probably as fast as if you made one call as even cassandra would have to make two calls internally to the other servers anyways).
hmmm, the more I read your post, I am not sure you should write SQL there as you may be able to do everything by primary key but I am not 100% sure. That SQL is confusing as it grabs all the user ids in the row it seems and then just counts them???? as it is the same table in both select and subselect?
As far as sharding your data, you don't need to do anything since cassandra does that automatically.

Cassandra column key auto increment

I am trying to understand Cassandra and how to structure my column families (CF) but it's quite hard since I am used to relational databases.
For example if I create simple users CF and I try to insert new row, how can I make an incremental key like in MySQL?
I saw a lot of examples where you would just put the username instead of unique ID and that would make a little sense, but what if I want users to have duplicated usernames?
Also how can I make searches when from what I understand cassandra does not suport > operators, so something like select * from users where something > something2 would not work.
And probably the most important question what about grouping? Would I need to retrieve all data and then filter it with whatever language I am using? I think that would slow down my system a lot.
So basically I need some brief explanation how to get started with Cassanda.
Your questions are quite general, but let me take a stab at it. First, you need to model your data in terms of your queries. With an RDBMS, you model your data in some normalized form, then optimize later for your specific queries. You cannot do this with Cassandra; you must write your data the way you intend to read it. Often this means writing it more than one way. In general, it helps to completely shed your RDBMS thinking if you want to work effectively with Cassandra.
Regarding keys:
They are used in Cassandra as the unit of distribution across the ring. So your key will get hashed and assigned an "owner" in the ring. Use the RandomPartitioner to guarantee even distribution
Presuming you use RandomPartitioner (you should), keys are not sorted. This means you cannot ask for a range of keys. You can, however, ask for a list of keys in a single query.
Keys are relevant in some models and not in others. If your model requires query-by-key, you can use any unique value that your application is aware of (such as a UUID). Sometimes keys are sentinel values, such as a Unix epoch representing the start of the day. This allows you to hand Cassandra a bunch of known keys, then get a range of data sorted by column (see below).
Regarding query predicates:
You can get ranges of data presuming you model it correctly to answer your queries.
Since columns are written in sorted order, you can query a range from column A to column n with a slice query (which is very fast). You can also use composite columns to abstract this mechanism a bit.
You can use secondary indexes on columns where you have low cardinality--this gives you query-by-value functionality.
You can create your own indexes where the data is sorted the way you need it.
Regarding grouping:
I presume you're referring to creating aggregates. If you need your data in real-time, you'll want to use some external mechanism (like Storm) to track data and constantly update your relevant aggregates into a CF. If you are creating aggregates as part of a batch process, Cassandra has excellent integration with Hadoop, allowing you to write map/reduce jobs in Pig, Hive, or directly in your language of choice.
To your first question:
can i make incremental key like in mysql
No, not really -- not native to Cassandra. How to create auto increment IDs in Cassandra -- You could check here for more information: http://srinathsview.blogspot.ch/2012/04/generating-distributed-sequence-number.html
Your second question is more about how you store and model your Cassandra data.
Check out stackoverflow's search option. Lots of interesting questions!
Switching from MySQL to Cassandra - Pros/Cons?
Cassandra Data Model
Cassandra/NoSQL newbie: the right way to model?
Apache Cassandra schema design
Knowledge sources for Apache Cassandra
Most importantly, When NOT to use Cassandra?
You may want to check out PlayOrm. While I agree you need to break out of RDBMS thinking sometimes having your primary key as userid is just the wrong choice. Sometimes it is the right choice(depends on your requirements).
PlayOrm is a mix of noSQL and relational concepts as you need both and you can do Scalable-SQL with joins and everything. You just need to partition the tables you believe will grow into the billions/trillions of rows and you can query into those partitions. Even with CQL, you need to partition your tables. What can you partition by? time is good for some use-cases. Others can be partitioned by clients as each client is really a mini-database in your noSQL cluster.
As far as keys go, PlayOrm generates unique "cluster" keys which is hostname-uniqueidinThatHost, basically like a TimeUUID except quite a bit shorter and more readable as we use hostnames in our cluster of a1, a2, a3, etc. etc.

realtime querying/aggregating millions of records - hadoop? hbase? cassandra?

I have a solution that can be parallelized, but I don't (yet) have experience with hadoop/nosql, and I'm not sure which solution is best for my needs. In theory, if I had unlimited CPUs, my results should return back instantaneously. So, any help would be appreciated. Thanks!
Here's what I have:
1000s of datasets
dataset keys:
all datasets have the same keys
1 million keys (this may later be 10 or 20 million)
dataset columns:
each dataset has the same columns
10 to 20 columns
most columns are numerical values for which we need to aggregate on (avg, stddev, and use R to calculate statistics)
a few columns are "type_id" columns, since in a particular query we may
want to only include certain type_ids
web application
user can choose which datasets they are interested in (anywhere from 15 to 1000)
application needs to present: key, and aggregated results (avg, stddev) of each column
updates of data:
an entire dataset can be added, dropped, or replaced/updated
would be cool to be able to add columns. But, if required, can just replace the entire dataset.
never add rows/keys to a dataset - so don't need a system with lots of fast writes
infrastructure:
currently two machines with 24 cores each
eventually, want ability to also run this on amazon
I can't precompute my aggregated values, but since each key is independent, this should be easily scalable. Currently, I have this data in a postgres database, where each dataset is in its own partition.
partitions are nice, since can easily add/drop/replace partitions
database is nice for filtering based on type_id
databases aren't easy for writing parallel queries
databases are good for structured data, and my data is not structured
As a proof of concept I tried out hadoop:
created a tab separated file per dataset for a particular type_id
uploaded to hdfs
map: retrieved a value/column for each key
reduce: computed average and standard deviation
From my crude proof-of-concept, I can see this will scale nicely, but I can see hadoop/hdfs has latency I've read that that it's generally not used for real time querying (even though I'm ok with returning results back to users in 5 seconds).
Any suggestion on how I should approach this? I was thinking of trying HBase next to get a feel for that. Should I instead look at Hive? Cassandra? Voldemort?
thanks!
Hive or Pig don't seem like they would help you. Essentially each of them compiles down to one or more map/reduce jobs, so the response cannot be within 5 seconds
HBase may work, although your infrastructure is a bit small for optimal performance. I don't understand why you can't pre-compute summary statistics for each column. You should look up computing running averages so that you don't have to do heavy weight reduces.
check out http://en.wikipedia.org/wiki/Standard_deviation
stddev(X) = sqrt(E[X^2]- (E[X])^2)
this implies that you can get the stddev of AB by doing
sqrt(E[AB^2]-(E[AB])^2). E[AB^2] is (sum(A^2) + sum(B^2))/(|A|+|B|)
Since your data seems to be pretty much homogeneous, I would definitely take a look at Google BigQuery - You can ingest and analyze the data without a MapReduce step (on your part), and the RESTful API will help you create a web application based on your queries. In fact, depending on how you want to design your application, you could create a fairly 'real time' application.
It is serious problem without immidiate good solution in the open source space. In commercial space MPP databases like greenplum/netezza should do.
Ideally you would need google's Dremel (engine behind BigQuery). We are developing open source clone, but it will take some time...
Regardless of the engine used I think solution should include holding the whole dataset in memory - it should give an idea what size of cluster you need.
If I understand you correctly and you only need to aggregate on single columns at a time
You can store your data differently for better results
in HBase that would look something like
table per data column in today's setup and another single table for the filtering fields (type_ids)
row for each key in today's setup - you may want to think how to incorporate your filter fields into the key for efficient filtering - otherwise you'd have to do a two phase read (
column for each table in today's setup (i.e. few thousands of columns)
HBase doesn't mind if you add new columns and is sparse in the sense that it doesn't store data for columns that don't exist.
When you read a row you'd get all the relevant value which you can do avg. etc. quite easily
You might want to use a plain old database for this. It doesn't sound like you have a transactional system. As a result you can probably use just one or two large tables. SQL has problems when you need to join over large data. But since your data set doesn't sound like you need to join, you should be fine. You can have the indexes setup to find the data set and the either do in SQL or in app math.