Is DynamoDB a wide-column store? - nosql

Sources indicate that DynamoDB is a key/value store, document store, and/or wide-column store:
At the core, DynamoDB is a key/value store.
If the value stored is a document, DynamoDB provides some support for working with the underlying document. Even Amazon agrees. So far, so good.
However, I've seen some claims that DynamoDB is actually a wide-column store (1, 2, 3, etc.). This seems odd to me, since as I understand it, a wide-column store would technically require a different data storage model.
Is it appropriate to consider DynamoDB to be a wide-column store?

In How do you call the data model of DynamoDB and Cassandra? I asked a similar question. I noted that both Cassandra and DynamoDB, which have a very similar data model, are sometimes called "wide-column store" because of its sort key feature:
In DynamoDB (and in Cassandra), items are stored inside a partition contiguously, sorted by the so-called "sort key". To locate an item, you need to specify its partition key, and inside that partition, specify its sort key. This is exactly the two-dimensional key-value store described in Wikipedia's definition of wide-column store https://en.wikipedia.org/wiki/Wide-column_store.
The historic evolution of a wide-column store into a DynamoDB-like one is easier to understand in the context of Cassandra, whose data model is more-or-less the same as DynamoDB's: Cassandra started its life as a real "wide column store": Each row (called "partition") had an unlimited number of unrelated columns. Later, CQL was introduced which added the concept of a "clustering key" (this is Cassandra's equivalent of DynamoDB's sort key), and now each partition was no longer a very long list of unrelated columns - instead it became a very long (and sorted) list of separate items. I explained this evolution in my answer
https://stackoverflow.com/a/47127723/8891224 comparing Cassandra's data model to Google Bigtable, which was the quintessential wide-column store.

How does Wikipedia define a wide-column store?
https://en.wikipedia.org/wiki/Wide-column_store opens with:
A wide-column store (or extensible record store) is a type of NoSQL
database. It uses tables, rows, and columns, but unlike a
relational database, the names and format of the columns can vary from
row to row in the same table. A wide-column store can be interpreted
as a two-dimensional key–value store.
DynamoDB has tables, rows (called items), and columns (called attributes). The names and format can vary from row to row (except for the primary key).
I think most wide-column stores define their table's schema centrally while DynamoDB lets each item define its own schema.
A simple key-value store would only let you look up by a key value. DynamoDB gives you a lot more choices.
At the end of the day this nomenclature is just our collective attempt to group things into similar buckets. There's naturally going to be some fuzzy edges.

To add up to the the great answer by Nadav, be careful with considering DynamoDB as wide column datastore...
Of course you can use wide-column-datastore patterns with DynamoDB with key range queries for instance (but the sortKey must be built smartly, nothing can prevent you from errors) but there is a hard limit to it, and it is the item size of a row that is limited to 400KB. This is great for most cases, but very narrow if you want to put, say hundreds of columns of data. And that is generally what you want to do with wide column datastores. Going around the limit is hell to put simply, you will add other tables and joins to compensate.
If you are really interested with using a columnar datastore on AWS, I personally would use AWS Keyspaces for that, it doesn't have the limits of DynamoDB. It will require you to design a database schema, but if you have so many columns, I see it as a plus. CQL is also better than DDB query API.

Related

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

jsonb and primary/foreign keys: which performs better in PostgreSQL?

I'm looking at using PostgreSQL's jsonb column type for a new backend project that will mainly serve as an REST-ful JSON API. I believe that PostgreSQL's jsonb will be a good fit for this project as it will give me JSON objects without need for conversion on the backend.
However, I have read that the jsonb data type slows down as keys are added, and my schema will have need of using primary keys and foreign key references.
I was wondering if having primary keys/foreign keys in their own columns (in the standard relational database way) and then having a jsonb column for the rest of the data would be beneficial, or would this cause problems (whether now or down the road)?
In short, would:
table car(id int, manufacturer_id int, data jsonb)
perform better or worse than:
table car(data jsonb)
Especially when looking up foreign keys frequently?
Would there be downsides to the first one, from a performance or a schema perspective?
All values involved in a PRIMARY KEY or FOREIGN KEY constraint must be stored as separate columns (best in normalized form). Constraints and references do not work for values nested inside a json / jsonb column.
As for the rest of the data: it depends. Having them inside a jsonb (or json) value carries the well-known advantages and disadvantages of storing unstructured document-type data.
For attributes that are present for all or most rows, it is typically better (faster, cleaner, smaller storage) to store them as separate columns. Especially simpler and cheaper to update. Easier indexing and other queries, too. The new jsonb has amazing index capabilities, but indexing dedicated columns is still simpler / faster.
For rarely used or dynamically appearing attributes, or if you want to store and retrieve JSON values without much handling inside the DB, look to jsonb.
For basic EAV structures with mainly character data, without nesting and no connection to JSON I would consider hstore. There are also the xml (more complex and verbose) and json data types (mostly superseded by jsonb), which are losing ground.
Which perform better? Depends on usage. It is same question, when you compare SQL (relational) and NoSQL (KeyValue or Document) databases. For some use cases a NoSQL databases performs very well, for other not.
Relational concept (normalized schema) is optimized for typical OLTP usage - 70% read/30% write, multiuser, lot of updates, report calculating, some ad hoc queries. Relational concept is relatively wide general .. with very wide usability (evidence, accounting, processing support, ...). Usually it is not too bad everywhere.
It is clear, so specialized databases (Document, KeyValue, Graph) can be significantly better (one order faster) on specialized use cases. But their usage is significantly narrower. When you are out of optimized use case, then performance can be bad.
Other question is database size - record numbers. The difference in performance on production databases can be significant in hundred thousand rows. For some smaller databases the impact can be not significant.
Postgres is relational database and my preference is to use a normalized schema for all important data in database. When you use it well, it is terrible fast. Non relation types is perfect for some fuzzy data (HStore, JSON, XML, Jsonb) - it is significantly better than EAV schema (perform worse on bigger data).
If you need do some important decision, prepare prototype, fill it for expected data (3 years) and check a speed of some important queries for your system. Attention: strong impact on these benchmarks has used hw, current load, current sw.

Why would using a nosql/document/MongoDB as a relational database be inferior?

I have recently been introduced to MongoDB and I've come to like a lot (compared to MySQL i used for all projects).
However in some certain situations, storing my data with documents "linking" to each other with simple IDs makes more sense (to reduce duplicated data).
For example, I may have Country and User documents, where a user's location is actually an ID to a Country (since a Country document includes more data, hence duplicating Country data in each user makes no sense).
What I am curious about is.. why would MongoDB be inferior compared to using a proper relationship database?
Is it because I can save transactions by doing joins (as opposed to doing two transactions with MongoDB)?
Thats a good question..!!
I would say there is definitely nothing wrong in using nosql db for the type of data you have described. For simple usecases it will work perfectly well.
The only point is that relational databases have been designed long time back to serve the purpose of storing and querying WELL STRUCTURED DATA.. with proper relations defined. Hence for a large amount of well structured data the performance and the features provided will be a lot more than that provided by a nosql database. Since they are more matured.. its their ball game..!!
On the other hand nosql databases have been designed to handle very large amount of unstructured data and has out of the box support for distributed environment scaling. So its a completely different ball game now..
They basically treat data differently and hence have different strategies / execution plans to fetch a given data..
MongoDB was designed from the ground up to be scalable over multiple servers. When a MongoDB database gets too slow or too big for a single server, you can add additional servers by making the larger collections "sharded". That means that the collection is divided between different servers and each one is responsible for managing a different part of the collection.
The reason why MongoDB doesn't do JOINs is that it is impossible to have JOINs perform well when one or both collections are sharded over multiple nodes. A JOIN requires to compare each entry of table/collection A with each one of table/collection B. There are shortcuts for this when all the data is on one server. But when the data is distributed over multiple servers, large amounts of data need to be compared and synchronized between them. This would require a lot of network traffic and make the operation very slow and expensive.
Is it correct that you have only two tables, country and user. If so, it seems to me the only data duplicated is a foreign key, which is not a big deal. If there is more duplicated, then I question the DB design itself.
In concept, you can do it in NOSQL but why? Just because NOSQL is new? OK, then do it to learn but remember, "if it ain't broke, don't fix it." Apparently the application is already running on relational. If the data is stored in separate documents in MongoDB and you want to interrelate them, you will need to use a link, which will be more work than a join and be slower. You will have to store a link, which would be no better than storing the foreign key. Alternatively, you can embed one document in another in MongoDB, which might even increase duplication.
If it is currently running on MySQL then it is not running on distributed servers, so Mongo's use of distributed servers is irrelevant. You would have to add servers to take advantage of that. If the tables are properly indexed in relational, it will not have to search through large amounts of data.
However, this is not a complex application and you can use either. If the data is stored on an MPP environment with relational, it will run very well and will not need to search to large amounts of data at all. There are two requirements, however, in choosing a partitioning key in MPP: 1. pick one that will achieve an even distribution of data; and 2. pick a key that can allow collocation of data. I recommend you use the same key as the partitioning key (shard key) in both files.
As much as I love MongoDB, I don't see the value in moving your app.

Cassandra column key auto increment

I am trying to understand Cassandra and how to structure my column families (CF) but it's quite hard since I am used to relational databases.
For example if I create simple users CF and I try to insert new row, how can I make an incremental key like in MySQL?
I saw a lot of examples where you would just put the username instead of unique ID and that would make a little sense, but what if I want users to have duplicated usernames?
Also how can I make searches when from what I understand cassandra does not suport > operators, so something like select * from users where something > something2 would not work.
And probably the most important question what about grouping? Would I need to retrieve all data and then filter it with whatever language I am using? I think that would slow down my system a lot.
So basically I need some brief explanation how to get started with Cassanda.
Your questions are quite general, but let me take a stab at it. First, you need to model your data in terms of your queries. With an RDBMS, you model your data in some normalized form, then optimize later for your specific queries. You cannot do this with Cassandra; you must write your data the way you intend to read it. Often this means writing it more than one way. In general, it helps to completely shed your RDBMS thinking if you want to work effectively with Cassandra.
Regarding keys:
They are used in Cassandra as the unit of distribution across the ring. So your key will get hashed and assigned an "owner" in the ring. Use the RandomPartitioner to guarantee even distribution
Presuming you use RandomPartitioner (you should), keys are not sorted. This means you cannot ask for a range of keys. You can, however, ask for a list of keys in a single query.
Keys are relevant in some models and not in others. If your model requires query-by-key, you can use any unique value that your application is aware of (such as a UUID). Sometimes keys are sentinel values, such as a Unix epoch representing the start of the day. This allows you to hand Cassandra a bunch of known keys, then get a range of data sorted by column (see below).
Regarding query predicates:
You can get ranges of data presuming you model it correctly to answer your queries.
Since columns are written in sorted order, you can query a range from column A to column n with a slice query (which is very fast). You can also use composite columns to abstract this mechanism a bit.
You can use secondary indexes on columns where you have low cardinality--this gives you query-by-value functionality.
You can create your own indexes where the data is sorted the way you need it.
Regarding grouping:
I presume you're referring to creating aggregates. If you need your data in real-time, you'll want to use some external mechanism (like Storm) to track data and constantly update your relevant aggregates into a CF. If you are creating aggregates as part of a batch process, Cassandra has excellent integration with Hadoop, allowing you to write map/reduce jobs in Pig, Hive, or directly in your language of choice.
To your first question:
can i make incremental key like in mysql
No, not really -- not native to Cassandra. How to create auto increment IDs in Cassandra -- You could check here for more information: http://srinathsview.blogspot.ch/2012/04/generating-distributed-sequence-number.html
Your second question is more about how you store and model your Cassandra data.
Check out stackoverflow's search option. Lots of interesting questions!
Switching from MySQL to Cassandra - Pros/Cons?
Cassandra Data Model
Cassandra/NoSQL newbie: the right way to model?
Apache Cassandra schema design
Knowledge sources for Apache Cassandra
Most importantly, When NOT to use Cassandra?
You may want to check out PlayOrm. While I agree you need to break out of RDBMS thinking sometimes having your primary key as userid is just the wrong choice. Sometimes it is the right choice(depends on your requirements).
PlayOrm is a mix of noSQL and relational concepts as you need both and you can do Scalable-SQL with joins and everything. You just need to partition the tables you believe will grow into the billions/trillions of rows and you can query into those partitions. Even with CQL, you need to partition your tables. What can you partition by? time is good for some use-cases. Others can be partitioned by clients as each client is really a mini-database in your noSQL cluster.
As far as keys go, PlayOrm generates unique "cluster" keys which is hostname-uniqueidinThatHost, basically like a TimeUUID except quite a bit shorter and more readable as we use hostnames in our cluster of a1, a2, a3, etc. etc.

What NoSQL DB to use for sparse Time Series like data?

I'm planning a side project where I will be dealing with Time Series like data and would like to give one of those shiny new NoSQL DBs a try and am looking for a recommendation.
For a (growing) set of symbols I will have a list of (time,value) tuples (increasing over time).
Not all symbols will be updated; some symbols may be updated while others may not, and completely new symbols may be added.
The database should therefore allow:
Add Symbols with initial one-element (tuple) list. E.g. A: [(2012-04-14 10:23, 50)]
Update Symbols with a new tuple. (Append that tuple to the list of that symbol).
Read the data for a given symbol. (Ideally even let me specify the time frame for which the data should be returned)
The create and update operations should possibly be atomic. If reading multiple symbols at once is possible, that would be interesting.
Performance is not critical. Updates/Creates will happen roughly once every few hours.
I believe literally all the major NoSQL databases will support that requirement, especially if you don't actually have a large volume of data (which begs the question, why NoSQL?).
That said, I've had to recently design and work with a NoSQL database for time series data so can give some input on that design, which can then be extrapolated for all others.
Our chosen database was Cassandra, and our design was as follows:
A single keyspace for all 'symbols'
Each symbol was a new row
Each time entry was a new column for that relevant row
Each value (can be more than a single value) was the value part of the time entry
This lets you achieve everything you asked for, most notably to read the data for a single symbol, and using a range if necessary (column range calls). Although you said performance wasn't critical, it was for us and this was quite performant also - all data for any single symbol is by definition sorted (column name sort) and always stored on the same node (no cross node communication for simple queries). Finally, this design translates well to other NoSQL databases that have have dynamic columns.
Further to this, here's some information on using MongoDB (and capped collections if necessary) for a time series store: MongoDB as a Time Series Database
Finally, here's a discussion of SQL vs NoSQL for time series: https://dba.stackexchange.com/questions/7634/timeseries-sql-or-nosql
I can add to that discussion the following:
Learning curve for NoSQL will be higher, you don't get the added flexibility and functionality for free in terms of 'soft costs'. Who will be supporting this database operationally?
If you expect this functionality to grow in future (either as more fields to be added to each time entry, or much larger capacity in terms of number of symbols or size of symbol's time series), then definitely go with NoSQL. The flexibility benefit is huge, and the scalability you get (with the above design) on both the 'per symbol' and 'number of symbols' basis is almost unbounded (I say almost unbounded - maximum columns per row is in the billions, maximum rows per key space is unbounded I believe).
Have a look at opentsdb.org an opensource time series database which use hbase. They have been smart on how they store the TS. It is well documented here: http://opentsdb.net/misc/opentsdb-hbasecon.pdf