jsonb and primary/foreign keys: which performs better in PostgreSQL? - postgresql

I'm looking at using PostgreSQL's jsonb column type for a new backend project that will mainly serve as an REST-ful JSON API. I believe that PostgreSQL's jsonb will be a good fit for this project as it will give me JSON objects without need for conversion on the backend.
However, I have read that the jsonb data type slows down as keys are added, and my schema will have need of using primary keys and foreign key references.
I was wondering if having primary keys/foreign keys in their own columns (in the standard relational database way) and then having a jsonb column for the rest of the data would be beneficial, or would this cause problems (whether now or down the road)?
In short, would:
table car(id int, manufacturer_id int, data jsonb)
perform better or worse than:
table car(data jsonb)
Especially when looking up foreign keys frequently?
Would there be downsides to the first one, from a performance or a schema perspective?

All values involved in a PRIMARY KEY or FOREIGN KEY constraint must be stored as separate columns (best in normalized form). Constraints and references do not work for values nested inside a json / jsonb column.
As for the rest of the data: it depends. Having them inside a jsonb (or json) value carries the well-known advantages and disadvantages of storing unstructured document-type data.
For attributes that are present for all or most rows, it is typically better (faster, cleaner, smaller storage) to store them as separate columns. Especially simpler and cheaper to update. Easier indexing and other queries, too. The new jsonb has amazing index capabilities, but indexing dedicated columns is still simpler / faster.
For rarely used or dynamically appearing attributes, or if you want to store and retrieve JSON values without much handling inside the DB, look to jsonb.
For basic EAV structures with mainly character data, without nesting and no connection to JSON I would consider hstore. There are also the xml (more complex and verbose) and json data types (mostly superseded by jsonb), which are losing ground.

Which perform better? Depends on usage. It is same question, when you compare SQL (relational) and NoSQL (KeyValue or Document) databases. For some use cases a NoSQL databases performs very well, for other not.
Relational concept (normalized schema) is optimized for typical OLTP usage - 70% read/30% write, multiuser, lot of updates, report calculating, some ad hoc queries. Relational concept is relatively wide general .. with very wide usability (evidence, accounting, processing support, ...). Usually it is not too bad everywhere.
It is clear, so specialized databases (Document, KeyValue, Graph) can be significantly better (one order faster) on specialized use cases. But their usage is significantly narrower. When you are out of optimized use case, then performance can be bad.
Other question is database size - record numbers. The difference in performance on production databases can be significant in hundred thousand rows. For some smaller databases the impact can be not significant.
Postgres is relational database and my preference is to use a normalized schema for all important data in database. When you use it well, it is terrible fast. Non relation types is perfect for some fuzzy data (HStore, JSON, XML, Jsonb) - it is significantly better than EAV schema (perform worse on bigger data).
If you need do some important decision, prepare prototype, fill it for expected data (3 years) and check a speed of some important queries for your system. Attention: strong impact on these benchmarks has used hw, current load, current sw.

Related

Is DynamoDB a wide-column store?

Sources indicate that DynamoDB is a key/value store, document store, and/or wide-column store:
At the core, DynamoDB is a key/value store.
If the value stored is a document, DynamoDB provides some support for working with the underlying document. Even Amazon agrees. So far, so good.
However, I've seen some claims that DynamoDB is actually a wide-column store (1, 2, 3, etc.). This seems odd to me, since as I understand it, a wide-column store would technically require a different data storage model.
Is it appropriate to consider DynamoDB to be a wide-column store?
In How do you call the data model of DynamoDB and Cassandra? I asked a similar question. I noted that both Cassandra and DynamoDB, which have a very similar data model, are sometimes called "wide-column store" because of its sort key feature:
In DynamoDB (and in Cassandra), items are stored inside a partition contiguously, sorted by the so-called "sort key". To locate an item, you need to specify its partition key, and inside that partition, specify its sort key. This is exactly the two-dimensional key-value store described in Wikipedia's definition of wide-column store https://en.wikipedia.org/wiki/Wide-column_store.
The historic evolution of a wide-column store into a DynamoDB-like one is easier to understand in the context of Cassandra, whose data model is more-or-less the same as DynamoDB's: Cassandra started its life as a real "wide column store": Each row (called "partition") had an unlimited number of unrelated columns. Later, CQL was introduced which added the concept of a "clustering key" (this is Cassandra's equivalent of DynamoDB's sort key), and now each partition was no longer a very long list of unrelated columns - instead it became a very long (and sorted) list of separate items. I explained this evolution in my answer
https://stackoverflow.com/a/47127723/8891224 comparing Cassandra's data model to Google Bigtable, which was the quintessential wide-column store.
How does Wikipedia define a wide-column store?
https://en.wikipedia.org/wiki/Wide-column_store opens with:
A wide-column store (or extensible record store) is a type of NoSQL
database. It uses tables, rows, and columns, but unlike a
relational database, the names and format of the columns can vary from
row to row in the same table. A wide-column store can be interpreted
as a two-dimensional key–value store.
DynamoDB has tables, rows (called items), and columns (called attributes). The names and format can vary from row to row (except for the primary key).
I think most wide-column stores define their table's schema centrally while DynamoDB lets each item define its own schema.
A simple key-value store would only let you look up by a key value. DynamoDB gives you a lot more choices.
At the end of the day this nomenclature is just our collective attempt to group things into similar buckets. There's naturally going to be some fuzzy edges.
To add up to the the great answer by Nadav, be careful with considering DynamoDB as wide column datastore...
Of course you can use wide-column-datastore patterns with DynamoDB with key range queries for instance (but the sortKey must be built smartly, nothing can prevent you from errors) but there is a hard limit to it, and it is the item size of a row that is limited to 400KB. This is great for most cases, but very narrow if you want to put, say hundreds of columns of data. And that is generally what you want to do with wide column datastores. Going around the limit is hell to put simply, you will add other tables and joins to compensate.
If you are really interested with using a columnar datastore on AWS, I personally would use AWS Keyspaces for that, it doesn't have the limits of DynamoDB. It will require you to design a database schema, but if you have so many columns, I see it as a plus. CQL is also better than DDB query API.

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

"Big" data in a JSONB column

I have a table with a metadata column (JSONB). Sometimes I run queries on this column. Example:
select * from "a" where metadata->'b'->'c' is not null
This column has always just small JSON objects <1KB. But for some records (less than 0.5%), it can be >500KB, because some sub-sub-properties have many data.
Today, I only have ~1000 records, everything works fine. But I think I will have more records soon, I don't know if having some big data (I don't speak about "Big Data" of course!) will have a global impact on performance. Is 500KB "big" for postgres and is it "hard" to parse? Maybe my question is too vague, I can edit if required. In other words:
Is having some (<0.5%) bigger entries in a JSONB column affect noticeably global performance of JSON queries?
Side note: assuming the "big" data is in metadata->'c'->'d' I don't run any queries to this particular property. Queries are always done on "small/common" properties. But the "big" properties still exists.
It is a theoretical question, so I hope a generic answer will satisfy.
If you need numbers, I recommend that you run some performance tests. It shouldn't be too hard to generate some large jsonb objects for testing.
As long as the data are jsonb and not json, operations like metadata->'b'->'c' will be fast. Where you could lose time is when a large jsonb value is loaded and uncompressed from the TOAST table (“detoasted”).
You can avoid that problem by creating an index on the expression. Then an index scan does not have to detoast the jsonb and hence will be fast, no matter how big the jsonb is.
So I think that you will not encounter problems with performance as long as you can index for the queries.
You should pay attention to the total size of your database. If the expected size is very large, backups will become a challenge. It helsp to design and plan how to discard old and unused data. That is an aspect commonly ignored during application design that tends to cause headaches years later.

When to use composite types and arrays and when to normalize a database?

Is there any guideline on when to normalize a database or just use composite types and arrays?
When using arrays and composite types, I can use just a single table. I can also normalize the database and use a couple of tables and joins.
How do you decide which option is best?
Most of the time, stick to normalization. Among other things, keeping your database fairly well normalized helps with lock granularity. For example, if you have a "parent" object with two arrays in it, you cannot have transactions that are simultaneously adding/updating/modifying members of the arrays. If they're regular side tables, you can. (You can still SELECT ... FOR UPDATE the parent row before updating child objects if you want the serialized behaviour, though).
Updating an array to add/replace/delete a value is expensive, as PostgreSQL must rewrite the whole tuple the array is in as an MVCC update. (It has a few TOAST tricks up its sleeve that can help, but not tons). Ditto composite types embedded in rows.
Big wide rows full of arrays and composites mean slower table scans, meaning slower fetches for commonly used values.
IIRC you can't define a foreign key into a field of a composite type, so you'll find yourself working around that or giving up on referential integrity where it'd be good to have. Ditto arrays (there was work to get foreign keys to arrays to work but I don't think it ever got comitted).
Many client drivers (PgJDBC, psqlODBC, psycopg2, etc etc etc) have incomplete to nonexistent support for arrays and composites, so you'll often land up expanding them into tuples for client driver interaction anyway. Some things, like arrays of composite types, are really quite painful to work with.
Most ORMs, including common ones like Hibernate, totally suck at using anything beyond the most utterly simplistic lowest-common-denominator SQL features. Sooner or later, someone's going to want to point one of those at your data model, at which point much wailing and gnashing of teeth will ensue. OTOH, don't accomodate garbage ORMs to the point where you avoid using features that'll greatly improve the data model and solve real world problems - for example, if you have the choice of storing native hstore fields, or using an EAV schema, consider just using jstore (or better, in 9.4, json with hstore features).
(Perversely, this means that people who have the most "object oriented" programs often have the most purely relational databases because their tools suck).
Things like report generation tools will similarly struggle with composites and arrays, so you'll often land up creating views to present a normalized appearance for the DB anyway. Then ON INSERT OR UPDATE OR DELETE ... DO INSTEAD triggers on the views to enable writes. At which point it gets ugly.
Personally I recommend keeping composites for times when it's logical to model something as a "type". Consider, say, if your data model required you to track timestamps in their original time zone. There's no built-in type for this (no, that's not what "timestamp with time zone" does, despite the name, thanks SQL committee), so you might create a composite type that stored (timestamp without time zone, tzname) and use that consistently in your data model.
Similarly, I tend to use arrays in queries a lot, but not in the data model much. They're useful when you want to intentionally denormalize something for performance, but that's often done in a materialized view or similar. Even if it's a change to the main data model, it's the sort of thing you should be doing based on proper performance review, not just "optimizing" stuff you don't know is slow yet.

Cassandra column key auto increment

I am trying to understand Cassandra and how to structure my column families (CF) but it's quite hard since I am used to relational databases.
For example if I create simple users CF and I try to insert new row, how can I make an incremental key like in MySQL?
I saw a lot of examples where you would just put the username instead of unique ID and that would make a little sense, but what if I want users to have duplicated usernames?
Also how can I make searches when from what I understand cassandra does not suport > operators, so something like select * from users where something > something2 would not work.
And probably the most important question what about grouping? Would I need to retrieve all data and then filter it with whatever language I am using? I think that would slow down my system a lot.
So basically I need some brief explanation how to get started with Cassanda.
Your questions are quite general, but let me take a stab at it. First, you need to model your data in terms of your queries. With an RDBMS, you model your data in some normalized form, then optimize later for your specific queries. You cannot do this with Cassandra; you must write your data the way you intend to read it. Often this means writing it more than one way. In general, it helps to completely shed your RDBMS thinking if you want to work effectively with Cassandra.
Regarding keys:
They are used in Cassandra as the unit of distribution across the ring. So your key will get hashed and assigned an "owner" in the ring. Use the RandomPartitioner to guarantee even distribution
Presuming you use RandomPartitioner (you should), keys are not sorted. This means you cannot ask for a range of keys. You can, however, ask for a list of keys in a single query.
Keys are relevant in some models and not in others. If your model requires query-by-key, you can use any unique value that your application is aware of (such as a UUID). Sometimes keys are sentinel values, such as a Unix epoch representing the start of the day. This allows you to hand Cassandra a bunch of known keys, then get a range of data sorted by column (see below).
Regarding query predicates:
You can get ranges of data presuming you model it correctly to answer your queries.
Since columns are written in sorted order, you can query a range from column A to column n with a slice query (which is very fast). You can also use composite columns to abstract this mechanism a bit.
You can use secondary indexes on columns where you have low cardinality--this gives you query-by-value functionality.
You can create your own indexes where the data is sorted the way you need it.
Regarding grouping:
I presume you're referring to creating aggregates. If you need your data in real-time, you'll want to use some external mechanism (like Storm) to track data and constantly update your relevant aggregates into a CF. If you are creating aggregates as part of a batch process, Cassandra has excellent integration with Hadoop, allowing you to write map/reduce jobs in Pig, Hive, or directly in your language of choice.
To your first question:
can i make incremental key like in mysql
No, not really -- not native to Cassandra. How to create auto increment IDs in Cassandra -- You could check here for more information: http://srinathsview.blogspot.ch/2012/04/generating-distributed-sequence-number.html
Your second question is more about how you store and model your Cassandra data.
Check out stackoverflow's search option. Lots of interesting questions!
Switching from MySQL to Cassandra - Pros/Cons?
Cassandra Data Model
Cassandra/NoSQL newbie: the right way to model?
Apache Cassandra schema design
Knowledge sources for Apache Cassandra
Most importantly, When NOT to use Cassandra?
You may want to check out PlayOrm. While I agree you need to break out of RDBMS thinking sometimes having your primary key as userid is just the wrong choice. Sometimes it is the right choice(depends on your requirements).
PlayOrm is a mix of noSQL and relational concepts as you need both and you can do Scalable-SQL with joins and everything. You just need to partition the tables you believe will grow into the billions/trillions of rows and you can query into those partitions. Even with CQL, you need to partition your tables. What can you partition by? time is good for some use-cases. Others can be partitioned by clients as each client is really a mini-database in your noSQL cluster.
As far as keys go, PlayOrm generates unique "cluster" keys which is hostname-uniqueidinThatHost, basically like a TimeUUID except quite a bit shorter and more readable as we use hostnames in our cluster of a1, a2, a3, etc. etc.