JSON or relational tables for complex user profiles - postgresql

I am trying to design a Postgres database for holding a variety of information about users and see two obvious ways to go about it - specifically, the different many-many relations.
Store the basic user data in a user_info table. In separate tables, store the many-many relations like what schools someone attended, places they worked at, and so on. There will be a large number of such tables, (it is easy to add things like what places someone visited, what books they've read, etc. etc. I expect this to grow to a rather large list of tables).
In the main user_info table, store a JSON blob (properly organized of course) with all this additional info.
Which of these two options should I choose? Naturally, read performance is more important. I know that JSON is generally slower than ordinary relational tables but I am unsure if looking up info from a lot of different tables (as in option 1) will be slower than getting a single json blob and displaying it in the browser. As a further note, the JSONB format, in Postgres, actually has good indexing options.
Update:
Following some comments that a graphdb is what needs to be used: I should clarify the question is not about the choice of technology (rdbms vs graph db). But about the choice of data type given the technology (rdbms).

NoSQL is great for when you don't know what data you're going to store or how it's going to be used, or it fits well with the list/hash model. Relational databases are great for when you have a lot of certainty about the data, how it will be used, and when it fits into the relational model. I would suggest a hybrid approach, especially given PostgreSQL 9.2's JSON performance improvements.
Make traditional relationships for things you know are solid.
Make use of JSON for data that you want to capture but aren't sure you need.
For simple lists, make use of PostgreSQL arrays or JSON rather than join tables.
Abstract this all behind model classes.
As you gain more knowledge about the data, change how its stored.
For example, make tables for People, Schools, Work and Places and join tables between them. Fields like People.name and Places.address are normal columns. Things like "list of a person's pets" store it as an array of TEXT or a JSON field until you feel you need a Pets table. Any extra information you don't immediately know what you're going to do with like "how big is a school's endowment" put into a JSON metadata column.
Using model classes allows you to refactor your database without worrying about every piece of code that touches the database. Just be sure that all code which makes assumptions about the table structure goes into model methods.

Related

PostgreSQL: JSON column or one-to-many table for config options

We currently have a table which stores information about users. Some of the columns hold information such as user ID, name etc., but many other columns (booleans, integers and varchars etc) hold configuration options for each user.
This has over time resulted in the width of the table becoming quite big and I think the time has come to migrate this to something new, so I want to remove all the "option"-related columns to a separate data structure.
The typical way of doing this, from my experience, would be to have a new table which would simply have option_id and option_name, and a second new table which would contain user_id, option_id, option_value, for example.
However, a colleague suggested using the new jsonb column type as an alternative, but I don't know if I like the idea of storing relational data in a non-relational way. From a Java point of view, it's pretty much the same as far as I can tell - it'll just be turned into a POJO and then cached on the object.
I should mention the number of users will be quite low, only going into the thousands, and number of columns could and will go into the hundreds.
Does anyone have advice on the best way forward here?
Technically, you have already de-normalized your database structure by adding columns to a table that are irrelevant to some of the entities stored therein.
Using JSON is just another way to de-normalize, cramming a bunch of values into a single row-column field. The excellent binary support for JSON in Postgres (the jsonb data type) then lets you index elements within those JSON documents, as a way to quickly access those embedded values. This is quite screwy from a relational point of view, but is handy for some situations.
Either approach is commonly done for this kind of problem, and is not necessarily bad. In general, de-normalizing is often a pay-now-or-pay-later kind of solution. But for something like user preferences, there may not be a pay-later penalty, as there often is with most business-oriented problem domains.
Nevertheless, you should consider a normalized database structure.
By the way, this kind of table-structure Question might be better asked in the sister site, http://DBA.StackExchange.com/.
I suggest searching Stack Overflow, that DBA site, and the wider Internet for discussions of database design for storing user preferences. Like this.

SaaS system with dynamic data model in production

I want to design a product which allows customers to create their own websites. A customer will be able to maintain his website's data model on the fly, do queries on it and display the output on a html page. I doubt a traditional RDMBS is the right choice for two reasons; with every customer the amount of data will grow and the RDBMS might reach its limits even if scaled. As the data model is highly dynamic doing many DDL queries will slow down the whole system.
I'm trying to figure out which database/datastorage system might be the best option for such a system. Recently I read a lot through NoSQL solutions like Cassandra and MongoDB and it looks promising in terms of performance but comes with a flaw: it's not relational data so data have to be denormalized.
I don't know what will be the impact of denormalizing a dynamic customer defined data model, because the customer models and inserts data first (in a relational way) and then does the queries afterwards. The denormalization has to happen automatically which leads to another problem: Can I create one table for each query, even if some queries might be similar? There might exist a high redundancy of data after a while.
Does creating/updating tables on the fly have any impact?
Every time the customer changes data the same data has to be changed in all tables which hold a copy of the same entity (like the name of an employee has to be changed in "team member" and also in "project task"). Are those updates costly?
Is it possible to nest data with unlimited depth like {"team": {"members": [{"name": "Ben"}]}}?
There might be even better/other approaches, I'm happy for any hints.
Adding clarification to the requirements
My question actually is, how can I use a NoSQL DB like Cassandra to maintain relational data and will the solution still perform better compared to a RDBMS?
The customer thinks relational (because in fact, data are always relational in my opinion) no matter what DBMS is used. And this service is not about letting the customer chose the underlying data storage. There can only be one.
A customer can define his own relational data model by using a management frontend provided by the application. The data model may be changed at any time by the customer. In RDBMS a DDL on a production system is not a good idea. On top of the data schema the customer can add named queries and use them as a data source on any web page he creates.
An example would be a query for News given the name "news" and in a web page it would be used like <ul><li query="news"><h1>[news.title]</h1></li></ul>, which would execute the query and iterate through the data and repeat the <li> on each iteration. That is the most simple example though.
In more complex examples if using SQL there might be extensive use of sub queries which performs bad. In NoSQL it seems there is the option to first denormalize and prepare a table with the data needed by the query and then just query that table. Any changes to involved data would lead to an update for that table. That means for every query the customer creates the system will automatically create and maintain a table and its data, so there will be a lot of data redundancy. Benchmarks state that Cassandra is fast in writing so that might be an option.
Let me put my 2 cents in.
Talking about of ability for users having own data models is not about SaaS.
In the pure SaaS paradigm, each user has the same functionality and data model. He could add his own objects, but not the classes of objects.
So scaling in this paradigm is a rather obvious (though frankly, it could be not so trivial) solution. You can get cloud DB with built-in multi-tenant support (like Azure, for example), you can use Amazon's RDS and add more instances as the user amount growth, you can use sharding (for instance, a partition by users) if the database supports it, etc.
But when we're talking about custom data model for each user is more like IaaS (infrastructure). It is some more low-level thing and you just say: "Ok, guys, you may build any data model you want, whatever".
And I believe that if you move the responsibility for the data model creation to the user, you should also move the responsibility for database selection, as IaaS provides. So the user would say:" "Ok, I need key-value database here" and you provide him Cassandra's table for example. If he wants RDBMS, you provide him one also.
Otherwise, you have to consider not the data model itself, but also the data strategy that your customer needs. So some customer may need to have key-value storage (that needs to be backed by some noSQL DB), the other may need RDBMS. How would you know it?
For instance, consider the entity from your example: {"team": {"members": [{"name": "Ben"}]}}. One user would use this model for the single type of queries something like "get the members for the team" and "add the member for the team". Another one user may need to query frequently for some stats information (average team player age, games played). And these two scenarios could demand different database types: first is for key-value search, the other is RDBMS. How would you guess the database type and structure as key-value storages are modeled around queries?
Technically, you may even try to guess the database type depending on the users' data model and queries, but you need to add some restrictions for users' creativity. Otherwise, it would be very untrivial task.
And about scaling, as each model is unique, you need to have add database instances as users grow. Of course, you can have multiple users in the single database instance in the different schemas, and you will need to determine the users' amount per instance by experiments or performance testing.
You may also look at the document-oriented databases, but I think that you need review your concept and make some changes. Maybe you have some obvious restrictions yet, but I just didn't get it from your post.

How to create HBase columns / table for related but separated entities

I saw video tutorial on HBase, where data got stored in a table like this:
EmployeeName - Height - ProjectInfo
------------------------------------
Jdoe - 5'7" - ProjA-TeamLead, ProjB-Contributor
What happens when some Business requirements comes up that name of ProjA has to be changed to ProjX ?
Wouldn't there be a separate table where Project information is stored?
In a relational database, yes: you'd have a project table, and the employee table would refer to it via a foreign key and only store the immutable project id (rather than the name). Then when you want to query it (in a relational database), you'd do a JOIN like:
SELECT
employee.name,
employee.height,
project.name,
employee_project_role.role_name
FROM
employee
INNER JOIN employee_project_role
ON employee_project_role.employee_id = employee.employee_id
INNER JOIN project
ON employee_project_role.project_id = project.project_id
This isn't how things are done in HBase (and other NoSQL databases); the reason is that since these databases are geared towards extremely large data sets, and distributed over many machines, the actual algorithms to transparently execute complex joins like this become a lot harder to pull off in ways that perform well. Thus, HBase doesn't even have built-in joins.
Instead, the general approach with systems like this is that you denormalize your data, and store things in a single table. So in this case, there might be one row per employee, and denormalized into that row is all of the employee's project role info (probably in separate columns -- the contents of a row in HBase is actually a key/value map, so you can represent repeating things like all of their different roles easily).
You're absolutely right, though: if you change the name of the project, that means you'd need to change the data that's stored for every employee. In this respect, the relational model is "cleaner". But if you're dealing with Petabytes of data or trillions of rows, the "clean" abstraction of a relational database becomes a lot messier, because you end up having to shard it all manually. The point of systems like HBase is to pay these costs up front in the design process, and not just assume the relational database will magically solve problems like this for you at scale. (Because it won't).
That said: if you don't expect to have at least Terabtyes of data (that's a million MB, remember), just do it in a relational database. It'll be much easier.
I think going through this presentation will give you some perspective:
http://ianvarley.com/coding/HBaseSchema_HBaseCon2012.pdf
And for a more programetical representation, have a look at:
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

SQL table structure

I am starting a new project that will handle surveys and reviews. At this point I am trying to figure out what would be the best sql table structure to store and handle such information.
Basically, the survey will contain ratings, text reviews and additional optional information available for clients to share. Now I am thinking of either storing each information in a separate column or maybe merge all this data and store it as an XML in one column.
I am not sure what would be a better solution, but I have the following issues on my mind:
- would possible increase of information collected would be a problem in case of single XML column
- would a single XML column have any serious impact on performance when extracting and handling information from xml column
If you ever have a reason to query on a single piece of info, or update it alone, then don't store that data in XML, but instead as a separate column.
It is rare, IMO, that storing XML (or any other composite type of data) is a good idea in a DB. Although there are always exceptions.
Well, to keep this simple, you have two choices: dyanmic or static surveys.
Dynamic surveys would look like this:
Not only would reporting be more complicated, but so would the UI. The number of questions is unknown and you would eventually need logic to handle order, grouping, and data types.
Static surveys would look more like this:
Although you certainly give up some flexibility, the solution (including reports) is considerably simpler. You need not handle order, grouping, or data types (at least dynamically).
I like to argue that "Simplicity is the best Design" in almost everything.
Since I cannot know your requirements in detail, I cannot assume which is the better fit. But I can tell you this, the dynamic is often built when the static is sufficient.
Best of luck!
If you don't want to fight with a relational database that expects relational data you probably want reasonably normalized data. I don't see in your case what advantage the XML would give you. If you have multiple values entered in the survey, you probably want another table for survey entries with a foreign key to the survey.
If this is going to be a relatively extensive application you might think about a table for survey definition, a table for survey question, a table for survey response, and a table for survey question response. If the survey data can be multiple types, you might need a table for each kind of question that might be asked, though in some cases a column might do.
EDIT - I think you would at least have one row per answer to a question. If the answer is complex (doesn't correspond to just one instance of a simple data type) it might actually be multiple rows (though denormalizing into multiple columns is probably O.K. if the number of columns is small and fixed). If an answer to one question needs to be stored in multiple rows, you would almost certainly end up with one table that represents the answer, and has one row per answer, plus another table that represents pieces of the answer, and has one row per piece.
If the reason you are considering XML is that the answers are going to be of very different types (for example, a review with a rating, a title, a header, a body, and a comments section for one question; a list of hyperlinks for another question, etc.) then the answer table might actually have to be several tables, so that you can model the data for each type of question. That would be a pretty complicated case though.
Hopefully one row per response, in a single table, would be sufficient.
To piggyback off of Flimzy's answer, you want to simply store the data in the database and not a specific format (i.e. XML). You might a requirement at the moment for XML, but tomorrow it might be a CSV or a fixed width DAT file. Also, if you store just the data, then you can use the "power" of the database to search on specific columns of information and then return it as XML, if desired.

What db fits me?

I am currently using mysql. I am finding that my schema is getting incredibly complicated. I seek to find a new db that will suit my needs:
Let's assume I am building a news aggregrator (which collects news from multiple website). I then run algorithms to determine if two news from different sites are actually referring to the same topic. I run this algorithm to cluster news together. The relationship is depicted below:
cluster
\--news1
\--word1
\--word2
\--news2
\--word3
\--news3
\--word1
\--word3
And then I will apply some magic and determine the importance of each word. Summing all the importance of each word gives me the importance of a news article. Summing the importance of each news article gives me the importance of a cluster.
Note that above cluster there are also subgroups( like split by region etc), and categories (like sports, etc) which I have to determine the importance of that in a particular day per se.
I have used views in the past to do so, but I realized that views are very slow. So i will normally do an insert into an actual table and index them for better performance. As you can see this leads to multiple tables derived like (cluster, importance), (news, importance), (words, importance) etc which can get pretty messy.
Also the "importance" metric will change. It has become increasingly difficult to alter tables, update data (which I am using TRUNCATE TABLE) and then inserting from null.
I am currently looking into something schemaless like Mongodb. I do not need distributedness. I would very much want something that is reasonably fast (which can be indexed) and something that is a lot more flexible that traditional RDMBS.
NEW
As requested by various people, I will post my usage to this database (they are not actual SQL queries since I hope everyone here could understand)
TABLE word ( word_id, news_id, word )
TABLE news ( news_id, date, site .. )
TABLE clusters ( cluster_id, cluster_leader, cluster_name, ... )
TABLE mapping_clusters_news( cluster_id, news_id)
TABLE word_importance (word_id, score)
TABLE news_importance (news_id, score)
TABLE cluster_importance( cluster_id, score)
TABLE group_importance( cluster_id, score)
You might notice that TABLE_word has an extra news_id column. This is to correspond to TABLE_word_importance column because the same word can have different importance in different articles (if you are familiar with tfidf, this is basically something like that).
All the "importance" table now calculates the importance of each entity by averaging the importance of all the sub-entities below it. This means that Each cluster's importance is determined by all the news inside it, each news's importance is determined by all the words inside it etc.
TYPICAL USAGE:
1) SELECT clusters FROM db THAT HAS word1, word2, word3, .. ORDER BY cluster_importance_score
2) SELECT words FROM db BELONGING TO THE CLUSTER cluster_id=5 ODER BY word_importance score.
3) SELECT groups ordered by importance score.
As you can see, I am deriving a lot of scores from each layer, and someone have been telling me to use a materialized view for this purpose (which postgresql supports it). However, as you can see, this simple schema already consists of 8 tables (my actual db consists of 26 tables of crap like that, which is adding so much additional layers of complexity for maintainance).
NOTE THIS IS NOT ABOUT FULL-TEXT SEARCH.
When the schema is getting complicated, a graph database can be a good alternative. As I understand your domain, you have lots of entities related to other entities in different ways. Would it make sense to you to model this as a graph/network of entities? As food for thought I whipped up an example using Neo4j:
news-analysis-example http://github.com/neo4j-examples/domain-models/raw/master/news-analysis.png
In a graphdb you can set properties on both nodes and relationships, which could be useful in your case (for instance the number of times a word is used in a news entry could be added to the relationship to that word). BTW, I added an extra is_related relationship between two news items, as I thought that could be interesting as well.
How about db4o? db4o
ORM means "Object-relational mapper". Not using a relational database wouldn't make much sense. I'll pretend you meant "I want to be able to serialize objects".
I don't understand why distributedness is not required. Could you elaborate on that?
Personally, I would reccomend Cassandra. It still has reasonably close ties to (by which I mean easy to integrate with) Hadoop, which you will probably eventually want for your processing. As an added bonus, there's Telephus, so Cassandra supports Twisted beautifully. Cassandra's method of conflict resolution (currently timestamps, soon-ish vector clocks) might work for your changing metric as long as you don't mind getting the old value for as long as the metric hasn't been recalculated. Otherwise, you might move up a level and simply store multiple versions of the data with different versions of the metric. That way, if you decide a metric is a bad idea, you don't have to recompute.
Cassandra, unfortunately, does not have something that serializes/deserializes objects very well yet. However, for the thin wrappers you would be writing (essentially structs with a few methods), would writing a fromCassandra #classmethod really be that big a deal?
Postgresql may be "schema based" but it kind of feels like you're throwing the baby out with the bathwater. If you don't need a distributed db or a particularly schema-less design (which it doesn't sound like offhand you do, but you appear to think you do) then I'm not sure why you would want mongodb. Postgres has lots of indexing options and it sounds like its built in full text searching would be good for you. If you're used to MySQL and altering tables (you mentioned issues there) can be a nightmare, mostly its better in Postgres. I'm a fan on Postgres and MongoDB - it just don't sound like there's a good reason to move away from a relational db for data that certainly sounds relational in nature.
In a word, YES, you should probably be looking at something else: Cassandra, Hadoop, MongoDB, something.
MongoDB is basically going to reduce your sample schema to "clusters" and "news", with everything else basically being contained in those two.
The good news:
This will make it easy to modify fields.
Map-reduce operations are a natural fit for the type of work that you're doing. You perform a map-reduce and then save the data back to the "news" item and all will be well.
The bad news:
It's easy to lose track of the structure of data with something like Mongo. Hadoop and Hive typically force your schema little more. But in any case, you'll need to write down some form of schema or just drown.
If you plan to do this for some non-trivial amount of data, then you're going to want "horizontal" scalability. MongoDB is "ok" for this, Hadoop is definitely a "leader" for this.