How to create HBase columns / table for related but separated entities - nosql

I saw video tutorial on HBase, where data got stored in a table like this:
EmployeeName - Height - ProjectInfo
------------------------------------
Jdoe - 5'7" - ProjA-TeamLead, ProjB-Contributor
What happens when some Business requirements comes up that name of ProjA has to be changed to ProjX ?
Wouldn't there be a separate table where Project information is stored?

In a relational database, yes: you'd have a project table, and the employee table would refer to it via a foreign key and only store the immutable project id (rather than the name). Then when you want to query it (in a relational database), you'd do a JOIN like:
SELECT
employee.name,
employee.height,
project.name,
employee_project_role.role_name
FROM
employee
INNER JOIN employee_project_role
ON employee_project_role.employee_id = employee.employee_id
INNER JOIN project
ON employee_project_role.project_id = project.project_id
This isn't how things are done in HBase (and other NoSQL databases); the reason is that since these databases are geared towards extremely large data sets, and distributed over many machines, the actual algorithms to transparently execute complex joins like this become a lot harder to pull off in ways that perform well. Thus, HBase doesn't even have built-in joins.
Instead, the general approach with systems like this is that you denormalize your data, and store things in a single table. So in this case, there might be one row per employee, and denormalized into that row is all of the employee's project role info (probably in separate columns -- the contents of a row in HBase is actually a key/value map, so you can represent repeating things like all of their different roles easily).
You're absolutely right, though: if you change the name of the project, that means you'd need to change the data that's stored for every employee. In this respect, the relational model is "cleaner". But if you're dealing with Petabytes of data or trillions of rows, the "clean" abstraction of a relational database becomes a lot messier, because you end up having to shard it all manually. The point of systems like HBase is to pay these costs up front in the design process, and not just assume the relational database will magically solve problems like this for you at scale. (Because it won't).
That said: if you don't expect to have at least Terabtyes of data (that's a million MB, remember), just do it in a relational database. It'll be much easier.

I think going through this presentation will give you some perspective:
http://ianvarley.com/coding/HBaseSchema_HBaseCon2012.pdf
And for a more programetical representation, have a look at:
http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Related

Redshift Performance of Flat Tables Vs Dimension and Facts

I am trying to create dimensional model on a flat OLTP tables (not in 3NF).
There are people who are thinking dimensional model table is not required because most of the data for the report present single table. But that table contains more than what we need like 300 columns. Should I still separate flat table into dimensions and facts or just use the flat tables directly in the reports.
You've asked a generic question about database modelling for data warehouses, which is going to get you generic answers that may not apply to the database platform you're working with - if you want answers that you're going to be able to use then I'd suggest being more specific.
The question tags indicate you're using Amazon Redshift, and the answer for that database is different from traditional relational databases like SQL Server and Oracle.
Firstly you need to understand how Redshift differs from regular relational databases:
1) It is a Massively Parallel Processing (MPP) system, which consists of one or more nodes that the data is distributed across and each node typically does a portion of the work required to answer each query. There for the way data is distributed across the nodes becomes important, the aim is usually to have the data distributed in a fairly even manner so that each node does about equal amounts of work for each query.
2) Data is stored in a columnar format. This is completely different from the row-based format of SQL Server or Oracle. In a columnar database data is stored in a way that makes large aggregation type queries much more efficient. This type of storage partially negates the reason for dimension tables, because storing repeating data (attibutes) in rows is relatively efficient.
Redshift tables are typically distributed across the nodes using the values of one column (the distribution key). Alternatively they can be randomly but evenly distributed or Redshift can make a full copy of the data on each node (typically only done with very small tables).
So when deciding whether to create dimensions you need to think about whether this is actually going to bring much benefit. If there are columns in the data that regularly get updated then it will be better to put those in another, smaller table rather than update one large table. However if the data is largely append-only (unchanging) then there's no benefit in creating dimensions. Queries grouping and aggregating the data will be efficient over a single table.
JOINs can become very expensive on Redshift unless both tables are distributed on the same value (e.g. a user id) - if they aren't Redshift will have to physically copy data around the nodes to be able to run the query. So if you have to have dimensions, then you'll want to distribute the largest dimension table on the same key as the fact table (remembering that each table can only be distributed on one column), then any other dimensions may need to be distributed as ALL (copied to every node).
My advice would be to stick with a single table unless you have a pressing need to create dimensions (e.g. if there are columns being frequently updated).
When creating tables purely for reporting purposes (as is typical in a Data Warehouse), it is customary to create wide, flat tables with non-normalized data because:
It is easier to query
It avoids JOINs that can be confusing and error-prone for causal users
Queries run faster (especially for Data Warehouse systems that use columnar data storage)
This data format is great for reporting, but is not suitable for normal data storage for applications — a database being used for OLTP should use normalized tables.
Do not be worried about having a large number of columns — this is quite normal for a Data Warehouse. However, 300 columns does sound rather large and suggests that they aren't necessarily being used wisely. So, you might want to check whether they are required.
A great example of many columns is to have flags that make it easy to write WHERE clauses, such as WHERE customer_is_active rather than having to join to another table and figuring out whether they have used the service in the past 30 days. These columns would need to be recalculated daily, but are very convenient for querying data.
Bottom line: You should put ease of use above performance when using Data Warehousing. Then, figure out how to optimize access by using a Data Warehousing system such as Amazon Redshift that is designed to handle this type of data very efficiently.

Guideline when designing a database in Postgresql

I am designing a database in Postgresql and I would like to have some expert advices before refactorizing my work.
The database naturally contains different parts that I plan to separate into schemas in order to have a mangling of object names that reflect logical organization of them. About 20 tables are for scientific purposes and 20 others are technical and 20 furthers are about administrative tasks.
Is that a good idea or am I misleading myself into a management overhead that I will regret later?
The database contains 3 tables that are huge. By huge, I mean there is more than 60 millions of rows in it and they might grow a little bit. I think I will create special tablespace for that tables. I would like to do it, in order to separate logically the place where data are stored because the rest of the database should be backuped in a different way than that three tables.
Further more one those 3 tables contains binary data that are not heavy but weight a bit when multiplying by the amount of rows and also this table grows faster than the 2 others. Then I will periodically purge it after backuping the table.
Is it a good idea to have more than one tablespace in a database? If so, is there any precaution to be taken when proceeding this way?
Thank you in advance for your advices.
Choosing good names & grouping database stuffs is always a wise choice, and such overheads are not usually considerable.
About separating tablespace of a single database, it also should not cause any special problem, I've a similar database (but in mysql) that has a large file table and I had to move all of it's content to another server for some optimization reasons and i had no problem with it till now.
There is a very more important matter in RDBMS designing and that's CORRECT TABLE INDEXING. I think choosing good indexes is most critical phase of designing a relational database and you'll see it's effect soon (when you begin to write JOIN queries!).
In general, designing and implementing database is an experimental job that depends to your situation and expertness, so you can't seek for a solid instruction.

JSON or relational tables for complex user profiles

I am trying to design a Postgres database for holding a variety of information about users and see two obvious ways to go about it - specifically, the different many-many relations.
Store the basic user data in a user_info table. In separate tables, store the many-many relations like what schools someone attended, places they worked at, and so on. There will be a large number of such tables, (it is easy to add things like what places someone visited, what books they've read, etc. etc. I expect this to grow to a rather large list of tables).
In the main user_info table, store a JSON blob (properly organized of course) with all this additional info.
Which of these two options should I choose? Naturally, read performance is more important. I know that JSON is generally slower than ordinary relational tables but I am unsure if looking up info from a lot of different tables (as in option 1) will be slower than getting a single json blob and displaying it in the browser. As a further note, the JSONB format, in Postgres, actually has good indexing options.
Update:
Following some comments that a graphdb is what needs to be used: I should clarify the question is not about the choice of technology (rdbms vs graph db). But about the choice of data type given the technology (rdbms).
NoSQL is great for when you don't know what data you're going to store or how it's going to be used, or it fits well with the list/hash model. Relational databases are great for when you have a lot of certainty about the data, how it will be used, and when it fits into the relational model. I would suggest a hybrid approach, especially given PostgreSQL 9.2's JSON performance improvements.
Make traditional relationships for things you know are solid.
Make use of JSON for data that you want to capture but aren't sure you need.
For simple lists, make use of PostgreSQL arrays or JSON rather than join tables.
Abstract this all behind model classes.
As you gain more knowledge about the data, change how its stored.
For example, make tables for People, Schools, Work and Places and join tables between them. Fields like People.name and Places.address are normal columns. Things like "list of a person's pets" store it as an array of TEXT or a JSON field until you feel you need a Pets table. Any extra information you don't immediately know what you're going to do with like "how big is a school's endowment" put into a JSON metadata column.
Using model classes allows you to refactor your database without worrying about every piece of code that touches the database. Just be sure that all code which makes assumptions about the table structure goes into model methods.

Cassandra column key auto increment

I am trying to understand Cassandra and how to structure my column families (CF) but it's quite hard since I am used to relational databases.
For example if I create simple users CF and I try to insert new row, how can I make an incremental key like in MySQL?
I saw a lot of examples where you would just put the username instead of unique ID and that would make a little sense, but what if I want users to have duplicated usernames?
Also how can I make searches when from what I understand cassandra does not suport > operators, so something like select * from users where something > something2 would not work.
And probably the most important question what about grouping? Would I need to retrieve all data and then filter it with whatever language I am using? I think that would slow down my system a lot.
So basically I need some brief explanation how to get started with Cassanda.
Your questions are quite general, but let me take a stab at it. First, you need to model your data in terms of your queries. With an RDBMS, you model your data in some normalized form, then optimize later for your specific queries. You cannot do this with Cassandra; you must write your data the way you intend to read it. Often this means writing it more than one way. In general, it helps to completely shed your RDBMS thinking if you want to work effectively with Cassandra.
Regarding keys:
They are used in Cassandra as the unit of distribution across the ring. So your key will get hashed and assigned an "owner" in the ring. Use the RandomPartitioner to guarantee even distribution
Presuming you use RandomPartitioner (you should), keys are not sorted. This means you cannot ask for a range of keys. You can, however, ask for a list of keys in a single query.
Keys are relevant in some models and not in others. If your model requires query-by-key, you can use any unique value that your application is aware of (such as a UUID). Sometimes keys are sentinel values, such as a Unix epoch representing the start of the day. This allows you to hand Cassandra a bunch of known keys, then get a range of data sorted by column (see below).
Regarding query predicates:
You can get ranges of data presuming you model it correctly to answer your queries.
Since columns are written in sorted order, you can query a range from column A to column n with a slice query (which is very fast). You can also use composite columns to abstract this mechanism a bit.
You can use secondary indexes on columns where you have low cardinality--this gives you query-by-value functionality.
You can create your own indexes where the data is sorted the way you need it.
Regarding grouping:
I presume you're referring to creating aggregates. If you need your data in real-time, you'll want to use some external mechanism (like Storm) to track data and constantly update your relevant aggregates into a CF. If you are creating aggregates as part of a batch process, Cassandra has excellent integration with Hadoop, allowing you to write map/reduce jobs in Pig, Hive, or directly in your language of choice.
To your first question:
can i make incremental key like in mysql
No, not really -- not native to Cassandra. How to create auto increment IDs in Cassandra -- You could check here for more information: http://srinathsview.blogspot.ch/2012/04/generating-distributed-sequence-number.html
Your second question is more about how you store and model your Cassandra data.
Check out stackoverflow's search option. Lots of interesting questions!
Switching from MySQL to Cassandra - Pros/Cons?
Cassandra Data Model
Cassandra/NoSQL newbie: the right way to model?
Apache Cassandra schema design
Knowledge sources for Apache Cassandra
Most importantly, When NOT to use Cassandra?
You may want to check out PlayOrm. While I agree you need to break out of RDBMS thinking sometimes having your primary key as userid is just the wrong choice. Sometimes it is the right choice(depends on your requirements).
PlayOrm is a mix of noSQL and relational concepts as you need both and you can do Scalable-SQL with joins and everything. You just need to partition the tables you believe will grow into the billions/trillions of rows and you can query into those partitions. Even with CQL, you need to partition your tables. What can you partition by? time is good for some use-cases. Others can be partitioned by clients as each client is really a mini-database in your noSQL cluster.
As far as keys go, PlayOrm generates unique "cluster" keys which is hostname-uniqueidinThatHost, basically like a TimeUUID except quite a bit shorter and more readable as we use hostnames in our cluster of a1, a2, a3, etc. etc.

What db fits me?

I am currently using mysql. I am finding that my schema is getting incredibly complicated. I seek to find a new db that will suit my needs:
Let's assume I am building a news aggregrator (which collects news from multiple website). I then run algorithms to determine if two news from different sites are actually referring to the same topic. I run this algorithm to cluster news together. The relationship is depicted below:
cluster
\--news1
\--word1
\--word2
\--news2
\--word3
\--news3
\--word1
\--word3
And then I will apply some magic and determine the importance of each word. Summing all the importance of each word gives me the importance of a news article. Summing the importance of each news article gives me the importance of a cluster.
Note that above cluster there are also subgroups( like split by region etc), and categories (like sports, etc) which I have to determine the importance of that in a particular day per se.
I have used views in the past to do so, but I realized that views are very slow. So i will normally do an insert into an actual table and index them for better performance. As you can see this leads to multiple tables derived like (cluster, importance), (news, importance), (words, importance) etc which can get pretty messy.
Also the "importance" metric will change. It has become increasingly difficult to alter tables, update data (which I am using TRUNCATE TABLE) and then inserting from null.
I am currently looking into something schemaless like Mongodb. I do not need distributedness. I would very much want something that is reasonably fast (which can be indexed) and something that is a lot more flexible that traditional RDMBS.
NEW
As requested by various people, I will post my usage to this database (they are not actual SQL queries since I hope everyone here could understand)
TABLE word ( word_id, news_id, word )
TABLE news ( news_id, date, site .. )
TABLE clusters ( cluster_id, cluster_leader, cluster_name, ... )
TABLE mapping_clusters_news( cluster_id, news_id)
TABLE word_importance (word_id, score)
TABLE news_importance (news_id, score)
TABLE cluster_importance( cluster_id, score)
TABLE group_importance( cluster_id, score)
You might notice that TABLE_word has an extra news_id column. This is to correspond to TABLE_word_importance column because the same word can have different importance in different articles (if you are familiar with tfidf, this is basically something like that).
All the "importance" table now calculates the importance of each entity by averaging the importance of all the sub-entities below it. This means that Each cluster's importance is determined by all the news inside it, each news's importance is determined by all the words inside it etc.
TYPICAL USAGE:
1) SELECT clusters FROM db THAT HAS word1, word2, word3, .. ORDER BY cluster_importance_score
2) SELECT words FROM db BELONGING TO THE CLUSTER cluster_id=5 ODER BY word_importance score.
3) SELECT groups ordered by importance score.
As you can see, I am deriving a lot of scores from each layer, and someone have been telling me to use a materialized view for this purpose (which postgresql supports it). However, as you can see, this simple schema already consists of 8 tables (my actual db consists of 26 tables of crap like that, which is adding so much additional layers of complexity for maintainance).
NOTE THIS IS NOT ABOUT FULL-TEXT SEARCH.
When the schema is getting complicated, a graph database can be a good alternative. As I understand your domain, you have lots of entities related to other entities in different ways. Would it make sense to you to model this as a graph/network of entities? As food for thought I whipped up an example using Neo4j:
news-analysis-example http://github.com/neo4j-examples/domain-models/raw/master/news-analysis.png
In a graphdb you can set properties on both nodes and relationships, which could be useful in your case (for instance the number of times a word is used in a news entry could be added to the relationship to that word). BTW, I added an extra is_related relationship between two news items, as I thought that could be interesting as well.
How about db4o? db4o
ORM means "Object-relational mapper". Not using a relational database wouldn't make much sense. I'll pretend you meant "I want to be able to serialize objects".
I don't understand why distributedness is not required. Could you elaborate on that?
Personally, I would reccomend Cassandra. It still has reasonably close ties to (by which I mean easy to integrate with) Hadoop, which you will probably eventually want for your processing. As an added bonus, there's Telephus, so Cassandra supports Twisted beautifully. Cassandra's method of conflict resolution (currently timestamps, soon-ish vector clocks) might work for your changing metric as long as you don't mind getting the old value for as long as the metric hasn't been recalculated. Otherwise, you might move up a level and simply store multiple versions of the data with different versions of the metric. That way, if you decide a metric is a bad idea, you don't have to recompute.
Cassandra, unfortunately, does not have something that serializes/deserializes objects very well yet. However, for the thin wrappers you would be writing (essentially structs with a few methods), would writing a fromCassandra #classmethod really be that big a deal?
Postgresql may be "schema based" but it kind of feels like you're throwing the baby out with the bathwater. If you don't need a distributed db or a particularly schema-less design (which it doesn't sound like offhand you do, but you appear to think you do) then I'm not sure why you would want mongodb. Postgres has lots of indexing options and it sounds like its built in full text searching would be good for you. If you're used to MySQL and altering tables (you mentioned issues there) can be a nightmare, mostly its better in Postgres. I'm a fan on Postgres and MongoDB - it just don't sound like there's a good reason to move away from a relational db for data that certainly sounds relational in nature.
In a word, YES, you should probably be looking at something else: Cassandra, Hadoop, MongoDB, something.
MongoDB is basically going to reduce your sample schema to "clusters" and "news", with everything else basically being contained in those two.
The good news:
This will make it easy to modify fields.
Map-reduce operations are a natural fit for the type of work that you're doing. You perform a map-reduce and then save the data back to the "news" item and all will be well.
The bad news:
It's easy to lose track of the structure of data with something like Mongo. Hadoop and Hive typically force your schema little more. But in any case, you'll need to write down some form of schema or just drown.
If you plan to do this for some non-trivial amount of data, then you're going to want "horizontal" scalability. MongoDB is "ok" for this, Hadoop is definitely a "leader" for this.