NoSql,Hbase,Cassandra Conceptualization db - nosql

Suppose I have a relationship 1 to N, for example
Student , College.
Student Attributes:
Name,Surname,CollegeFKey,
College attributes:
CollegeKey,Other,Other.
Suppose that I have a program which read students and Exams from a plain text file. And, on this file I have duplicated Colleges and Duplicated Studends.
Like in denormalized tables:
CollegeId,Other,Other,Name,Surname,CollegeFkey.
e.g.
1,x,y,Mike,M,1
1,x,y,R,P,1
...
...
...
You see, I have to check in this case always that in my normalized db, I have still not inserted in the Table College 2 times the key 1.
How can I solve this in Hbase or Cassandra? I mean, if I have 10000.. tables and rows, I don't want check for every Primary Key and then for every FK, if it was inserted OK?
How can I solve that? I can use no-sql db for work directly in de-normalized datas?
Can you link me to an example that solve this problem?

You can use Cassandra http://wiki.apache.org/cassandra/ with some high level language client (I use Hector for java https://github.com/rantav/hector). In Cassandra you will describe ColumnFamily College in this ColumnFamily you write Student columns which contains information about students.

Related

Postgres db design Normalize tables or Use Array Columns

Newbie trying to figure out the best way to design a Postgres db for the following use case scenario.
There is an Account table for the business customers and there is a contacts table with a column relationship.
account.pk_id, ….
contacts.pk_id, contacts.fk_accountid …
Thousands of different businesses in the Accounts table will be storing millions of contacts each in the Contacts table.
Each contact record will over time belong to between 1 and 100 different categories, lists and products.
If I use a classic sql master/child relationship I potentially end up with millions and millions of rows in tables such as contacts_categories, contacts_lists and contacts_products which would reference from Categories, Lists & Products tables.
Alternatively, I could store the related keys ( uuid’s) for categories, lists and products in 3 character varying arrays[] columns in the contact record row. This would eliminate the need for the contacts_categories, contacts_lists and contacts_products tables that would be quite large.
With tools like Select unnest, array_append() and the array index options it seems like a smart solution but am curious to know if it is better to stick to normalized relations and more tables and row counts for performance and / or storage memory / cost.
Anybody tried this before ?
Too many people have tried that, and it is a bad idea. Many of your queries, particularly joins, will become complicated and slow. Besides, you won't be able to have foreign key constraints to guarantee data integrity.
Relational databases are good at coping with millions of rows in a table. Keep your schema normalized.

Reference foreign keys using SSIS-Lookup

I am asking for help on the following topic. I am trying to create an ETL process with two Excel data sources (S1 ~300 rows and S2 ~7000 rows). S1 contains project information and employee details and S2 contains the amount of hours, which each employee worked in which project at a timestamp.
I want to insert the amount of hours, which each employee worked in each project at a timestamp, into the fact table by referencing to the existing primary keys in the dimension tables. If an entry is not present in the dimension tables already, i want to add a new entry first and use the newly generated id. The destination table structure looks as follows (Data Warehouse, Star Schema):Destination Table Structure
In SSIS, i created three Data Flow tasks for filling the Dimension Tables (project, employee and time) with distinct values (using group by, as S1 and S2 contain a lot of duplicate rows)first, and a fourth data flow task (see image below) to insert the FactTable data, and this is where I'm running into problems:
Data Flow Task FactTable
I am using three LookUp functions to retrieve the foreignKeys project_id, employee_id and time_id from the Dimension tables (using project name, employee number and timestamp). If the id is found, it is passed on all the way to Merge Join 1, if not, a new Dimension Entry is created (lets say project) and the generated project_id passed on instead. Same goes for employee and time respectively.
There is two issues with this:
1) The "amount of hours" (passed by Multicast four, see image above) is not matched in the final result (No Match)
2) The amount of rows being inserted keeps increasing forever (Endless Join, I belive due to the Merge joins).
What I've tried:
I have used one UNION instead of three Merge Joins before, but this resulted in the foreign keys being in seperate rows each, instead of merged together.
I used Merge (instead of Merge Join) and combined the join as well as sort conditions in as I fell all possible ways.
I understand that this scenario might be confusing for everybody else, but thank your for taking time looking at it! Any help is greatly appreciated.
Solved it
For anybody having similar issues:
Seperate Data Flows for filling Dimension Tables with those filling Fact Tables will do the trick.
Its a clean solution and easier to debug.
Also: Dont run the LookUp Functions in parallel, but rather one after each other and pass on the attributes. Saves unnecessary Merges as well.
So as a Sum Up:
Four Data Flow Tasks, three for filling dimension tables ONLY and one for filling fact tables ONLY.
Loading Multiple Tables using SSIS keeping foreign key relationships
The answer posted by onupdatecascade is basically it.
Good luck!

MATLAB- Joining tables w/ overlapping data using key variable WHERE neither table contains all data points from the other one

I am working on combining 2 tables with different types of patient information using the PID (Patient Identity) feature present in both tables. Usually the function "join" (https://www.mathworks.com/help/matlab/ref/table.join.html) does the trick when one of the tables have information on all the patients from the other one. But in my case, both tables have certain values of PID (or information for new patients) that isn't present in the other one. How do I create a new table for using patient info from both tables that only contains info on the patients present in both tables?
I could probably write some long, clunky code to do this manually, but I was wondering if there's a function (or a few functions) that can do the task more efficiently. Thank you
The solution is to use either innerjoin or outerjoin.

Suggest a database for key with multiple values , highly scalable

We have data with key-multipleValues. Each key can have around 500 values (each value will be around 200-300 chars) and the number of such keys will be around 10 million. Major operation is to check for a value given a key.
I've been using mysql for long time where i've got 2 options: one row for each keyvalue, one row for each key with all values in a text field.But these does not seem efficient to me as the first model has lot of rows,redundancies and second model text field will become very large .
I am considering using nosql database for this purpose, i've used mongodb before and i dont think it is suitable for my current case. keyvalue based or column family based nosql db would be better.It need not be distributed.Someone who used riak,redis,cassandra etc pls share your thoughts.
Thanks
From your description, it seems some sort of Key-value store will be better for you comparing relational DB.
The data itself seem to be a non-relational, why store in a relational storage? It seems valid to use something like Cassandra.
I think a typical data-structure for this data to store will be a column family, with Key as Row-key and Columns as value.
MyDATA: (ColumnFamily)
RowKey=>Key
Column1=>val1
Column2=>val2
...
...
ColumnN=valN
The data would look like (JSON notation):
MyDATA (CF){
[
{key1:[{val1-1:'', timestamp}, {val1-2:'', timestamp}, .., {val1-500:'', timestamp}]},
{key2:[{val2-1:'', timestamp}, {val2-2:'', timestamp}, .., {val2-500:'', timestamp}]},
...
...
]
}
Hopefully this helps.
Try the direct, normalized approach: One table with this schema:
id (primary key)
key
value
You have one row for every key->value relation
Add an index for each column, and lookup should be reasonably efficient. Have you profiled any of this to exhibit a bottleneck?
This does map straightforwardly to Cassandra. Row key will be your model key, and your model values will be column names (yes, names) in Cassandra. You can leave the Cassandra column value empty, or add metadata there such as timestamp if that would be useful.
I don't think this is beyond the scale of MySQL on a single machine. You'll need to tune inserts or it'll take forever to load. You might also consider compressing your values using COMPRESS() or in your app directly. Might save you 50% or so.
Redis is basically an in-memory database, so it's probably out. Riak might be a decent choice or HBase or Cassandra.

how to design Hbase schema?

suppose that I have this RDBM table (Entity-attribute-value_model):
col1: entityID
col2: attributeName
col3: value
and I want to use HBase due to scaling issues.
I know that the only way to access Hbase table is using a primary key (cursor). you can get a cursor for a specific key, and iterate the rows one-by-one .
The issue is, that in my case, I want to be able to iterate on all 3 columns.
for example :
for a given an entityID I want to get all its attriutes and values
for a give attributeName and value I want to all the entitiIDS
...
so one idea I had is to build one Hbase table that will hold the data (table DATA, with entityID as primary index), and 2 "index" tables one with attributeName as a primary key, and the other one with value
each index table will hold a list of pointers (entityIDs) for the DATA table.
Is it a reasonable approach ? or is is an 'abuse' of Hbase concepts ?
In this blog the author say:
HBase allows get operations by primary
key and scans (think: cursor) over row
ranges. (If you have both scale and
need of secondary indexes, don’t worry
- Lucene to the rescue! But that’s another post.)
Do you know how Lucene can help ?
-- Yonatan
Secondary indexes would indeed be useful for many potential applications of HBase, and I believe the developers are in fact looking at it. Checkout http://www.mail-archive.com/hbase-dev#hadoop.apache.org/msg04801.html.
In the mean time though, if your application data storage can be modelled as a star schema (see http://en.wikipedia.org/wiki/Star_schema) you might like to checkout the solution that Hypertable proposes for secondary index-type needs http://markmail.org/message/rphm4q6cbar2ycgp
I recommend having two different flat tables: one for looking up attributes+values given entityID, and one for looking up the entityID given attributes+values.
Table 1 would look like this:
entityID1 {
attribute1: value1;
attribute2: value2;
...
}
and Table 2:
attribute1_value1 {
entityID1;
}
attribute2_value2 {
entityID1;
}