how to design Hbase schema? - rdbms

suppose that I have this RDBM table (Entity-attribute-value_model):
col1: entityID
col2: attributeName
col3: value
and I want to use HBase due to scaling issues.
I know that the only way to access Hbase table is using a primary key (cursor). you can get a cursor for a specific key, and iterate the rows one-by-one .
The issue is, that in my case, I want to be able to iterate on all 3 columns.
for example :
for a given an entityID I want to get all its attriutes and values
for a give attributeName and value I want to all the entitiIDS
...
so one idea I had is to build one Hbase table that will hold the data (table DATA, with entityID as primary index), and 2 "index" tables one with attributeName as a primary key, and the other one with value
each index table will hold a list of pointers (entityIDs) for the DATA table.
Is it a reasonable approach ? or is is an 'abuse' of Hbase concepts ?
In this blog the author say:
HBase allows get operations by primary
key and scans (think: cursor) over row
ranges. (If you have both scale and
need of secondary indexes, don’t worry
- Lucene to the rescue! But that’s another post.)
Do you know how Lucene can help ?
-- Yonatan

Secondary indexes would indeed be useful for many potential applications of HBase, and I believe the developers are in fact looking at it. Checkout http://www.mail-archive.com/hbase-dev#hadoop.apache.org/msg04801.html.
In the mean time though, if your application data storage can be modelled as a star schema (see http://en.wikipedia.org/wiki/Star_schema) you might like to checkout the solution that Hypertable proposes for secondary index-type needs http://markmail.org/message/rphm4q6cbar2ycgp

I recommend having two different flat tables: one for looking up attributes+values given entityID, and one for looking up the entityID given attributes+values.
Table 1 would look like this:
entityID1 {
attribute1: value1;
attribute2: value2;
...
}
and Table 2:
attribute1_value1 {
entityID1;
}
attribute2_value2 {
entityID1;
}

Related

Sphinx centralize multiple tables into a single index

I do have multiple tables (MySQL) and I want to have a single index for them.
Each table has the primary key of int autoincrement type.
The structure of collected data is the same for each table (so no conflict), but as the IDs collide so it seems that I have to query each index separately (unless you can give me a hint of how to avoid ID collision)
Question is: If I query each index separately does it means that the weight of returned results are comparable between indexes?
unless you can give me a hint of how to avoid ID collision
See for example
http://sphinxsearch.com/forum/view.html?id=13078
You can just arrange for the ids to be offset differently. The 'sphinx document id' doesnt have to match the real primary key, but having a simple mapping makes the application simpler.
You have a choice between one-index, one-source (using a single sql query to union all the tables together. one-index, many-source. (a source per table, all making one index) or many-indexes (one index per table, each with own source). Which ever way will give the same query results.
If I query each index separately does it means that the weight of returned results are comparable between indexes?
Pretty much. The difference should be negiblibe that doesnt matter whic way round you do it.

Amazon DynamoDB table design and querying

We are considering DynamoDB for an expectedly large dataset. I come from a strong SQL background so the No-SQL way of thinking is new to me.
I have a problem and design, but ran into what appears to be a dead end.
The documentation says to make sure your Hash keys are widely distributed to aid in performance, okay that makes sense.
I am going to be recording various datapoints/actions for users. It makes sense to me that the hash key should be the user-id, and my range key can be the action(s) performed.
Now, if I want all the actions user #1 performs, I can easily query that.
But, if I want all the USERS who performed action X, I cannot do that without a table scan. From the Query documentation:
A Query operation directly accesses items from a table using the table primary key, or from an index using the index key. You must provide a specific hash key value.
So it would seem I am limited to getting data from a specific user, unless I am willing to do a table scan, which is slower and consumes many capacity units.
My question is, I think, ultimately a design question. Maybe I am missing something when it comes to No-SQL? Should my hash key be something else? Or is it simply that my requirements do not fit in with No-SQL (and more specifically, DynamoDB)?
It is almost as if the hash key is a kind of grouping with DynamoDB. I considered changing the hash key to the actions we are intending to put into place, but then I am not widely distributing my keys...
The DynamoDb way to meet your requirement to allow both types of queries is to store the data in two tables, one with hash key user-id and range key action-id, and one with hash key action-id and range key user-id.
And you should think about if you need all the data in both tables, or if one can be a summary table. For example, say you have a limited number of possible actions. Instead of putting the full record of every action in the user-keyed table, you might want a table with only one row for each user: a hash key of user - id, and a second column that is multiply valued and is a list of any action-id that the user has performed at least once.
You must create a Global Secondary Index (GSI). What this does is it creates a second pair of hash and range keys which differ from the original keys. You can then query the same table by also including an index name in your parameters.
Example in JS:
var table = tablename;
var index = actionId-username-gsi;
var action = actionId;
var params = {
TableName : table,
IndexName : index,
KeyConditionExpression : 'actionId = :v_actionId',
ExpressionAttributeValues : {
':v_actionId': { N : action }
},
ProjectionExpression : 'actionId, username'
};
ddb.query(params, err) {
if(err) {
// Oh well
} else {
// Do something
}
};
This will query the actionId-username-gsi index and look for any actionId hashes with the value provided. Using ProjectionExpression will return only the specified attributes' values for each item, lowering throughput if that ever becomes a concern. I hope this helps answer your question.
node.js aws amazon-dynamodb nosql
I guess the global secondary indexes option is better, as you get a single table.
Creating two tables will create redundancy and additional work to maintain consistency when doing any CUD (Create, Update, Delete) operation on any one table.

Are there any alternatives to HBASE in particular with regards to key range scans?

The most relevant feature that I appreciate in HBASE is the key range scan, where you can combine your keys under a higher level key with lower level ones, which allows you to obtain a hierarchy of data related to the higher level keys.
For example:
CUSTOMER ID = C100
DEPARTMENT ID = D100
USER ID = U100
The key for the above example would be
C100D100U100K01 : "my data for k01"
C100D100U100K02 : "my data for k02"
C100D100U100K03 : "my data for k03"
...
With the above, you would be able to fetch all of the data related to your customer ID by performing a range scan on C100* or if more details where needed, by department such as C100D100U100*, and so on.
Are there any alternatives to HBASE with this regard in the NOSQL spectrum of solutions ?
Any hierarchical key-value store would work. There's a (short) list on Wikipedia : Hierarchical key-value store.
The one I know best is GT.M, where your sample data could look like this :
customer("C100","D100","U100","K01")="my data for k01"
customer("C100","D100","U100","K02")="my data for k02"
customer("C100","D100","U100","K03")="my data for k03"
So customer("C100") would gives you access to all the data of a single customer, customer("C100","D100") would gives you access to all the data for a single department for a single customer, etc.
Couchbase has similar functionality if you use views (an index). You can create a view on all the keys, and do range queries over them. As far as I know, you can only wildcard over the end of a key but not the beginning, e.g.:
AAABBBCCCDDD* // yes
*BBBCCCDDDEEE // no
AAA*CCCDDDEEE // no
This is because it sorts the keys, and when you query you're getting a sub-range. However, you can get around this by creating views that sort by a different order.
More info: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views.html
Riak has secondary indexes that would allow querying data by matching the index or by range scan. The results from secondary indexes can be used as an input for Riak's MapReduce. Check this for more details: riak secondary indexes

Postgres hstore for time series

I am new to postgres and am experimenting with the hstore extension.Looking for some guidance. I need to support basic reporting on timeseries data for various products that we sell. I have a large amount data in the format "Timestamp, Value" for each product. This data is available in a csv fle for each product.
I am thinking of using hstore to store this data in the key value format. Assuming that all the timeseries data for a single product can be stored in a single hstore object. I need to be able to query this data by specific times, say what was the value of a product at a given time? Also need to run simple queries like retrieving the times where the product costed more than $100.
I'm planning to have a table with a product id column and an hstore column. But I am not very clear on how to make this work:
The hstore column needs to be loaded from thousands of timestamp,value records that exist in a csv. The hstore should be appended whenever we get a new csv.
The table needs to store the productId and corresponding Timeseries data.
Can you please advise if using hstore would be helpful ? If yes then how can I load data from csv as explained above. Also, if there could be any impact on the performance on inserts/updates in the hstore, as data grows please share your experiences.
I do think you should start with a simple, normalised schema first, especially since you are new to PostgreSQL. Something like:
CREATE TABLE product_data
(
product TEXT, -- I'm making an assumption about the types of your columns
time TIMESTAMP,
value DOUBLE PRECISION,
PRIMARY KEY (product, time);
);
I would definitely keep hstore and similar options in mind, if and when your data becomes large enough that efficiency is more important and simplicity. But note that all options have an efficiency tradeoff.
Do you know how much data you're going to support? Number of products, number of distinct timestamps for each product?
What other queries do you want to run? A query for the times where a single product cost more than $100 would benefit from an index on (product, value), if the product has many distinct timestamps.
Other options
hstore is most useful if you want to store a table set of arbitrary key-value pairs in a row. You could use it here, with a row for each product, and each distinct timestamp for that product being a key in the product's table. The downsides are that keys and values in hstore are text, whereas your keys are timestamps, and your values are numbers of some kind. So there will be a certain reduction in type checking, and a certain increase in type casting cost required. Another possible downside is that some queries on the hstore might not use indexes very efficiently. The above table can use simple btree indexes for range queries (say you want to pull out the values between two dates for a product). But hstore indexes are much more limited; you can use a gist or gin index on an hstore column to find all the rows that feature a certain key.
Another option (which I've played with and use experimentally for some of my databases) is arrays. Basically, each product will have an array of values, and each timestamp maps to an index in the array. This is easy if the timestamps are perfectly regular. For example, if all your products had a value every hour for every day, you could use a table like this:
CREATE TABLE product_data
(
product TEXT,
day DATE,
values DOUBLE PRECISION[], -- An array from 0 to 23.
PRIMARY KEY (product, day);
);
You can construct views and indexes to make querying this table moderate easy. (I wrote a blog post on this technique at http://ejrh.wordpress.com/2011/03/20/vector-denormalisation-in-postgresql/.)
But my advice is still: start with a simple table, then explore ways to improve efficiency when you know you're going to need them.

Suggest a database for key with multiple values , highly scalable

We have data with key-multipleValues. Each key can have around 500 values (each value will be around 200-300 chars) and the number of such keys will be around 10 million. Major operation is to check for a value given a key.
I've been using mysql for long time where i've got 2 options: one row for each keyvalue, one row for each key with all values in a text field.But these does not seem efficient to me as the first model has lot of rows,redundancies and second model text field will become very large .
I am considering using nosql database for this purpose, i've used mongodb before and i dont think it is suitable for my current case. keyvalue based or column family based nosql db would be better.It need not be distributed.Someone who used riak,redis,cassandra etc pls share your thoughts.
Thanks
From your description, it seems some sort of Key-value store will be better for you comparing relational DB.
The data itself seem to be a non-relational, why store in a relational storage? It seems valid to use something like Cassandra.
I think a typical data-structure for this data to store will be a column family, with Key as Row-key and Columns as value.
MyDATA: (ColumnFamily)
RowKey=>Key
Column1=>val1
Column2=>val2
...
...
ColumnN=valN
The data would look like (JSON notation):
MyDATA (CF){
[
{key1:[{val1-1:'', timestamp}, {val1-2:'', timestamp}, .., {val1-500:'', timestamp}]},
{key2:[{val2-1:'', timestamp}, {val2-2:'', timestamp}, .., {val2-500:'', timestamp}]},
...
...
]
}
Hopefully this helps.
Try the direct, normalized approach: One table with this schema:
id (primary key)
key
value
You have one row for every key->value relation
Add an index for each column, and lookup should be reasonably efficient. Have you profiled any of this to exhibit a bottleneck?
This does map straightforwardly to Cassandra. Row key will be your model key, and your model values will be column names (yes, names) in Cassandra. You can leave the Cassandra column value empty, or add metadata there such as timestamp if that would be useful.
I don't think this is beyond the scale of MySQL on a single machine. You'll need to tune inserts or it'll take forever to load. You might also consider compressing your values using COMPRESS() or in your app directly. Might save you 50% or so.
Redis is basically an in-memory database, so it's probably out. Riak might be a decent choice or HBase or Cassandra.