UPSERTing with TitanDB - titan

I'm having my first steps as a TitanDB user. That being, I'd like to know how to make an upsert / conditionally insert a vertex inside a TitanTransaction (in the style of "get or create").
I have a unique index on the vertex/property I want to create/lookup.

Here's a one-liner "getOrCreate" for Titan 1.0 and TinkerPop 3:
getOrCreate = { id ->
g.V().has('userId', id).tryNext().orElseGet{ g.addV('userId', id).next() }
}
As taken from the new TinkerPop "Getting Started" Tutorial. Here is the same code translated to java:
public Vertex getOrCreate(Object id) {
return g.V().has('userId', id).tryNext().orElseGet(() -> g.addV('userId', id).next());
}

Roughly speaking, every Cassandra insert is an "upsert". If you look at the Titan representation of vertices and edges in a Cassandra-like model, you'll find vertices and edges each get their own rows. This means a blind write of an edge will have the given behavior you're looking for: what you write is what will win. Doing this with a vertex isn't supported directly by Titan.
But I don't think this is what you're looking for. If you're looking to enforce uniqueness, why not use the unique() modifier on a Titan composite index? (From the documentation):
mgmt.buildIndex('byNameUnique', Vertex.class).addKey(name).unique().buildCompositeIndex()
With a Cassandra storage backend, you need to enable consistency locking. As with any database, managing the overhead of uniqueness comes at a cost which you need to consider when writing your data. This way if you insert a vertex which violates your uniqueness requirement, the transaction will fail.

Related

Make an index unique after creation in Titan

If I create an index according to the docs (http://s3.thinkaurelius.com/docs/titan/0.5.4/indexes.html) without making it unique is it possible to make it unique after? I have not added any vertices or edges to the graph, just created the index.
Something like:
index = mgmt.getGraphIndex('name')
index.unique()
I am using the Gremlin console to make these changes.
Is it possible to do this somehow?
This is a documented limitation of Titan.
Ref : http://s3.thinkaurelius.com/docs/titan/0.5.0/limitations.html
section - 14.2.1. Unable to Drop Indices
Since no vertices or edges are added to graph, try the below gremlin command.
g.V.remove() or g.V.each{g.removeVertex(it)}
g.commit()
Then try to create the indexes again with .unique().
If still unable to re-create the indices, try to clean storage-backend.
In case of cassandra "DROP Keyspace titan;"
This must definitely work,I have tried in Titan 0.4 and worked.

Amazon DynamoDB table design and querying

We are considering DynamoDB for an expectedly large dataset. I come from a strong SQL background so the No-SQL way of thinking is new to me.
I have a problem and design, but ran into what appears to be a dead end.
The documentation says to make sure your Hash keys are widely distributed to aid in performance, okay that makes sense.
I am going to be recording various datapoints/actions for users. It makes sense to me that the hash key should be the user-id, and my range key can be the action(s) performed.
Now, if I want all the actions user #1 performs, I can easily query that.
But, if I want all the USERS who performed action X, I cannot do that without a table scan. From the Query documentation:
A Query operation directly accesses items from a table using the table primary key, or from an index using the index key. You must provide a specific hash key value.
So it would seem I am limited to getting data from a specific user, unless I am willing to do a table scan, which is slower and consumes many capacity units.
My question is, I think, ultimately a design question. Maybe I am missing something when it comes to No-SQL? Should my hash key be something else? Or is it simply that my requirements do not fit in with No-SQL (and more specifically, DynamoDB)?
It is almost as if the hash key is a kind of grouping with DynamoDB. I considered changing the hash key to the actions we are intending to put into place, but then I am not widely distributing my keys...
The DynamoDb way to meet your requirement to allow both types of queries is to store the data in two tables, one with hash key user-id and range key action-id, and one with hash key action-id and range key user-id.
And you should think about if you need all the data in both tables, or if one can be a summary table. For example, say you have a limited number of possible actions. Instead of putting the full record of every action in the user-keyed table, you might want a table with only one row for each user: a hash key of user - id, and a second column that is multiply valued and is a list of any action-id that the user has performed at least once.
You must create a Global Secondary Index (GSI). What this does is it creates a second pair of hash and range keys which differ from the original keys. You can then query the same table by also including an index name in your parameters.
Example in JS:
var table = tablename;
var index = actionId-username-gsi;
var action = actionId;
var params = {
TableName : table,
IndexName : index,
KeyConditionExpression : 'actionId = :v_actionId',
ExpressionAttributeValues : {
':v_actionId': { N : action }
},
ProjectionExpression : 'actionId, username'
};
ddb.query(params, err) {
if(err) {
// Oh well
} else {
// Do something
}
};
This will query the actionId-username-gsi index and look for any actionId hashes with the value provided. Using ProjectionExpression will return only the specified attributes' values for each item, lowering throughput if that ever becomes a concern. I hope this helps answer your question.
node.js aws amazon-dynamodb nosql
I guess the global secondary indexes option is better, as you get a single table.
Creating two tables will create redundancy and additional work to maintain consistency when doing any CUD (Create, Update, Delete) operation on any one table.

Are there any alternatives to HBASE in particular with regards to key range scans?

The most relevant feature that I appreciate in HBASE is the key range scan, where you can combine your keys under a higher level key with lower level ones, which allows you to obtain a hierarchy of data related to the higher level keys.
For example:
CUSTOMER ID = C100
DEPARTMENT ID = D100
USER ID = U100
The key for the above example would be
C100D100U100K01 : "my data for k01"
C100D100U100K02 : "my data for k02"
C100D100U100K03 : "my data for k03"
...
With the above, you would be able to fetch all of the data related to your customer ID by performing a range scan on C100* or if more details where needed, by department such as C100D100U100*, and so on.
Are there any alternatives to HBASE with this regard in the NOSQL spectrum of solutions ?
Any hierarchical key-value store would work. There's a (short) list on Wikipedia : Hierarchical key-value store.
The one I know best is GT.M, where your sample data could look like this :
customer("C100","D100","U100","K01")="my data for k01"
customer("C100","D100","U100","K02")="my data for k02"
customer("C100","D100","U100","K03")="my data for k03"
So customer("C100") would gives you access to all the data of a single customer, customer("C100","D100") would gives you access to all the data for a single department for a single customer, etc.
Couchbase has similar functionality if you use views (an index). You can create a view on all the keys, and do range queries over them. As far as I know, you can only wildcard over the end of a key but not the beginning, e.g.:
AAABBBCCCDDD* // yes
*BBBCCCDDDEEE // no
AAA*CCCDDDEEE // no
This is because it sorts the keys, and when you query you're getting a sub-range. However, you can get around this by creating views that sort by a different order.
More info: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views.html
Riak has secondary indexes that would allow querying data by matching the index or by range scan. The results from secondary indexes can be used as an input for Riak's MapReduce. Check this for more details: riak secondary indexes

Does Cassandra support conditional queries?

I'm thinking of switching to cassandra from my current SQL-esque solution (simpledb) mainly due to speed, cost and the built in caching feature of cassandra. However I'm stuck on the idea of indexing. Ive gathered that in cassandra you have to manually create indexes in order to execute complex queries. But what if you have data like the following, a row with a simple supercolumn:
row1 {value1="5", value2="7", value3="9"}
And you need to execute dynamic queries like "give me all the rows with value1 between x and y and value2 between z and q, etc. Is this possible? Or if you have queries like this is it a bad idea to use cassandra?
Cassandra 0.7.x contains secondary index that let you make queries like the one above.
The following blog post describes the concept:
http://www.riptano.com/blog/whats-new-cassandra-07-secondary-indexes
Secondary indices were introduced in 0.7. However, to use an indexed_slice_query, you need to have at least one equals expression. For example, you can do value1 = x and value2 < y, but not both range queries.
See Cassandra API

how to design Hbase schema?

suppose that I have this RDBM table (Entity-attribute-value_model):
col1: entityID
col2: attributeName
col3: value
and I want to use HBase due to scaling issues.
I know that the only way to access Hbase table is using a primary key (cursor). you can get a cursor for a specific key, and iterate the rows one-by-one .
The issue is, that in my case, I want to be able to iterate on all 3 columns.
for example :
for a given an entityID I want to get all its attriutes and values
for a give attributeName and value I want to all the entitiIDS
...
so one idea I had is to build one Hbase table that will hold the data (table DATA, with entityID as primary index), and 2 "index" tables one with attributeName as a primary key, and the other one with value
each index table will hold a list of pointers (entityIDs) for the DATA table.
Is it a reasonable approach ? or is is an 'abuse' of Hbase concepts ?
In this blog the author say:
HBase allows get operations by primary
key and scans (think: cursor) over row
ranges. (If you have both scale and
need of secondary indexes, don’t worry
- Lucene to the rescue! But that’s another post.)
Do you know how Lucene can help ?
-- Yonatan
Secondary indexes would indeed be useful for many potential applications of HBase, and I believe the developers are in fact looking at it. Checkout http://www.mail-archive.com/hbase-dev#hadoop.apache.org/msg04801.html.
In the mean time though, if your application data storage can be modelled as a star schema (see http://en.wikipedia.org/wiki/Star_schema) you might like to checkout the solution that Hypertable proposes for secondary index-type needs http://markmail.org/message/rphm4q6cbar2ycgp
I recommend having two different flat tables: one for looking up attributes+values given entityID, and one for looking up the entityID given attributes+values.
Table 1 would look like this:
entityID1 {
attribute1: value1;
attribute2: value2;
...
}
and Table 2:
attribute1_value1 {
entityID1;
}
attribute2_value2 {
entityID1;
}