How to use hbase coprocessor to implement groupby? - group-by

Recently I learned hbase coprocessor, I used endpoint to accumulate one column of hbase table.For example, the hbase table named "pendings",its family is "asset", I accumulate all the value of "asset:amount". The table has other columns,such as "asset:customer_name". The first thing I want to do is accumulate the the value of "asset:amount" group by "asset:customer_name". But I found there is not API for groupby, or I did not find it. Do you know how to implement GROUPBY or how to use the API that HBASE provides?

You should use an endpoint to do this work.
You have a sum example in this article: https://blogs.apache.org/hbase/entry/coprocessor_introduction.
What you basically need to add is to append your row key and the customer name to form your new key "MyKey". You should keep a variable of the last seen MyKey and when the current MyKey is different from the previous one, you should emit the previous one along with its sum and overwrite the previous MyKey to the current one.
You have to make sure to perform the aggregation on the client side as it is done in the example provided in the URL because you may have a customer at the edges of two different regions.

Using endpoint coprocessor can make it. All you should do is that : first define related interface(reduce) protocol extends CoprocessorPotocol, then make an implementation of it, lastly code the client-side logic.

Related

Good URL syntax for a GET request with a composite key

Let's take the following resource in my REST API:
GET `http://api/v1/user/users/{id}`
In normal circumstances I would use this like so:
GET `http://api/v1/user/users/aabc`
Where aabc is the user id.
There are times, however, when I have had to design my REST API in a way that some extra information is passed with the ID. For example:
GET `http://api/v1/user/users/customer:1`
Where customer:1 denotes I am using an id from the customer domain to lookup the user and that id is 1.
I now have a scenario where the identifier is more than one key (a composite key). For example:
GET `http://api/v1/user/users/customer:1;type:agent`
My question: in the above URL, what should I use as the separator between customer:1 and type:agent?
According to https://www.ietf.org/rfc/rfc3986.txt I believe that the semi-colon is not allowed.
You should either:
Use parameters:
GET http://api/v1/user/users?customer=1
Or use a new URL:
GET http://api/v1/user/users/customer/1
But use Standards like this
("Paths tend to be cached, parameters tend to not be, as a general rule.")
Instead of trying to create a general structure for accessing records via multiple keys at once, I would suggest trying to think of this on more of a case-by-case basis.
To take your example, one way to interpret it is that you have multiple customers, and those customers each may have multiple user accounts. A natural hierarchy for this would be:
/customer/x/user/y
Often an elegant decision like this can be made, that not only solves the problem but also documents your data-model in a way that someone can easily see that users belong to customers via a 1-to-many relationship.

Designing REST API

Currently, I have this resource:
GET /orders/{orderNumber}/{provisionId}/{taxYear}/docs
This returns the given order's documents. An Order is identified by three numbers: orderNumber, provisionId and taxYear. That is the primary key in the database.
I think this is a bad resource design and I want to change it, instead of use different path params for each composite primary key's part.
Is there a standard to model this kind of resources? I don't know how to manage entities that have composite id.
I have thought to do this:
GET /orders/{orderNumber,provisionId,taxYear}/docs
This would be one path param for the order identificator and server would split it to obtain each part.
Another choice I have thought is by query params:
GET /orders/docs?orderNumber=1234&provisionId=1054&taxYear=2015
But I think the last one wouldn't be semantically correct in REST architecture, since in this case query params are required and are not " search filter" params.
Is there any standard to do this? Which is the better choice?
Thanks

Auto generation of Sequence Numbers for nodes in Cypher REST API - Neo4J

I have a requirement to auto generate sequence numbers when inserting nodes into neo4j db, this sequence # will be like an id to the node and can be used to generate external url's to access that node directly from the UI.
This is similar to the auto generation of sequence property in mysql, how can we do this in neo4j via Cypher ? I did some research and found these links
Generating friendly id sequence in Neo4j
http://neo4j.com/api_docs//1.9.M05/org/neo4j/graphdb/Transaction.html
However these links are useful when I'm doing this programatically in transactional mode, in my case it's all using Cypher REST API.
Pls advise.
Thanks,
Deepesh
You can use MERGE to mimic sequences:
MERGE (s:Sequence {name:'mysequenceName'})
ON CREATE s.current = 0
ON MATCH s.current=s.current+1
WITH s.current as sequenceCounter
MATCH .... <-- your statement continues here
If your unique ID does not need to be numeric nor sequential, you can just generate and use a GUID whenever you want to create a node. You have to do this programmatically, and you should pass the value as a parameter, but there should be good libraries for GUID generation in all languages and for all platforms.

How to get list of aggregates using JOliviers's CommonDomain and EventStore?

The repository in the CommonDomain only exposes the "GetById()". So what to do if my Handler needs a list of Customers for example?
On face value of your question, if you needed to perform operations on multiple aggregates, you would just provide the ID's of each aggregate in your command (which the client would obtain from the query side), then you get each aggregate from the repository.
However, looking at one of your comments in response to another answer I see what you are actually referring to is set based validation.
This very question has raised quite a lot debate about how to do this, and Greg Young has written an blog post on it.
The classic question is 'how do I check that the username hasn't already been used when processing my 'CreateUserCommand'. I believe the suggested approach is to assume that the client has already done this check by asking the query side before issuing the command. When the user aggregate is created the UserCreatedEvent will be raised and handled by the query side. Here, the insert query will fail (either because of a check or unique constraint in the DB), and a compensating command would be issued, which would delete the newly created aggregate and perhaps email the user telling them the username is already taken.
The main point is, you assume that the client has done the check. I know this is approach is difficult to grasp at first - but it's the nature of eventual consistency.
Also you might want to read this other question which is similar, and contains some wise words from Udi Dahan.
In the classic event sourcing model, queries like get all customers would be carried out by a separate query handler which listens to all events in the domain and builds a query model to satisfy the relevant questions.
If you need to query customers by last name, for instance, you could listen to all customer created and customer name change events and just update one table of last-name to customer-id pairs. You could hold other information relevant to the UI that is showing the data, or you could simply hold IDs and go to the repository for the relevant customers in order to work further with them.
You don't need list of customers in your handler. Each aggregate MUST be processed in its own transaction. If you want to show this list to user - just build appropriate view.
Your command needs to contain the id of the aggregate root it should operate on.
This id will be looked up by the client sending the command using a view in your readmodel. This view will be populated with data from the events that your AR emits.

Cassandra get_range_slices

I am new to Cassandra and I am having some difficulties fetching data.
I looked into the function:
list<KeySlice> get_range_slices(column_parent, predicate, range, consistency_level)
But, I do not understand what the column_parent is supposed to be.
Anybody any idea?=
Thanx,
Granit
column_parent is basicly used for indicator of ColumnFamily(but in rare cases it can indicate a supercolumn). In java you would put : new ColumnParent("Posts") there. but there should be one more parameter for namespace in get_range_slices query, I guess you are not using thrift but a client api. then you should check your client's documentation.
Edit:
the definition of ColumnParent in cassandra api :
The ColumnParent is the path to the
parent of a particular set of Columns.
It is used when selecting groups of
columns from the same ColumnFamily. In
directory structure terms, imagine
ColumnParent as ColumnPath + '/../'.
Frail is correct, but the real answer is "don't use raw Thrift, use one of the clients from http://wiki.apache.org/cassandra/ClientOptions instead."