I'm developing a mechanism for Cassandra using Hector.
What I need at this moment is to know which are the hash values of the keys to look at which node is stored (looking at the tokens of each one), and ask directly this node for the value. What I understood is that depending on the partitioner Cassandra uses, the values are stored independently from one partitioner to other. So, are the hash values of all keys stored in any table? In case not, how could I implement a generic class that once I read from System Keyspace the partitioner that is using Cassandra this class could be an instance of it without the necessity of modifying the code depending on the partitioner? I would need it to call the getToken method to calculate the hash value for a given key.
Hector's CqlQuery is poorly supported and buggy. You should use the native Java CQL driver instead: https://github.com/datastax/java-driver
You could just reuse the partitioners defined in Cassandra: https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/dht and then using the token ranges you could do the routing.
The CQL driver offers token-aware routing out of the box. I would use that instead of trying to reinvent the wheel in Hector, especially since Hector uses the legacy Thrift API instead of CQL.
Finally after testing different implementations I found the way to get the partitioner using the next code:
CqlQuery<String, String, String> cqlQuery = new CqlQuery<String, String, String>(
ksp, StringSerializer.get(), StringSerializer.get(), StringSerializer.get());
cqlQuery.setQuery("select partitioner from local");
QueryResult<CqlRows<String, String, String>> result = cqlQuery.execute();
CqlRows rows = result.get();
for (int i = 0; i < rows.getCount(); i++) {
RowImpl<String, String, String> row = (RowImpl<String, String, String>) rows
.getList().get(i);
List<HColumn<String, String>> column = row.getColumnSlice().getColumns();
for (HColumn<String , String> c: column) {
System.out.println(c.getValue());
}
}
Related
I'm new on Cassandra and Scala, I'm working on a Kafka consumer (written in Scala) that has to update a field of a row on Cassandra from data it receives.
And so far no problem.
In this row a field is a String list and when I do the update this field hasn't to change, so I have to assign the same String list to it self.
UPDATE keyspaceName.tableName
SET fieldToChange = newValue
WHERE id = idValue
AND fieldA = '${currentRow.getString("fieldA")}'
AND fieldB = ${currentRow.getInt("fieldB")}
...
AND fieldX = ${currentRow.getList("fieldX", classOf[String]).toString}
...
But I receive even the exception:
com.datastax.driver.core.exceptions.SyntaxError: line 19:49 no viable alternative at input ']' (... 482 AND fieldX = [[listStringItem1]]...)
I currently haven't found anything that could help me through the web
The problem is that Scala's string representation of the list doesn't match to the Cassandra's representation of the list, so it generates errors.
Instead of constructing the CQL statement directly in your code, it's better to use PreparedStatement and bind variables to it:
first, it will speedup the execution as Cassandra won't parse every statement separately;
it will be easier to bind variables as you won't need to care about corresponding string representation
But be very careful with Scala - Java driver expects Java's lists, sets, maps, and base types, like, ints, etc.. You may look to java-driver-scala-extras package, but you'll need to compile it yourself, as it's not available on Maven Central.
I will have C* tables that will be very wide. To prevent them to become too wide I have encountered a strategy that could suit me well. It was presented in this video.
Bucket Your Partitions Wisely
The good thing with this strategy is that there is no need for a "look-up-table" (it is fast), the bad part is that one needs to know the max amount of buckets and eventually end up with no more buckets to use (not scalable). I know my max bucket size so I will try this.
By calculating a hash from the tables primary keys this can be used as a bucket part together with the rest of the primary keys.
I have come up with the following method to be sure (I think?) that the hash always will be the same for a specific primary key.
Using Guava Hashing:
public static String bucket(List<String> primKeyParts, int maxBuckets) {
StringBuilder combinedHashString = new StringBuilder();
primKeyParts.forEach(part ->{
combinedHashString.append(
String.valueOf(
Hashing.consistentHash(Hashing.sha512()
.hashBytes(part.getBytes()), maxBuckets)
)
);
});
return combinedHashString.toString();
}
The reason I use sha512 is to be able to have strings with max characters of 256 (512 bit) otherwise the result will never be the same (as it seems according to my tests).
I am far from being a hashing guru, hence I'm asking the following questions.
Requirement: Between different JVM executions on different nodes/machines the result should always be the same for a given Cassandra primary key?
Can I rely on the mentioned method to do the job?
Is there a better solution of hashing large strings so they always will produce the same result for a given string?
Do I always need to hash from string or could there be a better way of doing this for a C* primary key and always produce same result?
Please, I don't want to discuss data modeling for a specific table, I just want to have a bucket strategy.
EDIT:
Elaborated further and came up with this so the length of string can be arbitrary. What do you say about this one?
public static int murmur3_128_bucket(int maxBuckets, String... primKeyParts) {
List<HashCode> hashCodes = new ArrayList();
for(String part : primKeyParts) {
hashCodes.add(Hashing.murmur3_128().hashString(part, StandardCharsets.UTF_8));
};
return Hashing.consistentHash(Hashing.combineOrdered(hashCodes), maxBuckets);
}
I currently use a similar solution in production. So for your method I would change to:
public static int bucket(List<String> primKeyParts, int maxBuckets) {
String keyParts = String.join("", primKeyParts);
return Hashing.consistentHash(
Hashing.murmur3_32().hashString(keyParts, Charsets.UTF_8),
maxBuckets);
}
So the differences
Send all the PK parts into the hash function at once.
We actually set the max buckets as a code constant since the consistent hash is only if the max buckets stay the same.
We use MurMur3 hash since we want it to be fast not cryptographically strong.
For your direct questions 1) Yes the method should do the job. 2) I think with the tweaks above you should be set. 3) The assumption is you need the whole PK?
I'm not sure you need to use the whole primary key since the expectation is that your partition part of your primary key is gonna be the same for many things which is why you are bucketing. You could just hash the bits that will provide you with good buckets to use in your partition key. In our case we just hash some of the clustering key parts of the PK to generate the bucket id we use as part of the partition key.
I have grouped all my customers in JavaPairRDD<Long, Iterable<ProductBean>> by there customerId (of Long type). Means every customerId have a List or ProductBean.
Now i want to save all ProductBean to DB irrespective of customerId. I got all values by using method
JavaRDD<Iterable<ProductBean>> values = custGroupRDD.values();
Now i want to convert JavaRDD<Iterable<ProductBean>> to JavaRDD<Object, BSONObject> so that i can save it to Mongo. Remember, every BSONObject is made of Single ProductBean.
I am not getting any idea of how to do this in Spark, i mean which Spark's Transformation is used to do that job. I think this task is some kind of seperate all values from Iterable. Please let me know how is this possible.
Any hint in Scala or Python are also ok.
You can use the flatMapValues function:
JavaRDD<Object,ProductBean> result = custGroupRDD.flatMapValues(v -> v)
I have the following column family:
CREATE COLUMN FAMILY messages with comparator=DateType and key_validation_class=UTF8Type and default_validation_class=UTF8Type;
now I'm using Cassandra Thrift to store new Data:
TTransport tr = this.dataSource.getConnection();
TProtocol proto = new TBinaryProtocol(tr);
Cassandra.Client client = new Cassandra.Client(proto);
long timestamp = System.currentTimeMillis();
client.set_keyspace("myTestSPace");
ColumnParent parent = new ColumnParent("messages");
Column messageColumn = new Column();
String time = String.valueOf(timestamp);
messageColumn.setName(time.getBytes());
messageColumn.setValue(toByteBuffer(msg.getText));
messageColumn.setTimestamp(timestamp);
client.insert(toByteBuffer(msg.getAuthor()), parent, messageColumn, ConsistencyLevel.ONE);
but I'm getting exception:
InvalidRequestException(why:Expected 8 or 0 byte long for date (16))
at org.apache.cassandra.thrift.Cassandra$insert_result.read(Cassandra.java:15198)
at org.apache.cassandra.thrift.Cassandra$Client.recv_insert(Cassandra.java:858)
at org.apache.cassandra.thrift.Cassandra$Client.insert(Cassandra.java:830)
at com.vanilla.cassandra.DaoService.addMessage(DaoService.java:57)
How to do it correct?
It appears that you're using the raw Thrift interface. For reasons like the one you've encountered and many, many more, I strongly suggest that you use an existing high-level client like Hector, Astyanax, or a CQL client.
The root cause of your issue is that you have to pack different datatypes into a binary format. The higher level clients manage this automatically.
How do I list all row keys in an hbase table?
I need to do this using PHP with a REST interface.
If you are listing all of the keys in an HBase table, then you are using the wrong tool. HBase is for large data systems where it is impractical to list all of the keys.
What may be more sensible is to start at a given key and list the next N keys (for values of N less than 10K). There are nice Java interfaces for doing this type of thing with a scan -- setting a start key and/or an end key.
Most HBase functionality is exposed via the Thrift interface. I would suggest looking there
I have found a way..
http://localhost:8080/tablename/* will return an xml data and i can preg-match it to get the rows.
Inviting better suggestions..
This...
http://localhost:8080/tablename/*/columnfamily:columnid
...will return all values in your table relative to that column in that table, sort of like applying column filter in the scanner.
Also, if you're looking for multiple columns - separate them with a comma.
So: /tablename/*/columnfamily:columnid,columnfamily:columnid2
I don't know what the REST interface is like, but you probably want to filter some data out client-side to avoid large RPC responses. You can do this by adding server-side filters to your scan:
Scan s = new Scan();
FilterList fl = new FilterList();
// returns first instance of a row, then skip to next row
fl.addFilter(new FirstKeyOnlyFilter());
// only return the Key, don't return the value
fl.addFilter(new KeyOnlyFilter());
s.setFilter(fl);
HTable myTable;
ResultScanner rs = myTable.getScanner(s);
Result row = rs.next();
while (row != null) ...
http://svn.apache.org/repos/asf/hbase/branches/0.90/src/main/java/org/apache/hadoop/hbase/filter/