I am using ScalarDB which provides ACID functionality on Cassandra. Does ScalarDB support pagination?
If a partition has say 100 records, can I query 10 records at a time with each query starting from where the previous one left?
No, Scalar DB doesn't support pagination.
It has to be done in an application side.
Related
I'm new to Mongodb.
I'd like to understand if it's possible or not to achieve full isolation for reads and updates.
For example, I have the following flow:
Count documents from a collection based on some filter.
Based on the result, update another document.
So basically I want to understand how to prevent a concurrent write operation to enter between 1 & 2 .
BTW I'm using the java driver 3.12
Thanks
Shira
Latest versions of mongoDB has transactions features, maybe you can check that.
https://docs.mongodb.com/manual/core/transactions/
How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.
Lets say we are building a tech blog where users can post articles. And say articles can be tagged.
user
{
id : 1235,
name : "John",
...
}
article
{
id : 789,
title: "dynamodb use cases",
author : 12345 //userid
tags : ["dynamodb","aws","nosql","document database"]
}
In the user interface we want to show for the current user tags and the respective count.
How to achieve the following aggregation?
{
userid : 12,
tag_stats:{
"dynamodb" : 3,
"nosql" : 8
}
}
We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.
I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.
I would like to know other and better ways of achieving the same.
How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.
Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.
You have three main options:
Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.
Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.
Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.
Dynamodb is pure key/value storage and does not support aggregation out of the box.
If you really want to do aggregation using DynamoDB here some hints.
For you particular case lets have table named articles.
To do aggregation we need an extra table user-stats holding userId and tag_starts.
Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic
If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.
Stand an API service retrieving these user stats.
Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)
This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.
Basic aggregation can be done using scan() and query() in lambda.
I am using Couchbase with Spring Data and wish to implement bulkGet of Couchbase. Please let me know the following:
Is it possible via Spring Data?
If yes, can you share an example?
Is findAll (using _all view) comparable to bulkGet in terms of performance?
Can I fetch the _id along with the Couchbase document?
Environment:- Couchbase 4.0, Spring Data 2.0.0.RELEASE, Java 8.
Thanks in Advance!
I assume you are asking about a bulk get in the context of repositories.
First, there is currently no complete support of a "bulkGet" in Spring Data Couchbase. Most of the implementation is based on the SDK synchronous API, and bulk get is something usually done using the asynchronous API, using RxJava.
Note that there is no actual "bulkGet" operation at the protocol level in Couchbase, it's just the SDK issuing multiple single Get and batching them together.
To answer your second question, the above is important. The bulk get pattern discussed in the Couchbase Java SDK documentation (here) gives a slight performance boost because unlike in synchronous mode, we don't wait for the retrieval of one item to get the next.
The findAll() and findAll(Iterable) methods in Spring Data Couchbase both operate on top of a view, which allows to only retrieve documents that match the entity type of your repository but introduces a level of indirection that can lower performance compared to a pure sequence of key/value gets.
So the closest you could get to a bulk operation like that in Spring Data Couchbase would be to know all the IDs you're interested in and then perform a findOne per ID.
In the near term, the code behind the findAll(Iterable) signature could maybe be improved by applying a bulk get pattern on all provided IDs, but that would mean forgetting about the type checking induced by the view, so I'm not sure...
I have been studying NoSQL and Hadoop for Data Warehousing however I never worked with this technologies before and I would like to inquire if this following is possible to check if I got my understanding of this technologies right.
If I have my data stored in MongoDB, can I use Hadoop with Hive to make Hiveql queries directly to MongoDB and store the output of those queries as views back in MongoDB again, instead of the HDFS?
Also If I understand correctly most of the NoSQL databases don't support joins and aggregates, but it's possible to make them through map-reduce. If HiveQL queries are map-reduce jobs when I do a join in HiveQL would it already be automatically "joining" the MongoDB data in map-reduce for me, with no need to be worried about the lack of support for joins and aggregates in MongoDB?
MongoDB does have very good support for Aggregation kind of functions. There are no joins of-course. The way MongoDB Schema is usually designed is such that you would typically not need a join.
HiveQL operates on 'Tables' in HDFS. That's the default behavior.
But you have a MongoDB-Hadoop Connector: http://docs.mongodb.org/ecosystem/tools/hadoop/
which will let you query MongoDB data from within Hadoop.
To use Map Reduce you can do that with MongoDB itself (without Hadoop).
See this: http://docs.mongodb.org/manual/core/map-reduce/
What I would like to do is to make a query against my Cassandra "table" and get not only the current matching data but any future data that's added.
I have an application where data is constantly added to the "table" and I have many "clients" that are interested in getting this data.
So the initial result of the query would be the current data that matches the client's query and then I would like ongoing data to be received as they are added. Each client may be making a different query.
I would prefer to have a callback registered with a query so that I receive the data w/o having to poll.
Is this even possible with Cassandra?
Thank you.
P.S. From my reading, it seems MongoDB does support this feature.
You can't do this in Cassandra at present, but the new triggers feature coming in Cassandra 2.0 may do what you need. It's only going to be experimental when 2.0 comes out (soon).
MongoDB does indeed have a feature that might fit the bill. It's called a "tailable cursor" and can only be used on a capped collection, i.e. a collection that works like a ring buffer and "forgets" old data. After the tailable cursor has exhausted the entire collection the next read attempt will block until new data becomes available.
You can convert this into a callback pattern easily by implementing a reader thread with which the rest of the application can register its callbacks.