How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.
Lets say we are building a tech blog where users can post articles. And say articles can be tagged.
user
{
id : 1235,
name : "John",
...
}
article
{
id : 789,
title: "dynamodb use cases",
author : 12345 //userid
tags : ["dynamodb","aws","nosql","document database"]
}
In the user interface we want to show for the current user tags and the respective count.
How to achieve the following aggregation?
{
userid : 12,
tag_stats:{
"dynamodb" : 3,
"nosql" : 8
}
}
We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.
I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.
I would like to know other and better ways of achieving the same.
How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.
Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.
You have three main options:
Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.
Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.
Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.
Dynamodb is pure key/value storage and does not support aggregation out of the box.
If you really want to do aggregation using DynamoDB here some hints.
For you particular case lets have table named articles.
To do aggregation we need an extra table user-stats holding userId and tag_starts.
Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic
If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.
Stand an API service retrieving these user stats.
Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)
This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.
Basic aggregation can be done using scan() and query() in lambda.
Related
I have been evaluating migration of our datastore from MongoDB to DynamoDB, since it is a well established AWS service.
However, I am not sure if the DynamoDB data model is robust enough to support our use cases. I understand that DynamoDB added document support in 2014, but whatever examples I have seen, does not look to be addressing queries which work across documents, and which do not specify a value for the partition key.
For instance if I have a document containing employee info,
{
"name": "John Doe",
"department": "sales",
"date_of_joining": "2017-01-21"
}
and I need to make query like give me all the employees which have joined after 01-01-2016, then I can't make it with this schema.
I might be able to make this query after creating a secondary index which has a randomly generated partition key (say 0-99) and create a sort key on "date_of_joining", then query for all the partitions and put condition on "date_of_joining". But this is too complex a way to do a simple query, doing something like this in MongoDB is quite straightforward.
Can someone help with understanding if there is a better way to do such queries in DynamoDB and is DynamoDB really suited for such use cases?
Actually, the partition key of the GSI need not be unique. You can have date_of_joining as a partition key of GSI.
However, when you query the partition key, you cannot use greater than for the partition key field. Only equality is supported for partition key. I am not sure that why you wanted to have a random number as partition key of GSI and date_of_joining as sort key. Even if you design like, I don't thing you will be able to use DynamoDB Query API to get the expected result. You may end-up using DynamoDB Scan API which is a costly operation in DynamoDB.
GSI:
date_of_joining - as Partition key
Supported in Query API:-
If you have multiple items for the same DOJ, the result with have multiple items (i.e. when you query using GSI).
KeyConditionExpression : 'date_of_joining = :doj'
Not supported in Query API:-
KeyConditionExpression : 'date_of_joining > :doj'
Conclusion:-
You need to use DynamoDB Scan. If you are going to use Scan, then GSI may not be required. You can directly scan the main table using FilterExpression.
FilterExpression : 'date_of_joining > :doj'
Disadvantage:-
Costly
Not efficient
You might decide to support your range queries with an indexing backend. For example, you could stream your table updates in DynamoDB to AWS ElasticSearch with a Lambda function, and then query ES for records matching the range of join dates you choose.
After a painful day trying to figure out if we should go with DynamoDB to store Json documents (vs mongo) and reading through almost all the AWS documentations and online examples, I have decided to ask my questions here.
Ours is a Spring Boot Java application and we are using the aws-dynamodb sdk plugin. Our application has to manage a couple of thousands of Json documents and be able to retrieve based on various conditions.
For example, imagine this is the JSon document -
{
"serial":"123123",
"feed":{
"ABC":{
"queue": "ABC",
"active": true
},
"XYZ" : {
"queue":"XYZ",
"active": false
}
}
These are the questions I have
Can I store this whole Json document as a String attribute in Dynamo table and still be able to retrieve the records based on the value of certain attributes inside the Json and how?
For example I would like to get all the items that has the feed ABC active.
How scalable is this solution?
I know I can do this very easily in Mongo but just couldn't get it working in dynamo.
First, if you aren't using DynamoDBMapper for talking to DynamoDB, you should consider using it, instead of low-level APIs, as it provides a more convenient higher-level abstraction.
Now, answers to your questions:
Instead of storing it as a String, consider using Map. More information on supported data types can be found here. As for searching, there are two ways: Query (in which you need to provide primary keys of the records you need) and Scan. For your example (i.e. 'all the items that has the feed ABC active'), you'd have to do a Scan, as you don't know the primary keys.
DynamoDB is highly scalable. Querying is efficient, but looks like you'll be Scaning more. The latter has its limitations, as it literally goes through each record, but should work fine for you as you'll only have couple of thousand records. Do performance testing first though.
What is the best way to query data in Azure Service Fabric? Is there something on top of the Reliable Dictionary? For example: map / reduce?
Example: A customer object is stored in a reliable collection. The key is the CustomerID. And now I like to find all customers with surname that starts with "A" which are coming from "Sydney" and have ordered something within the last month.
What is the best way to implement this query functionally within Azure Service Fabric? How would the performance look like? Lets assume there are several hundred thousand of customers in the list.
Reliable Collections are IEnumerable (or support creating enumerators), similar to single-machine .NET collections, which means you can query them with LINQ. For example:
IReliableDictionary<int, Customer> myDictionary =
await this.StateManager.GetOrAddAsync<IReliableDictionary<int, Customer>>("mydictionary");
return from item in myDictionary
where item.Value.Name.StartsWith("A")
orderby item.Key
select new ResultView(...);
Performance depends on what kind of query you're doing, but one of the major benefits of Reliable Collections is that they are local to the code doing the querying, meaning you're reading directly from memory rather than making a network request to external storage.
With a very large data set you'll end up partitioning the service that contains the dictionary (a built-in feature in Service Fabric). In that case, you will have the same query in each partition but then you'll need to aggregate the results. Typically in this case it's good to set up another dictionary that can act as an index for common queries.
I'm currently planning the development of a service which should handle a fair amount of request and for each request do some logging.
Each log will have the following form
{event: "EVENTTYPE", userid: "UID", itemid: "ITEMID", timestamp: DATETIME}
I expect that a lot of writing will be done, while reading and analysis will only be done once per hour.
A requirement in the data analysis is that I have to be able to do the following query:
Are both events, A and B, on item (ITEMID) logged for user (UID)? (Maybe even tell if event A came before event B based on their timestamps)
I have thought about MongoDB as my storage solution.
Can the above query be (properly) carried out by the MongoDB aggregation framework?
In the future I might add on to the analysis step, with a relation from ITEMID to ITEM.Categories (I have a collection of items, and each item has a series of categories). Possibly it would be interesting to know how many times event A occured on items grouped by the individual items category, during the last 30 days. Will MongoDB then be a good fit for my requirements?
Some information about the data I'll be working with:
I expect to be logging in the order of 10.000 events a day on average.
I haven't decided yet, whether the data should be stored indefinitely.
Is MongoDB a proper fit for my requirements? Is there another NoSQL database that will handle my requirements better? Is NoSQL even usable in this case or am I better off sticking with relational databases?
If my requirement of the frequency of analysis changes, say from once an hour to real time. I believe Redis would serve my purpose better than MongoDB, is this correctly understood?
Are both events, A and B, on item (ITEMID) logged for user (UID)? (Maybe even tell if event A came before event B based on their timestamps)
Can the above query be (properly) carried out by the MongoDB aggregation framework?
Yes, absolutely. You can use the $group operator to aggregate events by ITEMID, UID, you can filter results before the grouping via $match to limit them to a specific time period, or with any other filter, and you can push times (first, last) of each type of event into the document that the $group operator creates. Then you can use $project to create a field indicating what came before what, if you wish.
All of the capabilities of aggregation framework are well outlined here:
http://docs.mongodb.org/manual/core/aggregation-pipeline/
In the future I might add on to the analysis
step, with a relation from ITEMID to ITEM.Categories (I have a
collection of items, and each item has a series of categories).
Possibly it would be interesting to know how many times event A
occured on items grouped by the individual items category, during the
last 30 days. Will MongoDB then be a good fit for my requirements?
Yes. Aggregation in MongoDB allows you to $unwind arrays so that you can group things by categories, if you wish. All of the things you've described are easy to accomplish with aggregation framework.
Whether or not MongoDB is the right choice for your application is outside the scope of this site, but you the requirements you've listed in this question can be implemented in MongoDB.
I have this schema for support of in-site messaging:
When I send a message to another member, the message is saved to Message table; a record is added to MessageSent table and a record per recipient is added to MessageInbox table. MessageCount is being used to keep track of number of messages in the inbox/send folders and is filled using insert/delete triggers on MessageInbox/MessageSent - this way I can always know how many messages a member has without making an expensive "select count(*)" query.
Also, when I query member's messages, I join to Member table to get member's FirstName/LastName.
Now, I will be moving the application to MongoDB, and I'm not quite sure what should be the collection schema. Because there are no joins available in MongoDB, I have to completely denormalize it, so I woudl have MessageInbox, MessageDraft and MessageSent collections with full message information, right?
Then I'm not sure about following:
What if a user changes his First/LastName? It will be stored denormalized as sender in some messages, as a part of Recipients in other messages - how do I update it in optimal ways?
How do I get message counts? There will be tons of requests at the same time, so it has to be performing well.
Any ideas, comments and suggestions are highly appreciated!
I can offer you some insight as to what I have done to simulate JOINs in MongoDB.
In cases like this, I store the ID of a corresponding user (or multiple users) in a given object, such as your message object in the messages collection.
(Im not suggesting this be your schema, just using it as an example of my approach)
{
_id: "msg1234",
from: "user1234",
to: "user5678",
subject: "This is the subject",
body: "This is the body"
}
I would query the database to get all the messages I need then in my application I would iterate the results and build an array of user IDs. I would filter this array to be unique and then query the database a second time using the $in operator to find any user in the given array.
Then in my application, I would join the results back to the object.
It requires two queries to the database (or potentially more if you want to join other collections) but this illustrates something that many people have been advocating for a long time: Do your JOINs in your application layer. Let the database spend its time querying data, not processing it. You can probably scale your application servers quicker and cheaper than your database anyway.
I am using this pattern to create real time activity feeds in my application and it works flawlessly and fast. I prefer this to denormalizing things that could change like user information because when writing to the database, MongoDB may need to re-write the entire object if the new data doesnt fit in the old data's place. If I needed to rewrite hundreds (or thousands) of activity items in my database, then it would be a disaster.
Additionally, writes on MongoDB are blocking so if a scenario like I've just described were to happen, all reads and writes would be blocked until the write operation is complete. I believe this is scheduled to be addressed in some capacity for the 2.x series but its still not going to be perfect.
Indexed queries, on the other hand, are super fast, even if you need to do two of them to get the data.