What is the best way to query data in Azure Service Fabric?

What is the best way to query data in Azure Service Fabric? - azure-service-fabric

What is the best way to query data in Azure Service Fabric? Is there something on top of the Reliable Dictionary? For example: map / reduce?
Example: A customer object is stored in a reliable collection. The key is the CustomerID. And now I like to find all customers with surname that starts with "A" which are coming from "Sydney" and have ordered something within the last month.
What is the best way to implement this query functionally within Azure Service Fabric? How would the performance look like? Lets assume there are several hundred thousand of customers in the list.

Reliable Collections are IEnumerable (or support creating enumerators), similar to single-machine .NET collections, which means you can query them with LINQ. For example:
IReliableDictionary<int, Customer> myDictionary =
await this.StateManager.GetOrAddAsync<IReliableDictionary<int, Customer>>("mydictionary");
return from item in myDictionary
where item.Value.Name.StartsWith("A")
orderby item.Key
select new ResultView(...);
Performance depends on what kind of query you're doing, but one of the major benefits of Reliable Collections is that they are local to the code doing the querying, meaning you're reading directly from memory rather than making a network request to external storage.
With a very large data set you'll end up partitioning the service that contains the dictionary (a built-in feature in Service Fabric). In that case, you will have the same query in each partition but then you'll need to aggregate the results. Typically in this case it's good to set up another dictionary that can act as an index for common queries.

Related

MongoDB aggregation from few operations

Every user in our system (Like Facebook and twitter) has an option to add other users to his predefined lists like: *"Favorites", "Follow", "Blocked", "Closed Friends". Than, we want to allow him to search the list, filter and see commutative data from all the above list. for example:
UserA {
IsFollow: 1,
IsFavorite: 0
...
IsBlocked: 0
}
We also want to keep some additional information when user adding another user to one of the above list such addingDate.
Option One is to manage different collections like "Favorites", "Follow", "Blocked", "Closed Friends"
Option Two - to manage one collection like "Relations" and keep all the data on that collection without the needs of using lookup...
Option Three - Use option One but create a flat collection with all the relevant data from each table (RabbitMQ, transaction update, etc).
Since I'm new in MongoDB (I'm migrating the system from MS SQL), I'm wondering what is the best approach for high scale system.

I would suggest you go with option 2, where all the keys will be present in one document.
MongoDB recommends a schema design where all the data are embedded into a single document. They claim that this will lead to less read-write operations to DB and faster CRUD operations compared to the relational mapping approach.
But, there is a catch here. The data should be embedded in a single document only if the relations are One-to-One, One-to-Few, or One-to-Many.
DO NOT GO WITH DOCUMENT EMBEDDING APPROACH IF YOUR DATA MAPPING RELATION IS One-to-Squillions. I recommend you to read this article
The reason why I am not recommending Option-1 to have a separate collection is you will have to make more requests to a DB for each and every collection linkage. Although the $lookup stage is fast, it is not as efficient compared to the embedding approach.
As far as option 3 goes, it's a viable approach (If you use transactions properly and effectively), but it adds up complexity in the coding side.
I have personally used both Option-1 & Option-2 approaches, and option-1 has always left the AWS-EC2 instance running MongoDB to higher CPU and RAM usage. As far as option-2 goes, I have a collection that has almost 1000 array elements (With key indexed) and 15K keys in each records (I am not joking) and MongoDB had no issues processing it. Just make sure that you use the Projection of return documents everywhere.
So, go for Option-2 as a standard approach and Option-3 for One-to-Squillions relation mapping.
For referencing two or more collections, make sure that you use MongoDB generated ObjectId instead of your own custom referencing since have seen a minor performance impact on using multi-document relation-mapping other than ObjectId (Even if that particular key is indexed)
Hope this helps. Reach me out if you have additional queries

How to access relational data in DynamoDB (Keys vs Queries)

What is the best way to access relational data in DynamoDB? Is it best to store information redundantly and update it as needed, or to query every time you need access to relational data?
For example say I have a Team table that has a Members field that stores an array of member ids that also include the member's name.
Team
Members: [
{member1: "John"},
{member2: "Sam"},
{member3: "Pam"}
]
Whenever my application needs access to a team's members, all I have to do is call get on the teams members field. Then if I need additional team member data I can further call get on the entire member record.
One of the concerns I have with this method is having to update the names inside of the members array each time a person updates their name. Not to mention all of the other places the user's name could also be stored.
My main question then is whether I should be instead querying for these records every time instead of storing this data inside of the team record.
For example should I instead query all members who have team 123?
Would querying be more expensive to continually do, or would having to update all of this dependent data be more expensive?
I know that the query route is less of a headache then trying to prevent data anomalies, but querying may also require multiple calls to get the same data.

dynamodb write operations are more expensive than read ops, and become more when you add some secondary indexes to the table (or partition)
Also, synchronizing values in different items may charge additional fees if the sync operation must be done in a transaction to ensure atomicity
It meanly depends on your application needs and your budget i.e identify how many requests per use case you will approximatively have during a period (ex: month) then try to optimize the cost of use cases that generate the most of your traffic, regardless of their type (read or write).
For some read use cases, you may accept the eventual consistency of the read model and use a cache (e.g S3 files) it still an option...

Firestore pagination of multiple queries

In my case, there are 10 fields and all of them need to be searched by "or", that is why I'm using multiple queries and filter common items in client side by using Promise.all().
The problem is that I would like to implement pagination. I don't want to get all the results of each query, which has too much "read" cost. But I can't use .limit() for each query cause what I want is to limit "final result".
For example, I would like to get the first 50 common results in the 10 queries' results, if I do limit(50) to each query, the final result might be less than 50.
Anyone has ideas about pagination for multiple queries?

I believe that the best way for you to achieve that is using query cursors, so you can better manage the data that you retrieve from your searches.
I would recommend you to take a look at the below links, to find out more information - including a question answered by the community, that seems similar to your case.
Paginate data with query cursors
multi query and pagination with
firestore
Let me know if the information helped you!

Not sure it's relevant but I think I'm having a similar problem and have come up with 4 approaches that might be a workaround.
Instead of making 10 queries, fetch all the products matching a single selection filter e.g. category (in my case a customer can only set a single category field). And do all the filtering on the client side. With this approach the app still reads lots of documents at once but at least reuse these during the session time and filter with more flexibility than firestore`s strict rules.
Run multiple queries in a server environment, such as cloud store functions with Node.js and get only the first 50 documents that are matching all the filters. With this approach client only receives wanted data not more, but server still reads a lot.
This is actually your approach combined with accepted answer
Create automated documents in firebase with the help of cloud functions, e.g. Colors: {red:[product1ID,product2ID....], ....} just storing the document IDs and depending on filters get corresponding documents in server side with cloud functions, create a cross product of matching arrays (AND logic) and push first 50 elements of it to the client side. Knowing which products to display client then handle fetching client side library.
Hope these would help. Here is my original post Firestore multiple `in` queries with `and` logic, query structure

How to do basic aggregation with DynamoDB?

How is aggregation achieved with dynamodb? Mongodb and couchbase have map reduce support.
Lets say we are building a tech blog where users can post articles. And say articles can be tagged.
user
{
id : 1235,
name : "John",
...
}
article
{
id : 789,
title: "dynamodb use cases",
author : 12345 //userid
tags : ["dynamodb","aws","nosql","document database"]
}
In the user interface we want to show for the current user tags and the respective count.
How to achieve the following aggregation?
{
userid : 12,
tag_stats:{
"dynamodb" : 3,
"nosql" : 8
}
}
We will provide this data through a rest api and it will be frequently called. Like this information is shown in the app main page.
I can think of extracting all documents and doing aggregation at the application level. But I feel my read capacity units will be exhausted
Can use tools like EMR, redshift, bigquery, aws lambda. But I think these are for datawarehousing purpose.
I would like to know other and better ways of achieving the same.
How are people achieving dynamic simple queries like these having chosen dynamodb as primary data store considering cost and response time.

Long story short: Dynamo does not support this. It's not build for this use-case. It's intended for quick data access with low-latency. It simply does not support any aggregating functionality.
You have three main options:
Export DynamoDB data to Redshift or EMR Hive. Then you can execute SQL queries on a stale data. The benefit of this approach is that it consumes RCUs just once, but you will stick with outdated data.
Use DynamoDB connector for Hive and directly query DynamoDB. Again you can write arbitrary SQL queries, but in this case it will access data in DynamoDB directly. The downside is that it will consume read capacity on every query you do.
Maintain aggregated data in a separate table using DynamoDB streams. For example you can have a table UserId as a partition key and a nested map with tags and counts as an attribute. On every update in your original data DynamoDB streams will execute a Lambda function or some code on your hosts to update aggregate table. This is the most cost efficient method, but you will need to implement additional code for each new query.
Of course you can extract data at the application level and aggregate it there, but I would not recommend to do it. Unless you have a small table you will need to think about throttling, using just part of provisioned capacity (you want to consume, say, 20% of your RCUs for aggregation and not 100%), and how to distribute your work among multiple workers.
Both Redshift and Hive already know how to do this. Redshift relies on multiple worker nodes when it executes a query, while Hive is based on top of Map-Reduce. Also, both Redshift and Hive can use predefined percentage of your RCUs throughput.

Dynamodb is pure key/value storage and does not support aggregation out of the box.
If you really want to do aggregation using DynamoDB here some hints.
For you particular case lets have table named articles.
To do aggregation we need an extra table user-stats holding userId and tag_starts.
Enabled DynamoDB streams on table articles
Create a new lambda function user-stats-aggregate which is subscribed to articles DynamoDB stream and received OLD_NEW_IMAGES on every create/update/delete operation over articles table.
Lambda will perform following logic
If there is no old image, get current tags and increase by 1 every occurrence in the db for this user. (Keep in mind there could be the case there is no initial record in user-stats this user)
If there is old image see if tag was added or removed and apply change +1 or -1 depending on the case for each affected tag for received user.
Stand an API service retrieving these user stats.
Usually aggregation in DynamoDB could be done using DynamoDB streams , lambdas for doing aggregation and extra tables keeping aggregated results with different granularity.(minutes, hours, days, years ...)
This brings near realtime aggregation without need to do it on the fly per every request, you query on aggregated data.

Basic aggregation can be done using scan() and query() in lambda.

What would be the best practice with multiple collections in mongodb

I need to build a SAS application, part is MySQL (containing customer information and data etc.) the other part is on MongoDB as it uses great ammounts of quite raw unrelated data, which is subject to MapReduce and Aggregation. It would be highly inefficient to use just one of the two, so I have to use them both. The scenario summary would be like this, I have a table in the MySQL db with customer accounts, under which I have actual users. The relevant one is the customer_id. Now, each customer has particular formatted data in the MongoDB side so I would require collection sets for each of the customers, example 31filters,31data,31logs where 31 would be the customer id, all these collections are to reside in the same mongodb database. Is this an acceptable approach or it would be better to actually have separate mongodbdatabases for each customer? What would be the best approach in terms of scalability?
Thank you