Monitor MongoDB Atlas data transfer costs - mongodb

I have a MongoDB Atlas cluster that serves many customers. Each customer has its own database on the cluster.
I would like to reduce my application's impact on MongoDB data transfer costs, which have been increasing for the last few days, but the billing info provided by Atlas does not break down prices per database. Therefore, I have no way of knowing which customers are costly and what are the most costly queries in terms of data transfer.
Moreover, using the prices on a daily basis and a few queries, I cannot correlate insertion of resources in my application with prices. For example, let's say my resources are Cats, one day it will cost 5$ of data transfer with 5000 Cats inserted in total in the databases, but the next day, it's going to cost 13$ with 1500 Cats inserted.
Do you know of tools or something in the Atlas dashboard I might've missed that could help me better track costs per customer, or say, a cost per Cat (in my example) so that I build a pricing model for my customers?
Thank you

You are most likely going to need separate projects and deployments.
A MongoDB client instance is generally capable of using any database on the server (subject to authorization rules and APIs provided in the language in question), therefore to get a breakdown of data transfer by database would require the server to track bytes transferred per operation and then aggregate those counts. As far as I know this isn't a feature that currently exists.
The most practical way of tracking this today is probably writing a layer on top of the driver on the client side that would look at data actually received.

Related

Exposing Data from BigQuery to Mobile/Web Apps Via Firestore

I am looking for an easy way (of course with good performance) to expose data in my BigQuery table to web applications.
The current solution which is running is using a Cloud Function and Firestore (in native mode) to expose the data in BigQuery. The implementation is like - as soon as the data is written to the final big query table, we are triggering cloud functions (500 records per commit) to update the data in our final firestore table. The data in firestore table is finally exposed to the App/Web client.
And, to avoid timeout issues associated with Cloud Functions, we are dividing the entire dataset into batches and each cloud function instance will handle a single batch of records only.
But soon after going live, we were hit with scalability issues for the writes as we were triggering the Cloud Function instances sequentially.
A simple way to improve the performance could be to do parallel writes from inside the cloud function, but again as per the firestore documentation doing more than 1000 writes/sec against a collection can reduce performance. So eventually the performance gains we are getting with this approach could be minimum. In our case, we have only one collection.
Anyone here has experience of dealing with high volume writes and reads against Firestore ? Firestore in datastore mode can be used for high volume writes, but what about the read latency?
Also, I am thinking of using BigTable for this purpose (eventual consistency could be fine for us), but using bigtable might add additional layers to expose the data, maybe through a web service.
We are expecting data size to be around GBs only.
PS : I don't need the offline capabilities offered by the Firestore, the reason for choosing Firestore was for the ease of development only. 
Based on the information you shared Firestore does not seem like an appropriate choice of product for the amount of data you will be adding at once, plus the costs of this might be heavier than the alternative if we talking about TBs of data, which I assume is the case.
Generally speaking Firestore is not recommended for very data intensive apps nor apps with too many writes, for pricing reasons, as reads are considerably cheaper than writes.
Personally I would choose Big Table for this case for the following reasons:
Supports apps with high throughput.
Easily scalable without lost in performance or instance downtime while doing so.
If kept in the same zone or region as Big Query will have no additional costs to migrating the data to Big Table.

AWS database solution for storing non-relational data

Whats the best AWS database for the below requirement
I need to store around 50,000 - 1,00,000 entries in the database.
Each of the entry would have a String as a key and a Json array as the value.
I should be able to retrieve the JSON array using the key.
The size of JSON data is around 20-30KB
I expect around 10,000 - 40,000 reads per hour.
Around 50,000 - 1,00,000 writes/week
I have to consider the cost as well.
Ease of integration/development
I am bit confused between MongoDB, DynamoDB and PostgreSQL. Please share your thoughts on this.
DynamoDB:-
DynamoDB is a fully managed proprietary NoSQL database service that supports key-value and document data structures. For the typical use case that you have described in OP, it would serve the purpose.
DynamoDB can handle more than 10 trillion requests per day and support
peaks of more than 20 million requests per second.
DynamoDB has good AWS SDK for all operations. The read and write capacity units can be configured for the table.
DynamoDB tables using on-demand capacity mode automatically adapt to
your application’s traffic volume. On-demand capacity mode instantly
accommodates up to double the previous peak traffic on a table. For
example, if your application’s traffic pattern varies between 25,000
and 50,000 strongly consistent reads per second where 50,000 reads per
second is the previous traffic peak, on-demand capacity mode instantly
accommodates sustained traffic of up to 100,000 reads per second. If
your application sustains traffic of 100,000 reads per second, that
peak becomes your new previous peak, enabling subsequent traffic to
reach up to 200,000 reads per second.
One point to note is that it doesn't allow to query the table based on non-key attributes. This means if you don't know the hash key of the table, you may need to do full table scan to get the data. However, there is a Secondary Index option which you can explore to get around the problem. You may need to have all the Query Access Patterns of your use case before you design and make informed decision.
MongoDB:-
MongoDB is not a fully managed service on AWS. However, you can setup the database using AWS service such as EC2, VPC, IAM, EBS etc. This requires some AWS cloud experience to setup the database. The other option is to use MongoDB Atlas service.
MongoDB is more flexible in terms of querying. Also, it has a powerful aggregate functions. There are lots of tools available to query the database directly to explore the data like SQL.
In terms of Java API, the Spring MongoDB can be used to perform typical database operation. There are lots of open source frameworks available on various languages for MongoDB (example Mongoose Nodejs) as well.
The MongoDB has support for many programming languages and the APIs are mature as well.
PostgreSQL:-
PostgreSQL is a fully managed database on AWS.
PostgreSQL has become the preferred open source relational database
for many enterprise developers and start-ups, powering leading
geospatial and mobile applications. Amazon RDS makes it easy to set
up, operate, and scale PostgreSQL deployments in the cloud.
I think I don't need to write much about this database and its API. It is very mature database and has good APIs.
Points to consider:-
Query Access Pattern
Easy setup
Database maintenance
API and frameworks
Community support

Amazon Redshift for SaaS application

I am currently testing Redshift for a SaaS near-realtime analytics application.
The queries performance are fine on a 100M rows dataset.
However, the concurrency limit of 15 queries per cluster will become a problem when more users will be using the application at the same time.
I cannot cache all aggregated results since we authorize to customize filters on each query (ad-hoc querying)
The requirements for the application are:
queries must return results within 10s
ad-hoc queries with filters on more than 100 columns
From 1 to 50 clients connected at the same time on the application
dataset growing at 10M rows / day rate
typical queries are SELECT with aggregated function COUNT, AVG with 1 or 2 joins
Is Redshift not correct for this use case? What other technologies would you consider for those requirements?
This question was also posted on the Redshift Forum. https://forums.aws.amazon.com/thread.jspa?messageID=498430&#498430
I'm cross-posting my answer for others who find this question via Google. :)
In the old days we would have used an OLAP product for this, something like Essbase or Analysis Services. If you want to look into OLAP there is an very nice open source implementation called Mondrian that can run over a variety of databases (including Redshift AFAIK). Also check out Saiku for an OSS browser based OLAP query tool.
I think you should test the behaviour of Redshift with more than 15 concurrent queries. I suspect that it will not be user noticeable as the queries will simply queue for a second or 2.
If you prove that Redshift won't work you could test Vertica's free 3-node edition. It's a bit more mature than Redshift (i.e. it will handle more concurrent users) and much more flexible about data loading.
Hadoop/Impala is overly complex for a dataset of your size, in my opinion. It is also not designed for a large number of concurrent queries or short duration queries.
Shark/Spark is designed for the case where you data is arriving quickly and you have a limited set of metrics that you can pre-calculate. Again this does not seem to match your requirements.
Good luck.
Redshift is very sensitive to the keys used in joins and group by/order by. There are no dynamic indexes, so usually you define your structure to suit the tasks.
What you need to ensure is that your joins match the structure 100%. Look at the explain plans - you should not have any redistribution or broadcasting, and no leader node activities (such as Sorting). It sounds like the most critical requirement considering the amount of queries you are going to have.
The requirement to be able to filter/aggregate on arbitrary 100 columns can be a problem as well. If the structure (dist keys, sort keys) don't match the columns most of the time, you won't be able to take advantage of Redshift optimisations. However, these are scalability problems - you can increase the number of nodes to match your performance, you just might be surprised of the costs of the optimal solution.
This may not be a serious problem if the number of projected columns is small, otherwise Redshift will have to hold large amounts of data in memory (and eventually spill) while sorting or aggregating (even in distributed manner), and that can again impact performance.
Beyond scaling, you can always implement sharding or mirroring, to overcome some queue/connection limits, or contact AWS support to have some limits lifted
You should consider pre-aggregation. Redshift can scan billions of rows in seconds as long as it does not need to do transformations like reordering. And it can store petabytes of data - so it's OK if you store data in excess
So in summary, I don't think your use case is not suitable based on just the definition you provided. It might require work, and the details depend on the exact usage patterns.

What are the pros and cons of a Relational DB vs Mongo vs Flat file behind a CDN

let's say that I have an ecommerce website with million of products, that have millions of pageviews a day, mostly for product details pages.
Let's say that I currently have all my data in a relational DB, the old good way.
What would be the pros and cons of keeping the data in the relational DB for doing queries, aggreating and filtering products and all that...but using flat json files for the product details?
So, having 1 file per 1 product, with all details serialized to json. These files would be placed under a high-performance cdn, geographically distributed and all that. When the user goes to
www.mysite.com/prods/00123
the server (or even the client) would load a template file for the layout, and then fill it with the data it reads from something like cdn.mysite.com/prods/00123.json
So I basically don't need to do queries in this case - I jump straight to the file named after the product id. I guess it should be very fast, and yet I would delegate the scalability / caching / geographic distribution to an external strong partner (cdn like akamai, amazon etc.) instead of building my own (expensive and hard to maintain) distributed db server?
I look forward to your suggestions / feedback...especially if it comes to real world experience :)
Thanks!
As per your requirements,
It is better to store product descriptions in a schema free database like MongoDB since your products can have very different fields with wide variation in number of attributes (and corresponding fields). Also such information is written far less often then they are read. MongoDB has collection level write locks which deter write heavy applications if you like to do consistent writes. However the reads in MongoDB are very fast because you dont have to do joins or fetch field values from a EAV schema table. Needless to say, based on your data volume sharding and replication needs to be done in a production environment.
It is better than storing in a flat file since MongoDB's read performance is very good because of memory mapped files and you get replication/sharding as well.
However, if the filesystem (or the filesystem network) provides the security, speed and accessibility provided by the database then storing the data in filesystem is not a bad idea. The traditional db vs flat-file argument does not hold true if the flat files are configured to be served in an efficient manner.
However, you should not store information like shopping cart, checkout transaction, etc in MongoDB since you dont have ACID transactions and frequent writes and updates 'with consistency' is not MongoDB's cup of tea.

Can MongoDB handle TBs of data?

Will MongoDB handle several TB of data? I've read posts saying that Mongo does well with < 1TB of data, for larger sets I should go with HBase. Is that true?
I need to store and later process several TB of text data.
These may be of interest to you:
Wordnik: data set in the >3TB range
Craiglist: shard cluster designed to support 10TB of data.
You'll find some additional case studies on 10gen's website, although not all of them provide specific numbers on data set sizes. There are also some older discussions on Stack Overflow about this very question (see here for a blurb about a user with 12TB of data from March 2010), and you'll likely find more case studies scattered among presentations on Speaker Deck or Slideshare. In short, MongoDB can certainly handle that amount of data (people are using it to that effect today), but you'll want to heed best practices, which is where existing presentations can come in handy.
MongoDB
Tens of thousands of organizations use MongoDB to build high-performance systems at scale. Over a third of the Fortune 100 and many of the most successful and innovative web companies rely on MongoDB. They've grown from single server deployments to clusters with over 1,000 nodes, delivering millions of operations per second on over 100 billion documents and petabytes of data.
Scalability is not just about speed. It's about 3 different metrics, which often work together:
Cluster Scale. Distributing the database across 100+ nodes, often in multiple data centers
Performance Scale. Sustaining 100,000+ database read and writes per second while maintaining strict latency SLAs
Data Scale. Storing 1 billion+ documents in the database
There are many examples of MongoDB users who are pushing the limits to scalability. Here are a few, organized around each scaling dimension.
You can find reference about MongoDB: Bringing Online Big Data to
Business Intelligence & Analytics at this article