we store apple app data in a database (http://www.apple.com/itunes/affiliates/resources/documentation/itunes-enterprise-partner-feed.html).
we want to optimize for one type of query: find all apps that meet some criteria. criteria: (1) avg rating of app; (2) number of app ratings; (3) devices supported by app; (4) countries where app is sold; (5) current price of app; and (6) date when app went free. the query should be as fast as possible. example query: "find all apps with > 600 ratings, averages 5 stars, supports iPads and iPhones, is sold in the US, and dropped their price to $0.00 two days ago."
based on the apple schema, there is price information for every country. assuming apple supports 100 countries, each app will have 100 prices -- one for each country. we also need to store the historical prices for each app, meaning an app with 10 price changes will have 1000 prices (assuming 100 countries).
three questions:
1) how do you advise we store the price data in mongo to make queries fast? right now, we're thinking of storing prices as an array of objects. each object consists of three elements: (1) date; (2) country; (3) price.
2) if we store price data as objects in an array, what do we need to do to make searches against price data very fast. again, the common price search is something like, "find all apps that dropped their price to $0.00 2 days again in the USA store."
3) any gotchas we should be mindful of in storing the data?
Personally, I would have a separate collection for the daily price data -- 1 record per day per app (the compound natural key), with that day's set of 100 numbers for that app. This way the records will never need to grow or relocate -- that's a big win. With proper indexes, most any query against this collection can be made to perform well. Keep the field names small for more efficient storage.
I would keep a separate collection for the app "master data" -- 1 record per app. In those records you can memoize the most recent date the app went free, a snapshot of the most recent by-country price vector, and similar snapshot values of any other "summary" data that may form the selection criteria for an app search. Aggregations to compute and record such values, should they may become costly, can then be performed in the background at convenient times.
Hope that's a help! Great that you're asking these questions up front. :)
Related
My application is used for creating production budgets for complex projects (construction, media productions etc.)
The structure of the budget is as follows:
The budget contains "sections",
the sections contain "accounts"
the accounts contains "subaccounts"
the subaccounts contain line items.
Line items have a number of fields, (units, rate, currency, tax etc.) and a calculated total
Or perhaps using Firestore to do these cascading calculations is the wrong approach? I should just load a single complex budget document into my app, do all the cacluations and updates on the clients, and then write back the entire budget as a single document when the user presses "save budget"?
Certain fields in line items may have alpha numeric codes which represent numeric values, which a user can use instead of a hard-coded number, e.g. user can enter "=build-weeks" and define that with a formula that evaluates to say "7" which is then used in the calculation of a total.
Line items bubble up their totals, so subaccounts have total equal to the sum of their line items,
Accounts total equals the total of their subaccounts,
sections total equals sum of accounts totals,
and budget total is total of section totals.
The question si how to aggregate into this data into documents comprising the budget.
Budgets may be sort of long, say 5,000 linesitems or more in total. Single accounts may have hundreds of line items.
Users will most likely look at a all of the line items for a given account, so it occurred to me
to make individual documents for sections, accounts and subaccounts, and make line items a map within a sub account.
The problem main concern I have with this approach is that when the user changes, say the exchange rate of currency of a line item, or changes the calculated value of a named value like "build-weeks" I will ahve to retrieve all the individual line items containing that curency or named value, recalculate the total, and then bubble up the changes through the hierarchy.
This seems not that complicated if each line item is its own document, I can just search the collection for the presence of the code in question, recalculate the line item, and use a cloud function to bubble up teh changes maybe.
But if all the lineitems are contained in an array of maps within each subaccount map item,
it seems like it will be quite tedious to find and change them when necessary..
On the other hand -- keeping these documents so small seems like a lot of document reads when somebody is reviewing a budget, or say, printing it, If somebody just clicks on a bunch of accounts, it might be 100's of reads per click to retrieve all the line items and hundreds or a thousand writes when somebody changes the value of a often used named value like "build-weeks".
Does anybody have any thoughts on the obvious "right" answer to this? Or does it just depend on what I want to optimize for - firestore costs, responsiveness of app, complexity of code?
From my stand point, there is no obvious answer to your problem and indeed it does depend on what you want to optimize for.
However there are a few points that you need to consider on your decision:
Documents in Firestore have a limit of 1Mb/Document;
Documents in Firestore have a limit of 20000 fields;
Queries are shallow, so you don't get data from subcollections on the same query;
For considerations 1 and 2, this means that if you choose the design you database to a big document containing everything, even though you said that your app will have lots of data, I doubt that it will be more than the limits mentioned, still, do consider those. Also, how necessary is it to get all the data at once, this could represent performance and user battery/data usage issues (if you are making a mobile app).
For consideration 3, it means that you would have to make many reads if you choose to get all the data for your sections divided in subdocuments, this will mean more cost to you but better performance for users.
To make the right call on this problem I suggest that you talk to possible users of your solution and understand the problem that you are trying to fix and what they expect of the app. Also, it might be interesting to take a look at the How to Structure Your Data and Maps, Arrays and Subcollections videos, as they explain in a more visual way how Firestore behaves and it could be helpful to antecipate problems that the approach you choose could cause.
Hope I was able to help with these considerations.
I'd like to know what cost a reads on Cloud Firestore.
For example, the app is loading an object of a collection and all fields into it. Does it cost 1 read for the object or does it cost 10 reads since there are 10 fields in this object (name, image link, description, uuid, createDates, price , price, price 3 etc) ?
If the answer is 10 (which I supposed it is), there is a possibility to reduce reads by deleting the fields I don't need when using my app (createdates, uuid for example).
Is there any problems doing that?
Also, can I group some of the fields together? (let's say price(string)=price1/price2/price3 and then in my app I say price1 is the first number of price, price2 is the one in the middle and so on.
Will this reduce the reads by 3 for the price?
Thank you very much for theses explanations
Firestore pricing is based on document (object) reads: https://cloud.google.com/firestore/pricing with a minimum charge of one document for every query, even if there are no results.
Since documents contain the key/value pair fields (https://cloud.google.com/firestore/docs/data-model) you should only get charged per document, not per field.
Of course, other costs may come into play, as the documentation notes that larger documents can be slower to retrieve (a cost of latency) and of course larger documents will use more network bandwidth, which can incur a cost in some cases.
There is other guidance on the pricing page about how to reduce costs for large result sets, via the use of cursors, but the costs are still based on documents.
I have an application that runs a MongoDB database. This database will store 5 million documents per user. The web application that uses this database will display a 5 thousand of these documents on a page at any given time and will be a selection of the 5 million. Each document must have a sort order (or rank) such that the web page will be able to allow the user to sort their 5 thousand records, by dragging items up.down as they see fit (a sortable list).
I have read articles about how Trello uses a double-float decimal number to change the value of the sorted item in a list, but this seems to only allow for 50-odd worst-case sorts, so will not accommodate the large number of items in the users list. My questions is how do I do this?
In our application we have to display most popular articles but there could be same strategy required in case of trending, hot, relevant, etc. data.
We have 2 collections [Articles] and [Comments] where articles have multiple comments.
We want to display most popular [Articles].
Popularity is counted basing on some complex formula but let's assume that the formula sums total number of [Articles] views and total [Comments] count. We assume that if formula counts popularity of 1 article then it will take all [Articles] into account to give it also a rank among others.
As you can see users are constantly increasing views and adding more comments. Every day different articles can be among the most popular ones.
The problem is as follows: how to display up to date data without spamming database with queries?
I was thinking about scheduled cron job (in our backend app) that would update [Article] popularity for ex. every hour and then save it in article itself. This way when users visit the page nothing would have to be counted and we could just work on saved data.
There might be also possibility to build a query that is fast enough and counts popularity on demand but I don't know if it's possible.
What will be the best strategy? Count data in background and keep it up to date ,build advanced query or something different?
I am new to MongoDB and I have difficulties implementing a solution in it.
Consider a case where I have two collections: a client and sales collection with such designs
Client
==========
id
full name
mobile
gender
region
emp_status
occupation
religion
Sales
===========
id
client_id //this would be a DBRef
trans_date //date time value
products //an array of collections of product sold in the form {product_code, description, units, unit price, amount}
total sales
Now there is a requirement to develop another collection for analytical queries where the following questions can be answered
What are the distribution of sales by gender, region and emp_status?
What are the mostly purchase products for clients in a particular region?
I considered implementing a very denormalized collection to create a flat and wide collection of the properties of the sales and client collection so that I can use map-reduce to further answer the questions.
In RDBMS, an aggregation back by a join would answer these question but I am at loss to how to make Map-Reduce or Agregation help out.
Questions:
How do I implement Map-Reduce to map across 2 collections?
Is it possible to chain MapReduce operations?
Regards.
MongoDB does not do JOINs - period!
MapReduce always runs on a single collection. You can not have a single MapReduce job which selects from more than one collection. The same applies to aggregation.
When you want to do some data-mining (not MongoDBs strongest suit), you could create a denormalized collection of all Sales with the corresponding Client object embedded. You will have to write a little program or script which iterates over all clients and
finds all Sales documents for the clinet
merges the relevant fields from Client into each document
inserts the resulting document into the new collection
When your Client document is small and doesn't change often, you might consider to always embed it into each Sales. This means that you will have redundant data, which looks very evil from the viewpoint of a seasoned RDB veteran. But remember that MongoDB is not a relational database, so you should not apply all RDBMS dogmas unreflected. The "no redundancy" rule of database normalization is only practicable when JOINs are relatively inexpensive and painless, which isn't the case with MongoDB. Besides, sometimes you might want redundancy to ensure data persistence. When you want to know your historical development of sales by region, you want to know the region where the customer resided when they bought the product, not where they reside now. When each Sale only references the current Client document, that information is lost. Sure, you can solve this with separate Address documents which have date-ranges, but that would make it even more complicated.
Another option would be to embed an array of Sales in each Client. However, MongoDB doesn't like documents which grow over time, so when your clients tend to return often, this might result in sub-par write-performance.