On our Facebook Application's Diagnostics page, there is a message saying that for two different queries, our "Calls Access Too Much Data". Does anyone know exactly what this means?
Specifically, does this mean that our average request size per call is too large, or that we are accessing too much data in aggregate? We make great effort to batch queries together as much as possible (both for GraphAPI and FQL queries), so that we can limit our total number of queries per user. However, if these two calls are really "accessing too much data", does that mean that we should be less agressive about batching these queries?
It looks like some of your queries are trying to pull too much information in a single call. If you can try to break it down into smaller batches, you should see these errors decrease.
I'll look into getting the documentation around the limits made a little clearer.
Related
I wanna ask you about lists and pagination in APIs
I want to build a long list in home screen that's mean this request will have a lot of traffic because it's the main screen and I want to build it in a good way to handle the traffic
After I searched about the way of how I gonna implement it
Can I depend on postgresql in pagination ? Or I need to use search engine like solr
If I depend on the database and users started to visit the app, then this request gonna submit a lot of queries on the database is this gonna kill the database ?
Also I'm using Redis to Cache Some data and this gonna handle some traffic but the problem with home screen the response it too large and I can't cache all of this response in one key in Redis
Can anyone explain to me what is the best way to implement this request for pagination .. the only thing I want is pagination I'm not looking to implement a full text search but to handle the traffic I read that search engine will handle it to not affect the database or kill it
Thanks a lot :D
You can do this seamlessly with the pagination technology we know in PostgreSQL. PostgreSQL has enough functions and capabilities to do this. (limit, offset, fetch)
But let me give you a recommendation.
There are several types of pagination.
The first type is that the count of pages must be known in advance. This technology is outdated and is not recommended. Because at this time you need to know the count of records in the table. But calculating count of records is a very slowing process, mainly in large tables.
The second type is that the number of pages is not known in advance. Information from the next page is brought in parts only if necessary. Just like Google, LinkedIn and other big companies use it. In this case, it is not necessary to calculate the count of any table.
I know that from most of today's REST APIs, web calls responses have to be paginated.
But, I don't see on the web any insight on how to select the ideal size of a batch returned by an API call: should it be 10, 100, 1000. To be short: on what factors should we base the reflection of the size of an API response?
Some people state that it should be based on the number of elements displayed by the UI. I don't agree with this, as not all APIs are directly linked with an UI, and, in any cases, modern REST APIs allow to chose the number of items in output batch with a configurable parameter up to a certain amount.
So, how could we define the value for this "maximum number of elements returned by an HTTP request"?
Should it be based on the payload size? Internal architecture of the API? Should it be based on performance measurement?
Any insight on this? I'm not really looking for an explicit figure, but more some techniques that could help to find the answer. The best for me would be the process followed by some successful APIs. But, I cannot find any.
My general approach to this is to:
Try to avoid paging
Make REST clients resilient against paging changes, by using next and previous links, instead of using a ?page= attribute. A REST client knows another page is available strictly by the appearance of the link.
I don't use hard rules or measurements to figure out when paging is needed. Paging is generally painful, so my first approach would be to try to figure out what requirements drives the need for paging, and if that requirement can be removed.
Once I've determined it's not possible to remove this requirement in another way, I would set the cut-off of a page as large as reasonable, to remove the likelyhood clients need to do additional requests.
If it's a closed API (used only by clients you control), pick whatever the UI wants. It's trivial to change. If clients can choose from several options, you can include a pageSize parameter. Or, better..
If it's an open API (used by clients you don't control), then let clients control what size paging they want. Rather than support a pageNumber parameter, support offset and limit. Offset is how many records to skip before starting to return records, and limit is the maximum number of records to return. If the client is not happy with how their request performs, they can adjust the parameters to suit their needs. Any upper limit your API has should be driven by what your service can handle. It's neither possible nor desirable for you to try to figure out the Magic Maximum Page Size that makes all clients happy and no clients sad.
Also, please note that none of this has anything to do with ReST, which is silent when it comes to paging.
I usually make a rough performance measurement by hand. I want as few requests as necessary, but do not want to risk timeouts.
I am going to build a page that is designed to be "viewed" alot, but much fewer users will "write" into the database. For example, only 1 in 100 users may post his news on my site, and the rest will just read the news.
In the above case, 100 SAME QUERIES will be performed when they visit my homepage while the actual database change is little. Actually 99 of those queries are a waste of computer power. Are there any methods that can cache the results of the first query, and when they detect the same query in a short time, can deliver the cached result?
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
I use MongoDB and Tornado. However, some posts say that the MongoDB does not do caching.
I dunno who said that but MongoDB does have a way to cache queries, in fact it uses the OS' LRU to cache since it does not do memory management itself.
So long as your working set fits into the LRU without the OS having to page it out or swap constantly you should be reading this query from memory at most times. So, yes, MongoDB can cache but technically it doesn't; the OS does.
Actually 99 of those queries are a waste of computer power.
Caching mechanisms to solve these kind of problems is the same across most techs whether they by MongoDB or SQL. Of course, this only matters if it is a problem, you are probably micro-optimising if you ask me; unless you get Facebook or Google or Youtube type traffic.
The caching subject goes onto a huge subject that ranges from caching queries in either pre-aggregated MongoDB/Memcache/Redis etc to caching HTML and other web resources to make as little work as possible on the server end.
Your scenario, personally as I said, sounds as though you are thinking wrong about the wasted computer power. Even if you were to cache this query in another collection/tech you would probably use the same amount of power and resources retrieving the result from that tech than if you just didn't bother. However that assumption comes down to you having the right indexes, schema, set-up etc.
I recommend you read some links on good schema design and index creation:
http://docs.mongodb.org/manual/core/indexes/
https://docs.mongodb.com/manual/core/data-model-operations/#large-number-of-collections
Making a static, cached HTML with something like Nginx is not preferred, because I want to render a personalized page by Tornado each time.
Yea I think by trying to worry about query caching you are pre-maturely optimising, especially if you don't want to take off, what would be 90% of the load on your server each time; loading the page itself.
I would focus on your schema and indexes and then worry about caching if you really need it.
The author of the Motor (MOngo + TORnado) package gives an example of caching his list of categories here: http://emptysquare.net/blog/refactoring-tornado-code-with-gen-engine/
Basically, he defines a global list of categories and queries the database to fill it in; then, whenever he need the categories in his pages, he checks the list: if it exists, he uses it, if not, he queries again and fills it in. He has it set up to invalidate the list whenever he inserts to the database, but depending on your usage you could create a global timeout variable to keep track of when you need to re-query next. If you're doing something complicated, this could get out of hand, but if it's just a list of the most recent posts or something, I think it would be fine.
I'm looking to re-code an application to better handle spikes in tweets. I'm moving to Heroku and MongoDB (either MongoLab or MongoHQ) for the database solution.
During certain news events, tweet volume might spike to 15,000 / second. Typically with each tweet, I parse the tweet and store various pieces of data such as user data, etc. My idea is to store the raw tweets in a separate collection, and have a separate process grab raw tweets and parse them. The goal here is when there is a massive spike in tweets, my application isn't trying to parse all of these, but is essentially backlogging the raw tweets in another collection. As the volume slows, the process can take care of the backlog over time.
My question is three fold:
Can MongoDB handle this type of volume with regards to inserts into a collection at a rate of 15,000 tweets per second?
Any idea on the better setup: MongoHQ or MongoLab?
Any feedback on the overall setup?
Thanks!
The write volume that it will handle depends on lots of factors - hardware, indexes, size of each document, etc. Your best bet is to test it in the environment you're planning to use. If the demands of the write load exceed the capacity of a single mongo server, you can always use just multiple shards.
They are very similar, but there are some differences in pricing and the actual site design has a bunch of differences. There's a thread of discussion about it here: https://webmasters.stackexchange.com/questions/20782/mongodb-hosting-mongolab-vs-mongohq-vs-mongomachine
Overall it seems to make sense. Sounds like you will probably want to flesh out some details about how you will be processing the backlog. Will you be polling it by querying periodically, deleting tweets from the backlog as it processes them, etc.
Completely agree on the need to test this. In general, mongo can handle that many writes, but in practice it depends on the size of your set up, other operations, indexes, etc.
I had to do a similar approach for collecting tons of metrics data. I used a lightweight event-machine process to accept incoming requests in parallel, and store them in a simple format, then another process would take those requests and send them up to a central server. The main goal was to make sure no data was lost if the central server was down, but it also allowed me to put in some throttling logic so that the spikes in data wouldn't overwhelm the system.
I'd be interested to see how this works out for you price-wise, vs. a vps like linode. (I'm a huge Heroku fan, but with certain architectures it can get pricey quickly)
I'd like to get every status update for every friend. Given I have say 500 friends, each with 200 statuses, this could be 100,000 statuses. How would you approach this from the query point of view?
What query would you write? Would Facebook allow this much data to come through in a single go? If not is there a best practice paging or offsetting solution?
Would Facebook allow this much data to come through in a single go?
No. Facebook will throw exception of too much data. Also there is automated system in place which will block time-consuming requests as well as it will block your app if it is making too much queries too frequently on a single table - API Throttling Warnings.
If not is there a best practice paging or offsetting solution?
You can do paging in FQL and when querying connections in graph. It is best practice.
From their policy:
If you exceed, or plan to exceed, any of the following thresholds please contact us as you may be subject to additional terms: (>5M MAU) or (>100M API calls per day) or (>50M impressions per day).
http://developers.facebook.com/policy/
It means that 100k is not so big deal. However, it depends. You may have to consider,
Do you REALLY need every status?
Can't they be downloaded later?
Do you need these posts/stories from every friend?