What is the optimal way to do server side paging in expressjs with mongoose - mongodb

I'm currently doing a project with my own MEAN stack.
Now in a new project I'm creating I've got a collection that I'm paging with Express on serverside, returning the page size every time (e.g 10 results out of the total 2000) and the total rows found for the query the user preformed (e.g 193 for UserID 3).
Although this works fine, I'm afraid that this will create an enormous load on the server since a user can easily pull 50-60 pages a session with 10, 20, 50 or even 100 results each.
My question to you guys is: if I have say 1000 concurrent users paging every few seconds like this, will MongoDB be able to cope with this? If not, what might be my alternatives here?
Also is there anyway I can simulate such concurrent read tests on my app/MongoDB?
Please take in account that I must do server side paging because the app will be quite dynamic and information can change very often.

If you're planning on only using a single webserver, you could cache the result set belonging to a certain page in memory. If you're planning on using multiple webservers, caching in-memory would lead to different result sets across servers, so in that case I'd recommend storing your cache either in MongoDB or in Redis.
A certain result set would be stored under a certain key in your cache. Your key would probably be composed of something like entityName + filterOptions + offset + resultsLimit. So for example you're loading movies with title=titanic, skipping the first 100, so offset=100 and loading only 50 per page so limit=50, which would all be concatenated into a single key.
When a request comes in, you would first try to load the result set from the cache. If the result set is inside the cache, you'll return that to the client. If it's not in the cache, you'd query the database for the latest result set, put that in the cache and return it to the client.
Whether or not you could pull it off with 1000 concurrent users depends a lot on your hardware, the data you are loading, how you're loading it and the efficiency of your implementation. There's one way to find out, and that's testing.
Of course by using the asynchronous capabilities of Node.js you can achieve the best scalability, so every call that can be executed async, such as database calls, should definitely be executed asynchronously.
You could load test your application for free from your local computer using Apache JMeter or let it be tested using for example Azure.

Related

Firestore ignore limit (on flutter)

I have a simple collections, and to test, i have created 10k documents in this collections.
After that, when i do a simple query with limit(5):
Firestore.instance.collection(myCollection).orderBy(myOrderBy).limit(5).getDocuments();
And i see that in my console :
W/CursorWindow(21291): Window is full: requested allocation 253420 bytes, free space 68329 bytes, window size 2097152 bytes
I/zygote64(21291): Background concurrent copying GC freed 535155(13MB) AllocSpace objects, 5(1240KB) LOS objects, 50% free, 17MB/35MB, paused 60us total 102.836ms
When i go to my Dashboard Firebase i see i have 10k read.
So I conclude that my query returns 6 results, but that it reads the entire database. Which can quickly decrease performance and increase the price.
I looked for a solution and I find this:
Firestore.instance.settings(persistenceEnabled: false;)
It seems to be working, but I have trouble understanding.
By default Firestore loads the entire collection to be able to make requests Offline?
Changing the firestore settings when launching the application would be enough, I'm not likely to be surprised?
And if I disable persistence I assume that if the user makes an offline write request, it will no longer be persisted when he is online again. Is a compromise possible?
Thanks,
Firestore's offline storage behaves as a cache, persisting any documents it has recently seen. It does not pre-load documents you haven't told it to load with a query/read operation, so in the query you show that would be at most 5 documents for each time you execute the query.
Did you add the 10K documents from the same client where you are running the query by any chance? If so, the local cache of that client may/would contain all those documents, since the client added them. You'll want to uninstall/reinstall the client to wipe the cache in that case, to get a more realistic experience of what your users would get.
The fact that you see 10K reads in your usage tab is a separate issue, not explained by the code you shared. One things to keep in mind is that documents loaded in the console are also charged reads.

MongoDB documents of calulated values for a dashboard vs re-retrieving on each web page view?

If I have a page in a web app that displays some dashboard type statistics about documents in my database (counts, docs created per hour, per day etc), is it best to pre-calculate this data and store it in a separate document (and update as needed), or assuming the collections have appropriate indexes, would it be appropriate to execute queries to retrieve these statistics on every load of the page?
It's not necessary that the data has to be exactly up to date on every page hit/load, so that's why I was thinking to maintain the data I need to display in a separate document that can be retrieved on page hit (or even cached and only re-retrieved every 5 minutes or similar).
That's pretty broad, and I have the feeling you have already identified the key points. Generally speaking, you should consider these questions:
Do you need to allow users to apply filters? Complex filters usually make pre-aggregation impossible.
Related: Is it likely that the exact same data is ever queried again? If not, pre-aggregation might need to happen on different levels of granularity (e.g. by creating day / week / month totals and summing these, instead of individual events).
What is the relation of reads vs. writes on the data? If the number of writes is small, it might be OK to keep counters in real-time, instead of using read-caching.
What are your performance requirements for cached and uncached queries? Getting fast cached queries is trivial, but comes at the cost of stale data. Making uncached queries faster is more tricky and usually requires something like the multi-level approach discussed before - it often doesn't help if old data comes super fast, but new queries take minutes.
Caching works especially well if the data can't be changed later (or is seldomly changed), and the queries remain the same with a certain chance of re-occuring. A nice example are facebook's profiles, where past years are apparently cached for every visitor-profile combination. First accesses are slow, however...

Lucene searches are slow via AzureDirectory

I'm having trouble understanding the complexities of Lucene. Any help would be appreciated.
We're using a Windows Azure blob to store our Lucene index, with Lucene.Net and AzureDirectory. A WorkerRole contains the only IndexWriter, and it adds 20,000 or more records a day, and changes a small number (fewer than 100) of the existing documents. A WebRole on a different box is set up to take two snapshots of the index (into another AzureDirectory), alternating between the two, and telling the WebService which directory to use as it becomes available.
The WebService has two IndexSearchers that alternate, reloading as the next snapshot is ready--one IndexSearcher is supposed to handle all client requests at a time (until the newer snapshot is ready). The IndexSearcher sometimes takes a long time (minutes) to instantiate, and other times it's very fast (a few seconds). Since the directory is physically on disk already (not using the blob at this stage), we expected it to be a fast operation, so this is one confusing point.
We're currently up around 8 million records. The Lucene search used to be so fast (it was great), but now it's very slow. To try to improve this, we've started to IndexWriter.Optimize the index once a day after we back it up--some resources online indicated that Optimize is not required for often-changing indexes, but other resources indicate that optimization is required, so we're not sure.
The big problem is that whenever our web site has more traffic than a single user, we're getting timeouts on the Lucene search. We're trying to figure out if there's a bottleneck at the IndexSearcher object. It's supposed to be thread-safe, but it seems like something is blocking the requests so that only a single search is performed at a time. The box is an Azure VM, set to a Medium size so it has lots of resources available.
Thanks for whatever insight you can provide. Obviously, I can provide more detail if you have any further questions, but I think this is a good start.
I have much larger indexes and have not run into these issues (~100 million records).
Put the indexes in memory if you can (8 million records sounds like it should fit into memory depending on the amount of analyzed fields etc.) You can use the RamDirectory as the cache directory
IndexSearcher is thread-safe and supposed to be re-used, but I am not sure if that is the reality. In Lucene 3.5 (Java version) they have a SearcherManager class that manages multiple threads for you.
http://java.dzone.com/news/lucenes-searchermanager
Also a non-Lucene post, if you are on an extra-large+ VM make sure you are taking advantage of all of the cores. Especially if you have an Web API/ASP.NET front-end for it, those calls all should be asynchronous.

CacheManager GetData performance issue

We are using Enterprise Library 5 and the CacheManager that it provides in our web application. Everything seems to be working fine up to the point where we start a heavy load test on the application.
We are caching records from the database using a key based on their ID. We are not requesting from cache one item all the time, sometimes we need to get a list of items from the cache. For this we have a LINQ query that makes a Select(e => CacheManager.GetData(id_from_list)) and returns the list of items from the cache. Most of the time this works fine but in heavy loads the GetData method becomes a bottleneck due to the locking that the cache manager is performing both on read and write operations from cache. Basically only one thread can read data from the cache at one time. We did create several cache managers based on the type of the items - this allows several threads to get data from different cache managers but still the issue remains when heavy loads hit the application (one bottleneck per cache manager) - of course it did improve the application up to some point but not enough.
Did someone else encountered the same problem and did you find a way to overcome this?
NOTE: We tried to actually cache lists of items and compose the key from the ids of the items in the list. This actually solved the problem and the cachemanager.getdata is not a bottleneck anymore ... BUT ... obviously this is not a good solution as we could have each item thousands of times in the cache in a lot of lists.
You may consider adapting the CacheManager to use a read/write lock (which I think is much more suitable for this situation) instead of the exclusive locking that it uses now.
http://msdn.microsoft.com/en-us/library/system.threading.readerwriterlock.aspx
Basically, a read/write lock is appropriate when multiple reader threads need simultaneous access to the data, and only the occurrence of a write will cause incoming readers to block.
These have other problems when put under load, however, such as write starvation. Depending on the read/write lock implementation a write will always wait for all reads to finish first - with a constant stream of reads a write will never have a chance to happen.

Entity Framework Code First - Reducing round trips with .Load() and .Local

I'm setting up a new application using Entity Framework Code Fist and I'm looking at ways to try to reduce the number of round trips to the SQL Server as much as possible.
When I first read about the .Local property here I got excited about the possibility of bringing down entire object graphs early in my processing pipeline and then using .Local later without ever having to worry about incurring the cost of extra round trips.
Now that I'm playing around with it I'm wondering if there is any way to take down all the data I need for a single request in one round trip. If for example I have a web page that has a few lists on it, news and events and discussions. Is there a way that I can take down the records of their 3 unrelated source tables into the DbContext in one single round trip? Do you all out there on the interweb think it's perfectly fine when a single page makes 20 round trips to the db server? I suppose with a proper caching mechanism in place this issue could be mitigated against.
I did run across a couple of cracks at returning multiple results from EF queries in one round trip but I'm not sure the complexity and maturity of these kinds of solutions is worth the payoff.
In general in terms of composing datasets to be passed to MVC controllers do you think that it's best to simply make a separate query for each set of records you need and then worry about much of the performance later in the caching layer using either the EF Caching Provider or asp.net caching?
It is completely ok to make several DB calls if you need them. If you are affraid of multiple roundtrips you can either write stored procedure and return multiple result sets (doesn't work with default EF features) or execute your queries asynchronously (run multiple disjunct queries in the same time). Loading unrealted data with single linq query is not possible.
Just one more notice. If you decide to use asynchronous approach make sure that you use separate context instance in each asynchronous execution. Asynchronous execution uses separate thread and context is not thread safe.
I think you are doing a lot of work for little gain if you don't already have a performance problem. Yes, pay attention to what you are doing and don't make unnecessary calls. The actual connection and across the wire overhead for each query is usually really low so don't worry about it.
Remember "Premature optimization is the root of all evil".
My rule of thumb is that executing a call for each collection of objects you want to retrieve is ok. Executing a call for each row you want to retrieve is bad. If your web page requires 20 collections then 20 calls is ok.
That being said, reducing this to one call would not be difficult if you use the Translate method. Code something like this would work
var reader = GetADataReader(sql);
var firstCollection = context.Translate<whatever1>(reader);
reader.NextResult();
var secondCollection = context.Translate<whateve2r>(reader);
etc
The big down side to doing this is that if you place your sql into a stored proc then your stored procs become very specific to your web pages instead of being more general purpose. This isn't the end of the world as long as you have good access to your database. Otherwise you could just define your sql in code.