Why is saving data from an API to CSV faster than uploading it to MongoDB database - mongodb

My question revolves around understanding the following two procedures (particularly performance and code logic) that I used to collect trade data from the US Census Bureau API. I already collected the data but I ended up writing two different ways of requesting and saving the data for which my questions pertain to.
Summary of my final questions comes at the bottom.
First way: npm request and mongodb to save the data
I limited my procedure using tiny-async-pool (sets concurrency of a certain function to perform) to not try to request too much at once or receive a timeout or overload my database with queries. Simply put, the bottleneck I was facing was the database since the API requests returned rather quickly (depending on body size 1-15 secs), but to save each array item (return data was nested array, sometimes from a few hundred items to over one hundred thousand items with max 10 values in each array) to its own mongodb document ranged from 100 ms to 700 ms. To save time from potential errors and not redoing the same queries, I also performed a check in my database before making the query to see if the query was already complete. The end result was that I did not follow this method since it was very error prone and susceptible to timeouts if the data was very large (I even set the timeout to 10 minutes in request options).
Second way: npm request and save data to csv
I used the same approach as the first method for the requests and concurrency, however I saved each query to its own csv file. In case of errors and not redoing successful queries I also did a check to see if the file already existed and if so skipped that query. This approach was error free, I ran it and after a few hours was able to have all the data saved. To write to csv was insanely fast, much more so than using mongodb.
Final summary and questions
My end goal was to get the data in the easiest manner possible. I used javascript because that's where I learned api requests and async operations, even though I will do most of my data analysis with python and pandas. I first tried the database method mostly because I thought it was the right way and I wanted to improve my database CRUD skills. After countless hours of refactoring code and trying new techniques I still could not get it to work properly. I resorted to the csv method which was a) much less code to write, b) less checks, c) faster, and d) more reliable.
My final questions are these:
Why was the csv approach better than the database approach? Any counter arguments or different approaches you would have used?
How do you handle bottlenecks and concurrency in your applications with regards to APIs and database operations? Do your techniques vary in production environments from personal use cases (in my case I just needed the data and a few hours of waiting was fine)?
Would you have used a different programming language or different package/module for this data collection procedure?

Related

Why does paginating with offset using PSQL make sense?

I've been looking into pagination (paginate by timestamp) with a PSQL dbms. My approach currently is to build a b+ index to greatly reduce the cost of finding the start of the next chunk. But everywhere I look in tutorials and on NPM modules like express-paginate (https://www.npmjs.com/package/express-paginate), people seem to get chunks using offset one way or the other or fetching all the data anyways but simply sending them in chunks which to me doesn't seem to be a complete optimization that pagination is for.
I can see that they're still making an optimization by lazy loading and streaming the chunks (thus saving bandwidth and any download/processing time on the client-side), but since offset on psql still requires scanning previous rows. In the worst case where a user wants to view all the data, doesn't this approach have a very high server cost since if you have per say n chunks, you're accessing the first chunk n times, the second chunk n-1 times, the third chunk n-2 times, etc. I understand that this is really in terms of IOs so it's not that expensive but it still bothers me?
Am I missing something very obvious here? I feel like I am because there seems to be a lot more established and experienced engineers who seem to be using this approach. I'm guessing there is some part of the equation or mechanism that I'm just missing from my understanding.
No, you understand this quite well.
The reason why so many people and tools still advocate pagination with OFFSET and LIMIT (or FETCH FIRST n ROWS ONLY, to use the standard's language) is that they don't know a lot about databases. It is easy to understand LIMIT and OFFSET even if you the word “index” to you has no other meaning than ”the last pages in a book”.
There is another reason: to implement key set pagination, you must have an ORDER BY clause in your query, that ORDER BY clause has to contain a unique column, and you have to create an index that supports that ordering.
Moreover, your database has to be able to handle conditions like
... WHERE (name, id) > ('last_found', 42)
and support a multi-column index scan for them.
Since many tools strive to support several database systems, they are likely to go for the simple but inefficient method that works with every query on most database systems.

MarkLogic REST interface to send data to Qlik Sense

I need to present ~10 million XML documents to Qlik Sense using MarkLogic REST interface with the intention of analyzing raw data on Qlik.
I'm unable to send that bulk data using simple cts:search.
A template view with SQL call like below is not helping as it is not recognized at Qlik Sense.
xdmp:to-json(xdmp:sql('select * from SC1.V1'))
Is there a better way to achieve this?
I understand it is not usual to load such huge data to Qlik, but what limitations should I consider?
You are unlikely to be able transfer that volume of data into or out of ANY system in a single 'transaction' (or request ). And if you could you wouldn't want to because when it fails, it's likely to fail forever as you have to start all over.
You should 'batch' up the documents into manageable chunks .. 100MB or '1 minute' is a reasonable high upper bound -- as size and time increase the probability of problems goes up (way up) due to timeouts, memory, temp space, internet and network transient problems etc.
A simple strategy that often works well is to first produce a 'list' of what to extract (document uris, primary keys ..), save that, and then work your way through the list in batches - retrying as needed. Depending on the destination and local storage etc. you can either combine the lot to send on to the recipient, or generally better, send the target data in batches as well.
This approach has good transactional characteristics ... you effectively 'freeze' the set of data when you make the list, but can take your time collecting and sending it. Depending -- you may be able to do so in parallel.

When's the time to create dedicated collections in MongoDB to avoid difficult queries?

I am asking a question that I assume does not have a simple black and white question but the principal of which I'm asking is clear.
Sample situation:
Lets say I have a collection of 1 million books, and I consistently want to always pull the top 100 rated.
Let's assume that I need to perform an aggregate function every time I perform this query which makes it a little expensive.
It is reasonable, that instead of running the query for every request (100-1000 a second), I would create a dedicated collection that only stores the top 100 books that gets updated every minute or so, thus instead of running a difficult query a 100 times every second, I only run it once a minute, and instead pull from a small collection of books that only holds the 100 books and that requires no query (just get everything).
That is the principal I am questioning.
Should I create a dedicated collection for EVERY query that is often
used?
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough
to leave as is?
Is there any guidelines for best practice in those types of
situations?
Is there a point where if a query runs so often and the data doesn't
change very often that I should keep the data in the server's memory
for direct access? Even if it's a lot of data? How much is too much?
Lastly,
Is there a way in MongoDB to cache results?
If so, how can I tell it to fetch the cached result, and when to regenerate the cache?
Thank you all.
Before getting to collection specifics, one does have to differentiate between "real-time data" vis-a-vis data which does not require immediate and real-time presenting of information. The rules for "real-time" systems are obviously much different.
Now to your example starting from the end. The cache of query results. The answer is not only for MongoDB. Data architects often use Redis, or memcached (or other cache systems) to hold all types of information. This though, obviously, is a function of how much memory is available to your system and the DB. You do not want to cripple the DB by giving your cache too much of available memory, and you do not want your cache to be useless by giving it too little.
In the book case, of 100 top ones, since it is certainly not a real time endeavor, it would make sense to cache the query and feed that cache out to requests. You could update the cache based upon a cron job or based upon an update flag (which you create to inform your program that the 100 have been updated) and then the system will run an $aggregate in the background.
Now to the first few points:
Should I create a dedicated collection for EVERY query that is often used?
Yes and no. It depends on the amount of data which has to be searched to $aggregate your response. And again, it also depends upon your memory limitations and btw let me add the whole server setup in terms of speed, cores and memory. MHO - cache is much better, as it avoids reading from the data all the time.
Should I do it only for complicated ones?
How do I gauge which is complicated enough and which is simple enough to leave as is?
I dont think anyone can really black and white answer to that question for your system. Is a complicated query just an $aggregate? Or is it $unwind and then a whole slew of $group etc. options following? this is really up to the dataset and how much information must actually be read and sifted and manipulated. It will effect your IO and, yes, again, the memory.
Is there a point where if a query runs so often and the data doesn't change very often that I should keep the data in the server's memory for direct access? Even if it's a lot of data? How much is too much?
See answers above this is directly connected to your other questions.
Finally:
Is there any guidelines for best practice in those types of situations?
The best you can do here is to time the procedures in your code, monitor memory usage and limits, look at the IO, study actual reads and writes on the collections.
Hope this helps.
Use a cache to store objects. For example in Redis use Redis Lists
Redis Lists are simply lists of strings, sorted by insertion order
Then set expiry to either a timeout or a specific time
Now whenever you have a miss in Redis, run the query in MongoDB and re-populate your cache. Also since cache resids in memory therefore your fetches will be extremely fast as compared to dedicated collections in MongoDB.
In addition to that, you don't have to keep have a dedicated machine, just deploy it within your application machine.

MongoDB documents of calulated values for a dashboard vs re-retrieving on each web page view?

If I have a page in a web app that displays some dashboard type statistics about documents in my database (counts, docs created per hour, per day etc), is it best to pre-calculate this data and store it in a separate document (and update as needed), or assuming the collections have appropriate indexes, would it be appropriate to execute queries to retrieve these statistics on every load of the page?
It's not necessary that the data has to be exactly up to date on every page hit/load, so that's why I was thinking to maintain the data I need to display in a separate document that can be retrieved on page hit (or even cached and only re-retrieved every 5 minutes or similar).
That's pretty broad, and I have the feeling you have already identified the key points. Generally speaking, you should consider these questions:
Do you need to allow users to apply filters? Complex filters usually make pre-aggregation impossible.
Related: Is it likely that the exact same data is ever queried again? If not, pre-aggregation might need to happen on different levels of granularity (e.g. by creating day / week / month totals and summing these, instead of individual events).
What is the relation of reads vs. writes on the data? If the number of writes is small, it might be OK to keep counters in real-time, instead of using read-caching.
What are your performance requirements for cached and uncached queries? Getting fast cached queries is trivial, but comes at the cost of stale data. Making uncached queries faster is more tricky and usually requires something like the multi-level approach discussed before - it often doesn't help if old data comes super fast, but new queries take minutes.
Caching works especially well if the data can't be changed later (or is seldomly changed), and the queries remain the same with a certain chance of re-occuring. A nice example are facebook's profiles, where past years are apparently cached for every visitor-profile combination. First accesses are slow, however...

How to handle large mongodb collection

We have a collection that is potentially going to be very large.This collection used to store Bill releated data. So this is often used to reporting/Analytics purpose.
Please let me know the best approch to handle this large collection
1) Can I split and archive the old data(say 12 months period)?.But here old data is required to get analytic reports.I want to query this old data to show the sale comparion for past 2 yesrs.
2)can I have new collection with old data(12 months) .So for every 12 months i've to create new collection. For reports generation,I've to access all this documents to query. So this will cause performance problem?
3) Can I go for Sharding?
There are many variables to account for, the clearest being what hardware you use, how the data is structured, and how it is queried. A distributed network ought to be able to chew through your data faster than a single machine, but before diving into that solution I recommend generating an absurd amount of mock data comparable to what you are expecting, and then testing various approaches. Seriously. Create a bunch of data, and try to break things. It's fun! Soon enough you'll know more about what your problem requires than any website could tell you.
As for direct responses:
Perhaps, before archiving the data, appropriate stats summaries can be generated (or updated). Those summaries/simplifications can be used for sale comparisons without reloading all of the archived data they represent.
This strikes me as sensible. By splitting up the sales data, you have more control over how much data needs to be accessed. After all, a user won't always wish to see 3 years of data, they may only wish to see last week's.
Move to sharding when you actually need it. As is stated on the MongoDB site:
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
You'll know it's time when your memory-map approaches the server's RAM limit. MongoDB supports reading and writing to databases too large to keep in memory, but I'm sure you already know that is SLOW.