I have a CouchDB instance running on AWS (m4.large EC2 on CentOS) which is in excess of 75GB in size and constantly growing. I have encountered problems when modifying and indexing views on this database, which is taking almost 2 days now.
What are the optimization strategies available to me to make sure that:
Re-indexing after a map-reduce view change can be done more
efficiently
Fetching from a map-reduce view can be done faster (with
a custom reduce function in place)
I've read the suggestions on the CouchDB guide, but they're targeted more towards optimizing inserts.
Since you have such a large database, it does take time to re-index views when one is changed, it will not be near instant like it is when you have a smaller database. Now that I have said that, here is a solution for #1.
Whenever a design document is updated, it re-indexes all views within that document, so having each view in its own design document may increase some re-index speed. Since you do have a massive database, it will still take time to go through every document and re-index them, just now it will do one view and not all views.
Edit: Links
CouchDB Views Intro -> This is the overview of the CouchDB views documentation. I've read and re-read this page multiple times and always find something new each time. I suggest reading it over a few times just to be sure.
CouchDB One vs Multiple Design Documents -> Same page, but it will bring you to the section about my answer above. Please give it a read through and I hope it helps.
I do not know how to address #2, sorry.
Related
So, I don't have a specific issue here other than a general lack of knowledge, and I am hoping that the big brains on here can give me a nudge in the right direction or maybe refer me to an online resource that could help...here is the general problem I am trying to solve.
I have a mongo database that holds a handful of collections where I store data retrieved from some software that we use for our day to day operations. We are grabbing this data from the API and storing it in Mongo to build up a historical source of data (the API is limited in the timeframe of data that can be retrieved.)
The window for the historical data from the API is 7 days.
The data in question has a unique id for each item we pull from the API, so this allows us to grab a record, store it, and modify it as required if it changes over time. This has been working just fine for our needs, until we started to notice a few discrepencies between the data we stored in Mongo, and what we would get out of our software when we ran "reports." After digging into it, turns out there are a few edge cases where a record would be deleted from the software, but if we already grabbed that record through the API, then it would remain in our Mongo Database.
I'm looking for any advice on how to handle this situation. Ideally I suppose we would like to remove these deleted records from our mongoDB inorder to match what is in the software...but I'm having trouble dreaming up the process to make this happen. Apparently this is one of many gaping holes in my entirely self-taught knowledge of this stuff.
Thanks for any assistance.
Our website is built in Magento 2.1.7, and recently enabled a Layered Navigation plugin. It it a very good feature for customers and we all like it except the page loading speed.
The obvious difference we can feel is the number of cached pages right now is much less than before. For example, when you open one category page, the speed is extremely slow, and then if you open it on another device, it is as fast as normal. In other words, it seems that the cache memory cannot cover all our frequently visisted pages.
I have been working on it by setting up more attribute sets to eliminate useless attributes in products, and reducing filterable attributes, but we are still not pleased with the performance.
My questions are:
Is there any other way I can improve it by settings or coding?
If I upgrade the hosting plan, which aspect is more important?
Now we are using build-in cache, a server with 24GB RAM shared by 13 stores. Next plan is 24GB shared by 4, plus 250GB CDN. Or I just need to upgrade to a varnish cache?
Thanks
From my understanding, the slowness is caused by the complexity of MySQL queries. (The layered navigation adds more joins on the eav tables).
Because of this reason, I believe that Varnish won't help here (since it will still need to cache every page of the layered navigation on the first visit).
From reading here, it seems that more RAM should help (so the next plan that you've mentioned might help).
Another thing that might help (depending on the Layered Navigation implementation) is to try to enable the flat catalog. This should create an entry in the index table for each product, possibly reducing the number of joins that layered navigation uses.
I am working on a product using which a user can create his/her mobile site. Now, as this is a mobile site creation platform, there are lots of site created in the application. I need to keep all the visitor data in the database so that product can show the analytics to the user of his/her site.
When there was less site, all was working fine. But now the data is growing fast as there are lots of requests on the server. I make use of mongo as NoSQL DBMS to keep all the data. In a collection named "analytics", I usually insert row with site id so that it can be shown to the user. As the data is large, performance to show user analytics is also slow. Also disk space is growing gradually.
What should be best modeling to keep this type of BIG data.
Should i create collection per site and store data in separate collection per site ?
Should I also separate collection date wise ?
What should be the cleaning procedure of the data. What is the best practices adopted by other leader in the industry ?
Please help
I would strongly suggest reading through MongoDB Optimization strategies at http://docs.mongodb.org/manual/administration/optimization/ . You will find various ways to identify slow performing queries / ops and suggestions to improve them at the mentioned page. Hopefully that should help you resolve slow queries / performance issues.
If you haven't already seen, I would also suggest taking a look at various use cases at http://docs.mongodb.org/ecosystem/use-cases/ , how they are modeled for those scenarios and if there is any that resembles what you are trying to achieve.
After following through the optimization strategies and making appropriate changes, if you still have performance issues, I would suggest posting following information for further suggestions:
What is your current state in terms of performance and what is the planned target state?
How does your system look i.e. hardware / software characteristics?
Once you have needed performance characteristics, following questions may help you achieve your targets:
What are the common query patterns and which ones are slow?
Potentially look for adding indexes that can enhance query performance
Potentially look for schema refactoring based on access patterns
Potentially look for schema refactoring for rolling-up / aggregating analytics data based on how it will be used.
Are writes also slow and is that a concern as well?
Potentially plan for Sharding which would provide write as well as read scaling. Sharding is entirely a topic in itself and I would suggest reading about it at http://docs.mongodb.org/manual/sharding/
How big the data is and how is it growing or intended to grow
Potentially this would give further insights into what could be suggested
I'm implementing a web scraper that needs to scrape and store about 15GB+ of HTML files a day. The amount of daily data will likely grow as well.
I intend on storing the scraped data as long as possible, but would also like to store the full HTML file for at least a month for every page.
My first implementation wrote the HTML files directly to disk, but that quickly ran into inode limit problems.
The next thing I tried was using Couchbase 2.0 as a key/value store, but the Couchbase server would start to return Temp_OOM errors after 5-8 hours of web scraping writes. Restarting the Couchbase server is the only route for recovery.
Would MongoDB be a good solution? This article makes me worry, but it does sound like their requirements are beyond what I need.
I've also looked a bit into Cassandra and HDFS, but I'm not sure if those solutions are overkill for my problem.
As for querying the data, as long as I can get the specific page data for a url and a date, it will be good. The data too is mostly write once, read once, and then store for possible reads in the future.
Any advice pertaining to storing such a large amount of HTML files would be helpful.
Assuming 50kB per HTML page, 15GB daily gives us 300.000+ pages per day. About 10 million monthly.
MongoDB will definitely work well with this data volume. Concerning its limitations, all depends on how do you plan to read and analyze the data. You may take advantage of map/reduce features given that amount of data.
However if your problem size may further scale, you may want to consider other options. It might be worth noting that Google search engine uses BigTable as a storage for HTML data. In that sense, using Cassandra in your use case can be a good fit. Cassandra offers excellen, persistent write/read performance and scales horizontally much beyond your data volume.
I'm not sure what deployment scenario you did when you used Cassandra to give you those errors .. may be more investigation is required to know what is causing the problem. You need to trace back the errors to know their source, because, as per requirements described above, Cassandra should work fine, and shouldn't stop after 5 hours (unless you have a storage problem).
I suggest that you give MongoDB a try, it is very powerful and optimized to what you need, and shouldn't complain of the requirement you mentioned above.
You can use HDFS, but you don't really need it while MongoDB (or even Cassandra) can do it.
I want to build a web app similar to Reddit.com, where you have multy level of comments, lots of reads and writes. I was wondering if nosql and mongoDB in particular is the right tool for this?
Comments -- it's really thing for nosql database, no doubt. You avoiding multiple joins to itself. And it's means that your system can scale out!
With mongodb you can store all hierarchy within one document. Some peoples can say that here will be problems with atomic updates, but i guess that it's not a problem because of you can load and save back entire comments tree. In any way you can easy redesign your system later to support atomic updates and avoid issues with concurrency.
Reddit itself uses Cassandra. If you want something "similar to reddit.com," maybe you should look at their source -- https://github.com/reddit/reddit/wiki.
Here's what David King (ketralnis) said earlier this year about the Cassandra 0.7 release: "Running any large website is a constant race between scaling your user base and scaling your infrastructure to support it. Our traffic more than tripled this year, and the transparent scalability afforded to us by Apache Cassandra is in large part what allowed us to do it on our limited resources. Cassandra v0.7 represents the real-life operations lessons learned from installations like ours and provides further features like column expiration that allow us to scale even more of our infrastructure."
However, Rick Branson notes that Reddit doesn't take full advantage of Cassandra's features, so if you were to start from scratch, you'd want to do some things differently.