How to collect statistics about data usage in an enterprise application? - usage-statistics

For optimization purposes, I would like to collect statistics about data usage in an enterprise Java application. In practise, I would like to know which database tables, and moreover, which individual records are accessed most frequently.
What I thought I could do is write an aspect that records all data access and asynchronously writes the results to a database but I feel that I would be re-inventing the wheel by doing so. So, are there any existing open source frameworks that already tackle this problem or is it somehow possible to squeeze this information directly from MySQL?

This might be useful - have you seen the UserTableMonitoring project?

Related

Best practise for modeling analytics data

I am working on a product using which a user can create his/her mobile site. Now, as this is a mobile site creation platform, there are lots of site created in the application. I need to keep all the visitor data in the database so that product can show the analytics to the user of his/her site.
When there was less site, all was working fine. But now the data is growing fast as there are lots of requests on the server. I make use of mongo as NoSQL DBMS to keep all the data. In a collection named "analytics", I usually insert row with site id so that it can be shown to the user. As the data is large, performance to show user analytics is also slow. Also disk space is growing gradually.
What should be best modeling to keep this type of BIG data.
Should i create collection per site and store data in separate collection per site ?
Should I also separate collection date wise ?
What should be the cleaning procedure of the data. What is the best practices adopted by other leader in the industry ?
Please help
I would strongly suggest reading through MongoDB Optimization strategies at http://docs.mongodb.org/manual/administration/optimization/ . You will find various ways to identify slow performing queries / ops and suggestions to improve them at the mentioned page. Hopefully that should help you resolve slow queries / performance issues.
If you haven't already seen, I would also suggest taking a look at various use cases at http://docs.mongodb.org/ecosystem/use-cases/ , how they are modeled for those scenarios and if there is any that resembles what you are trying to achieve.
After following through the optimization strategies and making appropriate changes, if you still have performance issues, I would suggest posting following information for further suggestions:
What is your current state in terms of performance and what is the planned target state?
How does your system look i.e. hardware / software characteristics?
Once you have needed performance characteristics, following questions may help you achieve your targets:
What are the common query patterns and which ones are slow?
Potentially look for adding indexes that can enhance query performance
Potentially look for schema refactoring based on access patterns
Potentially look for schema refactoring for rolling-up / aggregating analytics data based on how it will be used.
Are writes also slow and is that a concern as well?
Potentially plan for Sharding which would provide write as well as read scaling. Sharding is entirely a topic in itself and I would suggest reading about it at http://docs.mongodb.org/manual/sharding/
How big the data is and how is it growing or intended to grow
Potentially this would give further insights into what could be suggested

What would be the best distributed storage solution for a heavy use web scraper/crawler?

I'm implementing a web scraper that needs to scrape and store about 15GB+ of HTML files a day. The amount of daily data will likely grow as well.
I intend on storing the scraped data as long as possible, but would also like to store the full HTML file for at least a month for every page.
My first implementation wrote the HTML files directly to disk, but that quickly ran into inode limit problems.
The next thing I tried was using Couchbase 2.0 as a key/value store, but the Couchbase server would start to return Temp_OOM errors after 5-8 hours of web scraping writes. Restarting the Couchbase server is the only route for recovery.
Would MongoDB be a good solution? This article makes me worry, but it does sound like their requirements are beyond what I need.
I've also looked a bit into Cassandra and HDFS, but I'm not sure if those solutions are overkill for my problem.
As for querying the data, as long as I can get the specific page data for a url and a date, it will be good. The data too is mostly write once, read once, and then store for possible reads in the future.
Any advice pertaining to storing such a large amount of HTML files would be helpful.
Assuming 50kB per HTML page, 15GB daily gives us 300.000+ pages per day. About 10 million monthly.
MongoDB will definitely work well with this data volume. Concerning its limitations, all depends on how do you plan to read and analyze the data. You may take advantage of map/reduce features given that amount of data.
However if your problem size may further scale, you may want to consider other options. It might be worth noting that Google search engine uses BigTable as a storage for HTML data. In that sense, using Cassandra in your use case can be a good fit. Cassandra offers excellen, persistent write/read performance and scales horizontally much beyond your data volume.
I'm not sure what deployment scenario you did when you used Cassandra to give you those errors .. may be more investigation is required to know what is causing the problem. You need to trace back the errors to know their source, because, as per requirements described above, Cassandra should work fine, and shouldn't stop after 5 hours (unless you have a storage problem).
I suggest that you give MongoDB a try, it is very powerful and optimized to what you need, and shouldn't complain of the requirement you mentioned above.
You can use HDFS, but you don't really need it while MongoDB (or even Cassandra) can do it.

Is there any GIS extension for Apache Cassandra?

I want to use Cassandra for my web application, because it will manage a lot information. The problem is that it will also handle a lot of geographical data, so I need a GIS (http://en.wikipedia.org/wiki/Geographic_information_system) cassandra extension to capture, store, manipulate, analyze, manage, and present all types of geographical data.
Something like PostGIS for PostgreSQL. Does it already exists? Something simillar? Any suggestions?
Thanks for your help in advance :)
Well, one of our clients at PlayOrm(a client on top of cassandra with it's own command line client) is heavy into GIS so we are going to be adding features to store GIS data though I think they already exist. I meet with someone next week regarding this so in the meantime, you may want to checkout PlayOrm.
Data will be read from cassandra and displayed on one fo the largest monitors I have seen with some big huge machine(s) backing it with tons of graphic's cards....pretty cool setup.
Currently, PlayOrm does joins and Scalable-SQL but it is very likely we will be adding spatial queries to the mix if we continue to do GIS data.

When to use multiple DBMS

When is it a good idea to use more than one DBMS? What are the possible repercussions, and how do you decide when to do so?
I'm currently building an application which runs an analysis on our users' websites and stores it. This allows me to analyze all the data and give them analytics.
Since the data collected from each site is static and varies greatly from site to site, CouchDB seemed like a great fit. But in order to create this system, I'd need to build a user account system which couch is quite horrible at (reserving names, emails, etc has all sorts of problems).
My first thought was to use MySQL to handle the user accounts and CouchDB for the massive amounts of data. Essentially, trying to use a hammer for a nail and a screwdriver for a screw.
Is this a time when more than one DBMS is a good idea?
I don't see anything wrong with using MySQL for users accounts and CouchDB for crawled information.
For the users, you might even consider something simpler, like GDBM

content Revision History moving database?

I keep a content revision history for a certain content type. It's stored in MongoDB. But since the data is not frequently accessed I don't really need it there, taking up memory. I'd put it in a slower hard disk database.
Which database should I put it in? I'm looking for something that's really cheap and with cloud hosting available. And I don't need speed. I'm looking at SimpleDB, but it doesn't seem very popular. rdbms doesn't seem easy enough to handle since my data is structured into documents. What are my options?
Thanks
Depends on how often you want to look at this old data:
Why don't you mongodump it to your local disk and mongorestore when you want it back.
Documentation here
OR
Setup a local mongo instance and clone the database using the information here
Based on your questions and comments, you might not find the perfect solution. You want free or dirt cheap storage, and you want to have your data available online.
There is only one solution I can see feasible:
Stick with MongoDB. SimpleDB does not allow you to store documents, only key-value pairs.
You could create a separate collection for your history. Use a cloud service that gives you a free tier. For example, http://MongoLab.com gives you 240Mb free tier.
If you exceed the free tier, you can look at discarding the oldest data, moving it to offline storage, or start paying for what you are using.
If you data grows a lot you will have to make the decision whether to pay for it, keep it available online or offline, or discard it.
If you are dealing with a lot of large objects (BLOBS or CLOBS), you can also store the 'non-indexed' data separate from the database. This keeps the database both cheap and fast. The large objects can be retrieved from any cheap storage when needed.
Cloudant.com is pretty cool for hosting your DB in the cloud and it uses Big Couch which is a nosql thing. I'm using it for my social site in the works as Couch DB (Big Couch) similar has an open ended structure and you talk to it via JSON. Its pretty awesome stuff but so weird to move from SQL to using Map-Reduce but once you do its worth it. I did some research because I'm a .NET guy for a long time but moving to Linux and Node.js partly out of bordom and the love of JavaScript. These things just fit together because Node.js is all JavaScript on the backend and talks seemlessly to Couch DB and the whole thing scales like crazy.