Statistics (& money transfers) with mongoDB - mongodb

1) My first questions is regarding the best solution to store statistics with mongoDB
If i want to store large amounts of statistics (lets say visitors on a specific site - down to hourly), a noSQL DB like mongoDB seems to work very fine. But how do I structure those tables to get the most out of mongoDB?
I'd increase the visitor amount for that specific object id (for example SITE_MONTH_DAY_YEAR_SOMEOTHERFANCYPARAMETER) by one every time a user visits the page. But if the database gets big (>10g), doesnt that slow down (like it would on mysql) because it has to search for the object_id and update it? Is the data always accurate when I update it (afaik mongoDB does not have any table locking?)
Wouldnt it be faster just INSERTING one row for every visitor? (and more accurate) On the other hand, reading the statistics would be much faster with my first solution, wouldnt it? (especially in terms of "grouping" by site/date/[...]).
2) For every visitor counted I'd like to make a money transfer between two users. It is crucial that those transfers are always accurate. How would you achieve that?
I was thinking about a hourly cron that picks the amount of vistiors from the mongoDB.statistics for the last hour and updates the users balance. I'd prefer doing this directly/live while counting the user - but what happens if thousands of visitors are calling the script simultaneously, is there any risk of getting wrong balances?

Related

Create tables using a predefined schema in REST API call in Springboot

There is a scenario where I need to add entry for every user in a table. There will be around 5-10 records per user and the approximate number of users are approximately 1000. So, if I add the data of every user each day in a single table, the table becomes very heavy and the Read/Write operations in the table would take some time to return the data(which would be mostly for a particular user)
The tech stack for back-end is Spring-boot and PostgreSQL.
Is there any way to create a new table for every user dynamically from the java code and is it really a good way to manage the data, or should all the data should be in a single table.
I'm concerned about the performance of the queries once there are many records in case of a single table holding data for every user.
The model will contain the similar things like userName, userData, time, etc.
Thank you for you time!
Creating one table per user is not a good practice. Based on the information you provided, 10000 rows are created per day. Any RDBMS will be able to perfectly handle this amount of data without any performance issues.
By making use of indexing and partitioning, you will be able to address any potential performance issues.
PS: It is always recommended to define a retention period for data you want to keep in operational database. I am not sure about your use-case, but if possible define a retention period and move older data out of operational table into a backup storage.

MongoDB documents of calulated values for a dashboard vs re-retrieving on each web page view?

If I have a page in a web app that displays some dashboard type statistics about documents in my database (counts, docs created per hour, per day etc), is it best to pre-calculate this data and store it in a separate document (and update as needed), or assuming the collections have appropriate indexes, would it be appropriate to execute queries to retrieve these statistics on every load of the page?
It's not necessary that the data has to be exactly up to date on every page hit/load, so that's why I was thinking to maintain the data I need to display in a separate document that can be retrieved on page hit (or even cached and only re-retrieved every 5 minutes or similar).
That's pretty broad, and I have the feeling you have already identified the key points. Generally speaking, you should consider these questions:
Do you need to allow users to apply filters? Complex filters usually make pre-aggregation impossible.
Related: Is it likely that the exact same data is ever queried again? If not, pre-aggregation might need to happen on different levels of granularity (e.g. by creating day / week / month totals and summing these, instead of individual events).
What is the relation of reads vs. writes on the data? If the number of writes is small, it might be OK to keep counters in real-time, instead of using read-caching.
What are your performance requirements for cached and uncached queries? Getting fast cached queries is trivial, but comes at the cost of stale data. Making uncached queries faster is more tricky and usually requires something like the multi-level approach discussed before - it often doesn't help if old data comes super fast, but new queries take minutes.
Caching works especially well if the data can't be changed later (or is seldomly changed), and the queries remain the same with a certain chance of re-occuring. A nice example are facebook's profiles, where past years are apparently cached for every visitor-profile combination. First accesses are slow, however...

MongoDB High Avg. Flush Time - Write Heavy

I'm using MongoDB with approximately 4 million documents and around 5-6GB database size. The machine has 10GB of RAM, and free only reports around 3.7GB in use. The database is used for a video game related ladder (rankings) website, separated by region.
It's a fairly write heavy operation, but still gets a significant number of reads as well. We use an updater which queries an outside source every hour or two. This updater then processes the records and updates documents on the database. The updater only processes one region at a time (see previous paragraph), so approximately 33% of the database is updated.
When the updater runs, and for the duration that it runs, the average flush time spikes up to around 35-40 seconds, and we experience general slowdowns with other queries. The updater is RAN on a SEPARATE MACHINE and only queries MongoDB at the end, when all the data has been retrieved and processed from the third party.
Some people have suggested slowing down the number of updates, or only updating players who have changed, but the problem comes down to rankings. Since we support ties between players, we need to pre-calculate the ranks - so if only a few users have actually changed ranks, we still need to update the rest of the users ranks accordingly. At least, that was the case with MySQL - I'm not sure if there is a good solution with MongoDB for ranking ~800K->1.2 million documents while supporting ties.
My question is: how can we improve the flush and slowdown we're experiencing? Why is it spiking so high? Would disabling journaling (to take some load off the i/o) help, as data loss isn't something I'm worried about as the database is updated frequently regardless?
Server status: http://pastebin.com/w1ETfPWs
You are using the wrong tool for the job. MongoDB isn't designed for ranking large ladders in real time, at least not quickly.
Use something like Redis, Redis have something called a "Sorted List" designed just for this job, with it you can have 100 millions entries and still fetch the 5000000th to 5001000th at sub millisecond speed.
From the official site (Redis - Sorted sets):
Sorted sets
With sorted sets you can add, remove, or update elements in a very fast way (in a time proportional to the logarithm of the number of elements). Since elements are taken in order and not ordered afterwards, you can also get ranges by score or by rank (position) in a very fast way. Accessing the middle of a sorted set is also very fast, so you can use Sorted Sets as a smart list of non repeating elements where you can quickly access everything you need: elements in order, fast existence test, fast access to elements in the middle!
In short with sorted sets you can do a lot of tasks with great performance that are really hard to model in other kind of databases.
With Sorted Sets you can:
Take a leader board in a massive online game, where every time a new score is submitted you update it using ZADD. You can easily take the top users using ZRANGE, you can also, given an user name, return its rank in the listing using ZRANK. Using ZRANK and ZRANGE together you can show users with a score similar to a given user. All very quickly.
Sorted Sets are often used in order to index data that is stored inside Redis. For instance if you have many hashes representing users, you can use a sorted set with elements having the age of the user as the score and the ID of the user as the value. So using ZRANGEBYSCORE it will be trivial and fast to retrieve all the users with a given interval of ages.
Sorted Sets are probably the most advanced Redis data types, so take some time to check the full list of Sorted Set commands to discover what you can do with Redis!
Without seeing any disk statistics, I am of the opinion that you are saturating your disks.
This can be checked with iostat -xmt 2, and checking the %util column.
Please don't disable journalling - you will only cause more issues later down the line when your machine crashes.
Separating collections will have no effect. Separating databases may, but if you're IO bound, this will do nothing to help you.
Options
If I am correct, and your disks are saturated, adding more disks in a RAID 10 configuration will vastly help performance and durability - more so if you separate the journal off to an SSD.
Assuming that this machine is a single server, you can setup a replicaset and send your read queries there. This should help you a fair bit, but not as much as the disks.

Calculating price drop Apps or Apps gonna free - App Store

I am working on a Website which is displaying all the apps from the App Store. I am getting AppStore data by their EPF Data Feeds through EPF Importer. In that database I get the pricing of each App for every store. There are dozen of rows in that set of data whose table structure is like:
application_price
The retail price of an application.
Name Key Description
export_date The date this application was exported, in milliseconds since the UNIX Epoch.
application_id Y Foreign key to the application table.
retail_price Retail price of the application, or null if the application is not available.
currency_code The ISO3A currency code.
storefront_id Y Foreign key to the storefront table.
This is the table I get now my problem is that I am not getting any way out that how I can calculate the price reduction of apps and the new free apps from this particular dataset. Can any one have idea how can I calculate it?
Any idea or answer will be highly appreciated.
I tried to store previous data and the current data and then tried to match it. Problem is the table is itself too large and comparing is causing JOIN operation which makes the query execution time to more than a hour which I cannot afford. there are approx 60, 000, 000 rows in the table
With these fields you can't directly determine price drops or new application. You'll have to insert these in your own database, and determine the differences from there. In a relational database like MySQL this isn't too complex:
To determine which applications are new, you can add your own column "first_seen", and then query your database to show all objects where the first_seen column is no longer then a day away.
To calculate price drops you'll have to calculate the difference between the retail_price of the current import, and the previous import.
Since you've edited your question, my edited answer:
It seems like you're having storage/performance issues, and you know what you want to achieve. To solve this you'll have to start measuring and debugging: with datasets this large you'll have to make sure you have the correct indexes. Profiling your queries should helping in finding out if they do.
And probably, your environment is "write once a day", and read "many times a minute". (I'm guessing you're creating a website). So you could speed up the frontend by processing the differences (price drops and new application) on import, rather than when displaying on the website.
If you still are unable to solve this, I suggest you open a more specific question, detailing your DBMS, queries, etc, so the real database administrators will be able to help you. 60 million rows are a lot, but with the correct indexes it should be no real trouble for a normal database system.
Compare the table with one you've downloaded the previous day, and note the differences.
Added:
For only 60 million items, and on a contemporary PC, you should be able to store a sorted array of the store id numbers and previous prices in memory, and do an array lookup faster than the data is arriving from the network feed. Mark any differences found and double-check them against the DB in post-processing.
Actually I also trying to play with these data, and I think best approach for you base on data from Apple.
You have 2 type of data : full and incremental (updated data daily). So within new data from incremental (not really big as full) you can compare only which record updated and insert them into another table to determine pricing has changed.
So you have a list of records (app, song, video...) updated daily with price has change, just get data from new table you created instead of compare or join them from various tables.
Cheers

PostgreSql and retrieving real time business statistics leads to too long queries : a solution?

We have a national application & the users would like to have accurate business statistics regarding some tables.
We are using tomcat, Spring Ws & hibernate on top of that.
We have thought of many solutions :
plain old query for each user request. The problem is those tables contains millions of records. Every query will take many seconds at least. Solution never used.
the actual solution used: create trigger. But it is painful to create & difficult to maintain (no OO, no cool EDI, no real debug). The only helping part is the possibility to create Junit Test on a higher level to verify the expected result. And for each different statistic on a table we have to create an other trigger for this table.
Using the quartz framework to consolidate data after X minutes.
I have learned that databases are not designedfor these heavy and complicated queries.
A separate data warehouse optimize for reading only queries will be better. (OLAP??)
But I don't have any clue where to start with postGresql. (pentaho is the solution or just a part?)
How could we extract data from the production database ? Using some extractor ?
And when ?Every night ?
If it is periodically - How will we manage to maintain near real time statistics if the data are just dumped on our datawarehouse one time per day ?
"I have learn that databases are NOT DESIGNED for these heavy and complicated queries."
Well you need to unlearn that. A database was designed for just these type of queries. I would blame bad design of the software you are using before I would blame the core technology.
I seems i have been misunderstood.
For those who think that a classic database is design for even processing real-time statistic with queries on billions datas, they might need to read articles on the origin of OLAP & why some people bother to design products around if the answer for performance was just a design question.
"I would blame bad design of the software you are using before I would blame the core technology."
By the way, im not using any software (or pgadmin counts ?). I have two basic tables, you cant make it more simple,and the problem comes when you have billions datas to retreve for statistics.
For those who think it is just a design problm, im glad to hear their clever answer (no trigger i know this one) to a simple problem :
Imagine you have 2 tables: employees & phones. An employee may have 0 to N phones.
Now let say that you have 10 000 000 employees & 30 000 000 phones.
You final users want to know in real time :
1- the average number of phones per user
2-the avegarde age of user who have more than 3 phones
3-the averagae numbers of phones for employees who are in the company for more than 10 years
You have potentially 100 users that want those real time statistics at anytime.
Of course, any queries dont have to take more than 1/4 sec.
Incrementally summarize the data..?
The frequency depends on your requirements, and in extreme cases you may need more hardware, but this is very unlikely.
Bulk load new data
Calculate new status [delta] using new data and existing status
Merge/update status
Insert new data into permanent table (if necessary)
NOTIFY wegotsnewdata
Commit
StarShip3000 is correct, btw.