MongoDB - anonymizing 600k records

MongoDB - anonymizing 600k records - scala

I am trying to anonymize a large data set of about 600k records (removing sensitive information like email, etc.) so that it can be used for some performance tests.
I am using Scala (Casbah) with Mongo. The actual script is pretty simple and straightforward. When I run the script, the entire process starts off pretty fast - parsing 1000 records every 2-3 seconds, but it slows down tremendously and starts crawling very slowly.
I know this is pretty vague without too much details, but any idea why this is happening, and any hints on how I could speed this up?

It turned out to be an issue with the driver and not with Mongo. When I tried the same inserts using the mongo shell, it was through without breaking a sweat.
UPDATE
So, I tried both approaches. Inserting into existing collection and dumping the results in a new collection. The first approach was faster for me. Of course, one should never assume this to be always true and must benchmark before choosing the first approach over the second. In both case, Mongo was very very fast (meaning - it was not taking hours to get this done). There was a problem with the Java interface I was using to connect with Mongo, which was more of a stupid mistake on my part.

Related

Time-consuming queries to PostgreSQL Database are duplicated

After restoring db-server from snapshot something strange started happening with our database. Basically it can be described as all time-consuming queries are seems to be duplicated. At least as pg_stat_activity shows it
These lines are almost equal except for their PIDs and client addresses.
Usually I'd think that that's just a mistake of dev team (multiple equal queries at a time in code, cron misconfiguration, etc), but one of those time-consuming selects comes from PowerBI which I believe to be quite reliable in terms of loading data.
Has anybody ever stumbled upon this problem?

Turned out that's the way pg_stat_activity shows parallel workers processing single query. You can make sure that's the case by getting backend_type of these records.

storing huge amounts of data in mongo

I am working on a front end system for a radius server.
The radius server will pass updates to the system every 180 seconds. Which means if I have about 15,000 clients that would be around 7,200,000 entries per day...Which is a lot.
I am trying to understand what the best possible way to store and retrieve this data will be. Obviously as time goes on, this will become substantial. Will MongoDB handle this? Typical document is not much, something this
{
id: 1
radiusId: uniqueId
start: 2017-01-01 14:23:23
upload: 102323
download: 1231556
}
However, there will be MANY of these records. I guess this is something similar to the way that SNMP NMS servers handle data which as far as I know they use RRD to do this.
Currently in my testing I just push every document into a single collection. So I am asking,
A) Is Mongo the right tool for the job and
B) Is there a better/more preferred/more optimal way to store the data
EDIT:
OK, so just incase someone comes across this and needs some help.
I ran it for a while in mongo, I was really not satisfied with performance. We can chalk this up to the hardware I was running on, perhaps my level of knowledge or the framework I was using. However I found a solution that works very well for me. InfluxDB pretty much handles all of this right out of the box, its a time series database which is effectively the data I am trying to store (https://github.com/influxdata/influxdb). Performance for me has been like night & day. Again, could all be my fault, just updating this.
EDIT 2:
So after a while I think I figured out why I never got the performance I was after with Mongo. I am using sailsjs as framework and it was searching by id using regex, which obviously has a huge performance hit. I will eventually try migrate back to Mongo instead of influx and see if its better.

15,000 clients updating every 180 seconds = ~83 insertions / sec. That's not a huge load even for a moderately sized DB server, especially given the very small size of the records you're inserting.
I think MongoDB will do fine with that load (also, to be honest, almost any modern SQL DB would probably be able to keep up as well). IMHO, the key points to consider are these:
Hardware: make sure you have enough RAM. This will primarily depend on how many indexes you define, and how many queries you're doing. If this is primarily a log that will rarely be read, then you won't need much RAM for your working set (although you'll need enough for your indexes). But if you're also running queries then you'll need much more resources
If you are running extensive queries, consider setting up a replica set. That way, your master server can be reserved for writing data, ensuring reliability, while your slaves can be configured to serve your queries without affecting the write reliability.
Regarding the data structure, I think that's fine, but it'll really depend on what type of queries you wish to run against it. For example, if most queries use the radiusId to reference another table and pull in a bunch of data for each record, then you might want to consider denormalizing some of that data. But again, that really depends on the queries you run.
If you're really concerned about managing the write load reliably, consider using the Mongo front-end only to manage the writes, and then dumping the data to a data warehouse backend to run queries on. You can partially do this by running a replica set like I mentioned above, but the disadvantage of a replica set is that you can't restructure the data. The data in each member of the replica set is exactly the same (hence the name, replica set :-) Oftentimes, the best structure for writing data (normalized, small records) isn't the best structure for reading data (denormalized, large records with all the info and joins you need already done). If you're running a bunch of complex queries referencing a bunch of other tables, using a true data warehouse for the querying part might be better.
As your write load increases, you may consider sharding. I'm assuming the RadiusId points to each specific server among a pool of Radius servers. You could potentially shard on that key, which would split the writes based on which server is sending the data. Thus, as you increase your radius servers, you can increase your mongo servers proportionally to maintain write reliability. However, I don't think you need to do this right away as I bet one reasonably provisioned server should be able to manage the load you've specified.
Anyway, those are my preliminary suggestions.

Using find() with arguments that will result in not finding anything results in the RAM never getting "dropped"

I'm currently setting up a mongodb and so far I have a DB with 25million objects, and some index. When I try use find() with 4 arguments and I know that one of the arguments are "wrong" or that the result of the query will be nothing found, then what happens is that mongodb takes all my RAM (8gb) and the time to get the "nothing found" varies a lot.
BUT the biggest problem is that once it's done it never lets go of the RAM and I need to restart mongodb to get back the RAM.
I've tried many other tests, such as adding huge amounts of objects (1million+), use the find() for something that I know exist and other operations and they work fine.
Also want to know that this works fine if I have a smaller DB like 1million objects.
I will keep on doing test on this, but if anyone have anything they know right away that could help that would be great.
EDIT: Tested some more with 1m DB. The first failed found took about 4sec and took some RAM, the proceding fails took only 0.6sec and didn't seem to affect the RAM. Does mongodb store the DB in RAM when trying to find a document and with very huge DB it can't get all in the cache?
Btw. on the 25mil it first took about 60sec to not find it, but the 2nd try took 174sec and after that find it had taken all the 8gb on RAM.

Mongo database extremely slow until I restart

I just inherited an application from another developer, and I've been asked to fix some latency issues that users have been experiencing. The problem is that any page that makes db calls to mongo takes several minutes to load in the browser.
When I restart mongo, however, everything speeds up again, and the application functions normally. I see several cron jobs that run throughout the day, and I believe one of these may be causing mongo to slow down.
Unfortunately, I have no experience with mongo (only mysql), and I really don't have any idea of what I'm looking for in terms of things that could be making mongo run so slowly.
Anyways, I was hoping someone could suggest some potential things that could be causing the latency so I can approach this problem better. I have looked in the mongo logs, and the only thing I see that could be of concern is a message that says:
warning: can't find plugin [asc]
I know this may point to an indexing problem, but are there any other obvious things I should be investigating?

From what I read at https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/pqPvMq7cSBw it looks like one of your queries declared
db.a2.find().sort({a:"asc"})
rather than
db.a2.find().sort({a:1})
In MongoDB you need to declare your sorting with either 1 or -1, there's no asc or desc constants for the sorting. So I would recommend that you check if any of your queries runs incorrectly. You can check what queries are running through the log files (with correct profiling settings) http://docs.mongodb.org/manual/tutorial/manage-the-database-profiler/ . You may use mongotop (http://docs.mongodb.org/manual/reference/mongotop/) to see where the most time reading/writing data is spent for your collections, as well.

Can CouchDB handle 15 million records daily?

I'm relatively new to NoSQL databases and I have to evaluate different NoSQL-Solutions for a monitoring tool.
The situation is the following:
One datum is just about 100 Bytes big, but there are really a lot of them. During a day we get about 15 million records... So I'm currently testing with 900 million records (about 15GB as SQL-Insert Script)
My question is: Does Couchdb fit my needs? I need to do range querys (on the date the records were created) and sum up some of the columns acoording to groups definied by "secondary indexes" stored in the datum.)
I know that MapReduce is probably the best solution to calculate that, but is the JavaScript of CouchDB able to do this in an acceptable time?
I already tried MongoDB but it's really poor MapReduce made a crappy job... I also read about HBase and Cassandra. But maybee CouchDB is also a good possibility
I hope I gave you all the needed information... Thank you for your help!
andy

Frankly, at this time, unless you have very good hardware, Apache CouchDB may run into problems. Map/reduce will probably be fine. CouchDB's incremental map/reduce is ideal for your requirements.
As a developer, you will love it! Unfortunately as a sysadmin, you may notice more disk usage and i/o than expected.
I suggest to try it. Being HTTP and Javascript, it's easy to do a feasibility test. Just remember, the initial view build will take a long time (let's assume for argument it takes longer than every other competing database). But that time will never be spent again. Map/reduce runs only once per document (actually per document update).

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse