Find out HDD usage of binary data in a field - mongodb

In a collection with a field image that contains images (BinData). I want to find out how many % of the DB are used by the images. What is the most efficient way to calculate the total size of all images?
I want to avoid fetching all images from the DB server, so I came up with this code:
mapper = Code("""
function() {
var n = 0;
if (this.image) {
n = this.image.length();
}
emit('sizes', n);
}
""")
reducer = Code("""
function(key, sizes) {
var total = 0;
for (var i = 0; i < sizes.length; i++) {
total += sizes[i];
}
}
return total;
""")
result = db.files.map_reduce(mapper, reducer, "image_sizes")
During the execution memory usage of mongodb gets quite high, it looks as if the whole data is loaded into memory. How can this be optimized? Also, does it make sense to call this.image.length() in order to find out how many Bytes the images occupy on the harddrive?

You can not avoid loading all the data into memory. MongoDB treats a document as its atomic unit, and by querying all documents, it will pull in all documents into memory.
As an alternative, what might possibly help you is just to see how many bytes a collection takes up, but that of course only works if the only thing you have stored in your collection are images. On the shell, you can do this with:
db.files.stats()
Which has the field storageSize that shows you how much storage is needed for your images approximately. This is not nearly as accurate as going through all your images though.

Related

Spark update cached dataset

I have a sorted dataset, which is updated (filtered) inside a cycle according on the value of the head of the dataset.
If I cache the dataset every n (e.g., 50) cycles, I have good performance.
However, after a certain amount of cycles, the cache it seems to not work, since the program slows down (I guess it is because the memory assigned to the caching is filled).
I was asking if and how is it possible to maintain only the updated dataset in cache, in order to not fill the memory and still have good performance.
Please find below an example of my code:
dataset = dataset.sort(/* sort condition */)
dataset.cache()
var head = dataset.head(1)
var count = 0
while (head.nonEmpty) {
count +=1
/* custom operation with the head */
dataset = dataset.filter(/* filter condition based on the head of the dataset */
if (count % 50 == 0) {
dataset.cache()
}
head = dataset.head(1)
}
cache alone won't help you here. With each iteration lineage and execution plan grow, and it is not something that can be addressed by caching alone.
You should at least break the lineage:
if (count % 50 == 0) {
dataset.cache()
dataset.checkpoint
}
although personally I would also write data to a distributed storage and read it back:
if (count % 50 == 0) {
dataset.write.parquet(s"/some/path/$count")
dataset = spark.read.parquet(s"/some/path/$count")
}
it might not be acceptable depending on your deployment, but in many cases behaves more predictably than caching and checkpointing
Try uncaching dataset before caching this way you will remove old copy of dataset from memory and keep only the latest, avoiding multiple copies in memory. Below is sample but you have keep dataset.unpersist() in correct location based on you code logic
if (count % 50 == 0) {
dataset.cache()
}
dataset = dataset.filter(/* filter condition based on the head of the dataset */
if (count % 50 == 0) {
dataset.cache()
}

MongoDB Map/Reduce very poor perfomance

I am executing MAP/Reduce in Mongo DB and it is extremely unexpectedly slow. I am using very wide document with 700 documents and reduce function performs simple sum and multiplication of each value in document by another value
Performance is very poor, 20 min for only 20K docs, and linearly growing with number of docs. Looking into mongotop 99% of the time process i busy writing into temp collection. Any idea on what it is doing and any ways to optimize, anything is missing?
I have used all proper practices, sorted by all emited keys(key1,key2,key3)
Created compound index db.ensueIndex({Key1:1,Key2:1,Key3:1})
map = function(){
Key = {
key1:this.Key1,
key2:this.Key2,
key3:this.Key3
};
values = this;
emit(key,values)
}
reduce = function(key,values)
{
//initialize all values to 0
var ret = {
val1:0,
val2:0,
.
.
val700:0,
}
values.forEach(function(v){
val1+=v.val1*v.quantity;
val2+=v.val2*v.quantity;
.
.
val700+=v.val700*v.quantity;
});
return ret;
}
db.runCommand({mapReduce:coll,map:map,reduce:reduce,out:coucoll,sort:{ke1,key2,key3})

Random Sampling from Mongo

I have a mongo collection with documents. There is one field in every document which is 0 OR 1. I need to random sample 1000 records from the database and count the number of documents who have that field as 1. I need to do this sampling 1000 times. How do i do it ?
For people coming to the answer, you should now use the new $sample aggregation function, new in 3.2.
https://docs.mongodb.org/manual/reference/operator/aggregation/sample/
db.collection_of_things.aggregate(
[ { $sample: { size: 15 } } ]
)
Then add another step to count up the 0s and 1s using $group to get the count. Here is an example from the MongoDB docs.
For MongoDB 3.0 and before, I use an old trick from SQL days (which I think Wikipedia use for their random page feature). I store a random number between 0 and 1 in every object I need to randomize, let's call that field "r". You then add an index on "r".
db.coll.ensureIndex(r: 1);
Now to get random x objects, you use:
var startVal = Math.random();
db.coll.find({r: {$gt: startVal}}).sort({r: 1}).limit(x);
This gives you random objects in a single find query. Depending on your needs, this may be overkill, but if you are going to be doing lots of sampling over time, this is a very efficient way without putting load on your backend.
Here's an example in the mongo shell .. assuming a collection of collname, and a value of interest in thefield:
var total = db.collname.count();
var count = 0;
var numSamples = 1000;
for (i = 0; i < numSamples; i++) {
var random = Math.floor(Math.random()*total);
var doc = db.collname.find().skip(random).limit(1).next();
if (doc.thefield) {
count += (doc.thefield == 1);
}
}
I was gonna edit my comment on #Stennies answer with this but you could also use a seprate auto incrementing ID index here as an alternative if you were to skip over HUGE amounts of record (talking huge here).
I wrote another answer to another question a lot like this one where some one was trying to find nth record of the collection:
php mongodb find nth entry in collection
The second half of my answer basically describes one potential method by which you could approach this problem. You would still need to loop 1000 times to get the random row of course.
If you are using mongoengine, you can use a SequenceField to generate an incremental counter.
class User(db.DynamicDocument):
counter = db.SequenceField(collection_name="user.counters")
Then to fetch a random list of say 100, do the following
def get_random_users(number_requested):
users_to_fetch = random.sample(range(1, User.objects.count() + 1), min(number_requested, User.objects.count()))
return User.objects(counter__in=users_to_fetch)
where you would call
get_random_users(100)

EF4: Object Context consuming too much memory

I have a reporting tool that runs against an MS SQL Server using EF4. The general bulk of this report involves looping over around 5000 rows and then pulling numerous other rows for each one of these.
I pull the initial rows through one data context. The code that pulls the related rows involves using another data context, wrapped in a using statement. It would appear though that the memory consumed by the second data context is never freed and usage shoots up to 1.5GB before an out of memory exception is thrown.
Here a snippet of the code so you can get the idea:
var outlets = (from o in db.tblOutlets
where o.OutletType == 3
&& o.tblCalls.Count() > number && o.BelongsToUser.HasValue && o.tblUser.Active == true
select new { outlet = o, callcount = o.tblCalls.Count() }).OrderByDescending(p => p.callcount);
var outletcount = outlets.Count();
//var outletcount = 0;
//var average = outlets.Average(p => p.callcount);
foreach (var outlet in outlets)
{
using (relenster_v2Entities db_2 = new relenster_v2Entities())
{
//loop over calls and add history
//check the last time the history table was added to for this call
var lastEntry = (from h in db_2.tblOutletDistributionHistories
where h.OutletID == outlet.outlet.OutletID
orderby h.VisitDate descending
select h).FirstOrDefault();
DateTime? beginLooking = null;
I had hoped that by using a second data context that memory could be released after each iteration. It would appear it is not (or the GC is not running in time)
With the input from #adrift I altered the code so that the saving of the changes took place after each iteration of the loop, rather than all at the end. It would appear that there is a limit (in my case anyway) of around 150,000 pending writes that the data context can happily hold before consuming too much memory.
By allowing it to write changes after each iteration, it would appear that it could manage memory more effectively, although it did seem to use as much, it didn't throw an exception.

iPhone context: How do i extract palette information from an image?

Hi everybody: i want to take a picture and retrieve the main color analyzing its palette (i think this should be the easiest way), but i don't know really where to start.
Assuming you have a raw 24 bit RGB image and you want to find the number of times each color appears:
there are really 2 simple ways of doing this:
my favourite is just to create an array of ints for each possible color, then just index into that array and ++ that method does use like 64 meg of memory tho.
another method is to create a linked list, adding to it each time a new color is encountered, the structs stored in the list would store the color and the number of times encountered, its slow to do all the accumulating cause you have to search the whole list for each pixel, but it would be quicker to sort by number of times used, as only colors actually used are in the list (making it more ideal for small images too)
I like a compromise between the two:
take say, red and green to index into an array of linked lists, that way the array is only 256k (assuming it just stores a pointer to the list) and the lists to search will be relatively short because its only the blue variants of the Red,Green color. if you were only interested in the SINGLE MOST used color, I would just store this in a "max color" variable, that I would compare with each time I iterated over a pixel and incremented a color, that way you don't have to go through the whole structure searching for the most used at the end.
struct Pixel
{
byte R,G,B;
}
const int noPixels = 1024*768; // this is whatever the number of pixels you have is
Pixel pixels[noPixels]; // this is your raw array of pixels
unsinged int mostUsedCount = 0;
Pixel mostUsedColor;
struct ColorNode
{
ColorNode* next;
unsigned int count;
byte B;
}
ColorNode* RG = new ColorNode[256*256];
memset(RG,0,sizeof(ColorNode)*256*256);
for(int i = 0; i<noPixels; i++)
{
int idx = pixels[i].R + pixels[i].G*256;
ColorNode*t;
for(t=RG[idx]; t; t = t->next)
{
if(t->B == pixels[i].B)
{
break;
}
}
if(!t)
{
t = new ColorNode;
t->next = RG[idx];
RG[idx] = t;
t->B = pixels[i].B;
t->count = 0;
}
t->count++;
if(t->count > mostUsedCount)
{
mostUsedCount = t->count;
mostUsedColor = pixels[i];
}
}
you might consider using a Binary Tree, or Tree of some kind too, rather than just searching through a list like that. but I'm not too knowledgeable on that type of thing...
Oh yeah... I forgot about memory management.
you could just go through the whole array and delete all the nodes that need deleting, but that would be boring.
if possible, allocate all the memory you would possibly need at the beginning, that would be 256k + sizeof(ColorNode) *noPixels ~= 256 + 2 to 3 times the size of your raw image data.
that way you can just pull nodes out using a simple stack method, and then delete everything in one foul swoop.
another thing you could do is also add the nodes to another link list for ALL allocated nodes as well, this increases the iterating process, adds data to Color node, and just saves having to iterate through the entire array to find lists to delete.