Storing a string that can vary a lot, from very long to very short :: Fragmentation

Storing a string that can vary a lot, from very long to very short :: Fragmentation - mongodb

Ok, so every player in my game has a document in my players collection and each player has 1 string that is a serialized has of their game state. So this string can be
way long or way short and vary a lot for every single player.
I had somebody who doesn't have a ton of mongo experience tell me that i should pad every single string in the collection so that they are all the same length. So like add tons of zeros at the end to all the short and medium game state strings.
So A) is this a good idea?
B) I'm not even totally sure how to find out the longest length of a game so Im not sure how far to pad them and what if later on game states exceed my padding length?
My friend said he had a mongo collection keep blowing up because of fragmentation and when he implemented padding all of his issues went away.
oh i doubt it matters but my code is in php and obviously uses the php pecl mongo driver
Thanks for any thoughts or input!!!!!
-dave

MongoDB allocates space for documents at creation time. If the size of the document increases the document will need to be moved to a new location to accomodate the larger size. The original space is not released to the operating system. Instead, MongoDB will eventually reuse this space. Until this happens, it may appear the database is over-allocated or what is sometimes called fragmented.
So, what probably happened to your friend:
documents were inserted
when fields were updated, their sizes sometimes increased, and the documents therefore grew
documents were moved as they grew, and the database became over-allocated (what your
friend called fragmented)
And by padding the fields in the documents your friend was able to ensure documents never grew in size and therefore his database never became over-allocated.
The padding approach is valid but it also adds complexity to the application. Typically padding is performed for fields that will eventually be created, rather than fixing the size of the values themselves, but the idea is the same. In your case it doesn't sound like padding is a great option because you cannot predict the field size.
Instead, you might consider using usePowerOf2Sizes: http://docs.mongodb.org/manual/reference/command/collMod/
This configuration will automatically pad the space allocated for documents and will increase the chances that space is reused for efficiently by MongoDB at the cost of a slightly larger database.

So A) is this a good idea?
Depends. If the game documents were to be frequently updated in such a manner that they would move on disk a lot then you might find that padding does help, however, considering that the entire works of Shakespear can fit into a 4mb document with some room left I doubt very much that any string you have will cause a heavy amount of fragmentation; in fact I will be quite surprised if it does.
The problem that could, in theory, occur is that you get a lot of documents within your freelists and deleted buckets that cannot be reused causing fragmentation to occur.
Not only that but the IO of disk movement can be a killer if it becomes persistent.
B) I'm not even totally sure how to find out the longest length of a game so Im not sure how far to pad them and what if later on game states exceed my padding length?
Then the idea is useless, infact the idea is 90% of the time useless anyway and you would be better off using a power of 2 sizes allocation on your documents if this were to be a problem: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes
Using this option would be a far more optimal approach to solving fragmentation issues.
My friend said he had a mongo collection keep blowing up because of fragmentation and when he implemented padding all of his issues went away.
A friend of a friend, of a cousin, of a niece of mine said something similar too...you would be better off testing this for yourself.
I would bet that the bigger problem he had was with indexes and the queries he performed. It is extremely rare for string lengths to cause such a heaving amount of IO usage in disk movement that you would actually use artificial padding.

From your question I understand those strings are just blobs, i.e. they are not structured in some way for allowing db queries/filtering on their contents. If this is the case, store them in files, and store file names in the mongo document.

Related

Is it a good idea to have a big set/list as column in `Scylla DB`?

Is it a good idea to have a table in Scylla DB with column type set with couple of thousands elements in it, e.g 5000 elements?
In Scylla documentation it's stated that:
Collections are meant for storing/denormalizing a relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all messages sent by a user”, “events registered by a sensor”…), then collections are not appropriate, and a specific table (with clustering columns) should be used. ~ [source]
My column is much bigger than "the phone numbers of a given user", but much smaller than “all messages sent by a user” (column set is going to be 'frozen', if that matters), so I am confused what to do?

If your set is frozen, you can be a little more relaxed about it. This is because ScyllaDB will not have to break it into components and re-create it so often as it does with non-frozen sets.
So if you're sure the frozen set won't be larger than a megabyte or so, it will be fine. For simple read/write queries it will be treated as a blob.

The main downside of having a large individual cell - frozen set, string, or a even an unfrozen set - is that the CQL API does not give you an efficient way to read or write only part of that cell. For example, every time you want to access your set, Scylla will need to read it entirely into memory. This takes time and effort. Even worse, it also increases the latency of other requests because Scylla's scheduling is cooperative and does not switch tasks in the middle of handling a single cell because it is assumed to be fairly small.
Whether or not 5,000 elements specifically is too much or not also depends on the size of each element - 5,000 elements of 10 bytes each totals 50K, but if it's 100 bytes each they total 500K. A 500K cell will certainly increase tail latency noticeably, but this may or may not be important for your application. If you can't think of a data model that doesn't involve large collections, then you can definitely try the one you thought of, and check if the performance is acceptable to you or not.
In any case, if your use case involves unbounded collections - i.e., 5,000 elements is not a hard limit but some sort of average, and if in some rows you actually have a million elements, you're in for a world of pain :-( You can start to see huge latencies (as one single 1-million-cell row delays many other requests waiting in line) and in extreme cases even allocation failures. So you will somehow need to avoid this problem. Avoiding it isn't always easy - Scylla doesn't have a feature that prevents your 5,000-element set growing into a million-element set (see https://github.com/scylladb/scylladb/issues/10070).

Is shortening MongoDB property names worthwhile?

In mongodb docs the author mentions it's a good idea to shorten property names:
Use shorter field names.
and in an old blog post from how to node (it is offline by now April, 2022 edit)
....oft-reported issue with mongoDB is the
size of the data on the disk... each and every record stores all the field-names
.... This means that it can often be
more space-efficient to have properties such as 't', or 'b' rather
than 'title' or 'body', however for fear of confusion I would avoid
this unless truly required!
I am aware of solutions of how to do it. I am more interested in when is this truly required?

To quote Donald Knuth:
Premature optimization is the root of all evil (or at least most of
it) in programming.
Build your application however seems most sensible, maintainable and logical. Then, if you have performance or storage issues, deal with those that have the greatest impact until either performance is satisfactory or the law of diminishing returns means there's no point in optimising further.
If you are uncertain of the impact of particular design decisions (like long property names), create a prototype to test various hypotheses (like "will shorter property names save much space"). Don't expect the outcome of testing to be conclusive, however it may teach you things you didn't expect to learn.

Keep the priority for meaningful names above the priority for short names unless your own situation and testing provides a specific reason to alter those priorities.
As mentioned in the comments of SERVER-863, if you're using MongoDB 3.0+ with the WiredTiger storage option with snappy compression enabled, long field names become even less of an issue as the compression effectively takes care of the shortening for you.

Bottom line up: So keep it as compact as it still stays meaningful.
I don't think that this is every truly required to be shortened to one letter names. Anyway you should shorten them as much as possible, and you feel comfortable with it. Lets say you have a users name: {FirstName, MiddleName, LastName} you may be good to go with even name:{first, middle, last}. If you feel comfortable you may be fine with name:{f, m,l}.
You should use short names: As it will consume disk space, memory and thus may somewhat slowdown your application(less objects to hold in memory, slower lookup times due to bigger size and longer query time as seeking over data takes longer).
A good schema documentation may tell the developer that t stands for town and not for title. Depending on your stack you may even be able to hide the developer from working with these short cuts through some helper utils to map it.
Finally I would say that there's no guideline to when and how much you should shorten your schema names. It highly depends on your environment and requirements. But you're good to keep it compact if you can supply a good documentation explaining everything and/or offering utils to ease the life of developers and admins. Anyway admins are likely to interact directly with mongodb, so I guess a good documentation shouldn't be missed.

I performed a little benchmark, I uploaded 252 rows of data from an Excel into two collections testShortNames and testLongNames as follows:
Long Names:
{
"_id": ObjectId("6007a81ea42c4818e5408e9c"),
"countryNameMaster": "Andorra",
"countryCapitalNameMaster": "Andorra la Vella",
"areaInSquareKilometers": 468,
"countryPopulationNumber": NumberInt("77006"),
"continentAbbreviationCode": "EU",
"currencyNameMaster": "Euro"
}
Short Names:
{
"_id": ObjectId("6007a81fa42c4818e5408e9d"),
"name": "Andorra",
"capital": "Andorra la Vella",
"area": 468,
"pop": NumberInt("77006"),
"continent": "EU",
"currency": "Euro"
}
I then got the stats for each, saved in disk files, then did a "diff" on the two files:
pprint.pprint(db.command("collstats", dbCollectionNameLongNames))
The image below shows two variables of interest: size and storageSize.
My reading showed that storageSize is the amount of disk space used after compression, and basically size is the uncompressed size. So we see the storageSize is identical. Apparently the Wired Tiger engine compresses fieldnames quite well.
I then ran a program to retrieve all data from each collection, and checked the response time.
Even though it was a sub-second query, the long names consistently took about 7 times longer. It of course will take longer to send the longer names across from the database server to the client program.
-------LongNames-------
Server Start DateTime=2021-01-20 08:44:38
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606964546 EndTimeM= 606965328
ElapsedTime MilliSeconds= 782
-------ShortNames-------
Server Start DateTime=2021-01-20 08:44:39
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606965328 EndTimeM= 606965421
ElapsedTime MilliSeconds= 93
In Python, I just did the following (I had to actually loop through the items to force the reads, otherwise the query returns only the cursor):
results = dbCollectionLongNames.find(query)
for result in results:
pass

Adding my 2 cents on this..
Long named attributes (or, "AbnormallyLongNameAttributes") can be avoided while designing the data model. In my previous organisation we tested keeping short named attributes strategy, such as, organisation defined 4-5 letter encoded strings, eg:
First Name = FSTNM,
Last Name = LSTNM,
Monthly Profit Loss Percentage = MTPCT,
Year on Year Sales Projection = YOYSP, and so on..)
While we observed an improvement in query performance, largely due to the reduction in size of data being transferred over the network, or (since we used JAVA with MongoDB) the reduction in length of "keys" in MongoDB document/Java Map heap space, the overall improvement in performance was less than 15%.
In my personal opinion, this was a micro-optimzation that came at an additional cost (and a huge headache) of maintaining/designing an additional system of managing Data Attribute Dictionary for each of the data models. This system was required to have an organisation wide transparency while debugging the application/answering to client queries.
If you find yourself in a position where upto 20% increase in the performance with this strategy is lucrative to you, may be it is time to scale up your MongoDB servers/choose some other data modelling/querying strategy, or else to choose a different database altogether.

If using verbose xml, trying to ameliorate that with custom names could be very important. A user comment in the SERVER-863 ticket said in his case; I'm ' storing externally-defined XML objects, with verbose naming: the fieldnames are, perhaps, 70% of the total record size. So fieldname tokenization could be a giant win, both in terms of I/O and memory efficiency.'

Collection with smaller name - InsertCompress
Collection with bigger name - InsertNormal
I Performed this on our mongo sharded cluster and Analysis shows
There is around 10-15% gain in shorter names while saving and seems purely based on network latency. I added bulk insert using multiple threads. So if single inserts it can save more.
My avg data size for InsertCompress is 280B and InsertNormal is 350B and inserted 25 million records. So InsertNormal shows 8.1 GB and InsertCompress shows 6.6 GB. This is data size.
Surprisingly Index data size shows as 2.2 GB for InsertCompress collection and 2 GB for InsertNormal collection
Again the storage size is 2.2 GB for InsertCompress collection while InsertNormal its around 1.6 GB
Overall apart from network latency there is nothing gained for storage, so not worth to put efforts going in this direction to save storage. Only if you have much bigger document and smaller field names saves lot of data you can consider

How to deal with relationships while using mongodb

I know, think in "denormalized way" or "nosql way".
but tell me about this simple use-case.
db.users
db.comments
some user post a comment, and i want to fetch some user data while fetching a comment.
say i want to show dynamic data, like "userlevel", and static data like "username".
with the static data i will never have problems, but what about the dynamic data?
userlevel is in users collation, i need the denormalized data duplicated into comments to archieve read performance but also having the userlevel updated.
is this archiveable in some way?

EDIT:
Just found an answer of Brendan McAdams, guy from 10gen, who is obviously way way authoritative than me, and he recommends to embed documents.
older text:
The first one is to manually include to each comment ObjectID of user they're belong.
comment: { text : "...",
date: "...",
user: ObjectId("4b866f08234ae01d21d89604"),
votes: 7 }
The second one, and clever way is to use DBRefs
we add extra I/O to our disk, losing performance am i right? (i'm not sure how this work internally) therefore we need to avoid linking if possible, right?
Yes - there would be one more query, but driver will do it for you - you can think of this as of kind of syntax sugar. Does it affect performance? Actually, it is depends too :) One of the reasons why Mongo so freaking fast is that it is using memory mapped files
and mongo try it best to keep all of the working set (plus indexes) directly in RAM. And every 60 seconds (by default) it syncs RAM snapshot with disk based file.
When I'm saying working set, I mean things you are working with: you can have three collections - foo, bar, baz, but if you are working now only with foo and bar, they ought to be loaded into ram, while baz stays on disk abandoned. Moreover memory mapped files allow as to load only part of the collection. So if you're building something like engadget or techcrunch there is high probability that working set would be comments for the last few days and old pages will be revived way less frequently (comments would be spawned to memory on demand), so it doesn't affect performance significally.
So recap: as long as you keep working set in memory (you may think that is read/write caching), fetching of those things is superfast and one more query wouldn't be a problem. If you working with a slices of data that doesn't fit into memory, there would be speed degradation, but I don't now your circumstances -- it could be acceptable, so in both cases I tend to choose do use linking.

Memory Efficient and Speedy iPhone/Android Dictionary Storage/Access

Im having trouble with memory on older generation iPhones (ipod touch 1st gen, 2nd gen e.t.c). This is due to the amount of memory allocated when I load and store a 170k word dictionary.
This is the code (very simple):
string[] words = dictionaryRef.text.Split("\n"[0]);
_words = new List<string>(words);
This allocates on start around 12mb of storage, iphone has around 43mb I think. So that + textures + sounds + the OS it tends to break.
Speed wise, accessing using a binary search is fine. But its storing it in memory more efficiently (and loading it more efficiently).
The text.Split appears to take up alot of heap memory.
Any advice?

You can't count too much on how much memory these pre-3.0 devices have available on startup. 43 MB is rather optimistic. Is your app just checking to see if the word is in the list or not? You might want to roll your own hash table instead of using a binary search. I'd search some of the literature and stack overflow to look for efficient ways to store a large dictionary with the particular word sizes you have. A google search on hash table might give you a better implementation.

Use SQLite. It will use less memory and be faster. Create an index on your words column and voila, you have binary search, without having the whole dictionary loaded in memory.

First if dictionaryRef.text is a string (and it looks so) then you already got something huge being allocated (2 bytes per characters). Check this it might well account for a large (near half) amount of the total memory being allocated. You should think about caching this (the database idea is a good one, but a file could do to then use File.ReadAllLines in future execution).
Next you can try do a bit better than Mono's Split method. It creates a List and then turn it into an array (calling ToList) at the end - which you end up creating a new List from. Since your requirement (only '/n') is fairly basic I suggest you to roll your own Split method (or copy/paste/reduce the one from Mono) and avoid the temporary memory allocations.
In any case take a lot of (memory) measurements since allocations, even more for strings, often occurs where we don't look ;-)

I would have to agree with Morningstar that using a SQLite backend for your word storage sounds like the best solution to what you are trying to do.
However, if you insist on using a word list, here's a suggestion:
It looks to me like dictionaryRef.text is constructed by reading a text file in its entirety (File.ReadAllText() or some such).
Instead of doing that, why not use TextReader.ReadLine() to read 1 word at a time from the file into a List, thus avoiding the need to use String.Split() and using tons of temporary storage space?
Ultimately that seems to be what you want anyway... and ReadLine() will "split" on \n for you.

mongoDB vs relational databases when data can't fit into memory?

First of all, I apologize for my potentially shallow understanding of NoSQL architecture (and databases in general) so try to bear with me.
I'm thinking of using mongoDB to store resources associated with an UUID. The resources can be things such as large image files (tens of megabytes) so it makes sense to store them as files and store just links in my database along with the associated metadata. There's also the added flexibility of decoupling the actual location of the resource files, so I can use a different third party to store the files if I need to.
Now, one document which describes resources would be about 1kB. At first I except a couple hundred thousands of resource documents which would equal some hundreds of megabytes in database size, easily fitting into server memory. But in the future I might have to scale this into the order of tens of MILLIONS of documents. This would be tens of gigabytes which I can't squeeze into server memory anymore.
Only the index could still fit in memory being around a gigabyte or two. But if I understand correctly, I'd have to read from disk every time I did a lookup on an UUID. Is there a substantial speed benefit from mongoDB over a traditional relational database in such a situation?
BONUS QUESTION: is there an existing, established way of doing what I'm trying to achieve? :)

MongoDB doesn't suddenly become slow the second the entire database no longer fits into physical memory. MongoDB currently uses a storage engine based on memory mapped files. This means data that is accessed often will usually be in memory (OS managed, but assume a LRU scheme or something similar).
As such it may not slow down at all at that point or only slightly, it really depends on your data access patterns. Similar story with indexes, if you (right) balance your index appropriately and if your use case allows it you can have a huge index with only a fraction of it in physical memory and still have very decent performance with the majority of index hits happening in physical memory.
Because you're talking about UUID's this might all be a bit hard to achieve since there's no guarantee that the same limited group of users are generating the vast majority of throughput. In those cases sharding really is the most appropriate way to maintain quality of service.

This would be tens of gigabytes which I can't squeeze into server
memory anymore.
That's why MongoDB gives you sharding to partition your data across multiple mongod instances (or replica sets).

In addition to considering sharding, or maybe even before, you should also try to use covered indexes as much as possible, especially if it fits your Use cases.
This way you do not HAVE to load entire documents into memory. Your indexes can help out.
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-CoveredIndexes

If you have to display your entire document all the time based on the id, then the general rule of thumb is to attempt to keep e working set in memory.
http://blog.boxedice.com/2010/12/13/mongodb-monitoring-keep-in-it-ram/
This is one of the resources that talks about that. There is a video on mongodb's site too that speaks about this.
By attempting to size the ram so that the working set is in memory, and also looking at sharding, you will not have to do this right away, you can always add sharding later. This will improve scalability of your app over time.
Again, these are not absolute statements, these are general guidelines, that you should think through your usage patterns and make sure that they ar relevant to what you are doing.
Personally, I have not had the need to fit everything in ram.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse