It's a simple question with apparently a multitude of answers.
Findings have ranged anywhere from:
a. 22 bytes as per Basho's documentation:
http://docs.basho.com/riak/latest/references/appendices/Bitcask-Capacity-Planning/
b. 450~ bytes over here:
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-August/005178.html
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-May/004292.html
c. And anecdotal records that state overheads anywhere in the range of 45 to 200 bytes.
Why isn't there a straight answer to this? I understand it's an intricate problem - one of the mailing list entries above makes it clear! - but is even coming up with a consistent ballpark so difficult? Why isn't Basho's documentation clear about this?
I have another set of problems related to how I am to structure my logic based on the key overhead (storing lots of small values versus "collecting" them in larger structures), but I guess that's another question.
The static overhead is stated on our capacity planner as 22 bytes because that's the size of the C struct. As noted on that page, the capacity planner is simply providing a rough estimate for sizing.
The old post on the mailing list by Nico you link to is probably the best complete accounting of bitcask internals you will find and is accurate. Figuring in the 8bytes for a pointer to the entry and the 13bytes of erlang overhead on the bucket/key pair you arrive at 43 bytes on a 64 bit system.
As for there not being a straight answer ... actually asking us (via email, the mailing list, IRC, carrier pigeon, etc) will always produce an actual answer.
Bitcask requires all keys to be held in memory. As far as I can see the overhead referenced in a) is the one to be used when estimating the total amount of RAM bitcask will require across the cluster due to this requirement.
When writing data to disk, Riak stores the actual value together with various metadata, e.g. the vector clock. The post mentioning 450 bytes listed in b) appears to be an estimate of the storage overhead on disk and would therefore probably apply also to other backends.
Nico's post seems to contain a good and accurate explanation.
Related
Is it a good idea to have a table in Scylla DB with column type set with couple of thousands elements in it, e.g 5000 elements?
In Scylla documentation it's stated that:
Collections are meant for storing/denormalizing a relatively small amount of data. They work well for things like “the phone numbers of a given user”, “labels applied to an email”, etc. But when items are expected to grow unbounded (“all messages sent by a user”, “events registered by a sensor”…), then collections are not appropriate, and a specific table (with clustering columns) should be used. ~ [source]
My column is much bigger than "the phone numbers of a given user", but much smaller than “all messages sent by a user” (column set is going to be 'frozen', if that matters), so I am confused what to do?
If your set is frozen, you can be a little more relaxed about it. This is because ScyllaDB will not have to break it into components and re-create it so often as it does with non-frozen sets.
So if you're sure the frozen set won't be larger than a megabyte or so, it will be fine. For simple read/write queries it will be treated as a blob.
The main downside of having a large individual cell - frozen set, string, or a even an unfrozen set - is that the CQL API does not give you an efficient way to read or write only part of that cell. For example, every time you want to access your set, Scylla will need to read it entirely into memory. This takes time and effort. Even worse, it also increases the latency of other requests because Scylla's scheduling is cooperative and does not switch tasks in the middle of handling a single cell because it is assumed to be fairly small.
Whether or not 5,000 elements specifically is too much or not also depends on the size of each element - 5,000 elements of 10 bytes each totals 50K, but if it's 100 bytes each they total 500K. A 500K cell will certainly increase tail latency noticeably, but this may or may not be important for your application. If you can't think of a data model that doesn't involve large collections, then you can definitely try the one you thought of, and check if the performance is acceptable to you or not.
In any case, if your use case involves unbounded collections - i.e., 5,000 elements is not a hard limit but some sort of average, and if in some rows you actually have a million elements, you're in for a world of pain :-( You can start to see huge latencies (as one single 1-million-cell row delays many other requests waiting in line) and in extreme cases even allocation failures. So you will somehow need to avoid this problem. Avoiding it isn't always easy - Scylla doesn't have a feature that prevents your 5,000-element set growing into a million-element set (see https://github.com/scylladb/scylladb/issues/10070).
Background and Use Case
I have around 30 GB of data that never changes, specifically, every dictionary of every language.
Client requests to see the definition of a word, I simply respond with it.
On every request I have to conduct an algorithmic search of my choice so I don’t have to loop through the over two hundred million words I have stored in my .txt file.
If I open the txt file and read it so I can search for the word, it would take forever due to the size of the file (even if that file is broken down into smaller files, it is not feasible nor it is what I want to do).
I came across the concept of mmap, mentioned to me as a possible solution to my problem by a very kind gentleman on discord.
Problem
As I was learning about mmap I came across the fact that mmap does not store the data on the RAM but rather on a virtual RAM… well regardless of which it is, my server or docker instances may have no more than 64 GB of RAM and that chunk of data taking 30 of them is quite painful and makes me feel like there needs to be an alternative that is better. Even on a worst case scenario, if my server or docker container does not have enough RAM for the data stored on mmap, then it is not feasible, unless I am wrong as to how this works, which is why I am asking this question.
Questions
Is there better solution for my use case than mmap?
Will having to access such a large amount of data through mmap so I don’t have to open and read the file every time allocate RAM memory of the amount of the file that I am accessing?
Lastly, if I was wrong about a specific statement I made on what I have written so far, please do correct me as I am learning lots about mmap still.
Requirements For My Specific Use Case
I may get a request from one client that has tens of words that I have to look up, so I need to be able to retrieve lots of data from the txt file effectively.
The response to the client has to be as quick as possible, the quicker the better, I am talking ideally a less than three seconds, or if impossible, then as quick as it can be.
I'm trying to asses if Kafka could be used to scale-out our current solution.
I can identify partitions easily. Currently, the requirement is there to be 1500 partitions, each having 1-2 events per second, but future might go as high as 10000 partitions.
But there is one part of our solution which I don't know how would be solved in Kafka.
The problem is that each message contains a string and I want to assign a unique ID to each string across the whole topic. So same strings have the same ID while different strings have different IDs. The IDs don't need to be sequential, nor do they need to be always-growing.
The IDs will then be used down-stream as unique keys to identify those strings. The strings can be hundreds of characters long, so I don't think they would make efficient keys.
More advanced usage would be where messages might have different "kinds" of strings, so there would be multiple unique sequences of IDs. And messages will contain only some of those kinds depending on the type of the message.
Another advanced usage would be that the values are not strings, but structures and if two structures are same would be some more elaborate rule, like if PropA is equal, then structures are equal, if not, then structures are equal if PropB is equal.
To illustrate the problem: Each partition is a computer in a network. Each event is action on the computer. Events need to be ordered per-computer so that events that change the state of the computer (eg. user logged in) can affect other types of events, and ordering is critical for that. Eg. the user opened an application, a file is written, a flash drive is inserted, etc.. And I need each application, file, flash drive, or many others to have unique identifiers across all computers. This is then used to calculate statistics down-stream. And sometimes, an event can have multiple of those, eg. operation on a specific file on the specific flash drive.
There is a very nice post about kafka and blockchain. This is collective mind work and I think this could solve your IDs scalability issue. For solution refer to "Blockchain: reasons." part. All credits goes to respective authors.
Idea is simple, yet efficient:
Data is hash based, with link to previous block
Data may be very well same hashes, links to respective blocks of types
Custom block-chain solution means you in control of data encoding/decoding
Each hash chain is self-contained, and essentially may be your process (hdd/ram/cpu/word/app etc.)
Each hash chain may be a message itself
Bonus: statistics and analytics may be very well stored in block-chain, with high support for compression and replication. Consumers are pretty cheap in that context (scalability).
Proc:
Unique identifier issue solved
All records linked and thanks to kafka & blockchain highly ordered
Data extendable
Kafka properties applied
Cons:
Encryption/Decryption is CPU intensive
Growing level of hash calculation complexity
Problem: without problem context it's hard to approximate the limitations that need to be addressed further. However, assuming calculated solution has a finite nature you should have no issues scaling the solution in a regular way.
Bottom line:
Without knowledge of requirements in terms of speed/cost/quality it's hard to give a better, backed answer with working example. CPU cloud extension may be comparably cheap, data storage - depends on time for how long and what amount of data you want to store, replay-ability, etc. It's a good chunk of work. Prototype? Concept in referenced article.
Forgive me if this question is silly, but I'm starting to learn about consistent hashing and after reading Tom White blog post on it here and realizing that most default hash functions are NOT well mixed I had a thought on ensuring that an arbitrary hash function is minimally well-mixed.
My thought is best explained using an example like this:
Bucket 1: 11000110
Bucket 2: 11001110
Bucket 3: 11010110
Bucket 4: 11011110
Under a standard hash ring implementation for consistent caching across these buckets, you would be get terribly performance, and nearly every entry would be lumped into Bucket 1. However, if we use bits 4&5 as the MSBs in each case then these buckets are suddenly excellently mixed, and assigning a new object to a cache becomes trivial and only requires examining 2 bits.
In my mind this concept could very easily be extended when building distributed networks across multiple nodes. In my particular case I would be using this to determine which cache to place a given piece of data into. The increased placement speed isn't a real concern, but ensuring that my caches are well-mixed is and I was considering just choosing a few bits that are optimally mixed for my given caches. Any information later indexed would be indexed on the basis of the same bits.
In my naive mind this is a much simpler solution than introducing virtual nodes or building a better hash function. That said, I can't see any mention of an approach like this and I'm concerned that in my hashing ignorance I'm doing something wrong here and I might be introducing unintended consequences.
Is this approach safe? Should I use it? Has this approach been used before and are there any established algorithms for determining the minimum unique group of bits?
In mongodb docs the author mentions it's a good idea to shorten property names:
Use shorter field names.
and in an old blog post from how to node (it is offline by now April, 2022 edit)
....oft-reported issue with mongoDB is the
size of the data on the disk... each and every record stores all the field-names
.... This means that it can often be
more space-efficient to have properties such as 't', or 'b' rather
than 'title' or 'body', however for fear of confusion I would avoid
this unless truly required!
I am aware of solutions of how to do it. I am more interested in when is this truly required?
To quote Donald Knuth:
Premature optimization is the root of all evil (or at least most of
it) in programming.
Build your application however seems most sensible, maintainable and logical. Then, if you have performance or storage issues, deal with those that have the greatest impact until either performance is satisfactory or the law of diminishing returns means there's no point in optimising further.
If you are uncertain of the impact of particular design decisions (like long property names), create a prototype to test various hypotheses (like "will shorter property names save much space"). Don't expect the outcome of testing to be conclusive, however it may teach you things you didn't expect to learn.
Keep the priority for meaningful names above the priority for short names unless your own situation and testing provides a specific reason to alter those priorities.
As mentioned in the comments of SERVER-863, if you're using MongoDB 3.0+ with the WiredTiger storage option with snappy compression enabled, long field names become even less of an issue as the compression effectively takes care of the shortening for you.
Bottom line up: So keep it as compact as it still stays meaningful.
I don't think that this is every truly required to be shortened to one letter names. Anyway you should shorten them as much as possible, and you feel comfortable with it. Lets say you have a users name: {FirstName, MiddleName, LastName} you may be good to go with even name:{first, middle, last}. If you feel comfortable you may be fine with name:{f, m,l}.
You should use short names: As it will consume disk space, memory and thus may somewhat slowdown your application(less objects to hold in memory, slower lookup times due to bigger size and longer query time as seeking over data takes longer).
A good schema documentation may tell the developer that t stands for town and not for title. Depending on your stack you may even be able to hide the developer from working with these short cuts through some helper utils to map it.
Finally I would say that there's no guideline to when and how much you should shorten your schema names. It highly depends on your environment and requirements. But you're good to keep it compact if you can supply a good documentation explaining everything and/or offering utils to ease the life of developers and admins. Anyway admins are likely to interact directly with mongodb, so I guess a good documentation shouldn't be missed.
I performed a little benchmark, I uploaded 252 rows of data from an Excel into two collections testShortNames and testLongNames as follows:
Long Names:
{
"_id": ObjectId("6007a81ea42c4818e5408e9c"),
"countryNameMaster": "Andorra",
"countryCapitalNameMaster": "Andorra la Vella",
"areaInSquareKilometers": 468,
"countryPopulationNumber": NumberInt("77006"),
"continentAbbreviationCode": "EU",
"currencyNameMaster": "Euro"
}
Short Names:
{
"_id": ObjectId("6007a81fa42c4818e5408e9d"),
"name": "Andorra",
"capital": "Andorra la Vella",
"area": 468,
"pop": NumberInt("77006"),
"continent": "EU",
"currency": "Euro"
}
I then got the stats for each, saved in disk files, then did a "diff" on the two files:
pprint.pprint(db.command("collstats", dbCollectionNameLongNames))
The image below shows two variables of interest: size and storageSize.
My reading showed that storageSize is the amount of disk space used after compression, and basically size is the uncompressed size. So we see the storageSize is identical. Apparently the Wired Tiger engine compresses fieldnames quite well.
I then ran a program to retrieve all data from each collection, and checked the response time.
Even though it was a sub-second query, the long names consistently took about 7 times longer. It of course will take longer to send the longer names across from the database server to the client program.
-------LongNames-------
Server Start DateTime=2021-01-20 08:44:38
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606964546 EndTimeM= 606965328
ElapsedTime MilliSeconds= 782
-------ShortNames-------
Server Start DateTime=2021-01-20 08:44:39
Server End DateTime=2021-01-20 08:44:39
StartTimeMs= 606965328 EndTimeM= 606965421
ElapsedTime MilliSeconds= 93
In Python, I just did the following (I had to actually loop through the items to force the reads, otherwise the query returns only the cursor):
results = dbCollectionLongNames.find(query)
for result in results:
pass
Adding my 2 cents on this..
Long named attributes (or, "AbnormallyLongNameAttributes") can be avoided while designing the data model. In my previous organisation we tested keeping short named attributes strategy, such as, organisation defined 4-5 letter encoded strings, eg:
First Name = FSTNM,
Last Name = LSTNM,
Monthly Profit Loss Percentage = MTPCT,
Year on Year Sales Projection = YOYSP, and so on..)
While we observed an improvement in query performance, largely due to the reduction in size of data being transferred over the network, or (since we used JAVA with MongoDB) the reduction in length of "keys" in MongoDB document/Java Map heap space, the overall improvement in performance was less than 15%.
In my personal opinion, this was a micro-optimzation that came at an additional cost (and a huge headache) of maintaining/designing an additional system of managing Data Attribute Dictionary for each of the data models. This system was required to have an organisation wide transparency while debugging the application/answering to client queries.
If you find yourself in a position where upto 20% increase in the performance with this strategy is lucrative to you, may be it is time to scale up your MongoDB servers/choose some other data modelling/querying strategy, or else to choose a different database altogether.
If using verbose xml, trying to ameliorate that with custom names could be very important. A user comment in the SERVER-863 ticket said in his case; I'm ' storing externally-defined XML objects, with verbose naming: the fieldnames are, perhaps, 70% of the total record size. So fieldname tokenization could be a giant win, both in terms of I/O and memory efficiency.'
Collection with smaller name - InsertCompress
Collection with bigger name - InsertNormal
I Performed this on our mongo sharded cluster and Analysis shows
There is around 10-15% gain in shorter names while saving and seems purely based on network latency. I added bulk insert using multiple threads. So if single inserts it can save more.
My avg data size for InsertCompress is 280B and InsertNormal is 350B and inserted 25 million records. So InsertNormal shows 8.1 GB and InsertCompress shows 6.6 GB. This is data size.
Surprisingly Index data size shows as 2.2 GB for InsertCompress collection and 2 GB for InsertNormal collection
Again the storage size is 2.2 GB for InsertCompress collection while InsertNormal its around 1.6 GB
Overall apart from network latency there is nothing gained for storage, so not worth to put efforts going in this direction to save storage. Only if you have much bigger document and smaller field names saves lot of data you can consider