mmap about mongodb

mmap about mongodb - mongodb

A common confusion point is that “MongoDB is an in-memory DB”. Mongo maps to virtual memory: doesn’t need to fit in RAM. Now he’s showing a diagram of virtual address space: kernel space, stack, heap, program text, etc. Program text is mmap’ed just like Mongo data files. Showing where the mapped data files live in the virtual address space."
What is the program text ,and what does "Program text is mmap’ed just like Mongo data files. Showing where the mapped data files live in the virtual address space" mean ?
thank you !

Memory mapping is what the operating system does (Windows does it, I'm pretty sure linux does it too) when it loads your binaries. The binaries themselves could be called "program text". Hence, all .exe and .dll files are essentially memory-mapped. The message here is: if you can trust the OS to do memory-mapping for it's core purpose of allowing other binaries to execute, you can also trust it to map data files of your database, which is what MongoDB does.
All this has nothing to do with 'in-memory db', because memory-mapping is "just" a fancy way to coordinate file access through the OS.
It also explains that both binaries and data reside in the same memory, which is what I recall to be one of the most important contributions by Konrad Zuse to the early days of computing: Programs and data don't sit in different physical memories, because there is no fundamental difference between them.

Related

OS: how does kernel virtual memory help in making swap pages of the page table easier?

Upon reading this chapter from "Operating Systems: Three Easy Pieces" book, I'm confused of this excerpt:
If in contrast the kernel were located entirely in physical memory, it would be quite hard to do things like swap pages of the page table to disk;
I've been trying to make sense of it for days but still can't get as to how kernel virtual memory helps in making swap pages of the page table easier. Wouldn't it be the same if the kernel would live completely in physical memory as the pages of different page tables' processes would end up in physical memory in the end anyway (and thus be swapped to disk if needed)? How is it different when page tables reside in kernel virtual memory vs. in kernel-owned physical memory?

Let's assume the kernel needs to access a string of characters from user-space (e.g. maybe a file name passed to an open() system call).
With the kernel in virtual memory; the kernel would have to check that the virtual address of the string is sane (e.g. not an address for kernel's own data) and also guard against a different thread in user-space modifying the data the pointer points to (so that the string can't change while the kernel is using it, possibly after the kernel checked that the string itself is valid but before the kernel uses the string). In theory (and likely in practice) this could all be hidden by a "check virtual address range" function. Apart from those things; kernel can just use the string like normal. If the data is in swap space, then kernel can just let its own page fault handler fetch the data when kernel attempts to access it the data and not worry about it.
With kernel in physical memory; the kernel would still have to do those things (check the virtual address is sane, guard against another thread modifying the data the pointer points to). In addition; it would have to convert the virtual address to a physical addresses itself, and ensure the data is actually in RAM itself. In theory this could still be hidden by a "check virtual address range" function that also converts the virtual address into physical address(es).
However, the "contiguous in virtual memory" data (the string) may not be contiguous in physical memory (e.g. first half of the string in one page with the second half of the string in a different page with a completely unrelated physical address). This means that kernel would have to deal with this problem too (e.g. for a string, even things like strlen() can't work), and it can't be hidden in a "check (and convert) virtual address range" function to make it easier for the rest of the kernel.
To deal with the "not contiguous in physical memory" problem there's mostly only 2 possibilities:
a) A set of "get char/short/int.. at offset N using this list of physical addresses" functions; or
b) Refuse to support it in the kernel; which mostly just shifts unwanted burden to user-space (e.g. the open() function in a C library copying the file name string to a new page if the original string crossed a page boundary before calling the kernel's open() system call).

In an ideal world (which is what we live in now with kernels that are hundreds of megabytes in size running on machines with gigabytes of physical RAM) the kernel would never swap even parts of itself. But in the old days, when physical memory was a constraint, the less of the kernel in physical memory, the more the application could be in physical memory. The more the application is in physical memory, the fewer page faults in user space.
THe linux kernel has been worked over fairly extensively to keep it compact. Case in point: kernel modules. You can load a module using insmod or modprobe, and that module will become resident, but if nothing uses is, after a while it will get swapped out, and that's no big deal because nothing is using it.

What is meant by "CPU generates logical address space"?

As far as I read from the book(correct me I am wrong) after a compiler puts the compiled code in the storage, the CPU creates logical addresses and those logical addresses are mapped to physical memory through MMU(Memory Management Unit).Also I know that CPU directly cannot access anything other than the physical memory.
Then how does CPU produce logical addresses for the process in the first place?

It sounds like you have a bit of confusion about what things do.
The operating system defines the logical address space by setting up page tables that logical pages to physical page frames. The operating system loads hardware registers of the CPU so that it knows about the page tables it has defined.
This use of page tables to define logical address spaces is an integral part of a modern CPU. In some systems, the only use of physical addresses is within page tables.
The compiler generates an object code file that describes the instructions and data used and created.
The linker combines object code into an executable file that defines how the program will be loaded into memory.
The loader reads the instructions in an executable file and sets up the logical addresses space to run the program. The loader calls system routines that set up the page tables the define the logical addresses space.
For example, where the executable has read-only data, the loader will call OS routines to creates read-only pages in the logical address space and map them to the data in the executable file.

MongoDB: Can different databases be placed on separate drives?

I am working on an application in which there is a pretty dramatic difference in usage patterns between "hot" data and other data. We have selected MongoDB as our data repository, and in most ways it seems to be a fantastic match for the kind of application we're building.
Here's the problem. There will be a central document repository, which must be searched and accessed fairly often: it's size is about 2 GB now, and will grow to 4GB in the next couple years. To increase performance, we will be placing that DB on a server-class mirrored SSD array, and given the total size of the data, don't imagine that memory will become a problem.
The system will also be keeping record versions, audit trail, customer interactions, notification records, and the like. that will be referenced only rarely, and which could grow quite large in size. We would like to place this on more traditional spinning disks, as it would be accessed rarely (we're guessing that a typical record might be accessed four or five times per year, and will be needed only to satisfy research and customer service inquiries), and could grow quite large, as well.
I haven't found any reference material that indicates whether MongoDB would allow us to place different databases on different disks (were're running mongod under Windows, but that doesn't have to be the case when we go into production.
Sorry about all the detail here, but these are primary factors we have to think about as we plan for deployment. Given Mongo's proclivity to grab all available memory, and that it'll be running on a machine that maxes out at 24GB memory, we're trying to work out the best production configuration for our database(s).
So here are what our options seem to be:
Single instance of Mongo with multiple databases This seems to have the advantage of simplicity, but I still haven't found any definitive answer on how to split databases to different physical drives on the machine.
Two instances of Mongo, one for the "hot" data, and the other for the archival stuff. I'm not sure how well Mongo will handle two instances of mongod contending for resources, but we were thinking that, since the 32-bit version of the server is limited to 2GB of memory, we could use that for the archival stuff without having it overwhelm the resources of the machine. For the "hot" data, we could then easily configure a 64-bit instance of the database engine to use an SSD array, and given the relatively small size of our data, the whole DB and indexes could be directly memory mapped without page faults.
Two instances of Mongo in two separate virtual machines Would could use VMWare, or something similar, to create two Linux machines which could host Mongo separately. While it might up the administrative burden a bit, this seems to me to provide the most fine-grained control of system resource usage, while still leaving the Windows Server host enough memory to run IIS and it's own processes.
But all this is speculation, as none of us have ever done significant MongoDB deployments before, so we don't have a great experience base to draw upon.
My actual question is whether there are options to have two databases in the same mongod server instance utilize entirely separate drives. But any insight into the advantages and drawbacks of our three identified deployment options would be welcome as well.

That's actually a pretty easy thing to do when using Linux:
Activate the directoryPerDB config option
Create the databases you need.
Shut down the instance.
Copy over the data from the individual database directories to the different block devices (disks, RAID arrays, Logical volumes, iSCSI targets and alike).
Mount the respective block devices to their according positions beyond the dbpath directory (don't forget to add the according lines to /etc/fstab!)
Restart mongod.
Edit: As a side note, I would like to add that you should not use Windows as OS for a production MongoDB. The available filesystems NTFS and ReFS perform horribly when compared to ext4 or XFS (the latter being the suggested filesystem for production, see the MongoDB production notes for details ). For this reason alone, I would suggest Linux. Another reason is the RAM used by rather unnecessary subsystems of Windows, like the GUI.

Questions on Program execution flow

I was studying operating system concepts from galvin's sixth edition and i have some questions about the flow of execution of a program. A figure explains the processing of the user program as:
We get an executable binary file when we reach linkage editor point. As the book says, "The program must be brought into memory and placed within a process for it to be executed" Now some of my stupid questions are:
Before the program is loaded into the memory, the binary executable file generated by the linkage editor is stored in the hard disk. The address where the binary executable file is stored in the hard disk is the logical address as generated by the CPU ?
If the previous answer is yes, Why CPU has to generate the logical address ? I mean the executable file is stored somewhere in the hard disk which pertains to an address, why does CPU has to separately do the stuff ? CPU's main aim is processing after all!
Why does the executable file needs to be in the physical memory i.e ram and can not be executed in the hard disk? Is it due to speed issues ?
I know i am being stupid in asking these questions, but trust me, I can't find the answers! :|

1) The logical address where the binary file is stored in the hard disk is determined by the file system, the Operating System component that is aimed to manage files in the disks.
2) & 3) The Hard Disk is not a) fast enough b) does not support word addressing. The hard disks are addressed in sectors blocks. Usually the sector size is 512 bytes. The CPU need to be able to address each machine word in a program to execute it. So, the program is stored in the hard disk, that retains its content even being powered off (in contrast to the RAM that losts its content when it is powered off). Then the program is loaded into RAM to be executed. After program finished and possibly stored the result of its execution in the hard disk, the memory is freed for running another programs. The Compiler and the Linkage Editor in your sample are also programs. They are kept in the hard disk. The compiler get its input (the source text of your program) from the file in the hard disk. Then it stores the object file. The linkage editor, or linker for short does the same: it reads the object file and necessary library files and then produces a file with a binary representation of your program.

system hardware related

hi every body there
first of all thanks for your support and help .Now i want two know how to calculate the load on a computer system. i heard about load sensor and other utility which mostly provide the option of finding the temperature of hard disk or system , like top, system monitor but i don't want to use them.
all i want to know simply load on CPU like CPU is 40% free, load on memory means total usages of memory or amount of free memory ,free disk space etc.
i don't need soft ware tool for finding these thing .
all i want to know is there any program in c or in other language or script which can find out these thing one by one or simultaneously?
are there any commands to find this?
or can anybody explain how system monitor work?
waiting for your help

Depends on your operating system. On Unix/Linux, there is the directory /proc which contains a lot of files/directories with system information such as /proc/loadvg or /proc/cpuinfo
These aren't real files on your disk, but virtual files containing system information. But you can just open them and read them with standard file functions, so it works with any programming language, from C to Python.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse