I'm very new to this Parallel::ForkManager module in Perl and it has a lot of credits, so I think it supports what I need and I just haven't figured out yet.
What I need to do is in each child process, it writes some updates into a global hash map, according to the key value computed in each child process.
However, when I proceed to claim a hash map outside the for loop and expect the hash map is updated after the loop, it turns out that the hash map stays empty.
This means although the update inside the loop succeeds (by printing out the value), outside the loop it is not.
Does anybody know how to write such a piece of code that does what I want?
This isn't really a Perl-specific problem, but a matter of understanding Unix-style processes. When you fork a new process, none of the memory is shared by default between processes. There are a few ways you can achieve what you want, depending on what you need.
One easy way would be to use something like BerkeleyDB to tie a hash to a file on disk. The tied hash can be initialized before you fork and then each child process would have access to it. BerkeleyDB files are designed to be safe to access from multiple processes simultaneously.
A more involved method would be to use some form of inter-process communication. For all the gory details of achieving such, see the perlipc manpage, which has details on several IPC methods supported by Perl.
A final approach, if your Perl supports it, is to use threads and share variables between them.
Each fork call generates a brand new process, so updates to a hash variable in a child process are not visible in the parent (and changes to the parent after the fork call are not visible in the child).
You could use threads (and see also threads::shared) to have a change written in one thread be writeable in another thread.
Another option is to use interprocess communication to pass messages between parent and child processes. The Forks::Super module (of which I am the author) can make this less of a headache.
Or your child processes could write some output to files. When the parent process reaps them, it could load the data from those files and update its global hash map accordingly.
Read the "RETRIEVING DATASTRUCTURES from child processes" section from man Parallel::ForkManager. There are callbacks, child's data can be sent and parent can retrieve them and populate data structures.
Related
Most POSIX named objects (or all?) have unlink functions. e.g:
shm_unlink
mq_unlink
They all have in common, that they remove the name of the object from the system, causing next opens to fail or create a new object.
Why is this designed like this? I know, this is connected to the "everything is a file" policy, but why not delete the file on close? Would you do this the same if you create a new interface?
I think, this has a big drawback. Say, we have a server process and several client processes. If any process unlinks the object (by mistake) all new clients would not find the server. (This can be prohibited by user permissions on the according file, but still...)
Would it not be better, if it had reference counting and the name would be removed automatically when the last object is closed? Why would you want to keep it open?
Because they are low level tools that could be used when performance matters. Deleting the object when it is not used to create it again on next use has a (slight) performance penalty against keeping it alive.
I once used a named semaphore that I used to synchronize accesses to a spool with various producers and consumers. I used an init module to create the named semaphore that was called as part of the boot process, and all other processes knew that the well known semaphore should exist.
If you want a more programmer friendly way that creates the object on demand and destroys it when it is no longer used, you can build a higher level library and encapsulate the creation/unlink operations in it. But if the system call included it, it would not be possible to build a user level library avoiding it.
Would it not be better, if it had reference counting and the name would be removed automatically when the last object is closed?
No.
Because unlink() can fail and because always unlinking a resource that can be shared between processes when all processes merely close that resource simply doesn't fit the paradigm of a shared resource.
You don't demolish a roller coaster just because there's no one waiting in line to ride it again at that moment in time.
Say we have a certain parent process with some arbitrary amount of data stored in memory and we use fork to spawn a child process. I understand that in order for the OS to perform copy on write, the certain page in memory that contains the data that we are modifying will have its Read-only bit set, and the OS will use the exception that will result when the child tries to modify the data to copy the entire page into another area in memory so that the child gets it's own copy. What I don't understand is, if that specific section in memory is marked as Read-only, then the parent process, to whom the data originally belonged, would not be able to modify the data neither. So how can this whole scheme work? Does the parent lose ownership of its data and copy on write will have to be performed even when the parent itself tries to modify the data?
Right, if either process writes a COW page, it triggers a page fault.
In the page fault handler, if the page is supposed to be writeable, it allocates a new physical page and does a memcpy(newpage, shared_page, pagesize), then updates the page table of whichever process faulted to map the newpage to that virtual address. Then returns to user-space for the store instruction to re-run.
This is a win for something like fork, because one process typically makes an execve system call right away, after touching typically one page (of stack memory). execve destroys all memory mappings for that process, effectively replacing it with a new process. The parent once again has the only copy of every page. (Except pages that were already copy-on-write, e.g. memory allocated with mmap is typically COW-mapped to a single physical page of zeros, so reads can hit in L1d cache).
A smart optimization would be for fork to actually copy the page containing the top of the stack, but still do lazy COW for all the other pages, on the assumption that the child process will normally execve right away and thus drop its references to all the other pages. It still costs a TLB invalidation in the parent to temporarily flip all the pages to read-only and back, though.
Some UNIX implementations share the program text between the two since
that cannot be modified. Alternatively, the child may share all of the
parent’s memory, but in that case the memory is shared
copy-on-write, which means that whenever either of the two wants to modify part of the memory, that chunk of memory is explicitly
copied first to make sure the modification occurs in a private memory
area.
Excerpted from: Modern Operating Systems (4th Edition), Tanenbaum
From what I understand the main q thread monitors it socket descriptors for requests and respond to them.
I want to use a while loop in my main thread that will go on for an indefinite period of time. This would mean, that I will not be able to use hopen on the process port and perform queries.
Is there any way to manually check requests within the while loop.
Thanks.
Are you sure you need to use a while loop? Is there any chance you could, for instance, instead use the timer functionality of KDB+?
This could allow you to run a piece of code periodically instead of looping over it continually. Depending on your use case, this may be more appropriate as it would allow you to repeatedly run a piece of code (e.g. that could be polling something periodically), without using the main thread constantly.
KDB+ is by default single-threaded, which makes it tricky to do what you want to do. There might be something you can do with slave threads.
If you're interested in using timer functionality, but the built-in timer is too limited for your needs, there is a more advanced set of timer functionality available free from AquaQ Analytics (disclaimer: I work for AquaQ). It is distributed as part of the TorQ KDB framework, the specific script you'd be interested in is timer.q, which is documented here. You may be able to use this code without the full TorQ if you like, you may need some of the other "common" code from TorQ to provide functions used within timer.q
What is the reasoning behind maintaining doubly linked lists of PCB's(process control blocks) in an OS for scheduling. I have seen this mentioned multiple times for Real time operating systems.
I would ideally go for a circular singly linked list , so that you could do a round robin and reach back to 1st task after looking to all. You could also sort it by priority...
But, why a doubly linked list?
You have made the assumption that you would always want to start at the head, and work towards the end of the list. This may not be true. Say you are swapping one process out (eg it is pending on a sempahore). You already have the current process control block, so it makes sense to start using the information you have rather than iterate through the entire list.
Because the PCB has a reference to both the previous and the next, you can cut that node out of the running-list, and move to a pended-list, or from a pended-list back into a ready-to-run list etc, without having to iterate all the way through.
Supposed I have a key-value database, and I need to build a queue on top of it. How could I achieve this without getting a bad performance?
One idea might be to store the queue inside an array, and simply store the array using a fixed key. This is a quite simple implementation, but is very slow, as for every read or write access the complete array must be loaded / saved.
I could also implement a linked list, with random keys, and there is one fixed key which acts as starting point to element 1. Depending on if I prefer a fast read or a fast write access, I could let point the fixed element to the first or the last entry in the queue (so I have to travel it forward / backward).
Or, to proceed with that - I could also have two fixed pointers: One for the first, on for the last item.
Any other suggestions on how to do this effectively?
Initially, key-value structure is extremely similar to the original memory storage where the physical address in computer memory plays as the key. So any type of data structure could be modeled upon key-value storage surely, including linked list.
Originally, a linked list is a list of nodes including the index information of previous node or following node. Then the node it self should also be viewed as a sub key-value structure. With additional prefix to the key, the information in the node could be separately stored in a flat table of key-value pairs.
To proceed with that, special suffix to the key could also make it possible to get rid of redundant pointer information. This pretend list might look something like this:
pilot-last-index: 5
pilot-0: Rei Ayanami
pilot-1: Shinji Ikari
pilot-2: Soryu Asuka Langley
pilot-3: Touji Suzuhara
pilot-5: Makinami Mari
The corresponding algrithm is also imaginable, I think. If you could have a daemon thread for manipulation these keys, pilot-5 could be renamed as pilot-4 in the above example. Even though, it is not allowed to have additional thread in some special situation, the result of the queue it self is not affected. Just some overhead would exist for the break point in sequence.
However which of the two above should be applied is the problem of balance between the cost of storage space or the overhead of CPU time.
The thread safe is exactly a problem however an ancient problem. Just like the class implementing the interface of ConcurrentMap in JDK, Atomic operation on key-value data is also provided perfectly. There are similar methods featured in some key-value middleware, like memcached, as well, which could make you update key or value separately and thread safely. However these implementation is the algrithm problem rather than the key-value structure it self.
I think it depends on the kind of queue you want to implement, and no solution will be perfect because a key-value store is not the right data structure for this kind of task. There will be always some kind of hack involved.
For a simple first in first out queue you could use a few kev-value stores like the folliwing:
{
oldestIndex:5,
newestIndex:10
}
In this example there would be 6 items in the Queue (5,6,7,8,9,10). Item 0 to 4 are already done whereas there is no Item 11 or so for now. The producer worker would increment newestIndex and save his item under the key 11. The consumer takes the item under the key 5 and increments oldestIndex.
Note that this approach can lead to problems if you have multiple consumer/producers and if the queue is never empty so you cant reset the index.
But the multithreading problem is also true for linked lists etc.