What is the distinction between spooling and device queue? - operating-system

I know that each device has its own device queue. Also i see that "Spooling is a type of buffering. The most common spooling application is print spooling, which places a task (or 'print job') into a queue for extended or later processing."
Does that mean the device queue is spooling? I am confused.

Spooling is referred to as buffered queueing. A spooler is a logical device that takes input from a quicker process (such as a word processor or image application) and holds that data in a buffered queue for a slower process (printing or some other slow process). The spool allows the quicker application to not have to wait on the slower process to finish before moving on to other tasks.
Spoolers can also be used in reverse so that a slower process is an input and the spool manages the input and holds it in the buffered queue along with a queue of actions until enough data is present to take the actions on the data. This is known as batch processing. These reverse spools allow for the data processing actions to be spooled along with the incoming data to be processed when the correct data is present and allows the application that requested the spooling to continue on to other tasks.

Device queue is more specific term which take into account about a specific device.you cannot talk about device queue without having an actual device whereas spooling is a generic term that corresponds to usage of disk space as buffer irrespective of the type of device.

Related

Do memory instructions pass through the load-store queue and issue queue in the microarchitecture

What is the difference between the issue queue and lsq queue for
memory instructions? Do memory instructions pass through both queues, or do they only pass
through the lsq queue.
If they pass through both queues what is their order?
I'm assuming you use the arm-like nomenclature here so the issue queue is what Intel calls RS (reservation station) and by issue you mean sending a uop ready for execution.
The answer is that memory instructions need to pass both. All instructions need to be issued (except the ones that can be eliminated without execution, for example register moves, zero idioms, nops, etc..). Let's rephrase - all instructions that need to go through an ALU need to go through the issue process first. Memory instructions will simply use that step to calculate their addresses.
This is true for loads, for stores there is usually an internal split into store-address and store-data, so the store-address will behave like a load in that sense and calculate its address during that step.
There is usually a dedicated execution port for that and dedicated execution units because the address calculation usually follows one of few specific addressing modes (each architecture has a different set of these), but aside from that the execution needs to follow the same rules like any other operation in the CPU - it needs to have its sources ready and read from the register file or bypassed from an in flight operation, it needs to get arbitrated when the execution port is free and prioritized by the same aging rules, so it makes sense that it uses the common path.
Once the memory operation has finished execution, it will be sent to the LSU (load-store unit, or the DCU, data-cache unit on Intel) and perform the actual memory access using the generated address. The LSU pipe will take care of the address translation, TLB lookups, the page walk if needed (though this is sometimes done in a dedicated unit), the address range and property checks, the cache lookup (if cacheable) and sending a miss to the next cache level or memory if needed. It may also trigger prefetches as part of the process.
For a load, when the LSU pipe has completed (which may require multiple passes and wakeups if the data was not available in the L1), the LSU will signal the issue queue again in order to wakeup anyone who was depended on the result.
For a store, store-address may fetch the line to the cache in advance as an optimization but the actual next step is usually to wakeup after retirement (since stores may not be dispatched to memory while speculative, unless you have some tricks to handle that).
It's also worth to mention that some CPUs try to optimize loads that can forward the data directly from prior stores instead of fetching it from the cache/memory. This can include forwarding (very common) or memory renaming (less common). The former is usually handled by the LSU internally, but the latter can be done much earlier and without the LSU (though the LSU pipe is usually still activated to validate the result).

Can a mutex be used instead of a critical section when using stream buffers in FreeRTOS?

I am looking into using a stream buffer in FreeRTOS to transfer CAN frames from multiple tasks to an ISR, which puts them into a CAN transmit buffer as soon as it's ready. The manual here explains that a stream buffer should only be used by one task/isr and read by one task/isr, and if not, then a critical section is required.
Can a mutex be used in place of a critical section for this scenario? Would it make more sense to use?
First, if you are sending short discrete frames, you may want to consider a message buffer instead of a stream buffer.
Yes you could use a mutex.
If sending from multiple tasks the main thing to consider is what happens when the stream buffer becomes full. If you were using a different FreeRTOS object (other than a message buffer, message buffers being built on stream buffers) then multiple tasks attempting to write to the same instance of an object that was full would all block on their attempt to write to the object, and be automatically unblocked when space in the object became available - the highest priority waiting task would be the first to be unblocked no matter the order in which the tasks entered the blocked state. However, with stream/message buffers you can only have one task blocked attempting to write to a full buffer - and if the buffer was protected by a muted - all other tasks would instead block on the mutex. That could mean that a low priority task was blocked on the stream/message buffer while a higher priority task was blocked on the mutex - a kind of priority inversion.

What methods are available on unix for pub sub IPC?

There are various options for IPC.
Over a network:
for client-server, can use TCP
for pub sub, can use UDP multicast
Locally:
for client-server, can use unix domain sockets
for pub sub, can use ???
I suppose what I'd be interested in is some kind of file descriptor that supports many readers (subscribers) and many writers (publishers) simultaneously. Is this usage pattern feasible/efficient on unix?
After much googling I haven't found a whole lot in the way of ipc multicast, so I have decided to write a program pubsub that takes as arguments a publisher address and a subscriber address, listens and accepts connections on these 2 addresses, and then for each payload received on a publisher connection write it to each of the subscriber connections. It wouldn't surprise me if this is inefficient or reinventing the wheel but I have not come across a better solution.
I was looking for solutions to a similar problem and found /dev/fanout. Fanout is a kernel module that replicates its input out to all processes reading from it. You can think of it as IPC Broadcast mechanism. Works well for small data payloads according to the author. Multiple processes can write to the device and multiple processes can read from it. I am not sure of atomicity of writes though. Small writes from multiple processes should occur atomically as with FIFOs, etc.
More about Fanout:
http://compgroups.net/comp.linux.development.system/-dev-fanout-a-one-to-many-multi/2869739
http://www.linuxtoys.org/fanout/fanout.html
There are Posix message queues too. As man mq_overview puts it:
POSIX message queues allow processes to exchange data in the form of messages. This API is distinct from that provided by
System V message queues (msgget(2), msgsnd(2), msgrcv(2), etc.), but provides similar functionality.
Message queues are created and opened using mq_open(3); this function returns a message queue descriptor (mqd_t), which is
used to refer to the open message queue in later calls. Each message queue is identified by a name of the form /somename;
that is, a null-terminated string of up to NAME_MAX (i.e., 255) characters consisting of an initial slash, followed by one
or more characters, none of which are slashes. Two processes can operate on the same queue by passing the same name to
mq_open(3).
Messages are transferred to and from a queue using mq_send(3) and mq_receive(3). When a process has finished using the queue, it closes it using mq_close(3), and when the queue is no longer required, it can be deleted using mq_unlink(3).
Queue attributes can be retrieved and (in some cases) modified using mq_getattr(3) and mq_setattr(3). A process can request asynchronous notification of the arrival of a message on a previously empty queue using mq_notify(3).
A message queue descriptor is a reference to an open message queue description (cf. open(2)). After a fork(2), a child inherits copies of its parent's message queue descriptors, and these descriptors refer to the same open message queue descriptions as the corresponding descriptors in the parent. Corresponding descriptors in the two processes share the flags (mq_flags) that are associated with the open message queue description.
Each message has an associated priority, and messages are always delivered to the receiving process highest priority first.
Message priorities range from 0 (low) to sysconf(_SC_MQ_PRIO_MAX) - 1 (high). On Linux, sysconf(_SC_MQ_PRIO_MAX) returns 32768, but POSIX.1 requires only that an implementation support at least priorities in the range 0 to 31; some implementations provide only this range.
A more friendly introduction by Michael Kerrisk is available here: http://man7.org/conf/lca2013/IPC_Overview-LCA-2013-printable.pdf

MSI/MESI: How can we get "read miss" in shared state?

In The Cache Memory Book by Jim Handy (excerpt is below), the author has the table description of MESI protocol. The table looks very unclear to me, and unfortunately the text does not help.
The first question (in green on the picture):
Is this right? -- a data block is in the cache of a CPU,
and it is in the shared state, but when the CPU reads it,
the CPU gets read miss.
How is this possible?
The second question (in purple):
Who and when create all these messages "Read miss", etc.?
(afaik, the system bus just translates messages of others)
And finally, the third question (not on the pic):
Do all these cache coherency protocols
(MSI,MESI,MOESI,Firefly,Dragon...)
maintain sequential consistency memory model?
Are there protocols that maintain other consistency models?
For the first question, while there is a data block (in shared state) in the cache, it is the wrong block (i.e., the tag does not match), so there is a cache miss. On a cache miss, the cache still needs to writeback data to memory if it is in the modified state.
For the second question, each bus is providing address and request information (read or write) to the cache; the system bus is providing this from a remote processor. The hit or miss is whether the address provided is a hit in the cache. Requests on the system bus are filtered by the remote processor's cache.
On a read by processor 1 (P1), if the P1's cache has a hit, no signal needs to be sent to the system bus. On a write by P1, if the P1's cache has a hit and the state is exclusive or modified, then no signals need to be sent to the system bus, but if state in P1's cache is shared, then the other (P0) cache must told (via the system bus) that P1 is performing a write and that any entry in P0's cache for that address must be invalidated.
(Note that "shared" state does not necessarily mean that the block of memory is present in some other cache. Most snoop-based protocols allow silent [no system bus communication] dropping of cache blocks in shared state.)
(By the way, the MESI protocol given by that book is a bit unusual. It is more common for Exclusive state to be entered by a read miss that found no other caches with the block [this allows silent update to modified on a later write] and for writes not to use write through to memory on a hit that was in shared state.)
For the third question, cache coherence protocols only address coherence not consistency. The consistency model that the hardware provides determines what kinds of activity can be buffered and completed out of order (in a way that is visible to software). E.g., it helps performance if a write miss does not force any following read hits to wait until the write notification has been seen by all remote caches, but such can violate sequential consistency.
Here is an example how sequential consistency can prevent buffering of writes:
P0 reads "B_old" at location B (cache hit), writes "A_new" to location A (a cache miss), then reads location B.
P1 reads "A_old" at location A (cache hit), writes "B_new" to location B (a cache miss), then reads location A.
If writes are buffered and reads are allowed to proceed without waiting for the preceding write to be recognized by the other cache, the P0's second read of B can read "B_old" (since it will still be a cache hit) and P1's second read of A can read "A_old" (since P0's write has not yet been seen). There is no way to construct a sequential ordering of these memory accesses, so sequential consistency would be violated.
However, if P0's second read of B waited until P0's write to A was recognized by P1 (and P1's second read of A waited until P1's write to B was recognized by P0), then a sequential ordering would be possible.
If P1 saw P0's write to A before its write to B, then sequential ordering is possible, including:
P0.read "B_old", P1.read "A_old", P0.write "A_new", P1.write "B_new", P0.read "B_new", P1.read "A_new"
P0.read "B_old", P1.read "A_old", P0.write "A_new", P0.read "B_old", P1.write "B_new", P1.read "A_new"
Note that hardware can speculate that ordering requirements are not violated and reorder the completion of memory accesses; however, it needs to be able to handle incorrect speculation (e.g., by restoration to a checkpoint). (A classic early paper on this is "Is SC + ILP = RC?" [PDF].)
There is a one-to-many mapping between cache lines and blocks of main memory. A cache line that is shared may not be the block of memory your program is looking for. What actually happened is that the block of memory your program wants, and the block of memory present in the cache, though different, have the same cache line index. Consecutive blocks of memory are mapped to consecutive cache lines, till the cache lines finish, and the subsequent blocks of memory are mapped to cache lines starting with the first cache line all over again.
I found an intuitive explanation here: https://medium.com/breaktheloop/direct-mapping-map-cache-and-main-memory-d5e4c1cbf73e

Implement a good performing "to-send" queue with TCP

In order not to flood the remote endpoint my server app will have to implement a "to-send" queue of packets I wish to send.
I use Windows Winsock, I/O Completion Ports.
So, I know that when my code calls "socket->send(.....)" my custom "send()" function will check to see if a data is already "on the wire" (towards that socket).
If a data is indeed on the wire it will simply queue the data to be sent later.
If no data is on the wire it will call WSASend() to really send the data.
So far everything is nice.
Now, the size of the data I'm going to send is unpredictable, so I break it into smaller chunks (say 64 bytes) in order not to waste memory for small packets, and queue/send these small chunks.
When a "write-done" completion status is given by IOCP regarding the packet I've sent, I send the next packet in the queue.
That's the problem; The speed is awfully low.
I'm actually getting, and it's on a local connection (127.0.0.1) speeds like 200kb/s.
So, I know I'll have to call WSASend() with seveal chunks (array of WSABUF objects), and that will give much better performance, but, how much will I send at once?
Is there a recommended size of bytes? I'm sure the answer is specific to my needs, yet I'm also sure there is some "general" point to start with.
Is there any other, better, way to do this?
Of course you only need to resort to providing your own queue if you are trying to send data faster than the peer can process it (either due to link speed or the speed that the peer can read and process the data). Then you only need to resort to your own data queue if you want to control the amount of system resources being used. If you only have a few connections then it is likely that this is all unnecessary, if you have 1000s then it's something that you need to be concerned about. The main thing to realise here is that if you use ANY of the asynchronous network send APIs on Windows, managed or unmanaged, then you are handing control over the lifetime of your send buffers to the receiving application and the network. See here for more details.
And once you have decided that you DO need to bother with this you then don't always need to bother, if the peer can process the data faster than you can produce it then there's no need to slow things down by queuing on the sender. You'll see that you need to queue data because your write completions will begin to take longer as the overlapped writes that you issue cannot complete due to the TCP stack being unable to send any more data due to flow control issues (see http://www.tcpipguide.com/free/t_TCPWindowSizeAdjustmentandFlowControl.htm). At this point you are potentially using an unconstrained amount of limited system resources (both non-paged pool memory and the number of memory pages that can be locked are limited and (as far as I know) both are used by pending socket writes)...
Anyway, enough of that... I assume you already have achieved good throughput before you added your send queue? To achieve maximum performance you probably need to set the TCP window size to something larger than the default (see http://msdn.microsoft.com/en-us/library/ms819736.aspx) and post multiple overlapped writes on the connection.
Assuming you already HAVE good throughput then you need to allow a number of pending overlapped writes before you start queuing, this maximises the amount of data that is ready to be sent. Once you have your magic number of pending writes outstanding you can start to queue the data and then send it based on subsequent completions. Of course, as soon as you have ANY data queued all further data must be queued. Make the number configurable and profile to see what works best as a trade off between speed and resources used (i.e. number of concurrent connections that you can maintain).
I tend to queue the whole data buffer that is due to be sent as a single entry in a queue of data buffers, since you're using IOCP it's likely that these data buffers are already reference counted to make it easy to release then when the completions occur and not before and so the queuing process is made simpler as you simply hold a reference to the send buffer whilst the data is in the queue and release it once you've issued a send.
Personally I wouldn't optimise by using scatter/gather writes with multiple WSABUFs until you have the base working and you know that doing so actually improves performance, I doubt that it will if you have enough data already pending; but as always, measure and you will know.
64 bytes is too small.
You may have already seen this but I wrote about the subject here: http://www.lenholgate.com/blog/2008/03/bug-in-timer-queue-code.html though it's possibly too vague for you.