I trussed a process, and they are lines as below. And I want to know the definition of kaio, but there is no manual entry for kaio, so whether can I get the definition?
/1: kaio(AIOWRITE, 259, 0x3805B2A00, 8704, 0x099C9E000755D3C0) = 0
/1: kaio(AIOWRITE, 259, 0x380CF9200, 14336, 0x099CC0000755D5B8) = 0
/1: kaio(AIOWRITE, 259, 0x381573600, 8704, 0x099CF8000755D7B0) = 0
/1: kaio(AIOWRITE, 259, 0x381ACA600, 8192, 0x099D1A000755D9A8) = 0
/1: kaio(AIOWAIT, 0xFFFFFFFF7FFFD620) = 4418032576
/1: timeout: 600.000000 sec
/1: kaio(AIOWAIT, 0xFFFFFFFF7FFFD620) = 4418033080
/1: timeout: 600.000000 sec
/1: kaio(AIOWAIT, 0xFFFFFFFF7FFFD620) = 4418033584
/1: timeout: 600.000000 sec
From an article about it:
What kaio does, as the name implies, is implement async I/O inside the kernel rather than in user-land via user threads. The I/O queue is created and managed in the operating system. The basic sequence of events is as follows: When an application calls aioread(3) or aiowrite(3), the corresponding library routine is entered. Once entered, the library first tries to process the request via kaio. A kaio initialization routine is executed, which creates a "cleanup" thread, which is intended to ensure that there are no remaining memory segments that have been allocated but not freed during the async I/O process. Once that's complete, kaio is called, at which point a test is made to determine if kaio is supported for the requested I/O.
Kaio is implemented as a loadable kernel module, /kernel/sys/kaio, and is loaded the first time an async I/O is called. You can determine if the module is loaded or not with modinfo(1M):
fawlty> modinfo | grep kaio
105 608c4000 2efd 178 1 kaio (kernel Async I/O)
fawlty>
I get the answer:
it's defined in file /usr/include/sys/syscall.h
#define SYS_kaio 178
/*
* subcodes:
* aioread(...) :: kaio(AIOREAD, ...)
* aiowrite(...) :: kaio(AIOWRITE, ...)
* aiowait(...) :: kaio(AIOWAIT, ...)
* aiocancel(...) :: kaio(AIOCANCEL, ...)
* aionotify() :: kaio(AIONOTIFY)
* aioinit() :: kaio(AIOINIT)
* aiostart() :: kaio(AIOSTART)
* see
*/
Related
Right now it seems that on every click tick, the running process is preempted and forced to yield the processor, I have thoroughly investigated the code-base and the only relevant part of the code to process preemption is below (in trap.c):
// Force process to give up CPU on clock tick.
// If interrupts were on while locks held, would need to check nlock.
if(myproc() && myproc() -> state == RUNNING && tf -> trapno == T_IRQ0 + IRQ_TIMER)
yield();
I guess that timing is specified in T_IRQ0 + IRQ_TIMER, but I can't figure out how these two can be modified, these two are specified in trap.h:
#define T_IRQ0 32 // IRQ 0 corresponds to int T_IRQ
#define IRQ_TIMER 0
I wonder how I can change the default RR scheduling time-slice (which is right now 1 clock tick, fir example make it 10 clock-tick)?
If you want a process to be executed more time than the others, you can allow it more timeslices, *without` changing the timeslice duration.
To do so, you can add some extra_slice and current_slice in struct proc and modify the TIMER trap handler this way:
if(myproc() && myproc()->state == RUNNING &&
tf->trapno == T_IRQ0+IRQ_TIMER)
{
int current = myproc()->current_slice;
if ( current )
myproc()->current_slice = current - 1;
else
yield();
}
Then you just have to create a syscall to set extra_slice and modify the scheduler function to reset current_slice to extra_slice at process wakeup:
// Switch to chosen process. It is the process's job
// to release ptable.lock and then reacquire it
// before jumping back to us.
c->proc = p;
switchuvm(p);
p->state = RUNNING;
p->current_slice = p->extra_slice
You can read lapic.c file:
lapicinit(void)
{
....
// The timer repeatedly counts down at bus frequency
// from lapic[TICR] and then issues an interrupt.
// If xv6 cared more about precise timekeeping,
// TICR would be calibrated using an external time source.
lapicw(TDCR, X1);
lapicw(TIMER, PERIODIC | (T_IRQ0 + IRQ_TIMER));
lapicw(TICR, 10000000);
So, if you want the timer interrupt to be more spaced, change the TICR value:
lapicw(TICR, 10000000); //10 000 000
can become
lapicw(TICR, 100000000); //100 000 000
Warning, TICR references a 32bits unsigned counter, do not go over 4 294 967 295 (0xFFFFFFFF)
Right now it seems that on every click tick, the running process is preempted and forced to yield the processor, I have thoroughly investigated the code-base and the only relevant part of the code to process preemption is below (in trap.c):
// Force process to give up CPU on clock tick.
// If interrupts were on while locks held, would need to check nlock.
if(myproc() && myproc() -> state == RUNNING && tf -> trapno == T_IRQ0 + IRQ_TIMER)
yield();
I guess that timing is specified in T_IRQ0 + IRQ_TIMER, but I can't figure out how these two can be modified, these two are specified in trap.h:
#define T_IRQ0 32 // IRQ 0 corresponds to int T_IRQ
#define IRQ_TIMER 0
I wonder how I can change the default RR scheduling time-slice (which is right now 1 clock tick, fir example make it 10 clock-tick)?
If you want a process to be executed more time than the others, you can allow it more timeslices, *without` changing the timeslice duration.
To do so, you can add some extra_slice and current_slice in struct proc and modify the TIMER trap handler this way:
if(myproc() && myproc()->state == RUNNING &&
tf->trapno == T_IRQ0+IRQ_TIMER)
{
int current = myproc()->current_slice;
if ( current )
myproc()->current_slice = current - 1;
else
yield();
}
Then you just have to create a syscall to set extra_slice and modify the scheduler function to reset current_slice to extra_slice at process wakeup:
// Switch to chosen process. It is the process's job
// to release ptable.lock and then reacquire it
// before jumping back to us.
c->proc = p;
switchuvm(p);
p->state = RUNNING;
p->current_slice = p->extra_slice
You can read lapic.c file:
lapicinit(void)
{
....
// The timer repeatedly counts down at bus frequency
// from lapic[TICR] and then issues an interrupt.
// If xv6 cared more about precise timekeeping,
// TICR would be calibrated using an external time source.
lapicw(TDCR, X1);
lapicw(TIMER, PERIODIC | (T_IRQ0 + IRQ_TIMER));
lapicw(TICR, 10000000);
So, if you want the timer interrupt to be more spaced, change the TICR value:
lapicw(TICR, 10000000); //10 000 000
can become
lapicw(TICR, 100000000); //100 000 000
Warning, TICR references a 32bits unsigned counter, do not go over 4 294 967 295 (0xFFFFFFFF)
I have the following qsort example to try out callbacks in luajit. However it has a memory leak (luajit: not enough memory when executing) which is not obvious to me.
Can somebody give me some hints on how to create a proper callback example?
local ffi = require("ffi")
-- ===============================================================================
ffi.cdef[[
void qsort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *));
]]
function compare(a, b)
return a[0] - b[0]
end
-- ===============================================================================
-- Explicitly convert to a callback via cast
local callback = ffi.cast("int (*)(const char *, const char *)", compare)
local data = "efghabcd"
local size = 8
local loopSize = 1000 * 1000 * 100.
local bytes = ffi.new("char[15]")
-- ===============================================================================
for i=1,loopSize do
ffi.copy(bytes, data, size)
ffi.C.qsort(bytes, size, 1, callback)
end
Platform: OSX 10.8
luajit: 2.0.1
The problem appears to be that lua never gets a chance to perform a full garbage collection cycle inside the tight loop. As hinted by the comment, you can correct this by calling collectgarbage() yourself inside the loop.
Note that calling collectgarbage() on every iteration will impact the running time of whatever you're benching. To minimize this, you should set a threshold to limit how often collectgarbage() gets called:
local memthreshold = 2 ^ 20 / 1024
local start = os.clock()
for i = 1, loopSize do
ffi.copy(bytes, data, size)
ffi.C.qsort(bytes, size, 1, callback)
if collectgarbage'count' > memthreshold then
collectgarbage()
end
end
local elapse = os.clock() - start
print("elapsed:", elapse..'s')
As I understand it the idea of a pool in gevent is to limit the total number of concurrent requests at any time, to a database or an API or similar.
Say I have code like this where I am spawning more greenlets than I have room for in the Pool:
import gevent.pool
pool = gevent.pool.Pool(50)
jobs = []
for number in xrange(300):
jobs.append(pool.spawn(do_something, number))
total_result = [x.get() for x in jobs]
What is the actual behavior when trying to spawn the 51st request? When is the 51st request handled?
The pool class uses a semaphore to count active greenlets, initialized with size count in the constructor:
class Pool(Group):
def __init__(self, size=None, greenlet_class=None):
if size is not None and size < 1:
raise ValueError('Invalid size for pool (positive integer or None required): %r' % (size, ))
Group.__init__(self)
self.size = size
if greenlet_class is not None:
self.greenlet_class = greenlet_class
if size is None:
self._semaphore = DummySemaphore()
else:
self._semaphore = Semaphore(size)
Every time spawn() is called, it tries to acquire the semaphore:
def spawn(self, *args, **kwargs):
self._semaphore.acquire()
try:
greenlet = self.greenlet_class.spawn(*args, **kwargs)
self.add(greenlet)
except:
self._semaphore.release()
raise
return greenlet
If the pool is full, the called greenlet will thus wait on _semaphore.acquire() call. Semaphore is released whenever any of the greenlets ends execution:
def discard(self, greenlet):
Group.discard(self, greenlet)
self._semaphore.release()
So in your case, I'd expect the 51st request to be handled (or started, to be precise) as soon as any of the first 50 requests is done.
I traced an oracle process, and find it first open a file /etc/netconfig as file handle 11, and then duplicate it as 256 by calling fcntl with parameter F_DUPFD, and then close the original file handle 11. Later it read using file handle 256. So what's the point to duplicate the file handle? Why not just work on the original file handle?
12931: 0.0006 open("/etc/netconfig", O_RDONLY|O_LARGEFILE) = 11
12931: 0.0002 fcntl(11, F_DUPFD, 0x00000100) = 256
12931: 0.0001 close(11) = 0
12931: 0.0002 read(256, " # p r a g m a i d e n".., 1024) = 1024
12931: 0.0003 read(256, " t s t p i _ c".., 1024) = 215
12931: 0.0002 read(256, 0x106957054, 1024) = 0
12931: 0.0001 lseek(256, 0, SEEK_SET) = 0
12931: 0.0002 read(256, " # p r a g m a i d e n".., 1024) = 1024
12931: 0.0003 read(256, " t s t p i _ c".., 1024) = 215
12931: 0.0003 read(256, 0x106957054, 1024) = 0
12931: 0.0001 close(256) = 0
On some systems, like Solaris, standard I/O with FILE only works with file descriptors 0-255 because its implementation of the FILE structure uses an 8-bit integer instead of int. If a program uses a lot of file descriptors, it's useful to reserve file descriptors 3-255 using fnctl(fd, F_DUPFD, 256). Otherwise, functions like fopen(), freopen() and fdopen() will fail if you have 256 files open.
As an aside, they're file descriptors rather than file handles. The latter are a C feature used with fopen and its brethren while descriptors are more UNIXy, for use with open et al.
Interesting. The only reason that comes to mind is that some other piece of code has a specific need for the file descriptor to be 256. I suspect only Oracle would know the bizarre reasons for that. In any case, you're not guaranteed to get 256, you get the file first available file descriptor greater than or equal to that number.
From a bit of investigation (I don't know every little thing about the innards of UNIX off the top of my head), there are attributes that belong to a group of duplicated descriptors such as file position and access mode. There are other attributes that belong to a single file descriptor, even when duplicated, such as the close-on-exec flag in GNULib.
Doing a duplicate (either with dup, dup2 or your fcntl) could be a way to create two descriptors, one with different file descriptor attributes, but I can't see that being the case in your question since the first descriptor is closed anyway. As you say, why not just use the low descriptor?
Interestingly enough, if you google for netconfig f_dupfd, you will see similar traces where the fcntl fails and it continues to read that file with the low descriptor so my thoughts on the matter are that this is an attempt to preserve low file descriptors as much as possible. For example:
4327: open("/etc/netconfig", O_RDONLY|O_LARGEFILE) = 4
4327: fcntl(4, F_DUPFD, 0x00000100) Err#22 EINVAL
4327: read(4, " # p r a g m a i d e n".., 1024) = 1024
4327: read(4, " t s t p i _ c".., 1024) = 215
4327: read(4, 0x00296B80, 1024) = 0
4327: lseek(4, 0, SEEK_SET) = 0
4327: read(4, " # p r a g m a i d e n".., 1024) = 1024
4327: read(4, " t s t p i _ c".., 1024) = 215
4327: read(4, 0x00296B80, 1024) = 0
4327: close(4) = 0
Maybe the software has a byte array of file descriptors somewhere that's limited so it attempts to move other files above the 255-limit.
But really, that's just guesswork on my part (although I'd like to think it's relatively intelligent guesswork). Also keep in mind that it may not be Oracle itself doing this. The netconfig stuff is used in a lot of places so it may be some underlying library doing that, especially in light of the fact that most of the afore-mentioned web hits weren't Oracle-specific (ftp, remsh and so on).
Here is another example when a technique of reserving low-numbered file descriptors is needed.
Assume that a process opens a large number of file descriptor e.g. it accepts more than 1024 simultaneous socket connections. At the same time the process also uses third party library that opens socket connections and uses select() to see if sockets are ready for reading or writing. Additionally the third party library was compiled with __FD_SETSIZE set to 1024 (default value).
If the library opens a socket when all file descriptors below 1024 are in use then it will get a descriptor that select() and associated FD_* macros can not cope with. This will result in process crashing or undefined behaviour.