Meaning of the benchmark variables in snakemake - psutil

I included a benchmark directive to some of the rules in my snakemake workflow, and the resulting files have the following header:
s h:m:s max_rss max_vms max_uss max_pss io_in io_out mean_load
The only documentation I've found mentions a "benchmark txt file (which will contain a tab-separated table of run times and memory usage in MiB)".
I can guess that columns 1 and 2 are two different ways of displaying the time taken to execute the rule (in seconds, and converted to hours, minutes and seconds).
io_in and io_out likely related to disk read and write activity, but in what units are they measured?
What are the others? Is this documented somewhere?
Edit: Looking at the source code
I've found the following piece of code in /snakemake/benchmark.py, that might well be where the benchmark data come from:
def _update_record(self):
"""Perform the actual measurement"""
# Memory measurements
rss, vms, uss, pss = 0, 0, 0, 0
# I/O measurements
io_in, io_out = 0, 0
# CPU seconds
cpu_seconds = 0
# Iterate over process and all children
try:
main = psutil.Process(self.pid)
this_time = time.time()
for proc in chain((main,), main.children(recursive=True)):
meminfo = proc.memory_full_info()
rss += meminfo.rss
vms += meminfo.vms
uss += meminfo.uss
pss += meminfo.pss
ioinfo = proc.io_counters()
io_in += ioinfo.read_bytes
io_out += ioinfo.write_bytes
if self.bench_record.prev_time:
cpu_seconds += proc.cpu_percent() / 100 * (
this_time - self.bench_record.prev_time)
self.bench_record.prev_time = this_time
if not self.bench_record.first_time:
self.bench_record.prev_time = this_time
rss /= 1024 * 1024
vms /= 1024 * 1024
uss /= 1024 * 1024
pss /= 1024 * 1024
io_in /= 1024 * 1024
io_out /= 1024 * 1024
except psutil.Error as e:
return
# Update benchmark record's RSS and VMS
self.bench_record.max_rss = max(self.bench_record.max_rss or 0, rss)
self.bench_record.max_vms = max(self.bench_record.max_vms or 0, vms)
self.bench_record.max_uss = max(self.bench_record.max_uss or 0, uss)
self.bench_record.max_pss = max(self.bench_record.max_pss or 0, pss)
self.bench_record.io_in = io_in
self.bench_record.io_out = io_out
self.bench_record.cpu_seconds += cpu_seconds
So this seems to come from functionalities provided by psutil.

I will just leave this here for future reference.
Reading through
snakemake >= 6.0.0 benchmark module
psutil's memory_info(), memory_full_info(), io_counters(), cpu_times()
as previously suggested:
colname
type (unit)
description
s
float (seconds)
Running time in seconds
h:m:s
string (-)
Running time in hour, minutes, seconds format
max_rss
float (MB)
Maximum "Resident Set Size”, this is the non-swapped physical memory a process has used.
max_vms
float (MB)
Maximum “Virtual Memory Size”, this is the total amount of virtual memory used by the process
max_uss
float (MB)
“Unique Set Size”, this is the memory which is unique to a process and which would be freed if the process was terminated right now.
max_pss
float (MB)
“Proportional Set Size”, is the amount of memory shared with other processes, accounted in a way that the amount is divided evenly between the processes that share it (Linux only)
io_in
float (MB)
the number of MB read (cumulative).
io_out
float (MB)
the number of MB written (cumulative).
mean_load
float (-)
CPU usage over time, divided by the total running time (first row)
cpu_time
float(-)
CPU time summed for user and system

Benchmarking in snakemake could certainly be better documented, but psutil is documanted here:
get_memory_info()
Return a tuple representing RSS (Resident Set Size) and VMS (Virtual Memory Size) in bytes.
On UNIX RSS and VMS are the same values shown by ps.
On Windows RSS and VMS refer to "Mem Usage" and "VM Size" columns of taskmgr.exe.
psutil.disk_io_counters(perdisk=False)
Return system disk I/O statistics as a namedtuple including the following attributes:
read_count: number of reads
write_count: number of writes
read_bytes: number of bytes read
write_bytes: number of bytes written
read_time: time spent reading from disk (in milliseconds)
write_time: time spent writing to disk (in milliseconds)
The code you found confirms that all the memory usage and IO counts are reported in MB (= bytes * 1024 * 1024).

Related

matlab : out of memory with convenable confi of pc

I block on my problem that I will wrote in details below. During 3 days I tried a lot of differents things, none worked..
If anyone have an idea of what to do !
Here is my message error :
the call to "ft_selectdata" took 0 seconds
preprocessing
Out of memory. Type "help memory" for your options.
Error in ft_preproc_dftfilter (line 187)
tmp = exp(2*1i*pi*freqs(:)*time); % complex sin and cos
Error in ft_preproc_dftfilter (line 144)
filt = ft_preproc_dftfilter(filt, Fs, Fl(i), 'dftreplace', dftreplace, 'dftbandwidth',
dftbandwidth(i), 'dftneighbourwidth', dftneighbourwidth(i)); % enumerate all options
Error in preproc (line 464)
dat = ft_preproc_dftfilter(dat, fsample, cfg.dftfreq, optarg{:});
Error in ft_preprocessing (line 375)
[dataout.trial{i}, dataout.label, dataout.time{i}, cfg] = preproc(data.trial{i}, data.label,
data.time{i}, cfg, begpadding, endpadding);
Error in EEG_Prosocial_script (line 101)
data_intpl = ft_preprocessing(cfg, allData_preprosses);
187 tmp = exp(2*1i*pi*freqs(:)*time); % complex sin and cos
There is some informations from matlab about my computer and about the caracteristics of the calcul
Maximum possible array: 7406 MB (7.766e+09 bytes)*
Memory available for all arrays: 7406 MB (7.766e+09 bytes) *
Memory used by MATLAB: 4195 MB (4.398e+09 bytes)
Physical Memory (RAM): 12206 MB (1.280e+10 bytes)
Limited by System Memory (physical + swap file) available.
K>> whos Name
Size Bytes Class Attributes
freqs 7753x1 62024 double
li 1x1 16 double complex
time 1x1984512 15876096 double
So there the config of the computer which failed to run the script (Alienware aurora R4) :
Ram : 4gb free / 12 # 1,6Ghz --> 2x (4Gb 1600Mhz) - 2x (2Gb 1600 MHz)
Intel core i7-3820 4 core 8 threads 3,7 GHz 1 CPU
NVIDIA GeForce GTX 690 2gb
RAM : Kingston KVT8FP HYC
Hard disk : SSD kingston 250Go SATA 3"
This code work on this computer (Dell inspiron 14-500) : config
Ram 4 Go of memory DDR4 2 666 MHz (4 Go x 1)
Intel® Core™ i5-8265U 8e generatio, (6 Mo memory, 3,9 GHz)
Intel® UHD Graphics 620
Hard disk SATA 2,5" 500 Go 5 400 tr/min
Thank you
Kind regards,
by doing freqs(:)*time you are trying to create a 7753*1984512 size array, and you dont have memory for that... (you will need ~123 gigabytes for that, or 1.2309e+11 bytes, where your computer has ~7e+09)
see for example the case for :
f=rand(7,1);
t=rand(1,19);
size(f(:)*t)
ans =
7 19
what you want do to is probably a for loop per time element etc.

haproxy stats: qtime,ctime,rtime,ttime?

Running a web app behind HAProxy 1.6.3-1ubuntu0.1, I'm getting haproxy stats qtime,ctime,rtime,ttime values of 0,0,0,2704.
From the docs (https://www.haproxy.org/download/1.6/doc/management.txt):
58. qtime [..BS]: the average queue time in ms over the 1024 last requests
59. ctime [..BS]: the average connect time in ms over the 1024 last requests
60. rtime [..BS]: the average response time in ms over the 1024 last requests
(0 for TCP)
61. ttime [..BS]: the average total session time in ms over the 1024 last requests
I'm expecting response times in the 0-10ms range. ttime of 2704 milliseconds seems unrealistically high. Is it possible the units are off and this is 2704 microseconds rather than 2704 millseconds?
Secondly, it seems suspicious that ttime isn't even close to qtime+ctime+rtime. Is total response time not the sum of the time to queue, connect, and respond? What is the other time, that is included in total but not queue/connect/response? Why can my response times be <1ms, but my total response times be ~2704 ms?
Here is my full csv stats:
$ curl "http://localhost:9000/haproxy_stats;csv"
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime,
http-in,FRONTEND,,,4707,18646,50000,5284057,209236612829,42137321877,0,0,997514,,,,,OPEN,,,,,,,,,1,2,0,,,,0,4,0,2068,,,,0,578425742,0,997712,22764,1858,,1561,3922,579448076,,,0,0,0,0,,,,,,,,
servers,server1,0,0,0,4337,20000,578546476,209231794363,41950395095,,0,,22861,1754,95914,0,no check,1,1,0,,,,,,1,3,1,,578450562,,2,1561,,6773,,,,0,578425742,0,198,0,0,0,,,,29,1751,,,,,0,,,0,0,0,2704,
servers,BACKEND,0,0,0,5919,5000,578450562,209231794363,41950395095,0,0,,22861,1754,95914,0,UP,1,1,0,,0,320458,0,,1,3,0,,578450562,,1,1561,,3922,,,,0,578425742,0,198,22764,1858,,,,,29,1751,0,0,0,0,0,,,0,0,0,2704,
stats,FRONTEND,,,2,5,2000,5588,639269,8045341,0,0,29,,,,,OPEN,,,,,,,,,1,4,0,,,,0,1,0,5,,,,0,5374,0,29,196,0,,1,5,5600,,,0,0,0,0,,,,,,,,
stats,BACKEND,0,0,0,1,200,196,639269,8045341,0,0,,196,0,0,0,UP,0,0,0,,0,320458,0,,1,4,0,,0,,1,0,,5,,,,0,0,0,0,196,0,,,,,0,0,0,0,0,0,0,,,0,0,0,0,
In haproxy >2 you now get two values n / n which is the max within a sliding window and the average for that window. The max value remains the max across all sample windows until a higher value is found. On 1.8 you only get the average.
Example of haproxy 2 v 1.8. Note these proxies are used very differently and with dramatically different loads.
So looks like the average response times at least since last reboot are 66m and 275ms.
The average is computed as:
data time + cumulative http connections - 1 / cumulative http connections
This might not be a perfect analysis so if anyone has improvements it'd be appreciated. This is meant to show how I came to the answer above so you can use it to gather more insight into the other counters you asked about. Most of this information was gathered from reading stats.c. The counters you asked about are defined here.
unsigned int q_time, c_time, d_time, t_time; /* sums of conn_time, queue_time, data_time, total_time */
unsigned int qtime_max, ctime_max, dtime_max, ttime_max; /* maximum of conn_time, queue_time, data_time, total_time observed */```
The stats page values are built from this code:
if (strcmp(field_str(stats, ST_F_MODE), "http") == 0)
chunk_appendf(out, "<tr><th>- Responses time:</th><td>%s / %s</td><td>ms</td></tr>",
U2H(stats[ST_F_RT_MAX].u.u32), U2H(stats[ST_F_RTIME].u.u32));
chunk_appendf(out, "<tr><th>- Total time:</th><td>%s / %s</td><td>ms</td></tr>",
U2H(stats[ST_F_TT_MAX].u.u32), U2H(stats[ST_F_TTIME].u.u32));
You asked about all the counter but I'll focus on one. As can be seen in the snippit above for "Response time:" ST_F_RT_MAX and ST_F_RTIME are the values displayed on the stats page as n (rtime_max) / n (rtime) respectively. These are defined as follows:
[ST_F_RT_MAX] = { .name = "rtime_max", .desc = "Maximum observed time spent waiting for a server response, in milliseconds (backend/server)" },
[ST_F_RTIME] = { .name = "rtime", .desc = "Time spent waiting for a server response, in milliseconds, averaged over the 1024 last requests (backend/server)" },
These set a "metric" value (among other things) in a case statement further down in the code:
case ST_F_RT_MAX:
metric = mkf_u32(FN_MAX, sv->counters.dtime_max);
break;
case ST_F_RTIME:
metric = mkf_u32(FN_AVG, swrate_avg(sv->counters.d_time, srv_samples_window));
break;
These metric values give us a good look at what the stats page is telling us. The first value in the "Responses time: 0 / 0" ST_F_RT_MAX, is some max value time spent waiting. The second value in "Responses time: 0 / 0" ST_F_RTIME is an average time taken for each connection. These are the max and average taken within a window of time, i.e. however long it takes for you to get 1024 connections.
For example "Responses time: 10000 / 20":
max time spent waiting (max value ever reached including http keepalive time) over the last 1024 connections 10 seconds
average time over the last 1024 connections 20ms
So for all intents and purposes
rtime_max = dtime_max
rtime = swrate_avg(d_time, srv_samples_window)
Which begs the question what is dtime_max d_time and srv_sample_window? These are the data time windows, I couldn't actually figure how these time values are being set, but at face value it's "some time" for the last 1024 connections. As pointed out here keepalive times are included in max totals which is why the numbers are high.
Now that we know ST_F_RT_MAX is a max value and ST_F_RTIME is an average, an average of what?
/* compue time values for later use */
if (selected_field == NULL || *selected_field == ST_F_QTIME ||
*selected_field == ST_F_CTIME || *selected_field == ST_F_RTIME ||
*selected_field == ST_F_TTIME) {
srv_samples_counter = (px->mode == PR_MODE_HTTP) ? sv->counters.p.http.cum_req : sv->counters.cum_lbconn;
if (srv_samples_counter < TIME_STATS_SAMPLES && srv_samples_counter > 0)
srv_samples_window = srv_samples_counter;
}
TIME_STATS_SAMPLES value is defined as
#define TIME_STATS_SAMPLES 512
unsigned int srv_samples_window = TIME_STATS_SAMPLES;
In mode http srv_sample_counter is sv->counters.p.http.cum_req. http.cum_req is defined as ST_F_REQ_TOT.
[ST_F_REQ_TOT] = { .name = "req_tot", .desc = "Total number of HTTP requests processed by this object since the worker process started" },
For example if the value of http.cum_req is 10, then srv_sample_counter will be 10. The sample appears to be the number of successful requests for a given sample window for a given backends server. d_time (data time) is passed as "sum" and is computed as some non-negative value or it's counted as an error. I thought I found the code for how d_time is created but I wasn't sure so I haven't included it.
/* Returns the average sample value for the sum <sum> over a sliding window of
* <n> samples. Better if <n> is a power of two. It must be the same <n> as the
* one used above in all additions.
*/
static inline unsigned int swrate_avg(unsigned int sum, unsigned int n)
{
return (sum + n - 1) / n;
}

MongoDB Concurrency Bottleneck

Too Long; Didn't Read
The question is about a concurrency bottleneck I am experiencing on MongoDB. If I make one query, it takes 1 unit of time to return; if I make 2 concurrent queries, both take 2 units of time to return; generally, if I make n concurrent queries, all of them take n units of time to return. My question is about what can be done to improve Mongo's response times when faced with concurrent queries.
The Setup
I have a m3.medium instance on AWS running a MongoDB 2.6.7 server. A m3.medium has 1 vCPU (1 core of a Xeon E5-2670 v2), 3.75GB and a 4GB SSD.
I have a database with a single collection named user_products. A document in this collection has the following structure:
{ user: <int>, product: <int> }
There are 1000 users and 1000 products and there's a document for every user-product pair, totalizing a million documents.
The collection has an index { user: 1, product: 1 } and my results below are all indexOnly.
The Test
The test was executed from the same machine where MongoDB is running. I am using the benchRun function provided with Mongo. During the tests, no other accesses to MongoDB were being made and the tests only comprise read operations.
For each test, a number of concurrent clients is simulated, each of them making a single query as many times as possible until the test is over. Each test runs for 10 seconds. The concurrency is tested in powers of 2, from 1 to 128 simultaneous clients.
The command to run the tests:
mongo bench.js
Here's the full script (bench.js):
var
seconds = 10,
limit = 1000,
USER_COUNT = 1000,
concurrency,
savedTime,
res,
timediff,
ops,
results,
docsPerSecond,
latencyRatio,
currentLatency,
previousLatency;
ops = [
{
op : "find" ,
ns : "test_user_products.user_products" ,
query : {
user : { "#RAND_INT" : [ 0 , USER_COUNT - 1 ] }
},
limit: limit,
fields: { _id: 0, user: 1, product: 1 }
}
];
for (concurrency = 1; concurrency <= 128; concurrency *= 2) {
savedTime = new Date();
res = benchRun({
parallel: concurrency,
host: "localhost",
seconds: seconds,
ops: ops
});
timediff = new Date() - savedTime;
docsPerSecond = res.query * limit;
currentLatency = res.queryLatencyAverageMicros / 1000;
if (previousLatency) {
latencyRatio = currentLatency / previousLatency;
}
results = [
savedTime.getFullYear() + '-' + (savedTime.getMonth() + 1).toFixed(2) + '-' + savedTime.getDate().toFixed(2),
savedTime.getHours().toFixed(2) + ':' + savedTime.getMinutes().toFixed(2),
concurrency,
res.query,
currentLatency,
timediff / 1000,
seconds,
docsPerSecond,
latencyRatio
];
previousLatency = currentLatency;
print(results.join('\t'));
}
Results
Results are always looking like this (some columns of the output were omitted to facilitate understanding):
concurrency queries/sec avg latency (ms) latency ratio
1 459.6 2.153609008 -
2 460.4 4.319577324 2.005738882
4 457.7 8.670418178 2.007237636
8 455.3 17.4266174 2.00989353
16 450.6 35.55693474 2.040380754
32 429 74.50149883 2.09527338
64 419.2 153.7325095 2.063482104
128 403.1 325.2151235 2.115460969
If only 1 client is active, it is capable of doing about 460 queries per second over the 10 second test. The average response time for a query is about 2 ms.
When 2 clients are concurrently sending queries, the query throughput maintains at about 460 queries per second, showing that Mongo hasn't increased its response throughput. The average latency, on the other hand, literally doubled.
For 4 clients, the pattern continues. Same query throughput, average latency doubles in relation to 2 clients running. The column latency ratio is the ratio between the current and previous test's average latency. See that it always shows the latency doubling.
Update: More CPU Power
I decided to test with different instance types, varying the number of vCPUs and the amount of available RAM. The purpose is to see what happens when you add more CPU power. Instance types tested:
Type vCPUs RAM(GB)
m3.medium 1 3.75
m3.large 2 7.5
m3.xlarge 4 15
m3.2xlarge 8 30
Here are the results:
m3.medium
concurrency queries/sec avg latency (ms) latency ratio
1 459.6 2.153609008 -
2 460.4 4.319577324 2.005738882
4 457.7 8.670418178 2.007237636
8 455.3 17.4266174 2.00989353
16 450.6 35.55693474 2.040380754
32 429 74.50149883 2.09527338
64 419.2 153.7325095 2.063482104
128 403.1 325.2151235 2.115460969
m3.large
concurrency queries/sec avg latency (ms) latency ratio
1 855.5 1.15582069 -
2 947 2.093453854 1.811227185
4 961 4.13864589 1.976946318
8 958.5 8.306435055 2.007041742
16 954.8 16.72530889 2.013536347
32 936.3 34.17121062 2.043083977
64 927.9 69.09198599 2.021935563
128 896.2 143.3052382 2.074122435
m3.xlarge
concurrency queries/sec avg latency (ms) latency ratio
1 807.5 1.226082735 -
2 1529.9 1.294211452 1.055566166
4 1810.5 2.191730848 1.693487447
8 1816.5 4.368602642 1.993220402
16 1805.3 8.791969257 2.01253581
32 1770 17.97939718 2.044979532
64 1759.2 36.2891598 2.018374668
128 1720.7 74.56586511 2.054769676
m3.2xlarge
concurrency queries/sec avg latency (ms) latency ratio
1 836.6 1.185045183 -
2 1585.3 1.250742872 1.055438974
4 2786.4 1.422254414 1.13712774
8 3524.3 2.250554777 1.58238551
16 3536.1 4.489283844 1.994745425
32 3490.7 9.121144097 2.031759277
64 3527 18.14225682 1.989033023
128 3492.9 36.9044113 2.034168718
Starting with the xlarge type, we begin to see it finally handling 2 concurrent queries while keeping the query latency virtually the same (1.29 ms). It doesn't last too long, though, and for 4 clients it again doubles the average latency.
With the 2xlarge type, Mongo is able to keep handling up to 4 concurrent clients without raising the average latency too much. After that, it starts to double again.
The question is: what could be done to improve Mongo's response times with respect to the concurrent queries being made? I expected to see a rise in the query throughput and I did not expect to see it doubling the average latency. It clearly shows Mongo is not being able to parallelize the queries that are arriving.
There's some kind of bottleneck somewhere limiting Mongo, but it certainly doesn't help to keep adding up more CPU power, since the cost will be prohibitive. I don't think memory is an issue here, since my entire test database fits in RAM easily. Is there something else I could try?
You're using a server with 1 core and you're using benchRun. From the benchRun page:
This benchRun command is designed as a QA baseline performance measurement tool; it is not designed to be a "benchmark".
The scaling of the latency with the concurrency numbers is suspiciously exact. Are you sure the calculation is correct? I could believe that the ops/sec/runner was staying the same, with the latency/op also staying the same, as the number of runners grew - and then if you added all the latencies, you would see results like yours.

luajit qsort callback example memory leak

I have the following qsort example to try out callbacks in luajit. However it has a memory leak (luajit: not enough memory when executing) which is not obvious to me.
Can somebody give me some hints on how to create a proper callback example?
local ffi = require("ffi")
-- ===============================================================================
ffi.cdef[[
void qsort(void *base, size_t nel, size_t width, int (*compar)(const void *, const void *));
]]
function compare(a, b)
return a[0] - b[0]
end
-- ===============================================================================
-- Explicitly convert to a callback via cast
local callback = ffi.cast("int (*)(const char *, const char *)", compare)
local data = "efghabcd"
local size = 8
local loopSize = 1000 * 1000 * 100.
local bytes = ffi.new("char[15]")
-- ===============================================================================
for i=1,loopSize do
ffi.copy(bytes, data, size)
ffi.C.qsort(bytes, size, 1, callback)
end
Platform: OSX 10.8
luajit: 2.0.1
The problem appears to be that lua never gets a chance to perform a full garbage collection cycle inside the tight loop. As hinted by the comment, you can correct this by calling collectgarbage() yourself inside the loop.
Note that calling collectgarbage() on every iteration will impact the running time of whatever you're benching. To minimize this, you should set a threshold to limit how often collectgarbage() gets called:
local memthreshold = 2 ^ 20 / 1024
local start = os.clock()
for i = 1, loopSize do
ffi.copy(bytes, data, size)
ffi.C.qsort(bytes, size, 1, callback)
if collectgarbage'count' > memthreshold then
collectgarbage()
end
end
local elapse = os.clock() - start
print("elapsed:", elapse..'s')

what's the purpose of fcntl with parameter F_DUPFD

I traced an oracle process, and find it first open a file /etc/netconfig as file handle 11, and then duplicate it as 256 by calling fcntl with parameter F_DUPFD, and then close the original file handle 11. Later it read using file handle 256. So what's the point to duplicate the file handle? Why not just work on the original file handle?
12931: 0.0006 open("/etc/netconfig", O_RDONLY|O_LARGEFILE) = 11
12931: 0.0002 fcntl(11, F_DUPFD, 0x00000100) = 256
12931: 0.0001 close(11) = 0
12931: 0.0002 read(256, " # p r a g m a i d e n".., 1024) = 1024
12931: 0.0003 read(256, " t s t p i _ c".., 1024) = 215
12931: 0.0002 read(256, 0x106957054, 1024) = 0
12931: 0.0001 lseek(256, 0, SEEK_SET) = 0
12931: 0.0002 read(256, " # p r a g m a i d e n".., 1024) = 1024
12931: 0.0003 read(256, " t s t p i _ c".., 1024) = 215
12931: 0.0003 read(256, 0x106957054, 1024) = 0
12931: 0.0001 close(256) = 0
On some systems, like Solaris, standard I/O with FILE only works with file descriptors 0-255 because its implementation of the FILE structure uses an 8-bit integer instead of int. If a program uses a lot of file descriptors, it's useful to reserve file descriptors 3-255 using fnctl(fd, F_DUPFD, 256). Otherwise, functions like fopen(), freopen() and fdopen() will fail if you have 256 files open.
As an aside, they're file descriptors rather than file handles. The latter are a C feature used with fopen and its brethren while descriptors are more UNIXy, for use with open et al.
Interesting. The only reason that comes to mind is that some other piece of code has a specific need for the file descriptor to be 256. I suspect only Oracle would know the bizarre reasons for that. In any case, you're not guaranteed to get 256, you get the file first available file descriptor greater than or equal to that number.
From a bit of investigation (I don't know every little thing about the innards of UNIX off the top of my head), there are attributes that belong to a group of duplicated descriptors such as file position and access mode. There are other attributes that belong to a single file descriptor, even when duplicated, such as the close-on-exec flag in GNULib.
Doing a duplicate (either with dup, dup2 or your fcntl) could be a way to create two descriptors, one with different file descriptor attributes, but I can't see that being the case in your question since the first descriptor is closed anyway. As you say, why not just use the low descriptor?
Interestingly enough, if you google for netconfig f_dupfd, you will see similar traces where the fcntl fails and it continues to read that file with the low descriptor so my thoughts on the matter are that this is an attempt to preserve low file descriptors as much as possible. For example:
4327: open("/etc/netconfig", O_RDONLY|O_LARGEFILE) = 4
4327: fcntl(4, F_DUPFD, 0x00000100) Err#22 EINVAL
4327: read(4, " # p r a g m a i d e n".., 1024) = 1024
4327: read(4, " t s t p i _ c".., 1024) = 215
4327: read(4, 0x00296B80, 1024) = 0
4327: lseek(4, 0, SEEK_SET) = 0
4327: read(4, " # p r a g m a i d e n".., 1024) = 1024
4327: read(4, " t s t p i _ c".., 1024) = 215
4327: read(4, 0x00296B80, 1024) = 0
4327: close(4) = 0
Maybe the software has a byte array of file descriptors somewhere that's limited so it attempts to move other files above the 255-limit.
But really, that's just guesswork on my part (although I'd like to think it's relatively intelligent guesswork). Also keep in mind that it may not be Oracle itself doing this. The netconfig stuff is used in a lot of places so it may be some underlying library doing that, especially in light of the fact that most of the afore-mentioned web hits weren't Oracle-specific (ftp, remsh and so on).
Here is another example when a technique of reserving low-numbered file descriptors is needed.
Assume that a process opens a large number of file descriptor e.g. it accepts more than 1024 simultaneous socket connections. At the same time the process also uses third party library that opens socket connections and uses select() to see if sockets are ready for reading or writing. Additionally the third party library was compiled with __FD_SETSIZE set to 1024 (default value).
If the library opens a socket when all file descriptors below 1024 are in use then it will get a descriptor that select() and associated FD_* macros can not cope with. This will result in process crashing or undefined behaviour.