client backend VS background writer - postgresql

I know the difference between background writer, checkpointer, walwriter, and parallel_worker:
wal writer vs bg writer vs checkpointer
client_backend vs parallel_worker
But I am confused with client backend. I use opensnoop and trace to find that client backend usually writes (pwrite(2)) something to both data-file and wal-file.
PID COMM FD ERR PATH
18365 (client backend) postgres 51 0 base/16384/16403
18365 (client backend) postgres 52 0 pg_wal/0000000100000001000000E1
Question
What is client backend responsible for?
At most one client backend per connection?

The client backend is the server process that communicates with the client and performs SQL statements on the client's behalf. There is one client backend process for each database session.
However, the client backend process normally should not have to write to the data file: the backend writes table data only to shared buffers (the cache), and the "checkpointer" and "background writer" process do most of the actual writing to disk. The design idea here is that the client backend should not be bothered with these potentially slow tasks, but be free to process SQL statements as quickly as possible. WAL is written by the backend process when a transaction is committed, otherwise the WAL writer process takes care of that. So it is normal that you see these writes by a backend.
If the background writer is not fast enough to cope with the workload, it can happen that the backend process has to do some of the writing. This should not happen regularly, but only during spikes in data modification activity. Often, that is an indication that you can improve the performance of your SQL statements if you increase the size of shared buffers.

Related

Are the changes of a write transaction in ClientA immediately visible to a ClientB read, started after COMMIT?

We are observing some behaviours/errors in some of our workflows, related to the consistency and visiblity of a Postgres write transaction, followed by a read. One of our developers offered an explanation, but I could not find any search results documenting the proposed reasoning.
Given a single Postgres 10.3 host, the following operations take place:
ClientA performs a successful write transaction
After the COMMIT, an external notification is emitted
ClientB reacts to external notification and performs a read, only to find that the UPDATE transaction changes are not visible
The explanation that was proposed is that two postgres client connections on different threads don't have a guaranteed view snapshot and may not immediately observe the write transaction update after the commit. But from what I have read, I would expect that after the COMMIT has succeeded, a read operation then starting in response should see the effects of that write.
My specific question is: Given two database client connections on different threads, is it possible for a race condition with one client viewing the effects of a write transaction AFTER the other client has committed? (no overlapping transactions).
Every bit of documentation I have found thus far only refers to concerns about overlapping/concurrent transaction and the MVCC/transaction isolation topics. Nothing about a synchronised serial operation between two different client connections.
Edit: Some extra details about the configuration.
ClientA and ClientB would be different threads accessing postgres through a connection pool. Clients may both be in the same connection pool on the same application server, or it may be ClientA/ApplicationA and ClientB/ApplicationB.
When ClientB reacts, it will access the existing Application server connection pool to make a new read.
No, that cannot happen, unless the reading transaction started earlier and is running at the REPEATABLE READ or SERIALIZABLE isolation level.
There is also the possibility that the reading transaction does not connect to the same server as the writing transaction, but to a streaming replication standby server with hot_standby enabled. Then this can easily happen, even with synchronous replication (unless you set synchronous_commit = remote_apply).

How application server handle multiple requests to save data into table

I have created a web application in jsf and it has a button.
If the button is clicked then it will go to the server side and execute the below function to save the data in a table and I am using mybatis for this.
public void save(A a)
{
SqlSession session = null;
try{
session = SqlConnection.getInstance().openSession();
TestMapper testmap= session.getMapper(TestMapper.class);
testmap.insert(a);
session .commit();
}
catch(Exception e){
}
finally{
session.close();
}
}
Now i have deployed this application in an application server JBoss(wildfly).
As per my understanding, when multiple users try to access the application
by hitting the URL, the application server creates thread for each of the user request.
For example if 4 clients make request then 4 threads will be generated that is t1,t2,t3 and t4.
If all the 4 users hit the save button at the same time, how save method will be executed, like if t1 access the method and execute insert statement
to insert data into table, then t2,t3 and t4 or simultaneously all the 4 threads will execute the insert method and insert data?
To bring some context I would describe first two possible approaches to handling requests. In this case HTTP but these approaches do not depend on the protocol used and the main important thing is that requests come from the network and for their execution some IO is needed (either access to filesystem or database or network calls to other systems). Note that the following description has some simplifications.
These two approaches are:
synchronous
asynchronous
In general to process the typical HTTP request that involves DB access at least four IO operations are needed:
request handler needs to read the request data from the client socket
request handler needs to write request to the socket connected to the DB
request handler needs to read response from the DB socket
request handler needs to write the response to the client socket
Let's see how this is done for both cases.
Synchronous
In this approach the server has a pool (think a collection) of threads that are ready to serve a request.
When the request comes in the server borrows a thread from the pool and executes a request handler in that thread.
When the request handler needs to do the IO operation it initiates the IO operation and then waits for its completion. By wait I mean that thread execution is blocked until the IO operation completes and the data (for example response with the results of the SQL query) is available.
In this case concurrency that is requests processing for multiple clients simultaneously is achieved by having some number of threads in the pool. IO operations are much slower if compared to CPU so most of the time the thread processing some request is blocked on IO operation and CPU cores can execute stages of the request processing for other clients.
Note that because of the slowness of the IO operations thread pool used for handling HTTP requests is usually large enough. Documentation for sync requests processing subsystem used in wildfly says about 10 threads per CPU core as a reasonable value.
Asynchronous
In this case the IO is handled differently. There is a small number of threads handling IO. They all work the same way and I'll describe one of them.
Such thread runs a loop which basically waits for events and every time an event happen it calls a handler for an event.
The first such event is new request. When a request processing is started the request handler is invoked from the loop that is run by one of the IO threads. The first thing the request handler is doing it tries to read request from the client socket. So the handler initiates the IO operation on the client socket and returns control to the caller. That means that the thread is released and it can process another event.
Another event happens when the IO operations that reads from client socket got some data available. In this case the loop invokes the handler at the point where the handler returned the control to the loop after the IO initiate namely it is resumed on the next step that processes the input data (like parses HTTP parameters) and initiates new IO operation (in this case request to the DB socket). And again the handler releases the thread so it can handler other events (like completion of IO operations that are part of other clients' requests processing).
Given that IO operations are slow compared to the speed of CPU itself one thread handling IO can process a lot of requests concurrently.
Note: that it is important that the requests handler code never uses any blocking operation (like blocking IO) because that would steal the IO thread and will not allow other requests to proceed.
JSF and Mybatis
In case of JSF and mybatis the synchronous approach is used. JSF uses a servlet to handle requests from the UI and servlets are handled by the synchronous processors in WildFly. JDBC which is used by mybatis to communicate to a DB is also using synchronous IO so threads are used to execute requests concurrently.
Congestions
All of the above is written with the assumption that there is no other sources of the congestion. By congestion here I mean a limitation on the ability of the certain component of the system to execute things in parallel.
For example imagine a situation that a database is configured to only allow one client connection at a time (this is not a reasonable configuration and I'm using this only to demonstrate the idea). In this case even if multiple threads can execute the code of the save method in parallel all but one will be blocked at the moment when they try to open the connection to the database.
Another similar example is if you are using sqlite database. It only allows one client to write to the DB at a time. So at the point when thread A tries to execute insert it will be blocked if the is another thread B that is already executing the insert. And only after the commit executed by the thread B the thread A would be able to proceed with the insert. The time A depends on the time it take for B to execute its request and the number of other threads waiting to do a write operation to the same DB.
In practice if you are using a RDBMS that scales better (like postgresql, mysql or oracle) you will not hit this problem when using the small number of connection. But it may become a problem when there is a big number of concurrent requests and there is a limitation in the DB on the number of client connections or the connection pool is used to limit the number of connections on the application side. In this case if there are already many connections to the database the new clients will wait until existing requests are finished and connections are closed.

Operating system computations for throughput

Can i get help with this? I can't seem to understand the question
"In this problem you are to compare reading a file using a single-threaded file server with a multi- threaded file server. It takes 16 msec to get a request for work, dispatch it, and do the rest of the necessary processing, assuming the data are in the block cache. If a disk operation is needed (assume a spinning disk drive with 1 head), as is the case one-fourth of the time, an additional 32 msec is required."
Can i get help with this?
I don't think so (I don't think there's enough information in the question for anyone to be able to understand it).
Example 1
The file server is single-threaded and handles asynchronous requests, and the "16 msec" is primarily "request delivery latency" (time between a process sending a request and the file server receiving the request). A process sends a single request asking to read from 1000 files, the file server receives this request, "immediately" sends back 750 replies (for file data that was cached) and sends a single request asking something (file system code, disk driver?) to fetch the remaining 250 things; then file server "immediately" waits for more requests while waiting for the reply from something (file system code, disk driver?) to complete the early 250 things. In this case you can say that throughput for single-threaded file server is virtually infinite (e.g. infinite throughput for "file data cache hit", which is the only thing that matters because you can make more requests while waiting for slow disk IO).
Example 2
The file server has 8 threads and handles synchronous requests. A single-threaded process sends 1 request (to read from 1 file) and then has to wait for the reply, the request is given to one of the file server's threads (doesn't matter which) and that thread takes an average of "16 + 32*0.25 = 24 msec" for the request to be handled before the process can make it's next request; and the process does this in a loop because it wants to read 1000 files. In this case throughput is "1/0.024 = 41.66 requests per second", which is extremely bad (primarily because the single-threaded process can't send requests fast enough to keep all threads of the multi-threaded server busy).
Example 3
The file server has 8 threads and handles synchronous requests. A process with 1000 threads send 1 request (to read from 1 file) from each of its threads. In this case we need to know how many CPUs there are (and how scheduler works) to determine anything about throughput. E.g. if there's only 2 CPUs then you're not going to get 8 file server threads running in parallel at the same time.

Haskell database connections

Please look at this scotty app (it's taken directly from this old answer from 2014):
import Web.Scotty
import Database.MongoDB
import qualified Data.Text.Lazy as T
import Control.Monad.IO.Class
runQuery :: Pipe -> Query -> IO [Document]
runQuery pipe query = access pipe master "nutrition" (find query >>= rest)
main = do
pipe <- connect $ host "127.0.0.1"
scotty 3000 $ do
get "/" $ do
res <- liftIO $ runQuery pipe (select [] "stock_foods")
text $ T.pack $ show res
You see how the the database connection (pipe) is created only once when the web app launches. Subsequently, thousands if not millions of visitors will hit the "/" route simultaneously and read from the database using the same connection (pipe).
I have questions about how to properly use Database.MongoDB:
Is this the proper way of setting things up? As opposed to creating a database connection for every visit to "/". In this latter case, we could have millions of connections at once. Is that discouraged? What are the advantages and drawbacks of such an approach?
In the app above, what happens if the database connection is lost for some reason and needs to be created again? How would you recover from that?
What about authentication with the auth function? Should the auth function only be called once after creating the pipe, or should it be called on every hit to "/"?
Some say that I'm supposed to use a pool (Data.Pool). It looks like that would only help limit the number of visitors using the same database connection simultaneously. But why would I want to do that? Doesn't the MongoDB connection have a built-in support for simultaneous usages?
Even if you create connection per client you won't be able to create too many of them. You will hit ulimit. Once you hit that ulimit the client that hit this ulimit will get a runtime error.
The reason it doesn't make sense is because mongodb server will be spending too much time polling all those connections and it will have only as many meaningful workers as many CPUs your db server has.
One connection is not a bad idea, because mongodb is designed to send several requests and wait for responses. So, it will utilize as much resources as your mongodb can have with only one limitation - you have only one pipe for writing, and if it closes accidentally you will need to recreate this pipe yourself.
So, it makes more sense to have a pool of connections. It doesn't need to be big. I had an app which authenticates users and gives them tokens. With 2500 concurrent users per second it only had 3-4 concurrent connections to the database.
Here are the benefits connection pool gives you:
If you hit pool connection limit you will be waiting for the next available connection and will not get runtime error. So, you app will wait a little bit instead of rejecting your client.
Pool will be recreating connections for you. You can configure pool to close excess of connections and create more up until certain limit as you need them. If you connection breaks while you read from it or write to it, then you just take another connection from the pool. If you don't return that broken connection to the pool pool will create another connection for you.
If the database connection is closed then: mongodb listener on this connection will exit printing a error message on your terminal, your app will receive an IO error. In order to handle this error you will need to create another connection and try again. When it comes to handling this situation you understand that it's easier to use a db pool. Because eventually you solution to this will resemble connection pool very much.
I do auth once as part of opening a connection. If you need to auth another user later you can always do it.
Yes, mongodb handles simultaneous usage, but like I said it gives only one pipe to write and it soon becomes a bottle neck. If you create at least as many connections as your mongodb server can afford threads for handling them(CPU count), then they will be going at full speed.
If I missed something feel free to ask for clarifications.
Thank you for your question.
What you really want is a database connection pool. Take a look at the code from this other answer.
Instead of auth, you can use withMongoDBPool to if your MongoDB server is in secure mode.
Is this the proper way of setting things up? As opposed to creating a database connection for every visit to "/". In this latter case, we could have millions of connections at once. Is that discouraged? What are the advantages and drawbacks of such an approach?
You do not want to open one connection and then use it. The HTTP server you are using, which underpins Scotty, is called Warp. Warp has a multi-core, multi-green-thread design. You are allowed to share the same connection across all threads, since Database.MongoDB says outright that connections are thread-safe, but what will happen is that when one thread is blocked waiting for a response (the MongoDB protocol follows a simple request-response design) all threads in your web service will block. This is unfortunate.
We can instead create a connection on every request. This trivially solves the problem of one thread's blocking another but leads to its own share of problems. The overhead of setting up a TCP connection, while not substantial, is also not zero. Recall that every time we want to open or close a socket we have to jump from the user to the kernel, wait for the kernel to update its internal data structures, and then jump back (a context switch). We also have to deal with the TCP handshake and goodbyes. We would also, under high load, run out file descriptors or memory.
It would be nice if we had a solution somewhere in between. The solution should be
Thread-safe
Let us max-bound the number of connections so we don't exhaust the finite resources of the operating system
Quick
Share connections across threads under normal load
Create new connections as we experience increased load
Allow us to clean up resources (like closing a handle) as connections are deleted under reduced load
Hopefully already written and battle-tested by other production systems
It is this exactly problem that resource-pool tackles.
Some say that I'm supposed to use a pool (Data.Pool). It looks like that would only help limit the number of visitors using the same database connection simultaneously. But why would I want to do that? Doesn't the MongoDB connection have a built-in support for simultaneous usages?
It is unclear what you mean by simultaneous usages. There is one interpretation I can guess at: you mean something like HTTP/2, which has pipelining built into the protocol.
standard picture of pipelining http://research.worksap.com/wp-content/uploads/2015/08/pipeline.png
Above we see the client making multiple requests to the server, without waiting for a response, and then the client can receive responses back in some order. (Time flows from the top to the bottom.) This MongoDB does not have. This is a fairly complicated protocol design that is not that much better than just asking your clients to use connection pools. And MongoDB is not alone here: the simple request-and-response design is something that Postgres, MySQL, SQL Server, and most other databases have settled on.
And: it is true that connection pool limits the load you can take as a web service before all threads are blocked and your user just sees a loading bar. But this problem would exist in any of the three scenarios (connection pooling, one shared connection, one connection per request)! The computer has finite resources, and at some point something will collapse under sufficient load. Connection pooling's advantages are that it scales gracefully right up until the point it cannot. The correct solution to handling more traffic is to increase the number of computers; we should not avoid pooling simply due to this problem. 
In the app above, what happens if the database connection is lost for some reason and needs to be created again? How would you recover from that?
I believe these kinds of what-if's are outside the scope of Stack Overflow and deserve no better answer than "try it and see." Buuuuuuut given that the server terminates the connection, I can take a stab at what might happen: assuming Warp forks a green thread for each request (which I think it does), each thread will experience an unchecked IOException as it tries to write to the closed TCP connection. Warp would catch this exception and serve it as an HTTP 500, hopefully writing something useful to the logs also. Assuming a single-connection model like you have now, you could either do something clever (but high in lines of code) where you "reboot" your main function and set up a second connection. Something I do for hobby projects: should anything odd occur, like a dropped connection, I ask my supervisor process (like systemd) to watch the logs and restart the web service. Though clearly not a great solution for a production, money-makin' website, it works well enough for small apps.
What about authentication with the auth function? Should the auth function only be called once after creating the pipe, or should it be called on every hit to "/"?
It should be called once after creating the connection. MongoDB authentication is per-connection. You can see an example here of how the db.auth() command mutates the MongoDB server's data structures corresponding to the current client connection.

How do I trap a client-side signal and terminate a postgresql query on the backend?

A postgres client is running a possibly long-running query. The client receives a SIGTERM. Ideally, I'd like the client to trap the SIGTERM, clean up its query by sending a pg_cancel_backend(pid) and then quit.
This would seem to be a common concern with a client-server database architecture, but I can't find any solutions. Instead, the query will continue until completion, consuming unnecessary resources.
How can the client determine with certainty the PID of its backend query? (I don't mean interactively view the backend queries and choosing the one that appears to match the query that was issued.) For example, are PIDs associated with persistent database connections?
How does this work when using psql when ctrl-C is pressed?
You can use pg_backend_pid() to get PID for current connection.
Details here.