Some basic questions on RDBMSes - rdbms

I've skimmed thru Date and Silberschatz but can't seem to find answers to these specific questions of mine.
If 2 database users issue a query -- say, 'select * from AVERYBIGTABLE;' -- where would the results of the query get stored in general... i.e., independent of the size of the result set?
a. In the OS-managed physical/virtual memory of the DBMS server?
b. In a DBMS-managed temporary file?
Is the query result set maintained per connection?
If the query result set is indeed maintained per connection, then what if there's connection pooling in effect (by a layer of code sitting above the DBMS)? Won't, then, the result set be maintained per query (instead of per connection)?
If the database is changing in realtime while its users concurrently issue select queries, what happens to the queries that have already been executed but not yet (fully) 'consumed' by the query issuers? For example, assume the result set has 50,000 rows; the user is currently iterating at 100th, when parallely another user executes an insert/delete such that it would lead to more/less than 50,000 rows if the earlier query were to be re-issued by any user of the DBMS?
On the other hand, in case of a database that does not change in realtime, if 2 users issue identical queries each with identical but VERY LARGE result sets, would the DBMS maintain 2 identical copies of the result set, or would it have a single shared copy?
Many thanks in advance.

Some of this may be specific to Oracle.
The full results of the query do not need to copied each user gets a cursor (like a pointer) that maintains which rows have been retrieved, and what rows still need to be fetched. The database will cache as much of data as it can as it reads the data out of the tables. Same principal as two users have read only file handle on file.
The cursors are maintained per connection, the data for the next row may or may not already be in memory.
Connections for the most part are single threaded, only 1 client can use a connection at a time. If the same query is executed twice on the same connection then the cursor position is reset.
If a cursor is open on table that is being updated then the old rows are copied into a separate space (undo in Oracle) and is maintained for the life of the cursor, or at least until it runs out of space to maintain it. (Oracle will give a snapshot too old error)
The database will never duplicate the data stored in cache, in Oracle's case with cursor sharing there would a single cached cursor and each client cursor would only have to maintain its position in the cached cursor.
Oracle Database Concepts
See 8 Memory for questions 1, 2, 5
See 13 Data Concurrency and Consistency (Questions 3, 4)

The reason you don't find this in Date etc is because they could change between DBMS products, there is nothing in the relational model theory about pooling connections to the database or how to maintain the result sets from a query (like caching etc). The only point which is partially covered is 4 - where the read level would come into play (eg read uncommitted), but this only applies until the result set has been produced.

Related

cursors: atomic MOVE then FETCH?

the intent
Basically, pagination: have a CURSOR (created using DECLARE outside any function here, but that can be changed if need be) concurrently addressed to retrieve batches of rows, implying moving the cursor position in order to fetch more than one line (FETCH count seems to be the only way to fetch more than one line).
the context
During a more global transaction (i.e. using one connection), I want to retrieve a range of rows through a cursor. To do so, I:
MOVE the cursor to the desired position (e.g. MOVE 42 FROM "mycursor")
then FETCH the amount of rows (e.g. FETCH FORWARD 10 FROM "mycursor")
However, this transaction is used by many workers (horizontally scaled), each receiving a set of "coordinates" for the cursor, like LIMIT and OFFSET: the index to MOVE to, and the amount of rows to FETCH. These workers use the DB connection through HTTP calls to a single DB API which handles the pool of connections and the transactions' liveliness.
Because of this concurrent access to the transaction/connection, I need to ensure atomic execution of the couple "MOVE then FETCH".
the setup
NodeJS workers consuming ranges of rows through a DB API
NodeJS DB API based on pg (latest)
PostgreSQL v10 (can be upgraded if required, all documentation links here are from v12 - latest)
the tries
WITH (MOVE 42 FROM "mycursor") FETCH 10 FROM "mycursor" produces a syntax error, apparently WITH doesn't handle MOVE
MOVE 42 FROM "mycursor" ; FETCH 10 FROM "mycursor" as I'm inside a transaction I suppose this could work, but anyway I'm using Node's pg which apparently doesn't handle several statements in the same call to query() (no error, but no result yielded, I didn't dig into this too much as it looks like a hack)
I'm not confident a function would guarantee atomicity, it doesn't seem to be what PARALLEL UNSAFE does, and as I'm going to have high concurrency, I'd really love some explicitly written assurances about atomicity...
the reason
I'd prefer not to rely on LIMIT/OFFSET as it would require an ORDER BY clause to ensure pagination consistency (as per the docs, ctrl-f for "unpredictable"), unless (scrollable, without hold) cursors prove to be way more resource-consuming. "Way more" because it has to be weighed with the INSENSITIVE behavior of cursors that would allow me not to acquire a lock on the underlying table during the whole process. If it's proven that pagination in this context is not feasible using cursors, I'll fall back to this solution, unless you have something better to suggest!
the human side
Hello, and thanks in advance for the help! :)
You can share connections, but you cannot to share transactions. This request is impossible for this context. Good tools doesn't share connections, when any transaction is used in this connection.

How to cache duplicate queries in PostgreSQL

I have some extensive queries (each of them lasts around 90 seconds). The good news is that my queries are not changed a lot. As a result, most of my queries are duplicate. I am looking for a way to cache the query result in PostgreSQL. I have searched for the answer but I could not find it (Some answers are outdated and some of them are not clear).
I use an application which is connected to the Postgres directly.
The query is a simple SQL query which return thousands of data instance.
SELECT * FROM Foo WHERE field_a<100
Is there any way to cache a query result for at least a couple of hours?
It is possible to cache expensive queries in postgres using a technique called a "materialized view", however given how simple your query is I'm not sure that this will give you much gain.
You may be better caching this information directly in your application, in memory. Or if possible caching a further processed set of data, rather than the raw rows.
ref:
https://www.postgresql.org/docs/current/rules-materializedviews.html
Depending on what your application looks like, a TEMPORARY TABLE might work for you. It is only visible to the connection that created it and it is automatically dropped when the database session is closed.
CREATE TEMPORARY TABLE tempfoo AS
SELECT * FROM Foo WHERE field_a<100;
The downside to this approach is that you get a snapshot of Foo when you create tempfoo. You will not see any new data that gets added to Foo when you look at tempfoo.
Another approach. If you have access to the database, you may be able to significantly speed up your queries by adding and index on on field

postgres many tables vs one huge table

I am using postgresql db.
my application manages many objects of the same type.
for each object my application performs intense db writing - each object has a line inserted to db at least once every 30 seconds. I also need to retrieve the data by object id.
my question is how it's best to design the database? use one huge table for all the objects (slower inserts) or use table for each object (more complicated retrievals)?
Tables are meant to hold a huge number of objects of the same type. So, your second option, that is one table per object, doesn't seem to look right. But of course, more information is needed.
My tip: start with one table. If you run into problems - mainly performance - try to split it up. It's not that hard.
Logically, you should use one table.
However, so called "write amplification" problem exhibited by PostgreSQL seems to have been one of the main reasons why Uber switeched from PostgreSQL to MySQL. Quote:
"For tables with a large number of secondary indexes, these
superfluous steps can cause enormous inefficiencies. For instance, if
we have a table with a dozen indexes defined on it, an update to a
field that is only covered by a single index must be propagated into
all 12 indexes to reflect the ctid for the new row."
Whether this is a problem for your workload, only measurement can tell - I'd recommend starting with one table, measuring performance, and then switching to multi-table (or partitioning, or perhaps switching the DBMS altogether) only if the measurements justify it.
A single table is probably the best solution if you are certain that all objects will continue to have the same attributes.
INSERT does not get significantly slower as the table grows – it is the number of indexes that slows down data modification.
I'd rather be worried about data growth. Do you have a design for getting rid of old data? Big DELETEs can be painful; sometimes partitioning helps.

Multiple connections to PostgreSQL with huge number of INSERTs

This question is involved by this one: How to speed up insertion performance in PostgreSQL
So, I have java application which is doing a lot of (aprox. billion) INSERTs into PostgreSQL database. It opens several JDBC connections to the same DB for doing these inserts in parallel. As I read in mentioned question-answer:
INSERT or COPY in parallel from several connections. How many depends
on your hardware's disk subsystem; as a rule of thumb, you want one
connection per physical hard drive if using direct attached storage.
But in my case I have only one disk storage for my DB.
So, my question is: does it really have sence to open several connections in this case? Could it reduce perfomance instead of desired increasing due to I/O operations competitions?
For clarifying, here is the picture with actual postgresql processes load:
Since you mentioned INSERT in Java application, I'd assume (utilizing plain JDBC) COPY is not what you're looking for. Without using API like JPA or framework such as Spring-data, may I introduce addBatch() and executeBatch() in case you haven't heard of these:
/*
the whole nine yards
*/
Connection c = ...;
PreparedStatement ps = c.prepareStatement("INSERT INTO table1(columnInt2,columnVarchar)VALUES(?,?)");
Then read data in a loop:
ps.setShort(1, someShortValue);
ps.setString(2, someStringValue);
ps.addBatch(); // one row at a time from human's perspective
When data of all rows are prepared:
ps.executeBatch();
May I also recommend:
Connection pooling which saves you a lot of resources; check out Commons DBCP, c3p0 and BoneCP.
When doing multiple CUD (create, update, delete) operations, one should think about transaction (so you can rollback in case any row goes wrong).

IBMDB2 select query for millions of data

i am new at db2 i want to select around 2 million data with single query like that
which will select and display first 5000 data and in back process it will select other 5000 data and keep on same till end of the all data help me out with this how to write query or using function
Sounds like you want what's known as blocking. However, this isn't actually handled (not the way you're thinking of) at the database level - it's handled at the application level. You'd need to specify your platform and programming language for us to help there. Although if you're expecting somebody to actually read 2 million rows, it's going to take a while... At one row a second, that's 23 straight days.
The reason that SQL doesn't really perform this 'natively' is that it's (sort of) less efficient. Also, SQL is (by design) set up to operate over the entire set of data, both conceptually and syntactically.
You can use one of the new features, that incorporates paging from Oracle or MySQL: https://www.ibm.com/developerworks/mydeveloperworks/blogs/SQLTips4DB2LUW/entry/limit_offset?lang=en
At the same time, you can influence the optimizer by indicating OPTIMIZED FOR n ROWS, and FETCH FIRST n ROWS ONLY. If you are going to read only, it is better to specify this clause in the query "FOR READ ONLY", this will increase the concurrency, and the cursor will not be update-able. Also, assign a good isolation level, for this case you could eventually use "uncommitted read" (with UR). A Previous Lock table will be good.
Do not forget the common practices like: index or cluster index, retrieve only the necessary columns, etc. and always analyze the access plan via the Explain facility.