This question is involved by this one: How to speed up insertion performance in PostgreSQL
So, I have java application which is doing a lot of (aprox. billion) INSERTs into PostgreSQL database. It opens several JDBC connections to the same DB for doing these inserts in parallel. As I read in mentioned question-answer:
INSERT or COPY in parallel from several connections. How many depends
on your hardware's disk subsystem; as a rule of thumb, you want one
connection per physical hard drive if using direct attached storage.
But in my case I have only one disk storage for my DB.
So, my question is: does it really have sence to open several connections in this case? Could it reduce perfomance instead of desired increasing due to I/O operations competitions?
For clarifying, here is the picture with actual postgresql processes load:
Since you mentioned INSERT in Java application, I'd assume (utilizing plain JDBC) COPY is not what you're looking for. Without using API like JPA or framework such as Spring-data, may I introduce addBatch() and executeBatch() in case you haven't heard of these:
/*
the whole nine yards
*/
Connection c = ...;
PreparedStatement ps = c.prepareStatement("INSERT INTO table1(columnInt2,columnVarchar)VALUES(?,?)");
Then read data in a loop:
ps.setShort(1, someShortValue);
ps.setString(2, someStringValue);
ps.addBatch(); // one row at a time from human's perspective
When data of all rows are prepared:
ps.executeBatch();
May I also recommend:
Connection pooling which saves you a lot of resources; check out Commons DBCP, c3p0 and BoneCP.
When doing multiple CUD (create, update, delete) operations, one should think about transaction (so you can rollback in case any row goes wrong).
Related
I am dealing with a great number of inserts per second int a Postgres DB (and a lot of read too).
A few days ago I heard about Redis and start to think about send all these INSERTS for Redis first, to avoid a lot of open/insert/close things in Postgres every second.
Than, after some short period, i could group those data from Redis, in a INSERT SQL structure and run them together in Postgres, with only one connection opened.
The system stores GPS data and an Online Map read them, in real time.
Any suggestions for that scenario? Thanks !!
I do not know how important it is in your case to have the data available for your users almost real time. But from the listed above, I do not see anything that can not be solved by configuration/replication for Postgresql.
You have A lot of writes to your database; before going for a different technology, Postgresql is tested in big battles and I am sure you can get more by configuring it to handle more writes if it is optimized. link
You have a lot of read to your database; A Master-Slave replication can let all your read traffic be targeted to those DB salves and you can scale horizontally as much as you need.
I'm planning a product that will process updates from multiple data feeds. Input-data is guesstimated to be a total of 100Mbps stream containing 100 byte sized messages. These messages contain several data fields that needs to be checked for correlation with the existing data set within the application. If a input-message correlates with an existing data record, then the input-message will update the existing data-record, if not: it will create a new record. It is assumed that data are updated every 3 seconds in average.
The correlation process is assumed to be a bottleneck, and thus I intend to make our product able to run balanced in multiple processes if needed (most likely on a separate hardware or VM). Somewhat in the vicinity of Space-based architecture. I'd then like a shared storage between my processes so that all existing data records are visible to all the running processes. The shared storage will have to fetch possible candidates for correlation through a query/search based on some attributes (e.g. elevation). It will have to offer configuring warm redundancy, and a possibility to store snapshots every 5 minutes for logging.
Everything seems to be pointing towards MongoDB, but I'd like a confirmation from you that MongoDB will meet my needs. So do you think it is a go?
-Thank you
NB: I am not considering a relational database because we want to focus all coding in our application, instead of having to make 'stored procedures'/'functions' in a separate environment to optimize the performance of our system. Further, the data is diverse and I don't want to try normalize it into a schema.
Yes, MongoDB will meet your needs. I think the following aspects of your description are particularly relevant in your DB selection decision:
1. An update happens every 3 seconds
MongoDB has a database level write-lock (usually short lived) that blocks read operations. This means that you want will want to ensure that you have enough memory to fit your working set, and you will generally not run into any write-lock issues. Note that bulk inserts will hold the write lock for longer.
If you are sharding, you will want to consider shard keys that allow for write scaling i.e. distribute writes on different shards.
2. Shared storage for multiple processes
This is a pretty common scenario; in fact, many MongoDB deployments are expected be accessed from multiple processes concurrently. Unlike the write-lock, the read-lock does not block other reads.
3. Warm redundancy
Supported through MongoDB replication. If you'd like to read from secondary server(s) you will need to set the Read Preference to secondaryPreferred in your driver.
I build a tool for data extraction and transformation. Typical use case - transactionally processing lots of data.
Numbers are - about 10sec - 5min duration, 200-10000 row updated (long duration caused not by the database itself but by outside services that used during transaction).
There are two types of agents that access database - multiple read agents, and only one write agent (so, there are never multiple concurrent write).
During the transaction:
Read agents should be able to read database and see it in the current state.
Write agent should be able to read database (it does both - read and write during transaction) and see it in the new (not yet committed) state.
Is PostgreSQL a good choice for that type of load? I know it uses MVCC - so it should be ok in general, but is it ok to use long and big transactions extensively?
What other open-source transactional databases may be a good choice (I am not limited to SQL)?
P.S.
I do not know if the sharding may affect the performance. The database will be sharded. For every shard there will be multiple readers and only one writer, but multiple different shards can be written to at the same time.
I know that it's better not to use outside services during transaction, but in that case - it's the goal. The database used as a reliable and consistent index for some heavy, huge, slow and eventually-consistent data processing tool.
Huge disclaimer: as always, only real life test can tell you the truth.
But, I think PostgreSQL will not let you down, if you use most recent version (at least 9.1, better 9.2) and tune it properly.
I have somewhat similar load in my server, but with slightly worse R/W ratio: about 10:1. Transactions range from few milliseconds up to 1 hour (and sometimes even more), and one transaction can insert or update up to 100k rows. Total number of concurrent writers with long transactions can reach 10 and more.
So far so good - I don't really have any serious issues, performance is great (certainly not worse than I expected).
What really helps is that my hot working data set almost fits into available memory.
So, give it a try, it should work great for your load.
Have a look at this link. Maximum transaction size in PostgreSQL
Basically there can be some technical limits on the software side to how large your transaction can be.
Background:
I have a PostgreSQL (v8.3) database that is heavily optimized for OLTP.
I need to extract data from it on a semi real-time basis (some-one is bound to ask what semi real-time means and the answer is as frequently as I reasonably can but I will be pragmatic, as a benchmark lets say we are hoping for every 15min) and feed it into a data-warehouse.
How much data? At peak times we are talking approx 80-100k rows per min hitting the OLTP side, off-peak this will drop significantly to 15-20k. The most frequently updated rows are ~64 bytes each but there are various tables etc so the data is quite diverse and can range up to 4000 bytes per row. The OLTP is active 24x5.5.
Best Solution?
From what I can piece together the most practical solution is as follows:
Create a TRIGGER to write all DML activity to a rotating CSV log file
Perform whatever transformations are required
Use the native DW data pump tool to efficiently pump the transformed CSV into the DW
Why this approach?
TRIGGERS allow selective tables to be targeted rather than being system wide + output is configurable (i.e. into a CSV) and are relatively easy to write and deploy. SLONY uses similar approach and overhead is acceptable
CSV easy and fast to transform
Easy to pump CSV into the DW
Alternatives considered ....
Using native logging (http://www.postgresql.org/docs/8.3/static/runtime-config-logging.html). Problem with this is it looked very verbose relative to what I needed and was a little trickier to parse and transform. However it could be faster as I presume there is less overhead compared to a TRIGGER. Certainly it would make the admin easier as it is system wide but again, I don't need some of the tables (some are used for persistent storage of JMS messages which I do not want to log)
Querying the data directly via an ETL tool such as Talend and pumping it into the DW ... problem is the OLTP schema would need tweaked to support this and that has many negative side-effects
Using a tweaked/hacked SLONY - SLONY does a good job of logging and migrating changes to a slave so the conceptual framework is there but the proposed solution just seems easier and cleaner
Using the WAL
Has anyone done this before? Want to share your thoughts?
Assuming that your tables of interest have (or can be augmented with) a unique, indexed, sequential key, then you will get much much better value out of simply issuing SELECT ... FROM table ... WHERE key > :last_max_key with output to a file, where last_max_key is the last key value from the last extraction (0 if first extraction.) This incremental, decoupled approach avoids introducing trigger latency in the insertion datapath (be it custom triggers or modified Slony), and depending on your setup could scale better with number of CPUs etc. (However, if you also have to track UPDATEs, and the sequential key was added by you, then your UPDATE statements should SET the key column to NULL so it gets a new value and gets picked by the next extraction. You would not be able to track DELETEs without a trigger.) Is this what you had in mind when you mentioned Talend?
I would not use the logging facility unless you cannot implement the solution above; logging most likely involves locking overhead to ensure log lines are written sequentially and do not overlap/overwrite each other when multiple backends write to the log (check the Postgres source.) The locking overhead may not be catastrophic, but you can do without it if you can use the incremental SELECT alternative. Moreover, statement logging would drown out any useful WARNING or ERROR messages, and the parsing itself will not be instantaneous.
Unless you are willing to parse WALs (including transaction state tracking, and being ready to rewrite the code everytime you upgrade Postgres) I would not necessarily use the WALs either -- that is, unless you have the extra hardware available, in which case you could ship WALs to another machine for extraction (on the second machine you can use triggers shamelessly -- or even statement logging -- since whatever happens there does not affect INSERT/UPDATE/DELETE performance on the primary machine.) Note that performance-wise (on the primary machine), unless you can write the logs to a SAN, you'd get a comparable performance hit (in terms of thrashing filesystem cache, mostly) from shipping WALs to a different machine as from running the incremental SELECT.
if you can think of a 'checksum table' that contains only the id's and the 'checksum' you can not only do a quick select of the new records but also the changed and deleted records.
the checksum could be a crc32 checksum function you like.
The new ON CONFLICT clause in PostgreSQL has changed the way I do many updates. I pull the new data (based on a row_update_timestamp) into a temp table then in one SQL statement INSERT into the target table with ON CONFLICT UPDATE. If your target table is partitioned then you need to jump through a couple of hoops (i.e. hit the partition table directly). The ETL can happen as you load the the Temp table (most likely) or in the ON CONFLICT SQL (if trivial). Compared to to other "UPSERT" systems (Update, insert if zero rows etc.) this shows a huge speed improvement. In our particular DW environment we don't need/want to accommodate DELETEs. Check out the ON CONFLICT docs - it gives Oracle's MERGE a run for it's money!
I've skimmed thru Date and Silberschatz but can't seem to find answers to these specific questions of mine.
If 2 database users issue a query -- say, 'select * from AVERYBIGTABLE;' -- where would the results of the query get stored in general... i.e., independent of the size of the result set?
a. In the OS-managed physical/virtual memory of the DBMS server?
b. In a DBMS-managed temporary file?
Is the query result set maintained per connection?
If the query result set is indeed maintained per connection, then what if there's connection pooling in effect (by a layer of code sitting above the DBMS)? Won't, then, the result set be maintained per query (instead of per connection)?
If the database is changing in realtime while its users concurrently issue select queries, what happens to the queries that have already been executed but not yet (fully) 'consumed' by the query issuers? For example, assume the result set has 50,000 rows; the user is currently iterating at 100th, when parallely another user executes an insert/delete such that it would lead to more/less than 50,000 rows if the earlier query were to be re-issued by any user of the DBMS?
On the other hand, in case of a database that does not change in realtime, if 2 users issue identical queries each with identical but VERY LARGE result sets, would the DBMS maintain 2 identical copies of the result set, or would it have a single shared copy?
Many thanks in advance.
Some of this may be specific to Oracle.
The full results of the query do not need to copied each user gets a cursor (like a pointer) that maintains which rows have been retrieved, and what rows still need to be fetched. The database will cache as much of data as it can as it reads the data out of the tables. Same principal as two users have read only file handle on file.
The cursors are maintained per connection, the data for the next row may or may not already be in memory.
Connections for the most part are single threaded, only 1 client can use a connection at a time. If the same query is executed twice on the same connection then the cursor position is reset.
If a cursor is open on table that is being updated then the old rows are copied into a separate space (undo in Oracle) and is maintained for the life of the cursor, or at least until it runs out of space to maintain it. (Oracle will give a snapshot too old error)
The database will never duplicate the data stored in cache, in Oracle's case with cursor sharing there would a single cached cursor and each client cursor would only have to maintain its position in the cached cursor.
Oracle Database Concepts
See 8 Memory for questions 1, 2, 5
See 13 Data Concurrency and Consistency (Questions 3, 4)
The reason you don't find this in Date etc is because they could change between DBMS products, there is nothing in the relational model theory about pooling connections to the database or how to maintain the result sets from a query (like caching etc). The only point which is partially covered is 4 - where the read level would come into play (eg read uncommitted), but this only applies until the result set has been produced.