Batch insert in PostgreSQL extremely slow (F#) - postgresql

The code is in F#, but it's generic enough that it'll make sense to anyone not familiar with the language.
I have the following schema:
CREATE TABLE IF NOT EXISTS trades_buffer (
instrument varchar NOT NULL,
ts timestamp without time zone NOT NULL,
price decimal NOT NULL,
volume decimal NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_instrument ON trades_buffer(instrument);
CREATE INDEX IF NOT EXISTS idx_ts ON trades_buffer(ts);
My batches are made of 500 to 3000 records at once. To get an idea of the performance, I'm developing on a 2019 MBP (i7 CPU), running PostgreSQL in a Docker container.
Currently it will take between 20 and 80 seconds to insert a batch; while the size / time is not really linear, it scales somewhat.
I'm using this lib https://github.com/Zaid-Ajaj/Npgsql.FSharp as a wrapper around Npgsql.
This is my insertion code:
let insertTradesAsync (trades: TradeData list) : Async<Result<int, Exception>> =
async {
try
let data =
trades
|> List.map (fun t ->
[
"#instrument", Sql.text t.Instrument.Ticker
"#ts", Sql.timestamp t.Timestamp.DateTime
"#price", Sql.decimal t.Price
"#volume", Sql.decimal (t.Quantity * t.Price)
]
)
let! result =
connectionString
|> Sql.connect
|> Sql.executeTransactionAsync [ "INSERT INTO buffer_trades (instrument, ts, price, volume) VALUES (#instrument, #ts, #price, #volume)", data]
|> Async.AwaitTask
return Ok (List.sum result)
with ex ->
return Error ex
}
I checked that the connection step is extremely fast (<1ms).
pgAdmin seems to show that the PostGreSQL is mostly idle.
I ran profiling on the code and none of this code seem to take any time.
It's as if the time spent was in the driver, between my code and the database itself.
Since I'm a newbie with PostGreSQL, I could also be doing something horribly wrong :D
Edit:
I have tried a few things:
use the TimeScale plugin, made for time series
move the data from a docker volume to a local folder
run the code on a ridiculously large PostgreSQL AWS instance
and the results are the same.
What I know now:
no high CPU usage
no high ram usage
no hotspot in the profile on my code
pgAdmin shows the db is mostly idle
having an index, or not, has no impact
local or remote database gives the same results
So the issue is either:
how I interact with the DB
in the driver I'm using
Update 2:
The non async version of the connector performs significantly better.

Related

SQLite slower than expected when querying

I have a fairly large (3,000,000 rows) SQLite database. It consist of one table.
The table has an integer id column, a text-based tag column, a timestamp column saved as an int, and 15 double number columns.
I have a unique index on the tag and timestamp columns, since I always look entries up using both.
I need to run though the database and do quite a few calculations. Mainly calling a bunch of select statements.
The complexity of the select statements is really simple.
I am using the GRDB library.
Here is an example query.
do {
try dbQueue.read { db in
let request = try DataObject
.filter(Columns.tag == tag)
.filter(Columns.dateNoDash = date)
.fetchOne(db)
}
} catch { Log.msg("Was unable to query database. Error: \(error)") }
When I run the debugged trace on the queries my program generates (using explain query plan), I can see that the index is being used.
I have to loop over a lot of queries, so I benchmarked a segment of the queries. I am finding that 600 queries roughly take 28 seconds. I am running the program on a 10-core iMac Pro. This seems slow. I was always under the impression that SQLite was faster.
The other code in the loop basically adds certain numbers together and possible creates an average, so nothing complex and computationally expensive.
I tried to speed things up by adding the following configuration to the database connection.
var config = Configuration()
config.prepareDatabase { db in
try db.execute(sql: "PRAGMA journal_mode = MEMORY")
try db.execute(sql: "PRAGMA synchronous = OFF")
try db.execute(sql: "PRAGMA locking_mode = EXCLUSIVE")
try db.execute(sql: "PRAGMA temp_store = MEMORY")
try db.execute(sql: "PRAGMA cache_size = 2048000")
}
let dbQueue = try DatabaseQueue(path: path, configuration: config)
Is there anything I can do to speed things up? Is GRDB slowing things down? Am I doing anything wrong? Should I be using a different database like mySQL or something?
Thanks for any tips/input

Can I force an Eloquent model created_at to use unix_timestamp() for DB server time to combat time drift

I like storing my timestamps as epoch values, so in my model I use:
protected $dateFormat = 'U';
and in my migrations I use:
$table->unsignedInteger('created_at');
$table->unsignedInteger('updated_at');
On a single server with both PHP and the DB on, this gets the result I want.
But, this uses the local clock and in a scaled install with multiple 'front end' app servers all connecting to a single 'back end' database server, if any of the front end server clocks begin to drift I'm going to see entries in the DB where the (auto incrementing) id shows one order of insertion but the created_at showing a different order. This is not what I want.
With MySQL as the DB I'm trying to find a way to have created_at use the MySQL unix_timestamp() function so that the clock on the database server is used.
Is this even possible?

How to optimise this ef core query?

I'm using EF Core 3.0 code first with MSSQL database. I have big table that has ~5 million records. I have indexes on ProfileId, EventId and UnitId. This query takes ~25-30 seconds to execute. Is it normal or there is a way to optimize it?
await (from x in _dbContext.EventTable
where x.EventId == request.EventId
group x by new { x.ProfileId, x.UnitId } into grouped
select new
{
ProfileId = grouped.Key.ProfileId,
UnitId = grouped.Key.UnitId,
Sum = grouped.Sum(a => a.Count * a.Price)
}).AsNoTracking().ToListAsync();
I tried to loos through profileIds, adding another WHERE clause and removing ProfileId from grouping parameter, but it worked slower.
Capture the SQL being executed with a profiling tool (SSMS has one, or Express Profiler) then run that within SSMS /w execution plan enabled. This may highlight an indexing improvement. If the execution time in SSMS roughly correlates to what you're seeing in EF then the only real avenue of improvement will be hardware on the SQL box. You are running a query that will touch 5m rows any way you look at it.
Operations like this are not that uncommon, just not something that a user would expect to sit and wait for. This is more of a reporting-type request so when faced with requirements like this I would look at options to have users queue up a request where they can receive a notification when the operation completes to fetch the results. This would be set up to prevent users from repeatedly requesting updates ("not sure if I clicked" type spams) or also considerations to ensure too many requests from multiple users aren't kicked off simultaneously. Ideally this would be a candidate to run off a read-only reporting replica rather than the read-write production DB to avoid locks slowing/interfering with regular operations.
Try to remove ToListAsync(). Or replace it with AsQueryableAsync(). Add ToList slow performance down.
await (from x in _dbContext.EventTable
where x.EventId == request.EventId
group x by new { x.ProfileId, x.UnitId } into grouped
select new
{
ProfileId = grouped.Key.ProfileId,
UnitId = grouped.Key.UnitId,
Sum = grouped.Sum(a => a.Count * a.Price)
});

Why are subsequent queries so much slower?

I'm trying to sort out a very weird behavior.
I'm working with:
JBoss AS 7.1.1
EJB 3.0
JPA
XA DataSource
Oracle 11g
In one of the systems fuctionalitites, the user can see the status of each Store. For each Store I fire a query, to sum up all the files that have been processed. The query is something like this:
SELECT
SUM(CASE file.type
WHEN 'TYPE_1' THEN 1
ELSE 0
END)
,
SUM(CASE file.type
WHEN 'TYPE_2' THEN 1
ELSE 0
END)
,
SUM(CASE file.type
WHEN 'TYPE_3' THEN 1
ELSE 0
END)
FROM
File file
WHERE
file.type IN ('TYPE_1', 'TYPE_2', 'TYPE_3')
AND file.status = 'RECEIVED'
AND file.store.id = :storeId
The thing is, the user can select which of the stores he wants to check, and that's where things get weird.
When I check the first store, the result comes blazing fast, but all subsequent queries take significantly more time. Let me exemplify:
User checks store 15 (Blazing fast result) - About 200 ms
User checks store 2 (Very slow result) - About 8000 ms
Now pay attention to this part, it's very important.
User logs out, and logs in again.
User checks store 2 (the one that took 8000ms), and now the result is blazing fast.
This is very odd, the same store that took a while before, is now loading pretty fast.
Whenever I try the queries on SQLDeveloper the results come pretty fast as well.
I annotated my EJB with #TransactionAttribute(TransactionAttributeType.NOT_SUPPORTED) but I didn't get any difference on the execution time.
I created a standalone project to run the queries using JDBC and the result was fast again, which leaves me thinking it may be some configuration on my DataSource, persistence.xml or anything like that.
Does anyone have any clues why this happens?
few things :
when the user checks store2 in the second time , the oracle optimizer is probably using it's "cache" , and therefore it is blazing fast.
how much File records does store2 has ? try to perform a group by sentence to see if this File table needs special statistics , if for example the store2 has dramaticly more File records than other stores then try to perform this method :
begin
dbms_stats.gather_table_stats(user,'file' , estimate_percent=>100);
end;
this will ensure the table's statistics are accurate.
you can optimize the query , you don't have to perform 3 times "sum" , you can do something like this :
select f.type , count(*)
from File f
where f.store.id = :storeId
and f.type IN ('TYPE_1', 'TYPE_2', 'TYPE_3')
group by f.type
you may run in a cardinality feedback problem; just look in this blog;
http://orcasoracle.squarespace.com/oracle-rdbms/2012/12/18/when-a-query-runs-slower-on-second-execution-a-possible-side.html
/KR

Is it correct to scan a table in MySQL using "SELECT * .. LiMIT start, count" without an ORDER BY clause?

Suppose Table X has a 100 tuples.
Will the following approach to scanning X generate all the tuples in TABLE X, in MySQL?
for start in [0, 10, 20, ..., 90]:
print results of "select * from X LIMIT start, 10;"
I ask, because I've been using PostgreSQL, which clearly says that this approach need not work, but there seems to be no such info for MySQL. If it won't, is there a way to return results in a fixed ordering without knowing any other info about the table (like what the primary key fields are)?
I need to scan each tuple in a table in an application, and I want a way to do it without using too much memory in the application (so simply doing a "select * from X" is out).
No, that isn't a safe assumption. Without an ORDER BY clause, there is no guaranteeing that your query will return unique results each time. If this table is properly indexed, adding an ORDER BY (for the index) shouldn't be too expensive.
Edit: Non-ORDER BYed results will sometimes be in the order of the clustered index, but I wouldn't put any money on that!
If you are using Innodb or MyISAM table types, a better approach is to use the HANDLER interface. Only MySQL supports this, but it does what you want:
http://dev.mysql.com/doc/refman/5.0/en/handler.html
Also, the MySQL API supports two modes of retrieving data from the server:
store result: in this mode, as soon as a query is executed, the API retrieves the entire result set before returning to the user code. This can use up a lot of client memory buffering results, but minimises the use of resources on the server.
use result: in this mode, the API pulls results row-by-row and returns control to the user code more frequently. This minimises the use of memory on the client, but can hold locks on the server for longer.
Most of the MySQL APIs for various languages support this in oneform or another. It is usually an argument that can be supplied as when creating the connection, and / or a separate call that can be used against an existing connection to switch it to that mode.
So, in answer to your question - I would do the following:
set the connection to "use result" mode;
select * from X