SQLite slower than expected when querying - swift

I have a fairly large (3,000,000 rows) SQLite database. It consist of one table.
The table has an integer id column, a text-based tag column, a timestamp column saved as an int, and 15 double number columns.
I have a unique index on the tag and timestamp columns, since I always look entries up using both.
I need to run though the database and do quite a few calculations. Mainly calling a bunch of select statements.
The complexity of the select statements is really simple.
I am using the GRDB library.
Here is an example query.
do {
try dbQueue.read { db in
let request = try DataObject
.filter(Columns.tag == tag)
.filter(Columns.dateNoDash = date)
.fetchOne(db)
}
} catch { Log.msg("Was unable to query database. Error: \(error)") }
When I run the debugged trace on the queries my program generates (using explain query plan), I can see that the index is being used.
I have to loop over a lot of queries, so I benchmarked a segment of the queries. I am finding that 600 queries roughly take 28 seconds. I am running the program on a 10-core iMac Pro. This seems slow. I was always under the impression that SQLite was faster.
The other code in the loop basically adds certain numbers together and possible creates an average, so nothing complex and computationally expensive.
I tried to speed things up by adding the following configuration to the database connection.
var config = Configuration()
config.prepareDatabase { db in
try db.execute(sql: "PRAGMA journal_mode = MEMORY")
try db.execute(sql: "PRAGMA synchronous = OFF")
try db.execute(sql: "PRAGMA locking_mode = EXCLUSIVE")
try db.execute(sql: "PRAGMA temp_store = MEMORY")
try db.execute(sql: "PRAGMA cache_size = 2048000")
}
let dbQueue = try DatabaseQueue(path: path, configuration: config)
Is there anything I can do to speed things up? Is GRDB slowing things down? Am I doing anything wrong? Should I be using a different database like mySQL or something?
Thanks for any tips/input

Related

Batch insert in PostgreSQL extremely slow (F#)

The code is in F#, but it's generic enough that it'll make sense to anyone not familiar with the language.
I have the following schema:
CREATE TABLE IF NOT EXISTS trades_buffer (
instrument varchar NOT NULL,
ts timestamp without time zone NOT NULL,
price decimal NOT NULL,
volume decimal NOT NULL
);
CREATE INDEX IF NOT EXISTS idx_instrument ON trades_buffer(instrument);
CREATE INDEX IF NOT EXISTS idx_ts ON trades_buffer(ts);
My batches are made of 500 to 3000 records at once. To get an idea of the performance, I'm developing on a 2019 MBP (i7 CPU), running PostgreSQL in a Docker container.
Currently it will take between 20 and 80 seconds to insert a batch; while the size / time is not really linear, it scales somewhat.
I'm using this lib https://github.com/Zaid-Ajaj/Npgsql.FSharp as a wrapper around Npgsql.
This is my insertion code:
let insertTradesAsync (trades: TradeData list) : Async<Result<int, Exception>> =
async {
try
let data =
trades
|> List.map (fun t ->
[
"#instrument", Sql.text t.Instrument.Ticker
"#ts", Sql.timestamp t.Timestamp.DateTime
"#price", Sql.decimal t.Price
"#volume", Sql.decimal (t.Quantity * t.Price)
]
)
let! result =
connectionString
|> Sql.connect
|> Sql.executeTransactionAsync [ "INSERT INTO buffer_trades (instrument, ts, price, volume) VALUES (#instrument, #ts, #price, #volume)", data]
|> Async.AwaitTask
return Ok (List.sum result)
with ex ->
return Error ex
}
I checked that the connection step is extremely fast (<1ms).
pgAdmin seems to show that the PostGreSQL is mostly idle.
I ran profiling on the code and none of this code seem to take any time.
It's as if the time spent was in the driver, between my code and the database itself.
Since I'm a newbie with PostGreSQL, I could also be doing something horribly wrong :D
Edit:
I have tried a few things:
use the TimeScale plugin, made for time series
move the data from a docker volume to a local folder
run the code on a ridiculously large PostgreSQL AWS instance
and the results are the same.
What I know now:
no high CPU usage
no high ram usage
no hotspot in the profile on my code
pgAdmin shows the db is mostly idle
having an index, or not, has no impact
local or remote database gives the same results
So the issue is either:
how I interact with the DB
in the driver I'm using
Update 2:
The non async version of the connector performs significantly better.

How to optimise this ef core query?

I'm using EF Core 3.0 code first with MSSQL database. I have big table that has ~5 million records. I have indexes on ProfileId, EventId and UnitId. This query takes ~25-30 seconds to execute. Is it normal or there is a way to optimize it?
await (from x in _dbContext.EventTable
where x.EventId == request.EventId
group x by new { x.ProfileId, x.UnitId } into grouped
select new
{
ProfileId = grouped.Key.ProfileId,
UnitId = grouped.Key.UnitId,
Sum = grouped.Sum(a => a.Count * a.Price)
}).AsNoTracking().ToListAsync();
I tried to loos through profileIds, adding another WHERE clause and removing ProfileId from grouping parameter, but it worked slower.
Capture the SQL being executed with a profiling tool (SSMS has one, or Express Profiler) then run that within SSMS /w execution plan enabled. This may highlight an indexing improvement. If the execution time in SSMS roughly correlates to what you're seeing in EF then the only real avenue of improvement will be hardware on the SQL box. You are running a query that will touch 5m rows any way you look at it.
Operations like this are not that uncommon, just not something that a user would expect to sit and wait for. This is more of a reporting-type request so when faced with requirements like this I would look at options to have users queue up a request where they can receive a notification when the operation completes to fetch the results. This would be set up to prevent users from repeatedly requesting updates ("not sure if I clicked" type spams) or also considerations to ensure too many requests from multiple users aren't kicked off simultaneously. Ideally this would be a candidate to run off a read-only reporting replica rather than the read-write production DB to avoid locks slowing/interfering with regular operations.
Try to remove ToListAsync(). Or replace it with AsQueryableAsync(). Add ToList slow performance down.
await (from x in _dbContext.EventTable
where x.EventId == request.EventId
group x by new { x.ProfileId, x.UnitId } into grouped
select new
{
ProfileId = grouped.Key.ProfileId,
UnitId = grouped.Key.UnitId,
Sum = grouped.Sum(a => a.Count * a.Price)
});

Postgres multi-update with multiple `where` cases

Excuse what seems like it could be a duplicate. I'm familiar with multiple updates in Postgres... but I can't seem to figure out a way around this one...
I have a photos table with the following columns: id (primary key), url, sort_order, and owner_user_id.
We would like to allow our interface to allow the user to reorder their existing photos in a collection view. In which case when a drag-reorder interaction is complete, I am able to send a POST body to our API with the following:
req.body.photos = [{id: 345, order: 1, id: 911, order: 2, ...<etc>}]
In which case I can turn around and run the following query in a loop per each item in the array.
photos.forEach(function (item) {
db.runQuery('update photos set sort_order=$1 where id=$2 and owner_user_id=$3', [item.order, item.id, currentUserId])
})
In general, it's generally frowned upon to run database queries inside loops, so if there's anyway this can be done with 1 query that would be fantastic.
Much thanks in advance.
Running a select query inside of a loop is definitely questionable, but I don't think multiple updates is necessarily frowned upon if the data you are updating doesn't natively reside on the database. To do these as separate transactions, however, might be.
My recommendation would be to wrap all known updates in a single transaction. This is not only kinder to the database (compile once, execute many, commit once), but this is an ACID approach to what I believe you are trying to do. If, for some reason, one of your updates fails, they will all fail. This prevents you from having two photos with an order of "1."
I didn't recognize your language, but here is an example of what this might look like in C#:
NpgSqlConnection conn = new NpgSqlConnection(connectionString);
conn.Open();
NpgSqlTransaction trans = conn.BeginTransaction();
NpgSqlCommand cmd = new NpqSqlCommand("update photos set sort_order=:SORT where id=:ID",
conn, trans);
cmd.Parameters.Add(new NpgSqlParameter("SORT", DbType.Integer));
cmd.Parameters.Add(new NpgSqlParameter("ID", DbType.Integer));
foreach (var photo in photos)
{
cmd.Parameters[0].Value = photo.SortOrder;
cmd.Parameters[1].Value = photo.Id;
cmd.ExecuteNonQuery();
}
trans.Commit();
I think in Perl, for example, it would be even simpler -- turn off DBI AutoCommit and commit after the inserts.
CAVEAT: Of course, add error trapping -- I was just illustrating what it might look like.
Also, I changed you update SQL. If "Id" is the primary key, I don't think you need the additional owner_user_id=$3 clause to make it work.

C# Comparing lists of data from two separate databases using LINQ to Entities

I have 2 SQL Server databases, hosted on two different servers. I need to extract data from the first database. Which is going to be a list of integers. Then I need to compare this list against data in multiple tables in the second database. Depending on some conditions, I need to update or insert some records in the second database.
My solution:
(WCF Service/Entity Framework using LINQ to Entities)
Get the list of integers from 1st db, takes less than a second gets 20,942 records
I use the list of integers to compare against table in the second db using the following query:
List<int> pastDueAccts; //Assuming this is the list from Step#1
var matchedAccts = from acct in context.AmAccounts
where pastDueAccts.Contains(acct.ARNumber)
select acct;
This above query is taking so long that it gives a timeout error. Even though the AmAccount table only has ~400 records.
After I get these matchedAccts, I need to update or insert records in a separate table in the second db.
Can someone help me, how I can do step#2 more efficiently? I think the Contains function makes it slow. I tried brute force too, by putting a foreach loop in which I extract one record at a time and do the comparison. Still takes too long and gives timeout error. The database server shows only 30% of the memory has been used.
Profile the sql query being sent to the database by using SQL Profiler. Capture the SQL statement sent to the database and run it in SSMS. You should be able to capture the overhead imposed by Entity Framework at this point. Can you paste the SQL Statement emitted in step #2 in your question?
The query itself is going to have all 20,942 integers in it.
If your AmAccount table will always have a low number of records like that, you could just return the entire list of ARNumbers, compare them to the list, then be specific about which records to return:
List<int> pastDueAccts; //Assuming this is the list from Step#1
List<int> amAcctNumbers = from acct in context.AmAccounts
select acct.ARNumber
//Get a list of integers that are in both lists
var pastDueAmAcctNumbers = pastDueAccts.Intersect(amAcctNumbers);
var pastDueAmAccts = from acct in context.AmAccounts
where pastDueAmAcctNumbers.Contains(acct.ARNumber)
select acct;
You'll still have to worry about how many ids you are supplying to that query, and you might end up needing to retrieve them in batches.
UPDATE
Hopefully somebody has a better answer than this, but with so many records and doing this purely in EF, you could try batching it like I stated earlier:
//Suggest disabling auto detect changes
//Otherwise you will probably have some serious memory issues
//With 2MM+ records
context.Configuration.AutoDetectChangesEnabled = false;
List<int> pastDueAccts; //Assuming this is the list from Step#1
const int batchSize = 100;
for (int i = 0; i < pastDueAccts.Count; i += batchSize)
{
var batch = pastDueAccts.GetRange(i, batchSize);
var pastDueAmAccts = from acct in context.AmAccounts
where batch.Contains(acct.ARNumber)
select acct;
}

EF CF slow on generating insert statements

I have project that pull data from a service (return xml) which deserialize into objects/entities.
I'm using EF CF and testing is working fine until it come to big chuck of data, not too big, only 150K records, I use SQL profile to check the SQL statement and it's really fast, but there is a huge slow issue with generating insert statement.
simply put, the data model is simple, class Client has many child object set (5) and 1 many-to-many relationship.
ID for model is provided from service so I cleaned up the duplicate instances of one entity (same ID).
var clientList = service.GetAllClients(); // return IEnumerable<Client> // return 10K clients
var filteredList = Client.RemoveDuplicateInstancesSameEntity(clientList); // return IEnumerable<Client>
int cur = 0;
in batch = 100;
while (true)
{
logger.Trace("POINT A : get next batch");
var importSegment = filteredList.Skip(cur).Take(batch).OrderBy(x=> x.Id);
if (!importSegment.Any())
Break;
logger.Trace("POINT B: Saving to DB");
importSegment.ForEach(c => repository.addClient(c));
logger.Trace("POINT C: calling persist");
repository.persist();
cur = cur + batch;
}
logic is simple, breaking it up into batch to speed up the process. each 100 Client create about 1000 insert statement (for child records and 1 many to many table).
using profiler and logging to analyze this. right after it insert
log show POINT B as the last step all the time. but i dont see any insert statement yet in profiler. then 2 minutes later, I see all the insert statement and then the POINT B for the next batch. and 2 minutes again.
did I do anything wrong or is there is setting or anything I can do to improve?
insert 1k records seems to be fast. Database is wiped out when process start so no records in there. doesn't seem to be an issue with SQL slowness but EF generating insert statement?
although the project works but it is slow. I want to speed it up and understand more about EF when it comes to big chunks of data. or is this normal?
the first 100 is fast and then is getting slower and slower and slower. seems like issue at POINT B. is it issue with too much data repo/dbcontext can't handle it in timely manner?
repo is inheritance from dbcoontext and addClient is simply
dbcontext.Client.Add(client)
Thank you very much.