Postgres multi-update with multiple `where` cases - postgresql

Excuse what seems like it could be a duplicate. I'm familiar with multiple updates in Postgres... but I can't seem to figure out a way around this one...
I have a photos table with the following columns: id (primary key), url, sort_order, and owner_user_id.
We would like to allow our interface to allow the user to reorder their existing photos in a collection view. In which case when a drag-reorder interaction is complete, I am able to send a POST body to our API with the following:
req.body.photos = [{id: 345, order: 1, id: 911, order: 2, ...<etc>}]
In which case I can turn around and run the following query in a loop per each item in the array.
photos.forEach(function (item) {
db.runQuery('update photos set sort_order=$1 where id=$2 and owner_user_id=$3', [item.order, item.id, currentUserId])
})
In general, it's generally frowned upon to run database queries inside loops, so if there's anyway this can be done with 1 query that would be fantastic.
Much thanks in advance.

Running a select query inside of a loop is definitely questionable, but I don't think multiple updates is necessarily frowned upon if the data you are updating doesn't natively reside on the database. To do these as separate transactions, however, might be.
My recommendation would be to wrap all known updates in a single transaction. This is not only kinder to the database (compile once, execute many, commit once), but this is an ACID approach to what I believe you are trying to do. If, for some reason, one of your updates fails, they will all fail. This prevents you from having two photos with an order of "1."
I didn't recognize your language, but here is an example of what this might look like in C#:
NpgSqlConnection conn = new NpgSqlConnection(connectionString);
conn.Open();
NpgSqlTransaction trans = conn.BeginTransaction();
NpgSqlCommand cmd = new NpqSqlCommand("update photos set sort_order=:SORT where id=:ID",
conn, trans);
cmd.Parameters.Add(new NpgSqlParameter("SORT", DbType.Integer));
cmd.Parameters.Add(new NpgSqlParameter("ID", DbType.Integer));
foreach (var photo in photos)
{
cmd.Parameters[0].Value = photo.SortOrder;
cmd.Parameters[1].Value = photo.Id;
cmd.ExecuteNonQuery();
}
trans.Commit();
I think in Perl, for example, it would be even simpler -- turn off DBI AutoCommit and commit after the inserts.
CAVEAT: Of course, add error trapping -- I was just illustrating what it might look like.
Also, I changed you update SQL. If "Id" is the primary key, I don't think you need the additional owner_user_id=$3 clause to make it work.

Related

Sqlalcemy - Bulk-insert with multiple cascade & back_populates relationships

I have tried to optimize our insertions to the database, which is currently the bottleneck and slowing down our pipeline. I decided to first start speed up our data_generator used for testing, all the tables are empty at first. Thought it would be a easy place to start ..
they are then populated and used in various tests.
Currently, we do pretty much all insertions with Session.add(entry) or in some cases bulked entries with add_all(entries), which does not improve the speed that much.
The goal was to do more insertions at once and have less time communicating back and forth with the database and I tried various bulk_insert methods (bulk_save_objects, bulk_insert_mappings and ORM,CORE methods with INSERT INTO, COPY, IMPORT .. but I got nothing to work properly. Foreign key constraints, duplicated keys ... or tables not getting populated.
I will show an example of a Table that would previous be added with add_all() in a run_transaction.
class News(NewsBase):
__tablename__ = 'news'
news_id = Column(UUID(as_uuid=True), primary_key=True, nullable=False)
url_visit_count = Column('url_visit_count', Integer, default=0)
# One to many
sab_news = relationship("sab_news", back_populates="news")
sent_news = relationship("SenNews", back_populates="news")
scope_news = relationship("ScopeNews", back_populates="news")
news_content = relationship("NewsContent", back_populates="news")
# One to one
other_news = relationship("other_news", uselist=False, back_populates="news")
# Many to many
companies = relationship('CompanyNews', back_populates='news', cascade="all, delete")
aggregating_news_sources = relationship("AggregatingNewsSource", secondary=NewsAggregatingNewsSource,
back_populates="news")
def __init__(self, title, language, news_url, publish_time):
self.news_id = uuid4()
super().__init__(title, language, news_url, publish_time)
We have many tables built like this, some with more relations, and my conclusion now is that having many different relationships that back_populates and update each other does not allow for fast bulk_insertions, Am I wrong?
One of my current solution that was able to decrease our execution_time from 120s to 15s for a regular data_generator for testing is like this:
def write_news_to_db(news, news_types, news_sources, company_news,
):
write_bulk_in_chunks(news_types)
write_bulk_in_chunks(news_sources)
def write_news(session):
enable_batch_inserting(session)
session.add_all(news)
def write_company_news(session):
session.add_all(company_news)
engine = create_engine(
get_connection_string("name"),
echo = False,
executemany_mode = "values")
run_transaction(create_session(engine=engine), lambda s: write_news(s))
run_transaction(create_session(), lambda s: write_company_news(s))
I used this library sqlalchemy_batch_inserts
github together with Psycopg2 Fast Execution Helpers, set executemany_mode="values".
I did this by creating a new engine just for these insertions - It did work however this itself seems like a bad practice. It works with the same database.
Anyway, this does seem to work, but it is still not the execution speed I want - especially when we are initially working with empty tables.
Ideally, I wouldn't want to do this hacky solution and avoid bulk_insertions since SQLAlchemy does not recommend using them - to avoid problems that I have faced.
But how does one construct queries to properly do bulk_insertions in cases of complex Tables like these - should we re-design our tables or is it possible?
Using Multi-row insertions within the run_transaction with ORM or CORE would be ideal, but I haven't been able to do it.
Any recommendations or help would be much appreciated!
TLDR; Bulk-insertion with multiple relationships, back_populates, cascade. How is it supposed to done?
CockroachDB supports bulk insertions using multi-row insert for existing tables as well as IMPORT statements for new tables - https://www.cockroachlabs.com/docs/stable/insert.html. Have you considered using these options directly?

How to optimise this ef core query?

I'm using EF Core 3.0 code first with MSSQL database. I have big table that has ~5 million records. I have indexes on ProfileId, EventId and UnitId. This query takes ~25-30 seconds to execute. Is it normal or there is a way to optimize it?
await (from x in _dbContext.EventTable
where x.EventId == request.EventId
group x by new { x.ProfileId, x.UnitId } into grouped
select new
{
ProfileId = grouped.Key.ProfileId,
UnitId = grouped.Key.UnitId,
Sum = grouped.Sum(a => a.Count * a.Price)
}).AsNoTracking().ToListAsync();
I tried to loos through profileIds, adding another WHERE clause and removing ProfileId from grouping parameter, but it worked slower.
Capture the SQL being executed with a profiling tool (SSMS has one, or Express Profiler) then run that within SSMS /w execution plan enabled. This may highlight an indexing improvement. If the execution time in SSMS roughly correlates to what you're seeing in EF then the only real avenue of improvement will be hardware on the SQL box. You are running a query that will touch 5m rows any way you look at it.
Operations like this are not that uncommon, just not something that a user would expect to sit and wait for. This is more of a reporting-type request so when faced with requirements like this I would look at options to have users queue up a request where they can receive a notification when the operation completes to fetch the results. This would be set up to prevent users from repeatedly requesting updates ("not sure if I clicked" type spams) or also considerations to ensure too many requests from multiple users aren't kicked off simultaneously. Ideally this would be a candidate to run off a read-only reporting replica rather than the read-write production DB to avoid locks slowing/interfering with regular operations.
Try to remove ToListAsync(). Or replace it with AsQueryableAsync(). Add ToList slow performance down.
await (from x in _dbContext.EventTable
where x.EventId == request.EventId
group x by new { x.ProfileId, x.UnitId } into grouped
select new
{
ProfileId = grouped.Key.ProfileId,
UnitId = grouped.Key.UnitId,
Sum = grouped.Sum(a => a.Count * a.Price)
});

Using NEsper to read LogFiles for reporting purposes

We are evaluating NEsper. Our focus is to monitor data quality in an enterprise context. In an application we are going to log every change on a lot of fields - for example in an "order". So we have fields like
Consignee name
Consignee street
Orderdate
....and a lot of more fields. As you can imagine the log files are going to grow big.
Because the data is sent by different customers and is imported in the application, we want to analyze how many (and which) fields are updated from "no value" to "a value" (just as an example).
I tried to build a test case with just with the fields
order reference
fieldname
fieldvalue
For my test cases I added two statements with context-information. The first one should just count the changes in general per order:
epService.EPAdministrator.CreateEPL("create context RefContext partition by Ref from LogEvent");
var userChanges = epService.EPAdministrator.CreateEPL("context RefContext select count(*) as x, context.key1 as Ref from LogEvent");
The second statement should count updates from "no value" to "a value":
epService.EPAdministrator.CreateEPL("create context FieldAndRefContext partition by Ref,Fieldname from LogEvent");
var countOfDataInput = epService.EPAdministrator.CreateEPL("context FieldAndRefContext SELECT context.key1 as Ref, context.key2 as Fieldname,count(*) as x from pattern[every (a=LogEvent(Value = '') -> b=LogEvent(Value != ''))]");
To read the test-logfile I use the csvInputAdapter:
CSVInputAdapterSpec csvSpec = new CSVInputAdapterSpec(ais, "LogEvent");
csvInputAdapter = new CSVInputAdapter(epService.Container, epService, csvSpec);
csvInputAdapter.Start();
I do not want to use the update listener, because I am interested only in the result of all events (probably this is not possible and this is my failure).
So after reading the csv (csvInputAdapter.Start() returns) I read all events, which are stored in the statements NewEvents-Stream.
Using 10 Entries in the CSV-File everything works fine. Using 1 Million lines it takes way to long. I tried without EPL-Statement (so just the CSV import) - it took about 5sec. With the first statement (not the complex pattern statement) I always stop after 20 minutes - so I am not sure how long it would take.
Then I changed my EPL of the first statement: I introduce a group by instead of the context.
select Ref,count(*) as x from LogEvent group by Ref
Now it is really fast - but I do not have any results in my NewEvents Stream after the CSVInputAdapter comes back...
My questions:
Is the way I want to use NEsper a supported use case or is this the root cause of my failure?
If this is a valid use case: Where is my mistake? How can I get the results I want in a performant way?
Why are there no NewEvents in my EPL-statement when using "group by" instead of "context"?
To 1), yes
To 2) this is valid, your EPL design is probably a little inefficient. You would want to understand how patterns work, by using filter indexes and index entries, which are more expensive to create but are extremely fast at discarding unneeded events.
Read:
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#processingmodel_indexes_filterindexes and also
http://esper.espertech.com/release-7.1.0/esper-reference/html_single/index.html#pattern-walkthrough
Try the "previous" perhaps. Measure performance for each statement separately.
Also I don't think the CSV adapter is optimized for processing a large file. I think CSV may not stream.
To 3) check your code? Don't use CSV file for large stuff. Make sure a listener is attached.

T-SQL - Trying to query something across all databases on my server

I've got an environment where my server is hosting a variable number of databases, all of which utilize the same table structures/schemas. I need to pull a sum of customers that meet a certain series of constraints with say, the user table. I also need to show which database I am showing the sum for.
I already know all I need to get the sum in a db by db query, but what I'm really looking to do is have one script that hits all of the non-system DBs currently on my server to grab this info.
Please forgive my ignorance in this, just starting out.
Update-
So, to clarify things somewhat; I'm using MS SQL 2014. I know how to pull a listing of the dbs I want to hit by using:
SELECT name
FROM sys.databases
WHERE name not in ('master', 'model', 'msdb', 'tempdb')
AND state = 0
And for the purposes of gathering the data I need from each, let's just say I've got something like:
select count(u.userid)
from users n
join UserAttributes ua on u.userid = ua.userid
where ua.status = 2
New Update:
So, I went ahead and added the ps sp_foreachdb as suggested by #Philip Kelley, and I'm now running into a problem when trying to run this (admittedly, I can tell I'm closer to a solution). So, this is what I'm using to call the sp:
USE [master]
GO
DECLARE #return_value int
EXEC #return_value = [dbo].[sp_foreachdb]
#command = N'select count(userid) as number from ?..users',
#print_dbname = 1,
#user_only = 1
SELECT 'Return Value' = #return_value
GO
This provides a nice and clean output showing a count, but what I'd like to see is the db name in addition to the count, something like this:
|[DB_NAME]|[COUNT]|
But for each DB
Is this even possible?
Source Code: https://codereview.stackexchange.com/questions/113063/executing-dynamic-sql-programmatically
Example Usage:
declare #options int = (
select a.ExcludeSystemDatabases
from dbo.ForEachDatabaseOptions() as a
);
execute dbo.usp_ForEachDatabase
#Command = N'print Db_Name();'
, #Options = #options;
#Command can be anything you want but obviously it needs to be a query that every single database can understand. #Options currently has 3 built-in settings but can be expanded however you see fit.
I wrote this to mimic/expand upon the master.sys.sp_MSforeachdb procedure but it could still use a little bit of polish (especially around the "logic" that replaces ? with the current database name).
Enumerate the databases from schema / sysdatabases. At least in situations without replication, excluding db_ids 1 to 4 as system databases should be reasonably robust:
SELECT [name] FROM master.dbo.sysdatabases WHERE dbid NOT IN (1,2,3,4)
Other methods exist, see here: Get list of databases from SQL Server and here: SQL Server: How to tell if a database is a system database?
Then prefix the query or stored procedure call with the database name, and in a cursor loop over the resultset of the first query, store that in a sysname variable to construct a series of statements like that:
SELECT column FROM databasename.schema.Viewname WHERE ...
and call that using the string execute function
EXECUTE('SELECT ... FROM '+##fully_qualified_table_name+' WHERE ...')
There’s the undocumented sytem procedure, sp_msForEachDB, as found in the master database. Many pundits on the internet recommend not using this, as under obscure fringe cases it can be unreliable and somehow skip random databases. Count me as one of them, this caused me serious grief a few months back.
You can write your own routine to provide this kind of functionality. This is a common task, however, and many people have already done it and posted their code online… so why re-invent the wheel?
#kittoes0124 posted a link to “usp_ForEachDatabse”. This probably works, though pro forma I hate any stored procedures that beings with usp_. I ended up with Aaron Bertrand’s utility, which can be found at http://www.mssqltips.com/sqlservertip/2201/making-a-more-reliable-and-flexible-spmsforeachdb/.
Install a version of this routine, figure out how it works, plug in your script, and go!

iPhone Web App database SQLite and MySQL

I am making a planner application for the iphone that can work online to store tasks in a mysql server. However, when I attempt to synchronise the two databases I have a problem. The thing seems to be that I can't insert more than one set of values at once into the iPhone database:
INSERT INTO planner (title, duedate, submitdate, subject, info) VALUES ('Poster', '21092010', '28092010', 'chemistry', 'elements poster'), ('Essay', '22092010', '25092010', 'english', 'essay on shakespeare')
This does not work. There is no error or anything like that, it simply does nothing, it sometimes puts the first one in, but not the other. Perhaps I am going about this the wrong way, so to give the situation:
I have an array with a list of these properties, call them 1, 2, 3, 4 and 5, I need all of the array putting into the local database.
People on this site seem to be able to do this so I hope you can help,
Thanks,
Tom Ludlow
The SQLite INSERT syntax only supports single-row inserts. This should not be a problem.
Why? Because you should be using parameterized queries, not concatenating a giant string together and hoping that you've done all the "escaping" properly so that there are no SQL injection vulnerabilities. Additionally, sticking everything into the statement increases parsing overheads (you've spent all that effort escaping things, and now SQLite has to spend some more effort to un-escape things).
The suggested way to use a statement is something like this:
sqlite3_exec(db, "BEGIN", NULL, NULL, NULL);
sqlite3_prepare_v2(db, "INSERT INTO planner (title,duedate,submitdate,subject,info) VALUES (?,?,?,?,?)
For each row you want to insert,
sqlite3_bind() the five parameters (bound parameters are 1-based, so 1, 2, 3, 4, 5).
sqlite3_step(). It should return SQLITE_DONE.
sqlite3_reset() (so you can reuse the statement) and sqlite3_clear_bindings() (for good measure)
sqlite3_finalize() to destroy the statement.
sqlite3_exec(db, "COMMIT", NULL, NULL, NULL);
I've wrapped the inserts in a transaction to increase performance (outside of a transaction, all INSERTs happen their own transaction, which I've found to be significantly slower...).
For an Objective-C wrapper around sqlite, you might try FMDB (it has a reasonably nice wrapper around sqlite3_bind_*(), except it uses SQLITE_STATIC when it should probably be using SQLITE_TRANSIENT or retaining/copying its arguments).
Have you tried to split your inserts so you only insert a single row at a time?
tc hints at this in his answer, though using native code.
Try looking at this example with two inserts:
/* Substitute with your openDatabase call */
var db = openDatabase('yourdb', '1.0', 'Planner DB', 2 * 1024 * 1024);
db.transaction(function (tx) {
tx.executeSql('INSERT INTO planner (title, duedate, submitdate, subject, info) VALUES ("Poster", "21092010", "28092010", "chemistry", "elements poster")');
tx.executeSql('INSERT INTO planner (title, duedate, submitdate, subject, info) VALUES ("Essay", "22092010", "25092010", "english", "essay on shakespeare")');
});
/Mogens