Sqlalcemy - Bulk-insert with multiple cascade & back_populates relationships - postgresql

I have tried to optimize our insertions to the database, which is currently the bottleneck and slowing down our pipeline. I decided to first start speed up our data_generator used for testing, all the tables are empty at first. Thought it would be a easy place to start ..
they are then populated and used in various tests.
Currently, we do pretty much all insertions with Session.add(entry) or in some cases bulked entries with add_all(entries), which does not improve the speed that much.
The goal was to do more insertions at once and have less time communicating back and forth with the database and I tried various bulk_insert methods (bulk_save_objects, bulk_insert_mappings and ORM,CORE methods with INSERT INTO, COPY, IMPORT .. but I got nothing to work properly. Foreign key constraints, duplicated keys ... or tables not getting populated.
I will show an example of a Table that would previous be added with add_all() in a run_transaction.
class News(NewsBase):
__tablename__ = 'news'
news_id = Column(UUID(as_uuid=True), primary_key=True, nullable=False)
url_visit_count = Column('url_visit_count', Integer, default=0)
# One to many
sab_news = relationship("sab_news", back_populates="news")
sent_news = relationship("SenNews", back_populates="news")
scope_news = relationship("ScopeNews", back_populates="news")
news_content = relationship("NewsContent", back_populates="news")
# One to one
other_news = relationship("other_news", uselist=False, back_populates="news")
# Many to many
companies = relationship('CompanyNews', back_populates='news', cascade="all, delete")
aggregating_news_sources = relationship("AggregatingNewsSource", secondary=NewsAggregatingNewsSource,
back_populates="news")
def __init__(self, title, language, news_url, publish_time):
self.news_id = uuid4()
super().__init__(title, language, news_url, publish_time)
We have many tables built like this, some with more relations, and my conclusion now is that having many different relationships that back_populates and update each other does not allow for fast bulk_insertions, Am I wrong?
One of my current solution that was able to decrease our execution_time from 120s to 15s for a regular data_generator for testing is like this:
def write_news_to_db(news, news_types, news_sources, company_news,
):
write_bulk_in_chunks(news_types)
write_bulk_in_chunks(news_sources)
def write_news(session):
enable_batch_inserting(session)
session.add_all(news)
def write_company_news(session):
session.add_all(company_news)
engine = create_engine(
get_connection_string("name"),
echo = False,
executemany_mode = "values")
run_transaction(create_session(engine=engine), lambda s: write_news(s))
run_transaction(create_session(), lambda s: write_company_news(s))
I used this library sqlalchemy_batch_inserts
github together with Psycopg2 Fast Execution Helpers, set executemany_mode="values".
I did this by creating a new engine just for these insertions - It did work however this itself seems like a bad practice. It works with the same database.
Anyway, this does seem to work, but it is still not the execution speed I want - especially when we are initially working with empty tables.
Ideally, I wouldn't want to do this hacky solution and avoid bulk_insertions since SQLAlchemy does not recommend using them - to avoid problems that I have faced.
But how does one construct queries to properly do bulk_insertions in cases of complex Tables like these - should we re-design our tables or is it possible?
Using Multi-row insertions within the run_transaction with ORM or CORE would be ideal, but I haven't been able to do it.
Any recommendations or help would be much appreciated!
TLDR; Bulk-insertion with multiple relationships, back_populates, cascade. How is it supposed to done?

CockroachDB supports bulk insertions using multi-row insert for existing tables as well as IMPORT statements for new tables - https://www.cockroachlabs.com/docs/stable/insert.html. Have you considered using these options directly?

Related

Design question for a table with too many joins OR polymorphic relations in Postgres 11.7

I've been given a table that I'm not sure how to design. I'm hoping for some design suggestions, or pointers in the right direction. The table is called edge and is meant to store some event traces, and IDs that link out to a host of possible lookup tables. Leaving out everything but IDs, here's what the table contains, all UUIDs:
ID
InvID
OrgID
FacilityID
FromAssemblyID
FromAssociatedTo
FromAssociatedToID
FromClinicID
FromFacilityDepartmentID
FromFacilityID
FromFacilityLocationID
FromScanAtFacilityID
FromScanID
FromSCaseID
FromSterilizerLoadID
FromWasherLoadID
FromWebUserID
ToAssemblyID
ToAssociatedTo
ToAssociatedToID
ToClinicID
ToFacilityDepartmentID
ToFacilityID
ToFacilityLocationID
ToNodeDTS
ToScanAtFacilityID
ToScanID
ToSCaseID
ToSterilizerLoadID
ToUserName
ToWasherLoadID
ToWebUserID
That's an overwhelming number of IDs to possibly join on. I remember reading that the Postgres planner kind of gives up when you've got a dozen+ joins. The idea being that there are so many permutations to explore, that the planning time could quickly overwhelm the query time. If you boil it down, the "from" and "to" links are only ever going to have one key value across all of those fields. So, implemented as a polymorphic/promiscuous relations, something like this:
ID
InvID
OrgID
FacilityID
FromID
FromType
ToID
ToType
ToWebUserID
This table is going to be ginormous, so speed is/will be a consideration.
I encouraged the author not to use a polymorphic design, although the appeal is obvious. (I like Karwin's SQL Antipatterns book.) But now, confronted with nearly three dozen IDs, I'm a bit stumped.
Is there a common solution to this kind of problem? Namely, where you've got a central table like this with connections to a wide variety of possible tables? I don't have a Data Warehousing background, but this looks somewhat like that. (The author of this table has read Kimball's books, but not done any Data Warehouse implementations either.)
Important: We're using JOIN to do lookups on related values that might change, we're not using it to change the size of the result set. Just pretend it would always be LEFT JOIN.
With that in mind, what I've thought of is to skip joining on the From and To IDs, and instead use custom function calls to look up required values from the related tables. like (pseudo-code)
GetUserName(uuid) : citext
...and os on for other values of interest in this and other tables...
The function would return '' when the UUID is 0000etc.
I appreciate that this isn't the crispest question in the history of SO, and I what I'm hoping for pointers in a fruitful direction.
This smacks of “premature optimization” (which is a source of evil) based on something that you “remember reading”, so maybe some enlightenment about join optimization will help.
One rule of thumb that I follow in questions like this is to model things so that your queries become simple and natural. Experience shows that that often leads to good performance.
I assume that the table you show is the fact table of a star schema, and the foreign keys point to the many dimension tables, so that your query will look like
SELECT ...
FROM fact
JOIN dim1 ON fact.dim1_id = dim1.id
JOIN dim2 ON fact.dim3_id = dim2.id
JOIN dim3 ON fact.dim3_id = dim3.id
...
WHERE dim1.col1 = ...
AND dim2.col2 BETWEEN ... AND ...
AND dim3.col3 < ...
...
Now PostgreSQL will by default only consider all join permutations of the first eight tables (join_collapse_limit), and the rest of the tables are just joined in the order in which they appear in the query.
Moreover, if the number of tables reaches the threshold of 12 (geqo_threshold), the genetic query optimizer takes over, a component that simulates evolution by mutation and survival of the fittest with randomly chosen execution plans (really!) and consequently doesn't always come up with the same execution plan for the same query.
So my advice would be to write the queries in a way that the first seven dimension tables are the ones with the biggest chance of reducing the number of result rows most significantly (based on the WHERE conditions). You can also increase join_collapse_limit, because if your queries take a long time to run anyway, you can easily afford the planner to spend more time thinking about the best plan.
Then you'd set geqo = off to disable the genetic query optimizer.
If you design your queries according to these principles, you should be able to get good execution plans without messing up the data model.

Entity framework generates inefficient SQL

I wonder why is Entity framework generating such an inefficient SQL query. In my code I expected the WHERE to act upon the INCLUDE:
db.Employment.Where(x => x.Active).Include(x => x.Employee).Where(x => x.Employee.UserID == UserID)
but I ended up with a double SQL JOIN:
SELECT [x].[ID], [x].[Active], [x].[CurrencyID], [x].[DepartmentID], [x].[EmplEnd], [x].[EmplStart], [x].[EmployeeID], [x].[HolidayGroupID], [x].[HourlyCost], [x].[JobTitle], [x].[ManagerID], [x].[WorkScheduleGroupID], [e].[ID], [e].[Active], [e].[Address], [e].[BirthDate], [e].[CitizenshipID], [e].[City], [e].[CountryID], [e].[Email], [e].[FirstName], [e].[Gender], [e].[LastName], [e].[Note], [e].[Phone], [e].[PostalCode], [e].[TaxNumber], [e].[UserID]
FROM [Employment] AS [x]
INNER JOIN [Employee] AS [x.Employee] ON [x].[EmployeeID] = [x.Employee].[ID]
INNER JOIN [Employee] AS [e] ON [x].[EmployeeID] = [e].[ID]
WHERE ([x].[Active] = 1) AND ([x.Employee].[UserID] = #__UserID_0)
I found out that this query will create better SQL:
db.Employment.Where(x => x.Active).Where(x => x.Employee.UserID == UserID)
SELECT [x].[ID], [x].[Active], [x].[CurrencyID], [x].[DepartmentID], [x].[EmplEnd], [x].[EmplStart], [x].[EmployeeID], [x].[HolidayGroupID], [x].[HourlyCost], [x].[JobTitle], [x].[ManagerID], [x].[WorkScheduleGroupID]
FROM [Employment] AS [x]
INNER JOIN [Employee] AS [x.Employee] ON [x].[EmployeeID] = [x.Employee].[ID]
WHERE ([x].[Active] = 1) AND ([x.Employee].[UserID] = #__UserID_0)
However, the problem here that referenced entities are not retrieved from the DB.
Why don't two codes produce same SQLs?
The SQL is different because the statments are different.
Entity Framework does produce inefficient TSQL, it always has. By abstracting the subtleties that are necessary for SQL with good performance and replacing them with "belt and braces" nearly always work alternatives you sacrafice performance for utility.
If you need good performance, write the SQL yourself. Dapper works well for me. You can't realistically expect a "one size fits all" solution to come up with the best code for your specific situation. You can do this across the board or just where you need to.
Unless you have high volume or specific performance requirements get on with it and use whatever you find easiest. If you need to tune your queries to your database you are going to have learn the details of your database engine and implement the queries yourself. If you are expecting the next iteration of Entity Framework to be the magic bullet that allows you fast, efficient SQL data access with minimal knowledge, good luck.
P.S.
Off-topic but, NoSQL probably isn't the answer either, is just a different class of database.

Postgres multi-update with multiple `where` cases

Excuse what seems like it could be a duplicate. I'm familiar with multiple updates in Postgres... but I can't seem to figure out a way around this one...
I have a photos table with the following columns: id (primary key), url, sort_order, and owner_user_id.
We would like to allow our interface to allow the user to reorder their existing photos in a collection view. In which case when a drag-reorder interaction is complete, I am able to send a POST body to our API with the following:
req.body.photos = [{id: 345, order: 1, id: 911, order: 2, ...<etc>}]
In which case I can turn around and run the following query in a loop per each item in the array.
photos.forEach(function (item) {
db.runQuery('update photos set sort_order=$1 where id=$2 and owner_user_id=$3', [item.order, item.id, currentUserId])
})
In general, it's generally frowned upon to run database queries inside loops, so if there's anyway this can be done with 1 query that would be fantastic.
Much thanks in advance.
Running a select query inside of a loop is definitely questionable, but I don't think multiple updates is necessarily frowned upon if the data you are updating doesn't natively reside on the database. To do these as separate transactions, however, might be.
My recommendation would be to wrap all known updates in a single transaction. This is not only kinder to the database (compile once, execute many, commit once), but this is an ACID approach to what I believe you are trying to do. If, for some reason, one of your updates fails, they will all fail. This prevents you from having two photos with an order of "1."
I didn't recognize your language, but here is an example of what this might look like in C#:
NpgSqlConnection conn = new NpgSqlConnection(connectionString);
conn.Open();
NpgSqlTransaction trans = conn.BeginTransaction();
NpgSqlCommand cmd = new NpqSqlCommand("update photos set sort_order=:SORT where id=:ID",
conn, trans);
cmd.Parameters.Add(new NpgSqlParameter("SORT", DbType.Integer));
cmd.Parameters.Add(new NpgSqlParameter("ID", DbType.Integer));
foreach (var photo in photos)
{
cmd.Parameters[0].Value = photo.SortOrder;
cmd.Parameters[1].Value = photo.Id;
cmd.ExecuteNonQuery();
}
trans.Commit();
I think in Perl, for example, it would be even simpler -- turn off DBI AutoCommit and commit after the inserts.
CAVEAT: Of course, add error trapping -- I was just illustrating what it might look like.
Also, I changed you update SQL. If "Id" is the primary key, I don't think you need the additional owner_user_id=$3 clause to make it work.

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.
I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.

Postgres: updating not-changed rows

Say, I have a following query:
UPDATE table_name
SET column_name1 = column_value1, ..., column_nameN = column_valueN
WHERE id = M
The thing is, that column_value1, ..., column_valueN have not changed. Will this query be really executed and what about performance in this case comparing to update with really changed data? What if I have about 50 of such queries per page with not-changed data?
You need to help postgresql here by specifying only the changed columns and rows. It will go ahead and perform update on whatever you specify without checking if the data has been changed.
p.s. This is where ORM comes in handy.
EDIT: You may also be interested in How can I speed up update/replace operations in PostgreSQL?, where the OP went through all the troubles to speed up UPDATE performance, when the best performance can be achieved by updating changed data only.