Insert into subselect slow - postgresql

I try to fill a table "SAMPLE" that requires ids from three other tables.
The table "SAMPLE" that needs to be filled look holds the following:
id (integer, not null, pk)
code (text, not null)
subsystem_id (integer, fk)
system_id (integer, not null, fk)
manufacturer_id (integer, fk)
The current query looks like this:
insert into SAMPLE(system_id, manufacturer_id, code, subsystem_id)
values ((select id from system where initial = 'P'), (select id from manufacturer where name = 'nameXY'), 'P0001', (select id from subsystem where code = 'NAME PATTERN'));
It is ridiculously slow, inserting 8k rows in around a minute.
I'm not sure if this is a really bad query problem or if my postgres configuration is heavily messed up.
For clarification, more table information:
subsystem:
This table holds fixed values (9) with a basic pattern I can access easily.
system
This table holds fixed values (4) that can be identified using the "initial" attribute
manufacturer
This table holds the name of a manufacturer.
The "SAMPLE" table will be the only connection between those tables so I'm not sure if I can use joins.
I'm pretty sure 8k values should be a gigantic joke to insert for a database so I'm really confused.
My specs:
Win 7 x86_64
8GB RAM
intel i5 3470S (QUAD) 2,9 GHZ
Postgres is v9.3
I didn't see any peak during my query so I suspect something is up with my configuration. If you need information about it, let me know.
Note: It is possible that I have codes or names that can not be found in the subsystem or manufacturer tables. Instead of adding nothing, I want to add a NULL value to the cell then.

8000 inserts/mn is roughly 133 per second or 0.133 ms per statement.
This is to be expected if the INSERTs happen in a loop each statement in its own transaction.
Each transaction commits to disk and waits for the disk to confirm that the data is written in durable storage. This is known to be slow.
Add a transaction around the loop with BEGIN and END and it will run at normal speed.
Ideally you wouldn't even have a loop but a more complex query that does a single INSERT to create all the rows from their sources, if possible.

I could not test it because I have no PostgreSql installed and no database with a similar structure, but may it would be faster to get the insert data from a single statement
INSERT INTO Sample (system_id, manufacturer_id, code, subsystem_id)
SELECT s.id AS system_id,
m.id AS manufacturer_id,
'P0001' AS code,
ss.id AS subsystem_id
FROM system s
JOIN manufacturer m
ON m.name = 'nameXY'
JOIN subsystem ss
ON ss.code = 'NAME PATTERN'
WHERE s.initial = 'P'
I hope this works.

Related

Optimized Postgresql like and group by clause

Database: PostgresSQL PostgreSQL 12.11 (Ubuntu 12.11-0ubuntu0.20.04.1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, 64-bit
RAM : 8 GB
Processor : i7 4510U (dual core 2 Ghz)
I would to like to optimized below query
select a.gender from "employees" as a
where lower( gender ) LIKE 'f%' group by gender
limit 20
Total number of records in table : 2,088,290 rows
Index
CREATE INDEX ix_employees_gender_lower ON public.employees USING btree (lower((gender)::text) varchar_pattern_ops)
query execution plan
https://explain.dalibo.com/plan/h7e
Please use gdrive link to download and restore the sql to database for above query
SQL TABLE with data
I tried to index but unavail also i am not able to understand explain analyze so any pointers on the same as well
It sounds like you need an index skip-scan. PostgreSQL currently doesn't implement those automatically but you can emulate it with a recursive CTE. People are working on adding this to the planner so it will be chosen automatically, but even if they succeed it would probably not work with your case-folding LIKE condition. I couldn't see how to integrate the case-folding LIKE condition into the recursive CTE, but if you return all distinct genders preserving case, you can then filter that small list quickly without needing to use an index.
WITH RECURSIVE t AS (
SELECT min(gender) AS gender FROM employees
UNION ALL
SELECT (SELECT min(gender) FROM employees WHERE gender > t.gender)
FROM t WHERE t.gender IS NOT NULL
)
SELECT gender FROM t WHERE gender IS NOT NULL and lower(gender) like 'f%';
This took less than 2 ms for me, but it does require you add a plain index on gender, which you don't seem to have already.
so apparently
refactoring base query to below
select gender from (
select a.gender from "employees" as a
where lower(a.gender::text) LIKE 'f%'
limit 40) b
group by b.gender
brought the execution time from 5 seconds to 16 ms
Even the bad plan is far better for me than it appears to be for you (4s with completely cold cache, 0.4s upon repeat execution), and my hardware is far from excellent.
If the time is going to random page reads, you could greatly reduce that by creating an index suited for index-only-scans and making sure the table is well vacuum.
CREATE INDEX ix_employees_gender_lower2 ON public.employees USING btree (lower((gender)::text) varchar_pattern_ops, gender)
That reduces the timing to 0.3s, regardless of cache warmth.
But I don't see the point of running this query even once, much less often enough to care if it takes 22s.

Is it OK to store transactional primary key on data warehouse dimension table to relate between fact-dim?

I have data source (postgres transactional system) like this (simplified, the actual tables has more fields than this) :
Then I need to create an ETL pipeline, where the required report is something like this :
order number (from sales_order_header)
item name (from sales_order_lines)
batch shift start & end (from receiving_batches)
delivered quantity, approved received quantity, rejected received quantity (from receiving_inventories)
My design for fact-dim tables is this (simplified).
What I don't know about, is the optimal ETL design.
Let's focus on how to insert the fact, and relationship between fact with dim_sales_orders
If I have staging tables like these:
The ETL runs daily. After 22:00, there will be no more receiving, so I can run the ETL at 23:00.
Then I can just fetch data from sales_order_header and sales_order_lines, so at 23:00, the script can runs, kind of :
INSERT
INTO
staging_sales_orders (
SELECT
order_number,
item_name
FROM
sales_order_header soh,
sales_order_lines sol
WHERE
soh.sales_order_id = sol.sales_order_header_id
and date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
);
And for the fact table, can runs at 23:30, with query
SELECT
soh.order_number,
rb.batch_shift_start,
rb.batch_shift_end,
sol.item_name,
ri.delivered_quantity,
ri.approved_received_quantity,
ri.rejected_received_quantity
FROM
receiving_batches rb,
receiving_inventories ri,
sales_order_lines sol,
sales_order_header soh
WHERE
rb.batch_id = ri.batch_id
AND ri.sales_order_line_id = sol.sales_order_line_id
AND sol.sales_order_header_id = soh.sales_order_id
AND date_trunc('day', sol.created_timestamp) = date_trunc('day', now())
But how to optimally load the data into fact table, particulary the fact table?
My approach
select from staging_sales_orders and insert them into dim_sales_orders, using auto increment primary key.
before inserting into fact_receiving_inventories, I need to know the dim_sales_order_id. So in that case, I select :
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
order_number = staging_row.order_number
AND item_name = staging_row.item_name
then insert to fact table.
Now what I doubt, is on point 2 (selecting from existing dim). In here, I select based on 2 varchar columns, which should be performance hit. Since in the normalized form, I'm thinking of modifying the staging tables, adding sales_order_line_id on both staging tables. Hence, during point 2 above, I can just do
SELECT
dim_sales_order_id
FROM
dim_sales_orders dso
WHERE
sales_order_line_id = staging_row.sales_order_line_id
But as consequences, I will need to add sales_order_line_id into dim_sales_orders, which I don't find common on tutorials. I mean, adding transactional table PK, is technically can be done since I can access the data source. But is it a good DW fact-dim dimension, to add such transactional field (especially since it is PK)?
Or there is any other approach, rather than selecting the existing dim based on 2 varchars?
How to optimally select dimension id for fact tables?
Thanks
It is practically mandatory to include the source PK/BK in a dimension.
The standard process is to load your Dims and then load your facts. For the fact loads you translate the source data to the appropriate Dim SKs with lookups to the Dims using the PK/BK

Most efficient way to DECODE multiple columns -- DB2

I am fairly new to DB2 (and SQL in general) and I am having trouble finding an efficient method to DECODE columns
Currently, the database has a number of tables most of which have a significant number of their columns as numbers, these numbers correspond to a table with the real values. We are talking 9,500 different values (e.g '502=yes' or '1413= Graduate Student')
In any situation, I would just do WHERE clause and show where they are equal, but since there are 20-30 columns that need to be decoded per table, I can't really do this (that I know of).
Is there a way to effectively just display the corresponding value from the other table?
Example:
SELECT TEST_ID, DECODE(TEST_STATUS, 5111, 'Approved, 5112, 'In Progress') TEST_STATUS
FROM TEST_TABLE
The above works fine.......but I manually look up the numbers and review them to build the statements. As I mentioned, some tables have 20-30 columns that would need this AND some need DECODE statements that would be 12-15 conditions.
Is there anything that would allow me to do something simpler like:
SELECT TEST_ID, DECODE(TEST_STATUS = *TableWithCodeValues*) TEST_STATUS
FROM TEST_TABLE
EDIT: Also, to be more clear, I know I can do a ton of INNER JOINS, but I wasn't sure if there was a more efficient way than that.
From a logical point of view, I would consider splitting the lookup table into several domain/dimension tables. Not sure if that is possible to do for you, so I'll leave that part.
As mentioned in my comment I would stay away from using DECODE as described in your post. I would start by doing it as usual joins:
SELECT a.TEST_STATUS
, b.TEST_STATUS_DESCRIPTION
, a.ANOTHER_STATUS
, c.ANOTHER_STATUS_DESCRIPTION
, ...
FROM TEST_TABLE as a
JOIN TEST_STATUS_TABLE as b
ON a.TEST_STATUS = b.TEST_STATUS
JOIN ANOTHER_STATUS_TABLE as c
ON a.ANOTHER_STATUS = c.ANOTHER_STATUS
JOIN ...
If things are too slow there are a couple of things you can try:
Create a statistical view that can help determine cardinalities from the joins (may help the optimizer creating a better plan):
https://www.ibm.com/support/knowledgecenter/sl/SSEPGG_9.7.0/com.ibm.db2.luw.admin.perf.doc/doc/c0021713.html
If your license admits you can experiment with Materialized Query Tables (MQT). Note that there is a penalty for modifications of the base tables, so if you have more of a OLTP workload, this is probably not a good idea:
https://www.ibm.com/developerworks/data/library/techarticle/dm-0509melnyk/index.html
A third option if your lookup table is fairly static is to cache the lookup table in the application. Read the TEST_TABLE from the database, and lookup descriptions in the application. Further improvements may be to add triggers that invalidate the cache when lookup table is modified.
If you don't want to do all these joins you could create yourself an own LOOKUP function.
create or replace function lookup(IN_ID INTEGER)
returns varchar(32)
deterministic reads sql data
begin atomic
declare OUT_TEXT varchar(32);--
set OUT_TEXT=(select text from test.lookup where id=IN_ID);--
return OUT_TEXT;--
end;
With a table TEST.LOOKUP like
create table test.lookup(id integer, text varchar(32))
containing some id/text pairs this will return the text value corrseponding to an id .. if not found NULL.
With your mentioned 10k id/text pairs and an index on the ID field this shouldn't be a performance issue as such data amount should be easily be cached in the corresponding bufferpool.

Slow SQL Server 2008 R2 performance?

I'm using SQL Server 2008 R2 on my development machine (not a server box).
I have a table with 12.5 million records. It has 126 columns, half of which are int. Most columns in most rows are NULL. I've also tested with an EAV design which seems 3-4 times faster to return the same records (but that means pivoting data to make it presentable in a table).
I have a website that paginates the data. When the user tries to go to the last page of records (last 25 records), the resulting query is something like this:
select * from (
select
A.Id, part_id as PartObjectId,
Year_formatted 'year', Make_formatted 'Make',
Model_formatted 'Model',
row_number() over ( order by A.id ) as RowNum
FROM vehicles A
) as innerQuery where innerQuery.RowNum between 775176 and 775200
... but this takes nearly 3 minutes to run. That seems excessive? Is there a better way to structure this query? In the browser front-end I'm using jqGrid to display the data. The user can navigate to the next, previous, first, or last page. They can also filter and order data (example: show all records whose Make is "Bugatti").
vehicles.Id is int and is the primary key (clustered ASC). part_id is int, Make and Model are varchar(100) and typically only contain 20 - 30 characters.
Table vehicles is updated ~100 times per day in individual transactions, and 20 - 30 users use the webpage to view, search, and edit/add vehicles 8 hours/day. It gets read from and updated a lot.
Would it be wise to shard the vehicles table into multiple tables only containing say 3 million records each? Would that have much impact on performance?
I see lots of videos and websites talking about people having tables with 100+ million rows that are read from and updated often without issue.
Note that the performance issues I observe are on my own development computer. The database has a dedicated 16GB of RAM. I'm not using SSD or even SCSI for that matter. So I know hardware would help, but 3 minutes to retrieve the last 25 records seems a bit excessive no?
Though I'm running these tests on SQL Server 2008 R2, I could also use 2012 if there is much to be gained from doing so.
Yes there is a better way, even on older releases of MsSQL But it is involved. First, this process should be done in a stored procedure. The stored procedure should take as 2 of it's input parameters, the page requested (#page)and the page size (number of records per page - #pgSiz).
In the stored procedure,
Create a temporary table variable and put into it a sorted list of the integer Primary Keys for all the records, with a rowNumber column that is itself an indexed, integer, Primary Key for the temp table
Declare #PKs table
(rowNo integer primary key Identity not null,
vehicleId integer not null)
Insert #PKS (vehicleId)
Select vehicleId from Vehicles
Order By --[Here put sort criteria as you want pages sorted]
--[Try to only include columns that are in an index]
then, based on which page (and the page size), (#page, #pgSiz) the user requested, the stored proc selects the actual data for that page by joining to this temp table variable:
Select [The data columns you want]
From #PKS p join Vehicles v
on v.VehicleId = p.VehicleId
Where rowNo between #page*#pgSiz+1 and (#page+1)*#pgSiz
order by rowNo -- if you want to sort page of records on server
assuming #page is 0-based. Also, the Stored proc will need some input argument validation to ensure that the #page, #pgSize values are reasonable (do not take the code pas the end of the records.)

Postgres using an index for one table but not another

I have three tables in my app, call them tableA, tableB, and tableC. tableA has fields for tableB_id and tableC_id, with indexes on both. tableB has a field foo with an index, and tableC has a field bar with an index.
When I do the following query:
select *
from tableA
left outer join tableB on tableB.id = tableA.tableB_id
where lower(tableB.foo) = lower(my_input)
it is really slow (~1 second).
When I do the following query:
select *
from tableA
left outer join tableC on tableC.id = tabelA.tableC_id
where lower(tableC.bar) = lower(my_input)
it is really fast (~20 ms).
From what I can tell, the tables are about the same size.
Any ideas as to the huge performance difference between the two queries?
UPDATES
Table sizes:
tableA: 2061392 rows
tableB: 175339 rows
tableC: 1888912 rows
postgresql-performance tag info
Postgres version - 9.3.5
Full text of the queries are above.
Explain plans - tableB tableC
Relevant info from tables:
tableA
tableB_id, integer, no modifiers, storage plain
"index_tableA_on_tableB_id" btree (tableB_id)
tableC_id, integer, no modifiers, storage plain,
"index_tableA_on_tableB_id" btree (tableC_id)
tableB
id, integer, not null default nextval('tableB_id_seq'::regclass), storage plain
"tableB_pkey" PRIMARY_KEY, btree (id)
foo, character varying(255), no modifiers, storage extended
"index_tableB_on_lower_foo_tableD" UNIQUE, btree (lower(foo::text), tableD_id)
tableD is a separate table that is otherwise irrelevant
tableC
id, integer, not null default nextval('tableC_id_seq'::regclass), storage plain
"tableC_pkey" PRIMARY_KEY, btree (id)
bar, character varying(255), no modifiers, storage extended
"index_tableC_on_tableB_id_and_bar" UNIQUE, btree (tableB_id, bar)
"index_tableC_on_lower_bar" btree (lower(bar::text))
Hardware:
OS X 10.10.2
CPU: 1.4 GHz Intel Core i5
Memory: 8 GB 1600 MHz DDR3
Graphics: Intel HD Graphics 5000 1536 MB
Solution
Looks like running vacuum and then analyze on all three tables fixed the issue. After running the commands, the slow query started using "index_patients_on_foo_tableD".
The other thing is that you have your indexed columns queried as lower() , which can also be creating a partial index when the query is running.
If you will always query the column as lower() then your column should be indexed as lower(column_name) as in:
create index idx_1 on tableb(lower(foo));
Also, have you looked at the execution plan? This will answer all your questions if you can see how it is querying the tables.
Honestly, there are many factors to this. The best solution is to study up on INDEXES, specifically in Postgres so you can see how they work. It is a bit of holistic subject, you can't really answer all your problems with a minimal understanding of how they work.
For instance, Postgres has an initial "lets look at these tables and see how we should query them" before the query runs. It looks over all tables, how big each of the tables are, what indexes exist, etc. and then figures out how the query should run. THEN it executes it. Oftentimes, this is what is wrong. The engine incorrectly determines how to execute it.
A lot of the calculations of this are done off of the summarized table statistics. You can reset the summarized table statistics for any table by doing:
vacuum [table_name];
(this helps to prevent bloating from dead rows)
and then:
analyze [table_name];
I haven't always seen this work, but often times it helps.
ANyway, so best bet is to:
a) Study up on Postgres indexes (a SIMPLE write up, not something ridiculously complex)
b) Study up the execution plan of the query
c) Using your understanding of Postgres indexes and how the query plan is executing, you cannot help but solve the exact problem.
For starters, your LEFT JOIN is counteracted by the predicate on the left table and is forced to act like an [INNER] JOIN. Replace with:
SELECT *
FROM tableA a
JOIN tableB b ON b.id = a.tableB_id
WHERE lower(b.foo) = lower(my_input);
Or, if you actually want the LEFT JOIN to include all rows from tableA:
SELECT *
FROM tableA a
LEFT JOIN tableB b ON b.id = a.tableB_id
AND lower(b.foo) = lower(my_input);
I think you want the first one.
An index on (lower(foo::text)) like you posted is syntactically invalid. You better post the verbatim output from \d tbl in psql like I commented repeatedly. A shorthand syntax for a cast (foo::text) in an index definition needs more parentheses, or use the standard syntax: cast(foo AS text):
Create index on first 3 characters (area code) of phone field?
But that's also unnecessary. You can just use the data type (character varying(255)) of foo. Of course, the data type character varying(255) rarely makes sense in Postgres to begin with. The odd limitation to 255 characters is derived from limitations in other RDBMS which do not apply in Postgres. Details:
Refactor foreign key to fields
Be that as it may. The perfect index for this kind of query would be a multicolumn index on B - if (and only if) you get index-only scans out of this:
CREATE INDEX "tableB_lower_foo_id" ON tableB (lower(foo), id);
You can then drop the mostly superseded index "index_tableB_on_lower_foo". Same for tableC.
The rest is covered by the (more important!) indices in table A on tableB_id and tableC_id.
If there are multiple rows in tableA per tableB_id / tableC_id, then either one of these competing commands can swing the performance to favor the respective query by physically clustering related rows together:
CLUSTER tableA USING "index_tableA_on_tableB_id";
CLUSTER tableA USING "index_tableA_on_tableC_id";
You can't have both. It's either B or C. CLUSTER also does everything a VACUUM FULL would do. But be sure to read the details first:
Optimize Postgres timestamp query range
And don't use mixed case identifiers, sometimes quoted, sometimes not. This is very confusing and is bound to lead to errors. Use legal, lower-case identifiers exclusively - then it doesn't matter if you double-quote them or not.