How to decrease insert query time for 20k rows in postgresql - postgresql

I have written a query
INSERT INTO "public"."users" ("userId", "name", "age") VALUES ('1234', 'nop', '11'),....
I am trying to insert 20k rows but it is taking like 80+ seconds.
what can be the reason behind this?

Related

Postgres - distinct query slow over 5 million data

I am trying to do a select distinct on a table with 5 millions of data which is taking approximately 2 minutes. My intent is to improve the speed to to seconds.
Query: - select distinct accounttype from t_fin_do where country_id='abc'
Tried composite index, the cost just went up
You can try to create a partial and covering index for this query with:
create index on t_fin_do(country_id, accounttype) where country_id='abc';
Depending on the selectivity of country_id this could be faster than a table seq scan.

postgres Query optimization merge index

I am experienced in fine tuning in oracle, but in postgres I am unable to improve performance.
Problem statement: I need to aggregate rows out of one postgres table - that has large no.of columns (110) and 175 million rows for a month range. The query other than aggregation has a very simple where clause :
where time between '2019-03-15' and '2019-04-15'
and org_name in ('xxx','yyy'.. 15 elements)
There are individutal btree indexes on table for each "time" idx_time column and "org_name" idx_org_name but not composite index.
I tried creating new index with ('org_name','time') but my manager does not want to change anything.
How can I make it run faster? It takes 15 minutes now (in case of smaller set of org_name it takes 6 minutes ). Most time is spent on data access from table.
Is parallel execution possible?
thanks, Jay
QUERY EXPLAIN ANALYZE :

Select subset of an array in jsonb postgres

I have a table in Postgres 9.6.3:
CREATE TABLE public."Records"
(
"Id" uuid NOT NULL,
"Json" jsonb,
CONSTRAINT "PK_Records" PRIMARY KEY ("Id")
)
Inside my "Json" column i store arrays like so:
[
{"a":"b0","c":0,"z":true},
{"a":"b1","c":1,"z":false},
{"a":"b2","c":2,"z":true}
]
There can be some 10 million entries in each array, and there can be some 5 million records in the table.
I want to get the JSON out, paged, e.g. skip 1 record and return 2 records.
I can do it like so:
select string_agg(txt, ',') as x FROM
(select jsonb_array_elements_text("Json") as txt
FROM "Records" where "Id" = 'de70aadc-70e8-4c77-bd4b-af75ed36897e' -- some id here
limit 50 offset 5000 -- paging parameters
) s;
However, the query takes almost a second (between 780 and 900 msec) to run on my dev laptop with some quite decent hardware (macbook pro 2017). Note: the timing is for the data sizes specified above, of course the sample data of 3 records returns faster.
Adding a GIN index like so: CREATE INDEX records_gin ON "Records" USING gin ("Json"); didn't actually do anything for the query performance (i suppose because i am not querying by the contents of the array).
Is there any way to make this work faster?
It would be faster if you normalized your data and stored the array elements in a second table. Then you could use keyset pagination to page through the data.

Postgres: truncate / load causes basic queries to take seconds

I have a Postgres table that on a nightly basis gets truncated, and then reloaded in a bulk insert (a million or so records).
This table is behaving very strangely: basic queries such as "SELECT * from mytable LIMIT 10" are taking 40+ seconds. Records are narrow, just a couple integer columns.
Perplexed.. very much appreciate your advice.

Optimizing removal of SQL duplicates using ROW_NUMBER

I'm attempting to remove redundant rows from an SQL table, [InfoBucket], with columns:
[ID] (varchar(16)), [column1], ... [columnN], [Speed] (bigint)
([column1...N] are datatypes ranging from integers to varchar() objects of varying lengths.)
There are rows in the table that have the same value in the [ID] and some [column1...N] columns.
I'm taking all these duplicates and deleting all but the row that has the greatest [Speed].
There are approximately 400 million rows in the [InfoBucket].
To split the work into manageable chunks, I have another table, [UniqueIDs], with one column:
[ID] (varchar(16))
and which is populated like so:
begin
insert into [UniqueIDs]
select distinct [ID] from [InfoBucket]
end
go
There are approximately 15 million rows in [UniqueIDs].
I have been using using Martin Smiths excellent answer to a similar question:
My procedure currently looks like this:
begin
declare #numIDs int
set #numIDs = 10000
;with toRemove as
(
select ROW_NUMBER over (partition by
[ID],
[column1],
...
[columnN]
order by [Speed] desc) as 'RowNum'
from [InfoBucket]
where [ID] in
(
select top (#numIDs) [ID] from [UniqueIDs] order by [ID]
)
)
delete toRemove
where RowNum > 1
option (maxdop 1)
;
;with IDsToRemove as
(
select top (#numIDs) [ID] from [UniqueIDs] order by [ID]
)
delete IDsToRemove
option (maxdop 1)
end
go
There are nonclustered indexes on [ID] in both [InfoBucket] and [UniqueIDs], and the "partition by ..." in the over clause only includes the columns that need to be compared.
Now, my problem is that it takes a little over six minutes for this procedure to run. Adjusting the value of #numIDs changes the running time in a linear fashion (ie. when #numIDs has a value of 1,000 the procedure runs for approximately 36 seconds (6 min. / 10) and when #numIDs has a value of 1,000,000 the procedure runs for approximately 10 hours (6 min. * 100); this means that removing all duplicates in [InfoBucket] takes days.
I've tried adding a uniqueidentifier column, [UI_ID] to [InfoBucket] and creating a clustered index on it (so [InfoBucket] had one clustered index on [UI_ID] and one nonclustered on [ID]) but that actually increased running time.
Is there any way I can further optimize this?
The key is to find the sweet spot for deleting the rows. Play with #numIds to find the fastest increment, and then just let it churn.
It's 400 million rows, it's not going to complete the whole process in minutes, maybe hours, it's going to take time. As long as the table does not fill faster then you can remove dupes, you are ok.
Find the sweet spot, then schedule it to run often and off peak. Then check the process from time to time to make sure the sweet spot stays sweet.
The only other thing i can think of, is to calculate the dupes outside of deleting them. This will save some time. Especially if you can calculate the dupes in one sql statement then put that data into yet another table (ie DupeIdsToDelete, then run a delete loop against those IDs)