How to update 400 000 records through batch wise

How to update 400 000 records through batch wise - postgresql

I have the following table named business_extra
business_id address neighbourhood
==========================================
1
2
3
..
400 000 records
The address column contains null values, so I want to update that column using another table
I have written the following query:
update b2g_mns_v2.txn_business_extra a
set mappable_address=b.mappable_address
from b2g_mns_v2.temp_business b
where b.import_reference_id=a.import_reference_id
but got the error message:
out of shared memory

update b2g_mns_v2.txn_business_extra a
set mappable_address=b.mappable_address
from b2g_mns_v2.temp_business b
where b.import_reference_id=a.import_reference_id
and a.mappable_address is null
limit 10000
Do this a few times (batches of 10000).
As a_horse_with_no_name mentioned, better ensure you query is OK by providing the execution plan.

Related

Count(*) returns a value of 0 (even though it should be at least 1+)

I have a table of users in S3 that I'm running some queries against. In particular, I'm trying to get a count of records for a particular user ID. I start by querying the entire table as:
Select *
From table
Limit 100
That works just fine and returns results. I then copy one of the user ID's that I get from that result and run this query:
Select count(id)
From table
Where id = 'abc123'
Since I copied the user ID directly from the table I should get a count of at least 1 - as I know there is at least one record for that ID. However, Hive returns a result of 0.
I have tried analyze table to compute statistics and then re-ran my query and still got a count of 0.
So I then tried the following code but this timed out on me and wouldn't show any results. It took the query 0.001 seconds to run but then it just sat there "loading..." the table and then I eventually get a message saying "Operation timed out."
Select *
From table
Where id = 'abc123'
Limit 100
Any thoughts on why this may be happening or how to fix it?
Thanks!

Postgres select query with offset for large table taking too much time to process

To process a table having 3 million rows, I am using the following query in psql:
select id, trans_id, name
from omx.customer
where user_token is null
order by id, trans_id l
imit 1000 offset 200000000
It's taking more than 3 min to fetch the data. How to improve the performance?

The problem you have is that to know which 1000 records to fetch the database actually has to fetch all of the 200000000 records to count them.
The main strategy to combat this problem is to use a where clause instead of the offset.
If you know the previous 1000 rows (because this is some kind of iteratively used query) you can instead take the id and trans_id from the last row of that set and fetch the 1000 rows following it.
If the figure of 200000000 doesn't need to be exact and you can make a good guess of where to start then that might be an avenue to attack the problem.

Optimal use of LIKE on indexed column

I have a large table (+- 1 million rows, 7 columns including the primary key). The table contains two columns (ie: symbol_01 and symbol_02) that are indexed and used for querying. This table contains rows such as:
id symbol_01 symbol_02 value_01 value_02
1 aaa bbb 12 15
2 bbb aaa 12 15
3 ccc ddd 20 50
4 ddd ccc 20 50
As per the example rows 1 and 2 are identical except that symbol_01 and symbol_02 are swapped but they have the same values for value_01 and value_02. That is true once again with row 3 and 4. This is the case for the entire table, there are essentially two rows for each combination of symbol_01+symbol_02.
I need to figure out a better way of handling this to get rid of the duplication. So far the solution I am considering is to just have one column called symbol which would be a combination of the two symbols, so the table would be as follows:
id symbol value_01 value_02
1 ,aaa,bbb, 12 15
2 ,ccc,ddd, 20 50
This would cut the number of rows in half. As a side note, every value in the symbol column will be unique. Results always need to be queried for using both symbols, so I would do:
select value_01, value_02
from my_table
where symbol like '%,aaa,%' and symbol like '%,bbb,%'
This would work but my question is around performance. This is still going to be a big table (and will get bigger soon). So my question is, is this the best solution for this scenario given that symbol will be indexed, every symbol combination will be unique, and I will need to use LIKE to query results.
Is there a better way to do this? Im not sure how great LIKE is for performance but I don't see an alternative?

There's no high performance solution, because your problem is shoehorning multiple values into one column.
Create a child table (with a foreign key to your current/main table) to separately hold all the individual values you want to search on, index that column and your query will be simple and fast.

With this index:
create index symbol_index on t (
least(symbol_01, symbol_02),
greatest(symbol_01, symbol_02)
)
The query would be:
select *
from t
where
least(symbol_01, symbol_02) = least('aaa', 'bbb')
and
greatest(symbol_01, symbol_02) = greatest('aaa', 'bbb')
Or simply delete the duplicates:
delete from t
using (
select distinct on (
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) id
from t
order by
greatest(symbol_01, symbol_02),
least(symbol_01, symbol_02),
value_01, value_02
) s
where id = s.id
Depending on the columns semantics it might be better to normalize the table as suggested by #Bohemian

Postgres Crosstab Dynamic Number of Columns

In Postgres 9.4, I have a table like this:
id extra_col days value
-- --------- --- -----
1 rev 0 4
1 rev 30 5
2 cost 60 6
i want this pivoted result
id extra_col 0 30 60
-- --------- -- -- --
1 rev 4 5
2 cost 6
this is simple enough with a crosstab.
but i want the following specifications:
day column will be dynamic. sometimes increments of 1,2,3 (days), 0,30,60 days (accounting months), and sometimes in 360, 720 (accounting years).
range of days will be dynamic. (e.g., 0..500 days versus 1..10 days).
the first two columns are static (id and extra_col)
The return type for all the dynamic columns will remain the same type (in this example, integer)
Here are the solutions I've explored, none of which work for me for the following reasons:
Automatically creating pivot table column names in PostgreSQL -
requires two trips to the database.
Using crosstab_hash - is not dynamic
From all the solutions I've explored, it seems the only one that allows this to occur in one trip to the database requires that the same query be run three times. Is there a way to store the query as a CTE within the crosstab function?
SELECT *
FROM
CROSSTAB(
--QUERY--,
$$--RUN QUERY AGAIN TO GET NUMBER OF COLUMNS--$$
)
as ct (
--RUN QUERY AGAIN AND CREATE STRING OF COLUMNS WITH TYPE--
)

Every solution based on any buildin functionality needs to know a number of output columns. The PostgreSQL planner needs it. There is workaround based on cursors - it is only one way, how to get really dynamic result from Postgres.
The example is relative long and unreadable (the SQL really doesn't support crosstabulation), so I will not to rewrite code from blog here http://okbob.blogspot.cz/2008/08/using-cursors-for-generating-cross.html.

How to duplicate partition content?

I'm trying to set-up a testing environment for performance testing, currently we have a table with 8 million records and we want to duplicate this records for 30 days.
In other words:
- Table 1
--Partition1(8 million records)
--Partition2(0 records)
.
.
--Partition30(0 records)
Now I want to take the 8 million records in Partition1 and duplicate them across the rest of partitions, the only difference that they have is a column that contains a DATE. This column should vary 1 day in each copy.
Partition1(DATE)
Partition2(DATE+1)
Partition3(DATE+2)
And so on.
The last restrictions are that there are 2 indexes in the original table and they must be preserved in the copies and Oracle DB is 10g.
How can I duplicate this content?
Thanks!

It seems to me to be as simple as running as efficient an insert as possible.
Probably if you cross-join the existing data to a list of integers, 1 .. 29, then you can generate the new dates you need.
with list_of_numbers as (
select rownum day_add
from dual
connect by level <= 29)
insert /*+ append */ into ...
select date_col + day_add, ...
from ...,
list_of_numbers;
You might want to set NOLOGGING on the table, since this is test data.