How to Export large Table with 1.2 billion rows from Postgresql

How to Export large Table with 1.2 billion rows from Postgresql - postgresql

I am trying to export a large table using
SELECT name1, name2, name3, name4
From table1
GROUP BY 1, 2, 3, 4
However, after waiting for 1 hour, it returns OUT OF Memory DB connection needs to reset. I tried using the COPY table to csv file, but it returns needs to be superuser with STDIN/STDOUT. I am new to Postgresql.
How can I export this table without running out of memory?
Thanks in advance.

I think your best bet would be to chunk it up, I don't see how it's going to process 1.2 billion lines without freaking out.
Have a script that does off 10000 at a time or something and saves the starting index for the next cycle.

Please read error messages carefully and convey them to us verbatim. COPY says you need to be a superuser unless you use STDIN/STDOUT, not with them.
You should use a client-specific method to do this. If your client is psql that would be \copy. If your client is something else, you should tell us what that is.

Related

See all rows and download the result to csv in pgAdmin4

I have data 66596470 which i got from calculation in query not from table. but in pgAdmin4 I can only see 1000 of 66596470. How can I see all rows or download them to csv? Thank you...

This is not PostgreSQL you are talking about, but a GUI client that limits the number of result rows. You can probably configure the unnamed tool to display more than 1000 rows, but if you tried to actually see all 66 million rows, that would take a lot of time, both for the client software and you as a reader, and the client would probably go out of memory.
If you want to retrieve (rather than see) the complete result set, use psql's \copy command to write the result set to a file. FOr example:
\copy (SELECT ...) TO 'filename' (FORMAT 'csv')

How do I modify this PostgreSQL table?

I'm using PostgreSQL in my SBC, running Parabola (Arch and ALARM based). I was reading this tutorial so I can use PostgreSQL databases and make Postfix work with them. However, in this step I have to fill up some info in the tables I've created, right? How do I do that? Would be better if I can use pgAdmin3 or phpPgAdmin for this, but if I have to do in CLI, there's no problem.
Thanks.
EDIT: I'm a complete noob with PostgreSQL. I'm trying to learn about it, so if you need more info please tell me how to get it.

Short example, and you can run these from pgadmin3 etc as well as the command line.
create table tester (a int, b text);
insert into tester select generate_series(1,1000),'sometext';
select * from tester;
If you need random text, the easy way is to load something like a Shakespeare novel, and use random and substring to grab random sections from that one row in that one table.

T-SQL - Trying to query something across all databases on my server

I've got an environment where my server is hosting a variable number of databases, all of which utilize the same table structures/schemas. I need to pull a sum of customers that meet a certain series of constraints with say, the user table. I also need to show which database I am showing the sum for.
I already know all I need to get the sum in a db by db query, but what I'm really looking to do is have one script that hits all of the non-system DBs currently on my server to grab this info.
Please forgive my ignorance in this, just starting out.
Update-
So, to clarify things somewhat; I'm using MS SQL 2014. I know how to pull a listing of the dbs I want to hit by using:
SELECT name
FROM sys.databases
WHERE name not in ('master', 'model', 'msdb', 'tempdb')
AND state = 0
And for the purposes of gathering the data I need from each, let's just say I've got something like:
select count(u.userid)
from users n
join UserAttributes ua on u.userid = ua.userid
where ua.status = 2
New Update:
So, I went ahead and added the ps sp_foreachdb as suggested by #Philip Kelley, and I'm now running into a problem when trying to run this (admittedly, I can tell I'm closer to a solution). So, this is what I'm using to call the sp:
USE [master]
GO
DECLARE #return_value int
EXEC #return_value = [dbo].[sp_foreachdb]
#command = N'select count(userid) as number from ?..users',
#print_dbname = 1,
#user_only = 1
SELECT 'Return Value' = #return_value
GO
This provides a nice and clean output showing a count, but what I'd like to see is the db name in addition to the count, something like this:
|[DB_NAME]|[COUNT]|
But for each DB
Is this even possible?

Source Code: https://codereview.stackexchange.com/questions/113063/executing-dynamic-sql-programmatically
Example Usage:
declare #options int = (
select a.ExcludeSystemDatabases
from dbo.ForEachDatabaseOptions() as a
);
execute dbo.usp_ForEachDatabase
#Command = N'print Db_Name();'
, #Options = #options;
#Command can be anything you want but obviously it needs to be a query that every single database can understand. #Options currently has 3 built-in settings but can be expanded however you see fit.
I wrote this to mimic/expand upon the master.sys.sp_MSforeachdb procedure but it could still use a little bit of polish (especially around the "logic" that replaces ? with the current database name).

Enumerate the databases from schema / sysdatabases. At least in situations without replication, excluding db_ids 1 to 4 as system databases should be reasonably robust:
SELECT [name] FROM master.dbo.sysdatabases WHERE dbid NOT IN (1,2,3,4)
Other methods exist, see here: Get list of databases from SQL Server and here: SQL Server: How to tell if a database is a system database?
Then prefix the query or stored procedure call with the database name, and in a cursor loop over the resultset of the first query, store that in a sysname variable to construct a series of statements like that:
SELECT column FROM databasename.schema.Viewname WHERE ...
and call that using the string execute function
EXECUTE('SELECT ... FROM '+##fully_qualified_table_name+' WHERE ...')

There’s the undocumented sytem procedure, sp_msForEachDB, as found in the master database. Many pundits on the internet recommend not using this, as under obscure fringe cases it can be unreliable and somehow skip random databases. Count me as one of them, this caused me serious grief a few months back.
You can write your own routine to provide this kind of functionality. This is a common task, however, and many people have already done it and posted their code online… so why re-invent the wheel?
#kittoes0124 posted a link to “usp_ForEachDatabse”. This probably works, though pro forma I hate any stored procedures that beings with usp_. I ended up with Aaron Bertrand’s utility, which can be found at http://www.mssqltips.com/sqlservertip/2201/making-a-more-reliable-and-flexible-spmsforeachdb/.
Install a version of this routine, figure out how it works, plug in your script, and go!

How to tell if record has changed in Postgres

I have a bit of an "upsert" type of question... but, I want to throw it out there because it's a little bit different than any that I've read on stackoverflow.
Basic problem.
I'm working on moving from mysql to PostgreSQL 9.1.5 (hosted on Heroku). As a part of that, I need to import multiple CSV files everyday. Some of the data is sales information and is almost guaranteed to be new and need to be inserted. But, other parts of the data is almost guaranteed to be the same. For example, the csv files (note plural) will have POS (point of sale) information in them. This rarely changes (and is most likely only via additions). Then there is product information. There are about 10,000 products (vast majority will be unchanged, but it's possible to have both additions and updates).
The final item (but is important), is that I have a requirement to be able to provide an audit trail/information for any given item. For example, if I add a new POS record, I need to be able to trace that back to the file it was found in. If I change a UPC code or description of a product, then I need to be able to trace it back to the import (and file) where the change came from.
Solution that I'm contemplating.
Since the data is provided to me via CSV, then I'm working around the idea that COPY will be the best/fastest way. The structure of the data in the files is not exactly what I have in the database (i.e. final destination). So, I'm copying them into tables in the staging schema that match the CSV (note: one schema per datasource). The tables in the staging schemas will have a before insert row triggers. These triggers can decide what to do with the data (insert, update or ignore).
For the tables that are most likely to contain new data, then it will try to insert first. If the record is already there, then it will return NULL (and stop the insert into the staging table). For tables that rarely change, then it will query the table and see if the record is found. If it is, then I need a way to see if any of the fields are changed. (because remember, I need to show that the record was modified by import x from file y) I obviously can just boiler plate out the code and test each column. But, was looking for something a little more "eloquent" and more maintainable than that.
In a way, what I'm kind of doing is combining a importing system with an audit trail system. So, in researching audit trails, I reviewed the following wiki.postgresql.org article. It seems like the hstore might be a nice way of getting changes (and being able to easily ignore some columns in the table that aren't important - e.g. "last_modified")
I'm about 90% sure it will all work... I've created some testing tables etc and played around with it.
My question?
Is a better, more maintainable way of accomplishing this task of finding the maybe 3 records out of 10K that require a change to the database. I could certainly write a python script (or something else) that reads the file and tries to figure out what to do with each record, but that feels horribly inefficient and will lead to lots of round trips.
A few final things:
I don't have control over the input files. I would love it if they only sent me the deltas, but they don't and it's completely outside of my control or influence.
he system is grow and new data sources are likely to be added that will greatly increase the amount of data being processed (so, I'm trying to keep things efficient)
I know this is not nice, simple SO question (like "how to sort a list in python") but I believe one of the great things about SO is that you can ask hard questions and people will share their thoughts about how they think the best way to solve it is.

I have lots of similar operations. What I do is COPY to temporary staging tables:
CREATE TEMP TABLE target_tmp AS
SELECT * FROM target_tbl LIMIT 0; -- only copy structure, no data
COPY target_tmp FROM '/path/to/target.csv';
For performance, run ANALYZE - temp. tables are not analyzed by autovacuum!
ANALYZE target_tmp;
Also for performance, maybe even create an index or two on the temp table, or add a primary key if the data allows for that.
ALTER TABLE ADD CONSTRAINT target_tmp_pkey PRIMARY KEY(target_id);
You don't need the performance stuff for small imports.
Then use the full scope of SQL commands to digest the new data.
For instance, if the primary key of the target table is target_id ..
Maybe DELETE what isn't there any more?
DELETE FROM target_tbl t
WHERE NOT EXISTS (
SELECT 1 FROM target_tmp t1
WHERE t1.target_id = t.target_id
);
Then UPDATE what's already there:
UPDATE target_tbl t
SET col1 = t1.col1
FROM target_tmp t1
WHERE t.target_id = t1.target_id
To avoid empty UPDATEs, simply add:
...
AND col1 IS DISTINCT FROM t1.col1; -- repeat for relevant columns
Or, if the whole row is relevant:
...
AND t IS DISTINCT FROM t1; -- check the whole row
Then INSERT what's new:
INSERT INTO target_tbl(target_id, col1)
SELECT t1.target_id, t1.col1
FROM target_tmp t1
LEFT JOIN target_tbl t USING (target_id)
WHERE t.target_id IS NULL;
Clean up if your session goes on (temp tables are dropped at end of session automatically):
DROP TABLE target_tmp;
Or use ON COMMIT DROP or similar with CREATE TEMP TABLE.
Code untested, but should work in any modern version of PostgreSQL except for typos.

What is a "batch", and why is GO used?

I have read and read over MSDN, etc. Ok, so it signals the end of a batch.
What defines a batch? I don't see why I need go when I'm pasting in a bunch of scripts to be run all at the same time.
I've never understood GO. Can anyone explain this better and when I need to use it (after how many or what type of transactions)?
For example why would I need GO after each update here:
UPDATE [Country]
SET [CountryCode] = 'IL'
WHERE code = 'IL'
GO
UPDATE [Country]
SET [CountryCode] = 'PT'
WHERE code = 'PT'

GO is not properly a TSQL command.
Instead it's a command to the specific client program which connects to an SQL server (Sybase or Microsoft's - not sure about what Oracle does), signalling to the client program that the set of commands that were input into it up till the "go" need to be sent to the server to be executed.
Why/when do you need it?
GO in MS SQL server has a "count" parameter - so you can use it as a "repeat N times" shortcut.
Extremely large updates might fill up the SQL server's log. To avoid that, they might need to be separated into smaller batches via go.
In your example, if updating for a set of country codes has such a volume that it will run out of log space, the solution is to separate each country code into a separate transaction - which can be done by separating them on the client with go.
Some SQL statements MUST be separated by GO from the following ones in order to work.
For example, you can't drop a table and re-create the same-named table in a single transaction, at least in Sybase (ditto for creating procedures/triggers):
> drop table tempdb.guest.x1
> create table tempdb.guest.x1 (a int)
> go
Msg 2714, Level 16, State 1
Server 'SYBDEV', Line 2
There is already an object named 'x1' in the database.
> drop table tempdb.guest.x1
> go
> create table tempdb.guest.x1 (a int)
> go
>

GO is not a statement, it's a batch separator.
The blocks separated by GO are sent by the client to the server for processing and the client waits for their results.
For instance, if you write
DELETE FROM a
DELETE FROM b
DELETE FROM c
, this will be sent to the server as a single 3-line query.
If you write
DELETE FROM a
GO
DELETE FROM b
GO
DELETE FROM c
, this will be sent to the server as 3 one-line queries.
GO itself does not go to the server (no pun intended). It's a pure client-side reserved word and is only recognized by SSMS and osql.
If you will use a custom query tool to send it over the connection, the server won't even recognize it and issue an error.

Many command need to be in their own batch, like CREATE PROCEDURE
Or, if you add a column to a table, then it should be in its own batch.
If you try to SELECT the new column in the same batch it fails because at parse/compile time the column does not exist.
GO is used by the SQL tools to work this out from one script: it is not a SQL keyword and is not recognised by the engine.
These are 2 concrete examples of day to day usage of batches.
Edit: In your example, you don't need GO...
Edit 2, example. You can't drop, create and permission in one batch... not least, where is the end of the stored procedure?
IF OBJECT_ID ('dbo.uspDoStuff') IS NOT NULL
DROP PROCEDURE dbo.uspDoStuff
GO
CREATE PROCEDURE dbo.uspDoStuff
AS
SELECT Something From ATable
GO
GRANT EXECUTE ON dbo.uspDoStuff TO RoleSomeOne
GO

Sometimes there is a need to execute the same command or set of commands over and over again. This may be to insert or update test data or it may be to put a load on your server for performance testing. Whatever the need the easiest way to do this is to setup a while loop and execute your code, but in SQL 2005 there is an even easier way to do this.
Let's say you want to create a test table and load it with 1000 records. You could issue the following command and it will run the same command 1000 times:
CREATE TABLE dbo.TEST (ID INT IDENTITY (1,1), ROWID uniqueidentifier)
GO
INSERT INTO dbo.TEST (ROWID) VALUES (NEWID())
GO 1000
source:
http://www.mssqltips.com/tip.asp?tip=1216
Other than that it marks the "end" of an SQL block (e.g. in a stored procedure)... Meaning you're on a "clean" state again... e.G: Parameters used in the statement before the code are reset (not defined anymore)

As everyone already said, "GO" is not part of T-SQL. "GO" is a batch separator in SSMS, a client application used to submit queries to the database. This means that declared variables and table variables will not persist from code before the "GO" to code following it.
In fact, GO is simply the default word used by SSMS. This can be changed in the options if you want. For a bit of fun, change the option on someone else's system to use "SELECT" as a batch seperator instead of "GO". Forgive my cruel chuckle.

It is used to split logical blocks. Your code is interpreted into sql command line and this indicate next block of code.
But it could be used as recursive statement with specific number.
Try:
exec sp_who2
go 2
Some statement have to be delimited by GO:
use DB
create view thisViewCreationWillFail

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to Export large Table with 1.2 billion rows from Postgresql - postgresql

I think your best bet would be to chunk it up, I don't see how it's going to process 1.2 billion lines without freaking out. Have a script that does off 10000 at a time or something and saves the starting index for the next cycle.

Related

See all rows and download the result to csv in pgAdmin4

How do I modify this PostgreSQL table?

T-SQL - Trying to query something across all databases on my server

How to tell if record has changed in Postgres

What is a "batch", and why is GO used?

Categories

Resources