How to COPY CSV file into table resolving foreign key values into ids - postgresql

I'm expert at mssql but a beginner with PostgreSQL so all my assumptions are likely wrong.
Boss is asking me to get a 300 MB CSV file into PostgreSQL 12 (1/4 million rows and 100+ columns). The file has usernames in 20 foreign key columns that would need to be looked up and converted to id int values before getting inserted into a table. The COPY command doesn't seem to handle joining a csv to other tables before inserting. Am I going in a wrong direction? I want to test locally but ultimately am only allowed to give the CSV to a DBA for importing into a docker instance on a server. If only I could use pgAdmin and directly insert the rows!

Related

Basic questions about Cloud SQL

I'm trying to populate a cloud sql database using a cloud storage bucket, but I'm getting some errors. The csv has the headers (or column names) as first row and does not have all the columns (some columns in the database can be null, so I'm loading the data I need for now).
The database is in postgresql and this is the first database in GCP I'm trying to configure and I'm a little bit confused.
Does it matters if the csv file has the column names?
Does the order of the columns matter in the csv file? (I guess they do if there are not present in the csv)
The PK of the table is a serial number, which I'm not including in the csv file. Do I need to include also the PK? I mean, because its a serial number it should be "auto assigned", right?
Sorry for the noob questions and thanks in advance :)
This is all covered by the COPY documentation.
It matters in that you will have to specify the HEADER option so that the first line is skipped:
[...] on input, the first line is ignored.
The order matters, and if the CSV file does not contain all the columns in the same order as the table, you have to specify them with COPY:
COPY mytable (col12, col2, col4, ...) FROM '/dir/afile' OPTIONS (...);
Same as above: if you omit a table column in the column list, it will be filled with the default value, in that case that is the autogenerated number.

copy csv postgres ignore rows that violate constraints

I have a .csv file with ~300,000 rows, some of which violate certain constraints I set in my postgres database. Is there a way to copy my .csv file into the database and have postgres filter out the rows that violate the constraints? I do not want these rows to show up in the database.
If this is not possible, is there any other way to solve this problem?
what I'm doing right now is
COPY blocksequences from '/tmp/blocksequences.csv CSV HEADER;
And I get
'ERROR: new row for relation "blocksequences" violates check constraint "blocksequences_partid3_check"
DETAIL: Failing row contains (M001-M049-S186, M001, null, M049, S186).
CONTEXT: COPY blocksequences, line 680: "M001-M049-S186,M001,,M049,S186"
reason for the error: column that contains M049 is not allowed to have that string entered. Many other rows have violations like this.
I read a little about exception when check violation --do nothing am I on the right track here? seems like it's only a mysql thing maybe
Usually this is done in this way:
create a temporary table with the same structure as the destination one but without constraints,
copy data to the temporary table with COPY command,
copy rows that do fulfill constraints from temp table to the destination one, using INSERT command with conditions in the WHERE clause based on the table constraint,
drop the temporary table.
When dealing with really large CSV files or very limited server resources, use the extension file_fdw instead of temporary tables. It's much more efficient way but it requires server access to a CSV file (while copying to a temporary table can be done over the network).
In Postgres 12 you can use the WHERE clause in COPY FROM.

SQL Server Express 2008 10GB Size Limit

I am approaching the 10 GB limit that Express has on the primary database file.
The main problem appears to be some fixed length char(500) columns that are never near that length.
I have two tables with about 2 million rows between them. These two tables add up to about 8 GB of data with the remainder being spread over another 20 tables or so. These two tables each have 2 char(500) columns.
I am testing a way to convert these columns to varchar(500) and recover the trailing spaces.
I tried this:
Alter Table Test_MAILBACKUP_RECIPIENTS
Alter Column SMTP_address varchar(500)
GO
Alter Table Test_MAILBACKUP_RECIPIENTS
Alter Column EXDN_address varchar(500)
This quickly changed the column type but obviously didn’t recover the space.
The only way I can see to do this successfully is to:
Create a new table in tempdb with the varchar(500) columns,
Copy the information into the temp table trimming off the trailing spaces,
Drop the real table,
Recreate the real table with the new varchar(500) columns,
Copy the information back.
I’m open to other ideas here as I’ll have to take my application offline while this process completes?
Another thing I’m curious about is the primary key identity column.
This table has a Primary Key field set as an identity.
I know I have to use Set Identity_Insert on to allow the records to be inserted into the table and turn it off when I’m finished.
How will recreating a table affect new records being inserted into the table after I’m finished. Or is this just “Microsoft Magic” and I don’t need to worry about it?
The problem with you initial approach was that you converted the columns to varchar but didn't trim the existing whitespace (which is maintained after the conversion), after changing the data type of the columns to you should do:
update Test_MAILBACKUP_RECIPIENTS set
SMTP_address=rtrim(SMTP_address), EXDN_address=rtrim(EXDN_address)
This will eliminate all trailing spaces from you table, but note that the actual disk size will be the same, as SQL Server don't shrink automatically database files, it just mark that space as unused and available for other data.
You can use this script from another question to see the actual space used by data in the DB files:
Get size of all tables in database
Usually shrinking a database is not recommended but when there is a lot of difference between used space and disk size you can do it with dbcc shrinkdatabase:
dbcc shrinkdatabase (YourDatabase, 10) -- leaving 10% of free space for new data
OK I did a SQL backup, disabled the application and tried my script anyway.
I was shocked that it ran in under 2 minutes on my slow old server.
I re-enabled my application and it still works. (Yay)
Looking at the reported size of the table now it went from 1.4GB to 126Mb! So at least that has bought me some time.
(I have circled the Data size in KB)
Before
After
My next problem is the MailBackup table which also has two char(500) columns.
It is shown as 6.7GB.
I can't use the same approach as this table contains a FileStream column which has around 190gb of data and tempdb does not support FleStream as far as I know.
Looks like this might be worth a new question.

exporting data from one table to another in different database using DOS command

I have to transfer data from old database to new database where table name and column name is different.
Can it be done with DOS command or any other solution?
One is POSTGRESQL and old is MYSQL.
My concern is table name and column names are different, column number is same.
Thank you
I do not know the postgresql part but for sql server you can use sqlcmd.exe to export data as text format with or w/o column names
Please check
http://msdn.microsoft.com/en-us/library/ms162773.aspx

Using MySQLdump for sub-set of database migration

I've been using both mysql and mysqldump to teach myself how to get data out of one database, but I consider myself a moderate MySQL newbie, just for the record.
What I'd like to do is get a sub-set of one database into a brand new database on a different server, so I need to create both the db/table creation sql as well as populating the data records.
Let's say my original db has 10 tables with 100 rows each. My new database will have 4 of those tables (all original columns), but a further-refined dataset of 40 rows each. Those 40 rows are isolated with some not-so-short SELECT statements, one for each table.
I'd like to produce .sql file(s) that I can call from mysql to load/insert my exported data. How can I generate those sql files? I have HEARD that you can call a select statement from mysqldump, but haven't seen relevant examples with select statements as long as mine.
Right now I can produce sql output that is just the results set with column names, but no insert code, etc.
Assistance is GREATLY appreciated.
You will probably have to use mysqldump to dump your tables one at a time and using the where clause
-w, --where='where-condition'
Dump only selected records. Note that quotes are mandatory:
"--where=user='jimf'" "-wuserid>1" "-wuserid<1"
For example:
mysqldump database1 table1 --where='rowid<10'
See docs: http://linux.die.net/man/1/mysqldump
mysqldump each table with a where clause like Dennis said above. one table at a time and merge the scripts with cat
cat customer.db order.db list.db price.db > my_new_db.db