Restore a PostgreSQL-9.6 database from an old complete dump and/or a up-to-date just base directory and other rescued files - postgresql

I'm trying to restore/rescue a database from some that I have:
I have all the recent files in PGDATA/base (/var/lib/postgresql/9.6/main/base/), but I have not the complete /var/lib/postgresql/9.6/main/
I have all files from an old backup (and not much different) dump that I restored in a new installation of PostgreSQL-9.6.
I have a lot of rescued files from the hard drive (from ddrescue) and I got thousand of files without a name (having a "#" and then a number instead and in lost+found directory), so, for instance:
I have the pg_class file
I have the pg_clog directory with 0000 file
Edit:
Probably I have the content of pg_xlog, but I don't have the name of the files. I have 5 files sized 16777216 bytes:
#288294 (date 2019-04-01)
#288287 (date 2019-05-14)
#288293 (date 2019-07-02)
#261307 (date 2019-11-27)
#270185 (date 2020-01-28)
Also my old dump is from 2019-04-23, so the first one could
be the same?
So my next step is going to try to read those files with pg_xlogdump
and/or trying to name them with those namefiles (beginning with
00000001000000000000000A by date and put them to the new one pg_xlog directory, that I saw that the system filenaming them, could be?). Also I realized that the last one has the date of the day hard drive crashed, so I have the last one.
The PGDATA/base directory I rescued from the hard drive (damaged) contains directories 1, 12406, 12407 and 37972 with a lot of files inside. I check with pg_filedump -fi that my updated data is stored on files in directory 37972.
Same (but old) data is stored in files in directory PGDATA/base/16387 in the restored dump.
I tried directly to copy the files from one to other mixing the updated data over the old database but it doesn't work. After solved permission errors I can go in to the "Frankenstein" database in that way:
postgres#host:~$ postgres --single -P -D /var/lib/postgresql/9.6/main/ dbname
And I tried to do something, like reindex, and I get this error:
PostgreSQL stand-alone backend 9.6.16
backend> reindex system dbname;
ERROR: could not access status of transaction 136889
DETAIL: Could not read from file "pg_subtrans/0002" at offset 16384: Success.
CONTEXT: while checking uniqueness of tuple (1,7) in relation "pg_toast_2619"
STATEMENT: reindex system dbname;
Certainly pg_subtrans/0002 file is part of the "Frankenstein" and not the good one (because I didn't find it yet, not with that name), so I tried: to copied another one that seems similar first and then, to generate 8192 zeroes with dd to that file, in both cases I get the same error (and in case that the file doesn't exist get the DETAIL: Could not open file "pg_subtrans/0002": No such file or directory.). Anyway I have not idea that what should be that file. Do you think could I get that data from other file? Or could I find the missing file using some tool? So pg_filedump show me empty for the other file in that directory pg_subtrans/0000.
Extra note: I found this useful blog post that talk about restore from just rescued files using pg_filedump, pg_class's file, reindex system and other tools and but is so hard for me to understand how to adapt it to my concrete and easier problem (I think that my problem is easier because I have a dump): https://www.commandprompt.com/blog/recovering_a_lost-and-found_database/

Finally we restored completely database based on PGDATA/base/37972 directory after 4 parts:
Checking and "greping" with pg_filedump -fi which file correspond
to each table. To "greping" easier we made a script.
#!/bin/bash
for filename in ./*; do
echo "$filename"
pg_filedump -fi "$filename"|grep "$1"
done
NOTE: Only useful with small strings.
Executing the great tool pg_filedump -D. -D is a new option (from postgresql-filedump version ≥10), to decode tuples using given comma separated list of types.
As we know types because we made the database, we "just" need to give a comma separated list of types related to the table. I wrote "just" because in some cases it could be a little bit complicated. One of our tables need this kind of command:
pg_filedump -D text,text,text,text,text,text,text,text,timestamp,text,text,text,text,int,text,text,int,text,int,text,text,text,text,text,text,text,text,text,text,int,int,int,int,int,int,int,int,text,int,int 38246 | grep COPY > restored_table1.txt
From pg_filedump -D manual:
Supported types:
bigint
bigserial
bool
char
charN -- char(n)
date
float
float4
float8
int
json
macaddr
name
oid
real
serial
smallint
smallserial
text
time
timestamp
timetz
uuid
varchar
varcharN -- varchar(n)
xid
xml
~ -- ignores all attributes left in a tuple
All those text for us were type character varying(255) but varcharN didn't work for us, so after other tests we finally change it for text.
timestamp for us was type timestamp with time zone but timetz didn't work for us, so after other tests we finally change it for timestamp and we opted to lose the time zone data.
This changes work perfect for this table.
Other tables were much easier:
pg_filedump -D int,date,int,text 38183 | grep COPY > restored_table2.txt
As we get just "raw" data we have to re-format to CSV format. So we made a python program for format from pg_filedump -D output to CSV.
We inserted each CSV to the PostgreSQL (after create each empty table again):
COPY scheme."table2"(id_comentari,id_observacio,text,data,id_usuari,text_old)
FROM '<path>/table2.csv' DELIMITER '|' CSV HEADER;
I hope this will help other people :)

That is doomed. Without the information in pg_xlog and (particularly) pg_clog, you cannot get the information back.
A knowledgeable forensics expert might be able to salvage some of your data, but it is not a straightforward process.

Related

Is possible omit specifics tables from existing dump file when use pgsql for import data?

I have a dump file and I want import it, but it contains a log table with millions of records whereby I need exclude it when execute pgsql < dump_file. Note: I cannot use pg_restore
Edit:
Since the best option is to edit the file manually, any suggestions to remove 650K lines from a 690K line file on windows?
The correct way to is to fix whatever problem is preventing you from using pg_restore (I guess that you have already taken the dump in the wrong format?).
The quick and dirty way is to use a program to exclude what you don't want. I'd use perl, because that is what I would use. sed or awk have similar features, and I'm sure there are ways to do it in every other language you would care to look at.
perl -ne 'print unless /^COPY public.pgbench_accounts/../^\\\.$/' dump.file | psql
This excludes every line between the one that starts with COPY public.pgbench_accounts until the next following \.
Of course you would replace public.pgbench_accounts with your table's name, making sure to quote it properly if that is needed.
It might get confused if your database contains a row whose first column starts with the text "COPY public.pgbench_accounts"...
Then you have to edit the file manually.
A crude alternative might be: create a table with the same name as the log table, but with an incompatible definition or no permissions for the importing user. Then restoring that table will fail. If you ignore these errors, you have reached your goal.

How to create partitioned table from Google Bucket?

Every weekend I add a few files to a google bucket and then run something from the command line to "update" a table with the new data.
By "update" I mean that I delete the table and then remake it by using all the files in the bucket, including the new files.
I do everything by using python to execute the following command in the Windows command line:
bq mk --table --project_id=hippo_fence-5412 mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
This table is getting massive (>200 GB) and it would be much cheaper for the lab to use partitioned tables.
I've tried a to partition the table in a few ways, including what is recommened by the official docs but I can't make it work.
The most recent command I tried was just inserting --time_partitioning_type=DAY like:
bq mk --table --project_id=hippo_fence-5412 --time_partitioning_type=DAY mouse_data.partition_test gs://mybucket/mouse_data/* measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
but that didn't work, giving me the error:
FATAL Flags parsing error: Unknown command line flag 'time_partitioning_type'
How can I make this work?
For the old data, a possible solution would be to create an empty partitioned table and then import each bucket file in the desired day partition. Unfortunately it didn’t work with wildcards when I tested it.
1. Create the partitioned table
bq mk --table --time_partitioning_type=DAY [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] measurement_date:TIMESTAMP,symbol:STRING,height:FLOAT,weight:FLOAT,age:FLOAT,response_time:FLOAT
2. Import each file in the desired partition day. Here is an example for a file from 22nd February 2018.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE]$20180222 gs://MY-BUCKET/my_file.csv
3. Process the current uploads normally and they will be automatically counted in the day of the upload partition.
bq load [MY_PROJECT]:[MY_DATASET].[PROD_TABLE] gs://MY-BUCKET/files*

boolean field in redshift copy

I am producing a comma-separated file in S3 that needs to be copied to a staging table in a redshift database using the postgres COPY command.
It has one boolean field. With every sensible way I can think of to represent the boolean value in the file, redshift copy complains, usually with "Unknown boolean format".
I'm going to give up and change the staging table field to a smallint so that I can proceed with the copy and translate the value on the load from staging to the final redshift table, but I'm curious if anyone knows the correct incantation.
A zero or one works just fine for us.
Check your loads carefully, it may well be another issue that's 'pushing' invalid data into your boolean column.
For instance, we had all kinds of crazy characters embedded in our data that would cause errors like that. I eventually settled on using the US character for the record separator.
Check to make sure you're excluding the headers during the COPY command.
I ran into the same problem, but adding the ignoreheader 1 option (ignores 1 header line during import) solved the issue.

String contains invalid or unsupported UTF8 codepoints. Bad UTF8 hex sequence:

Team,
I am using redshift version *(8.0.2 ). while loading data using COPY command, I get an error: - "String contains invalid or unsupported UTF8 codepoints, Bad UTF8 hex sequence: bf (error 3)".
It seems COPY trying to load UTF-8 "bf" into VARCHAR field. As per Amazon redshift, this error code 3 defines below:
error code3:
The UTF-8 single-byte character is out of range. The starting byte must not be 254, 255
or any character between 128 and 191 (inclusive).
Amazon recommnds this as solution - we need to go replace the character with a valid UTF-8 code sequence or remove the character.
could you please help me how to replace the character with valid UTF-8 code ?
when i checked database properties in PG-ADMIN, it shows the encoding as UTF-8.
Please guide me how to replace the character in the input delimited file.
Thanks...
I've run into this issue in RedShift while loading TPC-DS datasets for experiments.
Here is the documentation and forum chatter I found via AWS:https://forums.aws.amazon.com/ann.jspa?annID=2090
And here is the explicit commands you can use to solve data conversion errors:http://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-conversion.html#copy-acceptinvchars
You can explicitly replace the invalid UTF-8 characters or disregard them all together during the COPY phase by stating ACCEPTINVCHARS.
Try this:
copy table from 's3://my-bucket/my-path
credentials 'aws_iam_role=<your role arn>'
ACCEPTINVCHARS
delimiter '|' region 'us-region-1';
Warnings:
Load into table 'table' completed, 500000 record(s) loaded successfully.
Load into table 'table' completed, 4510 record(s) were loaded with replacements made for ACCEPTINVCHARS. Check 'stl_replacements' system table for details.
0 rows affected
COPY executed successfully
Execution time: 33.51s
Sounds like the encoding of your file might not be utf-8. You might try this technique that we use sometimes
cat myfile.tsv| iconv -c -f ISO-8859-1 -t utf8 > myfile_utf8.tsv
For many people loading CSVs into databases, they get their files from someone using Excel or they have access to Excel. If so, this problem is quickly solved by:
First saving the file out of Excel using the Save As and selecting CSV UTF-8 (Comma Delimited) (*.csv) format, by requesting/training those giving you the files to use this export format. Note many people by default export to csv using the CSV (Comma delimited) (*.csv) format and there is a difference.
Loading the csv into Excel and then immediately Saving As to the UTF-8 csv format.
Of course it wouldn't work for files unusable by Excel, ie. larger than 1 million rows, etc. Then I would use the iconv suggestion by mike_pdb
Noticed Athena external table is able to parse data which Redshift copy command unable to do. We can use below alternative approach when encountering - String contains invalid or unsupported UTF8 codepoints Bad UTF8 hex sequence: 8b (error 3).
Follow below steps, if you want to load data into redshift database db2 and table table2.
Have a Glue crawler IAM role ready which has access to S3.
Run crawler.
Validate table and database in Athena created by Glue crawler, say external db1_ext, table1_ext
Login to redshift and create linking with Glue Catalog by creating Redshift schema (db1_schema) using below command.
CREATE EXTERNAL SCHEMA db1_schema
FROM DATA CATALOG
DATABASE 'db1_ext'
IAM_ROLE 'arn:aws:iam:::role/my-redshift-cluster-role';
Load from external table
INSERT INTO db2.table2 (SELECT * FROM db1_schema.table1_ext)

How should I open a PostgreSQL dump file and add actual data to it?

I have a pretty basic database. I need to drop a good size users list into the db. I have the dump file, need to convert it to a .pg file and then somehow load this data into it.
The data I need to add are in CSV format.
I assume you already have a .pg file, which I assume is a database dump in the "custom" format.
PostgreSQL can load data in CSV format using the COPY statement. So the absolute simplest thing to do is just add your data to the database this way.
If you really must edit your dump, and the file is in the "custom" format, there is unfortunately no way to edit the file manually. However, you can use pg_restore to create a plain SQL backup from the custom format and edit that instead. pg_restore with no -d argument will generate an SQL script for insertion.
As suggested by Daniel, the simplest solution is to keep your data in CSV format and just import into into Postgres as is.
If you're trying to to merge this CSV data into a 3rd party Postgres dump file, then you'll need to first convert the data into SQL insert statements.
One possible unix solution:
awk -F, '{printf "INSERT INTO TABLE my_tab (\"%s\",\"%s\",\"%s\");\n",$1,$2,$3}' data.csv