How to export full-text files with SQL? - postgresql

There are an easy way to import/export full-text fields as files?
that solve the problem of "load as multiple lines". Trying with SQL's COPY I can only to transform full-file into full-table, not into a single text field, because each line from COPY is a raw.
that solve the save-back problem, to save the full XML file in the filesystem, without changes in bynary representation (preserving SHA1), and without other exernal procedures (as Unix sed use).
The main problem is on export, so this is the title of this page.
PS: the "proof of same file" in the the round trip — import, export back and compare with original — can be obtained by sha1sum demonstration; see examples below. So, a natural demand is also to check same SHA1 by SQL, avoiding to export on simple check tasks.
All examples
Import a full text into a full-table (is not what I need), and test that can export as the same text. PS: I need to import one file into one field and one row.
Transform full table into one file (is not what I need) and test that can export as same text.PS: I need one row (of one field) into one file.
Calculate the hash by SQL, the SHA1 of the field. Must be the same when compare ... Else it is not a solution for me.
The folowing examples show each problem and a non-elegant workaround.
1. Import
CREATE TABLE ttmp (x text);
COPY ttmp FROM '/tmp/test.xml' ( FORMAT text ); -- breaking lines lines
COPY (SELECT x FROM ttmp) TO '/tmp/test_back.xml' (format TEXT);
Checking that original and "back" have exactly the same content:
sha1sum /tmp/test*.*
570b13fb01d38e04ebf7ac1f73dfad0e1d02b027 /tmp/test_back.xml
570b13fb01d38e04ebf7ac1f73dfad0e1d02b027 /tmp/test.xml
PS: seems perfect, but the problem here is the use of many rows. A real import-solution can import a file into a one-row (and one field). A real export-solution is a SQL function that produce test_back.xml from a single row (of a single field).
2. Transform full table into one file
Use it to store XML:
CREATE TABLE xtmp (x xml);
INSERT INTO xtmp (x)
SELECT array_to_string(array_agg(x),E'\n')::xml FROM ttmp
;
COPY (select x::text from xtmp) TO '/tmp/test_back2-bad.xml' ( FORMAT text );
... But not works as we can check by sha1sum /tmp/test*.xml, not produce the same result for test_back2-bad.xml.
So do also a translation from \n to chr(10), using an external tool (perl, sed or any other) perl -p -e 's/\\n/\n/g' /tmp/test_back2-bad.xml > /tmp/test_back2-good.xml
Ok, now test_back2-good.xml have the same hash ("570b13fb..." in my example) tham original.
Use of Perl is a workaround, how to do without it?
3. The SHA1 of the field
SELECT encode(digest(x::text::bytea, 'sha1'), 'hex') FROM xtmp;
Not solved, is not the same hash tham original (the "570b13fb..." in my example)... Perhaps the ::text enforced internal representation with \n symbols, so a solution will be direct cast to bytea, but it is an invalid cast. The other workaround also not is a solution,
SELECT encode(digest( replace(x::text,'\n',E'\n')::bytea, 'sha1' ), 'hex')
FROM xtmp
... I try CREATE TABLE btmp (x bytea) and COPY btmp FROM '/tmp/test.xml' ( FORMAT binary ), but error ("unknown COPY file signature").

COPY isn't designed for this. It's meant to deal with table-structured data, so it can't work without some way of dividing rows and columns; there will always be some characters which COPY FROM interprets as separators, and for which COPY TO will insert some escape sequence if it finds one in your data. This isn't great if you're looking for a general file I/O facility.
In fact, database servers aren't designed for general file I/O. For one thing, anything which interacts directly with the server's file system will require a superuser role. If at all possible, you should just query the table as usual, and deal with the file I/O on the client side.
That said, there are a few alternatives:
The built-in pg_read_file() function, and pg_file_write() from the adminpack module, provide the most direct interface to the file system, but they're both restricted to the cluster's data directory (and I wouldn't recommend storing random user-created files in there).
lo_import() and lo_export() are the only built-in functions I know of which deal directly with file I/O and which have unrestricted access to the server's file system (within the constraints imposed by the host OS), but the Large Object interface is not particularly user-friendly....
If you install the untrusted variant of a procedural language like Perl (plperlu) or Python (plpythonu), you can write wrapper functions for that language's native I/O routines.
There isn't much you can't accomplish via COPY TO PROGRAM if you're determined enough - for one, you could COPY (SELECT 1) TO PROGRAM 'mv <source_file> <target_file>' to work around the limitations of pg_file_write() - though this blurs the line between SQL and external tools somewhat (and whoever inherits your codebase will likely not be impressed...).

You can use plpythonu f.open(), f.write(), f.close() within a postgres function to write to a file.
Language extension would need to be installed.,
https://www.postgresql.org/docs/8.3/static/plpython.html
Working example from the mailing list.
https://www.postgresql.org/message-id/flat/20041106125209.55697.qmail%40web51806.mail.yahoo.com#20041106125209.55697.qmail#web51806.mail.yahoo.com
for example plpythonu
CREATE FUNCTION makefile(p_file text, p_content text) RETURNS text AS $$
o=open(args[0],"w")
o.write(args[1])
o.close()
return "ok"
$$ LANGUAGE PLpythonU;
PS: for safe implementation see this example.
Preparing
There are a not-so-obvious procedure to use PLpython extension. Supposing an UBUNTU server:
On SQL check SELECT version().
On terminal check sudo apt install postgresql-plpython listed versions.
Install the correct version, eg. sudo apt install postgresql-plpython-9.6.
Back to SQL do CREATE EXTENSION plpythonu.
Testing
The /tmp is default, to create or use other folder, eg. /tmp/sandbox, use sudo chown postgres.postgres /tmp/sandbox.
Suppose the tables of the question's examples. SQL script, repeating some lines:
DROP TABLE IF EXISTS ttmp;
DROP TABLE IF EXISTS xtmp;
CREATE TABLE ttmp (x text);
COPY ttmp FROM '/tmp/sandbox/original.xml' ( FORMAT text );
COPY (SELECT x FROM ttmp) TO '/tmp/sandbox/test1-good.xml' (format TEXT);
CREATE TABLE xtmp (x xml);
INSERT INTO xtmp (x)
SELECT array_to_string(array_agg(x),E'\n')::xml FROM ttmp
;
COPY (select x::text from xtmp)
TO '/tmp/sandbox/test2-bad.xml' ( FORMAT text );
SELECT makefile('/tmp/sandbox/test3-good.xml',x::text) FROM xtmp;
The sha1sum *.xml output of my XML original file:
4947.. original.xml
4947.. test1-good.xml
949f.. test2-bad.xml
4947.. test3-good.xml

Related

make tsvector tokenize by space only

I need to create a tsvector that does not split its content by hyphens but ideally only by whitespace.
select to_tsvector('simple','7073-03-001-01 7072-05-003-06')
creates
'-001':3 '-003':7 '-01':4 '-03':2 '-05':6 '-06':8 '7072':5 '7073':1
where I rather want
'7072-05-003-06':2 '7073-03-001-01':1
is this possible somehow?
There is a simple example of a parser called test_parser which seems to do what you want. It was last in the documents in 9.4, after that it was moved to only be documented in the source tree. These test extensions aren't always installed, so you might need to take special steps (depending on how you installed PostgreSQL and what your OS is and whether you are really using an EOL version) to get it.
create extension test_parser ;
create text search configuration test ( parser = testparser);
ALTER TEXT SEARCH CONFIGURATION test ADD MAPPING FOR word WITH simple;
SELECT * FROM to_tsvector('test', '7073-03-001-01 7072-05-003-06');
to_tsvector
---------------------------------------
'7072-05-003-06':2 '7073-03-001-01':1

postgresql writing a variable to a file or append a select output to a file avoid "copy to"

Regarding postgresql, I need to write to a file the result of a loop that does a number of selects.
The "copy to" does not seem to allow to append to a file and moreover is coubersome because does not allow to set delimiters as empty strings and I am obliged to set delimeters to space and tab.
Any other idea for writing a text variable of pgsql to a file, or for appending during "copy to" to the file.
Another problem is that the file name cannot be a variable of pgsql, is there a solution to that ?
Postgres provides for file writing (below is an excerpt from my code).
perform pg_catalog.pg_file_unlink(fn); -- I delete the file
perform pg_catalog.pg_file_write(fn,r,false); -- I accumulated the string to write into r.
These functions are available with the adminpack that must be installed as e.g. described in adminpack installation.
The above functions solve both the problem of writing to a file with the name given by an expression and the problem of writing an expression to the file.
In my case giving only a file name the file gets written in: /var/lib/postgresql/9.5/main (I use ubuntu).

ERROR: missing data for column when using \copy in psql

I'm trying to import a .txt file into PostgreSQL. The txt file has 6 columns:
Laboratory_Name Laboratory_ID Facility ZIP_Code City State
And 213 rows.
I'm trying to use \copy to put the contents of this file into a table called doe2 in PostgreSQL using this command:
\copy DOE2 FROM '/users/nathangroom/desktop/DOE_inventory5.txt' (DELIMITER(' '))
It gives me this error:
missing data for column "facility"
I've looked all around for what to do when encountering this error and nothing has helped. Has anyone else encountered this?
Three possible causes:
One or more lines of your file has only 4 or fewer space characters (your delimiter).
One or more space characters have been escaped (inadvertently). Maybe with a backslash at the end of an unquoted value. For the (default) text format you are using, the manual explains:
Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters.
Output from COPY TO or pg_dump would not exhibit any of these faults when reading from a table with matching layout. But maybe your file has been edited or is from a different, faulty source?
You are not using the file you think you are using. The \copy meta-command of the psql command-line interface is a wrapper for COPY and reads files local to the client. If your file lives on the server, use the SQL command COPY instead.
Check the file carefully. In my case, a blank line at the end of the file caused the ERROR: missing data for column. Deleted it, and worked fine.
Printing the blank lines might reveal something interesting:
cat -e $filename
I had a similar error. check the version of pg_dump that was used in exporting the data and the version of the database you are want to insert it into. make sure they are same. Also, if copy export fails then export the data by insert

How to use BCP to dump query (cdc function ) retrieved data to text file

Im trying to use BCP to dump data from CDC function into a .dat file. Im using the following query (which works in Server 2008 R2):
USE LEESWIJZER
DECLARE #begin_time datetime
, #end_time datetime
, #from_lsn binary(10)
, #to_lsn binary(10)
SET #end_time = '2013-07-05 12:00:00.000';
SELECT #to_lsn = sys.fn_cdc_map_time_to_lsn('largest less than or equal', #end_time);
SELECT #from_lsn = sys.fn_cdc_get_min_lsn('dbo_LWR_CONTRIBUTIES')
SELECT sys.fn_cdc_map_lsn_to_time(__$start_lsn) AS ChangeDTS
, *
FROM cdc.fn_cdc_get_net_changes_dbo_LWR_CONTRIBUTIES (#from_lsn, #to_LSN, 'all')
(edited for readability, used in BCP as single string)
my BCP string is:
BCP "Query above" queryout "C:\temp\LWRCONTRIBUTIES.dat" -w -t ";|" -r \n -T -S {server\\instance} -o "C:\temp\LWRCONTRIBUTIES.log"
As you can see I want a resulting .dat file in unicode, and a log file. I'm guessing the "ChangeDTS" column added to the function outcome is causing my problem. Error message reads: "[Microsoft][SQL Native Client]Host-file columns may be skipped only when copying into the Server".
It may be resolved using a format file, but since this code needs to run daily, likely more than once a day, and the tables are subject to change, I'm reluctant to constantly adjust my format files (there are 100's of tables needing the same procedure).
Furthermore, this is run on a clients database, who wont like me creating views in their database.
Anybody got any idea how I can create a text file (.dat) with a selected number of columns from a cdc function?
Found the answer, regardless of which version of bcp used, bcp cant handle declarations, it seems. If i edit those out, works like a charm.
However, according to someone on a different forum, BCP should be to handle declarations of variables. So happy it works for me now, but still confused why it does now and didnt before.

Disabling the PostgreSQL 8.4 tsvector parser's `file` token type

I have some documents that contain sequences such as radio/tested that I would like to return hits in queries like
select * from doc
where to_tsvector('english',body) ## to_tsvector('english','radio')
Unfortunately, the default parser takes radio/tested as a file token (despite being in a Windows environment), so it doesn't match the above query. When I run ts_debug on it, that's when I see that it's being recognized as a file, and the lexeme ends up being radio/tested rather than the two lexemes radio and test.
Is there any way to configure the parser not to look for file tokens? I tried
ALTER TEXT SEARCH CONFIGURATION public.english
DROP MAPPING FOR file;
...but it didn't change the output of ts_debug. If there's some way of disabling file, or at least having it recognize both file and all the words that it thinks make up the directory names along the way, or if there's a way to get it to treat slashes as hyphens or spaces (without the performance hit of regexp_replaceing them myself) that would be really helpful.
I think the only way to do what you want is to create your own parser :-( Copy wparser_def.c to a new file, remove from the parse tables (actionTPS_Base and the ones following it) the entries that relate to files (TPS_InFileFirst, TPS_InFileNext etc), and you should be set. I think the main difficulty is making the module conform to PostgreSQL's C idiom (PG_FUNCTION_INFO_V1 and so on). Have a look at contrib/test_parser/ for an example.