Using ASCII 31 field separator character as Postgresql COPY delimiter - postgresql

We are exporting data from Postgres 9.3 into a text file for ingestion by Spark.
We would like to use the ASCII 31 field separator character as a delimiter instead of \t so that we don't have to worry about escaping issues.
We can do so in a shell script like this:
#!/bin/bash
DELIMITER=$'\x1F'
echo "copy ( select * from table limit 1) to STDOUT WITH DELIMITER '${DELIMITER}'" | (psql ...) > /tmp/ascii31
But we're wondering, is it possible to specify a non-printable glyph as a delimiter in "pure" postgres?
edit: we attempted to use the postgres escaping convention per http://www.postgresql.org/docs/9.3/static/sql-syntax-lexical.html
warehouse=> copy ( select * from table limit 1) to STDOUT WITH DELIMITER '\x1f';
and received
ERROR: COPY delimiter must be a single one-byte character

Try prepending E before the sequence you're trying to use as a delimter. For example E'\x1f' instead of '\x1f'. Without the E PostgreSQL will read '\x1f' as four separate characters and not a hexadecimal escape sequence, hence the error message.
See the PostgreSQL manual on "String Constants with C-style Escapes" for more information.

From my testing, both of the following work:
echo "copy (select 1 a, 2 b) to stdout with delimiter u&'\\001f'"| psql;
echo "copy (select 1 a, 2 b) to stdout with delimiter e'\\x1f'"| psql;

I've extracted a small file from Actian Matrix (a fork of Amazon Redshift, both derivatives of postgres), using this notation for ASCII character code 30, "Record Separator".
unload ('SELECT btrim(class_cd) as class_cd, btrim(class_desc) as class_desc
FROM transport.stg.us_fmcsa_carrier_classes')
to '/tmp/us_fmcsa_carrier_classes_mk4.txt'
delimiter as '\036' leader;
This is an example of how this file looks in VI:
C^^Private Property
D^^Private Passenger Business
E^^Private Passenger Non-Business
I then moved this file over to a machine hosting PostgreSQL 9.5 via sftp, and used the following copy command, which seems to work well:
copy fmcsa.carrier_classes
from '/tmp/us_fmcsa_carrier_classes_mk4.txt'
delimiter u&'\001E';
Each derivative of postgres, and postgres itself seems to prefer a slightly different notation. Too bad we don't have a single standard!

Related

How to get output of a sql with additional special characters

Long story...
I am trying to geenrate a crosstab query dynamically and run it as a psql script..
To achieve this, I want the last line of the sql to generated and appended to the top portion of the sql.
The last line of the sql is like this.... "as final_result(symbol character varying,"431" numeric,"432" numeric,"433" numeric);"
Of which, the "431", "432" etc are to be generated dynamically as these are the pivot columns and they change from time to time...
So I wrote a query to output the text as follows....
psql -c "select distinct '"'||runweek||'" numeric ,' from calendar where runweek between current_runweek()-2 and current_runweek() order by 1;" -U USER -d DBNAME > /tmp/gengen.lst
While the sql provides the output, when I run it as a script, because of the special characters (', "", ) it fails.
How should I get it working? My plan was then loop through the "lst" file and build the pivot string, and append that to the top portion of the sql and execute the script... (New to postgres, does not know , dynamic sql generation and execution etc.. but I am comfortable with UNIX scripting..)
If I could somehow get the output as
("431" numeric, "432" numeric....etc) in a single step, if there is a recommendation to achieve this, it will be greatly appreciated.....
Since you're using double quotes around the argument, double quotes inside the argument must be escaped with a backslash:
psql -c "select distinct '\"'||runweek||'\" numeric ,' from calendar where runweek between current_runweek()-2 and current_runweek() order by 1;"
Heredoc can also be used instead of -c. It accepts multi-line formatting so that makes the whole thing more readable.
(psql [arguments] <<EOF
select distinct '"'||runweek||'" numeric ,'
from calendar
where runweek between current_runweek()-2 and current_runweek()
order by 1;
EOF
) > output
By using quote_ident which is specifically meant to produce a quoted identifier from a text value, you don't even need to add the double quotes. The query could be like:
select string_agg( quote_ident(runweek::text), ',' order by runweek)
from calendar
where runweek between current_runweek()-2 and current_runweek();
which also solves the problem that your original query has a stray ',' at the end, whereas this form does not.

How to NULL out a 'no-break space' character in Redshift Copy Command?

My COPY command keeps receiving the following error:
Missing newline: Unexpected character 0x73 found at location 4194303
I ran it through the following function to check for non-ASCII characters:
def return_non_ascii_codes(input: str):
for char in input:
if ord(char) > 127:
yield ord(char)
And found out that I had a number of characters that returned a 160 code. Looking this up in a Unicode chart, it looks like this is a non-break space character: http://www.fileformat.info/info/unicode/char/00a0/index.htm
I want to NULL these characters out in my COPY command, but am unsure of what the correct character sequence/format I should use.
The COPY command is as follows:
COPY xxx
FROM 's3://xxx/cleansed.csv'
WITH CREDENTIALS 'aws_access_key_id=xxx;aws_secret_access_key=xxx'
-- GZIP
ESCAPE
FILLRECORD
TRIMBLANKS
TRUNCATECOLUMNS
DELIMITER '|'
BLANKSASNULL
REMOVEQUOTES
ACCEPTINVCHARS
TIMEFORMAT 'auto'
DATEFORMAT 'auto';
EDIT:
I used Python to find the characters, but Python does not do any of the actual processing in my pipeline. I do a COPY TO STDOUT command from our PostgreSQL databases, and then upload those files directly to S3 for copy to Redshift. So it needs to be handled in one of those two places.
Here are the two fields from the destination table:
id BIGINT,
quiz_data VARCHAR(65535)
UPDATE 1:
I ran the script through a function to cleanse all non-ASCII characters like so:
with open(file, 'r') as inf, open(outfile, 'w') as outf:
for line in inf:
print(return_non_ascii_codes(line))
outf.write(''.join(return_ascii_chars(line)))
def return_ascii_chars(input: str):
return (char for char in input if ord(char) < 127)
and then tried to COPY to Redshift. Still getting the following:
Missing newline: Unexpected character 0x20 found at location 4194303
I've double-checked that the cleansed file doesn't have any non-ASCII character...
COPY table1 FROM 's3://my_bucket' CREDENTIALS '' ACCEPTINVCHARS
Use can use ACCEPTINVCHARS parameter in your copy command.
Its pretty easy and straight forward.
If I’ve made a bad assumption please comment and I’ll refocus my answer.
yourvariable.replace(unichr(160), "")

DB2 Load from delimitited Files - escape " in Fields doesn't work

I bet it's totaly simple and i just don't see it, but i don't get it ..
I execute the following command in DB2 command line processor:
DB2 LOAD FROM "DB_ACC_PASS_REGEXP.del" OF DEL METHOD P (1, 2, 3, 4, 5) MESSAGES "DB_ACC_PASS_REGEXP.del.msg" INSERT INTO DB_ACC_PASS_REGEXP (APP_ID,APREGEXP,EXPLAIN_TEXT,ID,OPT_KZ) NONRECOVERABLE INDEXING MODE REBUILD
Which loads the Data specified in following File into the database.
1,"[a-z]",,1,0
1,"[A-Z]",,2,0
1,"[0-9]",,3,0
1,"[!|\"|§|$|%|&|/|(|)|=|?|`|´|*|+|~|'|#|-|_|.|:|,|;|µ|<|>| |°|^]",,4,0
^
Here is the Problem
The Problem is, that only 3 of these 4 inserts will be accepted. The last one will be rejected, because DB2 Load doesn't notice the escape character before the double quotation mark.
if I change the last line to:
1,"[!|x|§|$|%|&|/|(|)|=|?|`|´|*|+|~|'|#|-|_|.|:|,|;|µ|<|>| |°|^]",,4,0
^
Here is the changed character
there is no problem ..
WHY doesn't the escape character "\" work??
edit
Okay.. I just tryed it the oracle way now and that works ... I escape " with another " so my Line looks like
1,"[!|""|§|$|%|&|/|(|)|=|?|`|´|*|+|~|'|#|-|_|.|:|,|;|µ|<|>| |°|^]",,4,0
But that's only a way to do it .. That doesn't explain why IBM offers the Backslash as an escape character (http://pic.dhe.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.cmd.doc%2Fdoc%2Fr0008305.html)
Using LOAD with ascii / delimited files requires to tune the file type modifiers (look on Table 6 and Table 8 of the docu page you linked). I am not quite sure, but I can't remember using backslash as escape character in DB2.
You can either use another character delimiter as double quotes with chardel option or force no character delimiter with nochardel option.
BUT ...
In your case you need special characters as regular expressions, so you will always need to escape " with "" and ' with ''. I think there is no other way to get this working.

How to treat a comma as text in output?

There's 1 column that contains commas. When I output my query to csv, these commas break the csv format. What I've been doing to avoid this is a simple
replace(A."Sales Rep",',','')
Is there a better way of doing this so that I can actually get the commas in the final output without breaking the csv file?
Thanks!
You can use the COPY command to get PostgreSQL to build the CSV for you:
COPY -- copy data between a file and a table
Something like one of these:
copy your_table to 'filename' csv
copy your_table to 'filename' csv force quote *
copy your_table to stdout csv force quote *
copy your_table to stdout csv force quote * header
...
You have to be the super user to copy to a filename though. If you're inside psql, you can use the \copy command:
Performs a frontend (client) copy. This is an operation that runs an SQL COPY command, but instead of the server reading or writing the specified file, psql reads or writes the file and routes the data between the server and the local file system.
The syntax is pretty much the same:
\copy your_table to 'filename.csv' csv force quote * header
...
Quote the fields with "
a,this has a , in it,b
would become
a,"this has a, in it",b
and if the fields have BOTH a , and a ", double the quotes:
a,this has a " and , in it,b
becomes
a,"this has a "" and , in it",b

ERROR: COPY delimiter must be a single one-byte character

I want to load the data from a flat file with delimiter "~,~" into a PostgreSQL table. I have tried it as below but looks like there is a restriction for the delimiter. If COPY statement doesn't allow multiple chars for delimiter, is there any alternative to do this?
metadb=# \COPY public.CME_DATA_STAGE_TRANS FROM 'E:\Infor\Outbound_Marketing\7.2.1\EM\metadata\pgtrans.log' WITH DELIMITER AS '~,~'
ERROR: COPY delimiter must be a single one-byte character
\copy: ERROR: COPY delimiter must be a single one-byte character
If you are using Vertica, you could use E'\t'or U&'\0009'
To indicate a non-printing delimiter character (such as a tab),
specify the character in extended string syntax (E'...'). If your
database has StandardConformingStrings enabled, use a Unicode string
literal (U&'...'). For example, use either E'\t' or U&'\0009' to
specify tab as the delimiter.
Unfortunatelly there is no way to load flat file with multiple characters delimiter ~,~ in Postgres unless you want to modify source code (and recompile of course) by yourself in some (terrific) way:
/* Only single-byte delimiter strings are supported. */
if (strlen(cstate->delim) != 1)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("COPY delimiter must be a single one-byte character")));
What you want is to preprocess your input file with some external tool, for example sed might to be best companion on GNU/Linux platfom, for example:
sed s/~,~/\\t/g inputFile
The obvious thing to do is what all other answers advised. Edit import file. I would do that, too.
However, as a proof of concept, here are two ways to accomplish this without additional tools.
1) General solution
CREATE OR REPLACE FUNCTION f_import_file(OUT my_count integer)
RETURNS integer AS
$BODY$
DECLARE
myfile text; -- read xml file into that var.
datafile text := '\path\to\file.txt'; -- !pg_read_file only accepts relative path in database dir!
BEGIN
myfile := pg_read_file(datafile, 0, 100000000); -- arbitrary 100 MB max.
INSERT INTO public.my_tbl
SELECT ('(' || regexp_split_to_table(replace(myfile, '~,~', ','), E'\n') || ')')::public.my_tbl;
-- !depending on file format, you might need additional quotes to create a valid format.
GET DIAGNOSTICS my_count = ROW_COUNT;
END;
$BODY$
LANGUAGE plpgsql VOLATILE;
This uses a number of pretty advanced features. If anybody is actually interested and needs an explanation, leave a comment to this post and I will elaborate.
2) Special case
If you can guarantee that '~' is only present in the delimiter '~,~', then you can go ahead with a plain COPY in this special case. Just treat ',' in '~,~' as an additional columns.
Say, your table looks like this:
CREATE TABLE foo (a int, b int, c int);
Then you can (in one transaction):
CREATE TEMP TABLE foo_tmp ON COMMIT DROP (
a int, tmp1 "char"
,b int, tmp2 "char"
,c int);
COPY foo_tmp FROM '\path\to\file.txt' WITH DELIMITER AS '~';
ALTER TABLE foo_tmp DROP COLUMN tmp1;
ALTER TABLE foo_tmp DROP COLUMN tmp2;
INSERT INTO foo SELECT * FROM foo_tmp;
Not quite sure if you're looking for a postgresql solution or just a general one.
If it were me, I would open up a copy of vim (or gvim) and run the commend :%s/~,~/~/g
That replaces all "~,~" with "~".
you can use a single character delimiter, open notepad press ctrl+h replace ~,~ with something will not interfere. like |