I am trying to use copy in golang to load hundreds of thousands of lines of texts files in a postgres database. Sometimes it fails because lines have special characters (non ascii). If I replace non ascii chars it works fine.
Is there a simple/easy way to save not allowed characters in a text field or another kind of field? Or a postgres function that valid a text to avoid a transaction with wrong characters?
I would recommend using db.Exec().
col1Val := `string 1 with special characters`
col2Val := `string 2 with special characters`
sqlStr := `INSERT INTO table (col1, col2) VALUES ($1, $2)`
_, err = db.Exec(sqlStr, col1Val, col2Val)
if err != nil {
panic(err)
}
db.Exec escapes special characters by itself and removes SQL injections.
Related
I have a requirement where I need to identify if a string has any special/junk characters excluding Arabic and alphanumeric and space. I have tried below, but its not detecting spl character
select count(*) from table
where not regexp_like (column1,UNISTR('[\0600-\06FF\0750-\077F\0870-\089F\08A0-\08FF\FB50-\FDFF\FE70-\FEFF\0030-\0039\0041-\005A\0061-\007A]'));
column has following value 'طًيAa1##$'
You have NOT REGEXP_LIKE(column, allowed_characters)
This means that any string with at least one allowed character will return TRUE from the regular expression, and so be excluded by the WHERE clause.
You want REGEXP_LIKE(column, disallowed_characters)
This will identify any strings that have at least one disallowed character.
You can accomplish this with ^ inside the regular expression (^ meaning 'not any of these characters')
select count(*) from table
where regexp_like (Column1, UNISTR('[^\0600-\06FF\0750-\077F\0870-\089F\08A0-\08FF\FB50-\FDFF\FE70-\FEFF\0030-\0039\0041-\005A\0061-\007A]'));
Demo; https://dbfiddle.uk/Rq1Zzopk
I inserted a bunch of rows with a text field like content='...\n...\n...'.
I didn't use e in front, like conent=e'...\n...\n..., so now \n is not actually displayed as a newline - it's printed as text.
How do I fix this, i.e. how to change every row's content field from '...' to e'...'?
The syntax variant E'string' makes Postgres interpret the given string as Posix escape string. \n encoding a newline is only one of many interpreted escape sequences (even if the most common one). See:
Insert text with single quotes in PostgreSQL
To "re-evaluate" your Posix escape string, you could use a simple function with dynamic SQL like this:
CREATE OR REPLACE FUNCTION f_eval_posix_escapes(INOUT _string text)
LANGUAGE plpgsql AS
$func$
BEGIN
EXECUTE 'SELECT E''' || _string || '''' INTO _string;
END
$func$;
WARNING 1: This is inherently unsafe! We have to evaluate input strings dynamically without quoting and escaping, which allows SQL injection. Only use this in a safe environment.
WARNING 2: Don't apply repeatedly. Or it will misinterpret your actual string with genuine \ characters, etc.
WARNING 3: This simple function is imperfect as it cannot cope with nested single quotes properly. If you have some of those, consider instead:
Unescape a string with escaped newlines and carriage returns
Apply:
UPDATE tbl
SET content = f_eval_posix_escapes(content)
WHERE content IS DISTINCT FROM f_eval_posix_escapes(content);
db<>fiddle here
Note the added WHERE clause to skip updates that would not change anything. See:
How do I (or can I) SELECT DISTINCT on multiple columns?
Use REPLACE in an update query. Something like this: (I'm on mobile so please ignore any typo or syntax erro)
UPDATE table
SET
column = REPLACE(column, '\n', e'\n')
I'm trying to read a plain text file that contains names like this: "CASTAÑEDA"
The code is basically like this:
file, err := os.Open("C:/Files/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(file)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
Then, when "CASTAÑEDA" is read it prints "CASTA�EDA"
There's any way to handle that characters when reading with bufio?
Thanks.
Your file is, most propably, non UTF-8. Because of that (go expects all strings to be UTF-8) your console output looks mangled. I would advise usage of the packages golang.org/x/text/encoding/charmap and golang.org/x/text/transform in your case, to convert the file's data to UTF-8. As I might presume, looking at your file path, you are on Windows. So your character encoding might be Windows1252 (if you have edited it e.g. with notepad.exe).
Try something like this:
package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/charmap"
"golang.org/x/text/transform"
)
func main() {
file, err := os.Open("C:/temp/file.txt")
defer file.Close()
if err != nil {
log.Fatal(err)
}
dec := transform.NewReader(file, charmap.Windows1252.NewDecoder()) <- insert your enconding here
scanner := bufio.NewScanner(dec)
for scanner.Scan() {
fmt.Println(scanner.Text())
}
}
You can find more encodings in the package golang.org/x/text/encoding/charmap, that you can insert into my example to your liking.
The issue you're encountering is that your input is likely not UTF-8 (which is what bufio and most of the Go language/stdlib expect). Instead, your input probably uses some extended-ASCII codepage, which is why the unaccented characters are passing through cleanly (UTF-8 is also a superset of 7-bit ASCII), but that the 'Ñ' is not passed through intact.
In this situation, the bit-representation of the accented character is not valid UTF-8, so the unicode replacement character (U+FFFD) is being produced. You've got a few options:
Convert your input files to UTF-8 before passing them to Go. There are many utilities that can do this, and editors often have this feature.
Try using golang.org/x/text/encoding/charmap together with NewReader from golang.org/x/text/transform to transform your input to UTF-8. Pass the resulting Reader to bufio.NewScanner
Change the line in the loop to os.Stdout.Write(scanner.Bytes()); fmt.Println(); This might avoid the bytes being interpreted as UTF-8 beyond newline splitting. Writing the bytes directly to os.Stdout will further avoid any (mis)interpretation of the contents.
I want to load the data from a flat file with delimiter "~,~" into a PostgreSQL table. I have tried it as below but looks like there is a restriction for the delimiter. If COPY statement doesn't allow multiple chars for delimiter, is there any alternative to do this?
metadb=# \COPY public.CME_DATA_STAGE_TRANS FROM 'E:\Infor\Outbound_Marketing\7.2.1\EM\metadata\pgtrans.log' WITH DELIMITER AS '~,~'
ERROR: COPY delimiter must be a single one-byte character
\copy: ERROR: COPY delimiter must be a single one-byte character
If you are using Vertica, you could use E'\t'or U&'\0009'
To indicate a non-printing delimiter character (such as a tab),
specify the character in extended string syntax (E'...'). If your
database has StandardConformingStrings enabled, use a Unicode string
literal (U&'...'). For example, use either E'\t' or U&'\0009' to
specify tab as the delimiter.
Unfortunatelly there is no way to load flat file with multiple characters delimiter ~,~ in Postgres unless you want to modify source code (and recompile of course) by yourself in some (terrific) way:
/* Only single-byte delimiter strings are supported. */
if (strlen(cstate->delim) != 1)
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("COPY delimiter must be a single one-byte character")));
What you want is to preprocess your input file with some external tool, for example sed might to be best companion on GNU/Linux platfom, for example:
sed s/~,~/\\t/g inputFile
The obvious thing to do is what all other answers advised. Edit import file. I would do that, too.
However, as a proof of concept, here are two ways to accomplish this without additional tools.
1) General solution
CREATE OR REPLACE FUNCTION f_import_file(OUT my_count integer)
RETURNS integer AS
$BODY$
DECLARE
myfile text; -- read xml file into that var.
datafile text := '\path\to\file.txt'; -- !pg_read_file only accepts relative path in database dir!
BEGIN
myfile := pg_read_file(datafile, 0, 100000000); -- arbitrary 100 MB max.
INSERT INTO public.my_tbl
SELECT ('(' || regexp_split_to_table(replace(myfile, '~,~', ','), E'\n') || ')')::public.my_tbl;
-- !depending on file format, you might need additional quotes to create a valid format.
GET DIAGNOSTICS my_count = ROW_COUNT;
END;
$BODY$
LANGUAGE plpgsql VOLATILE;
This uses a number of pretty advanced features. If anybody is actually interested and needs an explanation, leave a comment to this post and I will elaborate.
2) Special case
If you can guarantee that '~' is only present in the delimiter '~,~', then you can go ahead with a plain COPY in this special case. Just treat ',' in '~,~' as an additional columns.
Say, your table looks like this:
CREATE TABLE foo (a int, b int, c int);
Then you can (in one transaction):
CREATE TEMP TABLE foo_tmp ON COMMIT DROP (
a int, tmp1 "char"
,b int, tmp2 "char"
,c int);
COPY foo_tmp FROM '\path\to\file.txt' WITH DELIMITER AS '~';
ALTER TABLE foo_tmp DROP COLUMN tmp1;
ALTER TABLE foo_tmp DROP COLUMN tmp2;
INSERT INTO foo SELECT * FROM foo_tmp;
Not quite sure if you're looking for a postgresql solution or just a general one.
If it were me, I would open up a copy of vim (or gvim) and run the commend :%s/~,~/~/g
That replaces all "~,~" with "~".
you can use a single character delimiter, open notepad press ctrl+h replace ~,~ with something will not interfere. like |
I have a trim function that apply ltrim and rtrim
CREATE FUNCTION dbo.TRIM(#string VARCHAR(MAX))
RETURNS VARCHAR(MAX)
BEGIN
RETURN LTRIM(RTRIM(#string))
END
GO
I do the following query:
SELECT distinct dbo.trim([subject]) as subject
FROM [DISTR]
The result has rows like:
"A"
"A "
"B"
...
I thought that thoose chars maybe weren't spaces but when I got the ascii code, it returns 32 which is the code for space.
My only guess is that I had to change the collaction of the database to: SQL_Latin1_General_CP1_CI_AI
Can that be the problem? Any ideas?
Thanks
Maybe your field contains more than spaces. Remember than " " could be a space, tab, and many other "blank" chars. It's possible to match it using ASCII or building a CLR implementation of trim that uses regular expressions