literal newline/carriage found in data exported with pandas to csv - postgresql

I have the following text file that has CRLF at the end of each line and that has a relatively small number of bad rows (b'Skipping line 55000: expected 14 fields, saw 15\n').
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.72;;Jane;;-3.4
0.0;0.98;Gil;0.68
0.0;0.48;;;0
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
I import the file with pandas, python 3.5.2 for windows 10 as follows:
with open('E:\DATA\my_file.txt','rb') as f:
df = pd.read_csv(f, sep=';', encoding='CP1252', error_bad_lines=False) // skipping bad rows
df looks like this: // the bad rows seem to be empty now
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.98;Gil;0.68
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
Then I export the table into csv as follows:
with open('E:\DATA\csv_file.csv','w',newline='\n') as outfile:
df.to_csv(outfile, sep=';',index = False, line_terminator = '\r')
csv_file.csv looks like this: // the empty rows seem to be removed
0.0;0.7;John;0.29
1.0;0.23;Mike;0.55
0.0;0.98;Gil;0.68
1.0;0.34;Karl;0.73
0.0;0.44;James;0.06
1.0;0.4;Kiki;0.74
0.0;0.18;Albert;0.18
1.0;0.53;Mark;0.53
Unfortunately when I import the file into postgres with the following code:
set client_encoding to 'WIN1252';
COPY my_table FROM 'E:\DATA\csv_file.csv' (DELIMITER(';'));
I get the following error:
ERROR: literal newline found in data
HINT: Use "\n" to represent newline.
CONTEXT: COPY my_table, line 25408: ""
When I open the csv_file in nopetad++, I see that it has "CR" at the end of each row up to line 25407, line 25408 and a few others have "CRLF" at the end of the line.
I tried a few things I read on this site like opnening the file in binary mode, but nothing helped.
Can anyone explain to me what is going on here and how I can solve this? Thanks

UPDATE2: it just works fine:
In [194]: pd.read_csv(r'D:\download\onetest.txt', sep=';')
Out[194]:
COL1 COL2 COL2.1 COL3 COL4 COL5 COL6 COL7 COL8 COL9 COL10
0 23 21.0 UP 15/08/1986 BOBO NaN 1071001 268-Z DON 1620 NaN
1 012R 65.0 UP 15/10/1986 ESTO NaN 15065108 066-B DON 8415 NaN
2 234 8.0 EIJFTERF 17/12/1989 KING NaN 15571508 0776-V UP 6329 NaN
UPDATE: if your file(s) are small enough to fit into memory you can try this:
import io
data = []
with open(r'E:\DATA\my_file.txt') as f:
for line in f:
data.append(line.rstrip())
df = pd.read_csv(io.StringIO(data), sep=';', encoding='CP1252', error_bad_lines=False)
df.to_csv(r'E:\DATA\csv_file.csv', sep=';', index = False)
OLD answer:
You are using '\r' as a line-break and PostgreSQL's COPY command expects '\n', so try the following:
df = pd.read_csv(r'E:\DATA\my_file.txt', sep=';', encoding='CP1252', error_bad_lines=False)
df.to_csv(r'E:\DATA\csv_file.csv', sep=';', index = False)
in PostgreSQL:
set client_encoding to 'WIN1252';
COPY my_table FROM 'E:\DATA\csv_file.csv' (DELIMITER(';'));

Related

How does SAS decide which file to select when using wildcards

I have a SAS code that looks something like this:
DATA WORK.MY_IMPORT_&stamp;
INFILE "M:\YPATH\myfile_150*.csv"
delimiter = ';' MISSOVER DSD lrecl = 1000000 firstobs = 2 ignoredoseof;
[...]
RUN;
Now, at M:\YPATH I have several files named myfile_150.YYYYMMDD. The code works the way it is supposed to by importing always the latest file. I am wondering how SAS decides which file to choose, since the wildcard * can be replaced by anything. Does it sort the files in descending order and choose the first one?
On my system, SAS 9.4 TS1M4, SAS is reading ALL files that satisfy the wildcard.
I created 3 files (file_A.csv, file_B.csv, and file_C.csv). Each contain 1 record ('A', 'B', and 'C' respectively).
data test;
infile "c:\temp\file_*.csv"
delimiter = ';' MISSOVER DSD lrecl = 1000000 ignoredoseof;
format char $1.;
input char $;
run;
(Note I dropped the firstobs option from your code.)
The resulting TEST data set contains 3 observations, 'A', 'B', and 'C'.
This is the order of files returned when issuing
dir c:\temp\file_*.csv
SAS is using the default behavior of the OS and reading the files in that order.
25 data test;
26 infile "c:\temp\file_*.csv"
27 delimiter = ';' MISSOVER DSD lrecl = 1000000 ignoredoseof;
28 format char $1.;
29 input char $;
30 run;
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_A.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_B.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: The infile "c:\temp\file_*.csv" is:
Filename=c:\temp\file_C.csv,
File List=c:\temp\file_*.csv,RECFM=V,
LRECL=1000000
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: 1 record was read from the infile "c:\temp\file_*.csv".
The minimum record length was 1.
The maximum record length was 1.
NOTE: The data set WORK.TEST has 3 observations and 1 variables.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
cpu time 0.00 seconds

COPY NULL values in Postgresql using psycopg2 copy_from()

This seems like a rather popular question, but all the answers around here did not help me solve the issue... I have a Postgresql 9.5 table on my OS X machine:
CREATE TABLE test (col1 TEXT, col2 INT)
The following function uses the psycopg2 copy_from() command:
def test_copy(conn, curs, data):
cpy = BytesIO()
for row in data:
cpy.write('\t'.join([str(x) for x in row]) + '\n')
print cpy
cpy.seek(0)
curs.copy_from(cpy, 'test')
test_copy(connection, [('a', None), ('b', None)])
And will result in this error:
ERROR: invalid input syntax for integer: "None"
CONTEXT: COPY test, line 1, column col2: "None"
STATEMENT: COPY test FROM stdin WITH DELIMITER AS ' ' NULL AS '\N'
I tried also curs.copy_from(cpy, 'test', null=''), curs.copy_from(cpy, 'test', null='NULL'). Any suggestions are greatly appreciated.
OK, after more trial & error I found the solution:
copy_from(cpy, 'test', null='None')

Postgres copy data & evaluate expression

Is it possible a copy command to evaluate expressions upon insertion?
For example consider the following table
create table test1 ( a int, b int)
and we have a file to import
5 , case when b = 1 then 100 else 101
25 , case when b = 1 then 100 else 101
145, case when b = 1 then 100 else 101
The following command fill fail
COPY test1 FROM 'file' USING DELIMITERS ',';
with the following error
ERROR: invalid input syntax for integer
which means that it can not evaluate the case expression. Is there any workaround?
The command COPY only copies data (obviously) and does not evaluate SQL code, as explained in the documentation: http://www.postgresql.org/docs/9.3/static/sql-copy.html
As far as I know there is not workarounds to making COPY evaluating sql code.
You must preprocess your csv file and convert it to a standard sql script with INSERT statements in this form:
INSERT INTO your_table VALUES(145, CASE WHEN 1 = 1 THEN 100 ELSE 101 END);
Then execute the sql script with the client you are using. I.e. with psql you would use the -f option:
psql -d your_database -f your_sql_script

how to deal with missings when importing csv to postgres?

I would like to import a csv file, which has multiple occurrences of missing values. I recoded them into NULL and tried to import the file as. I suppose that my attributes which include the NULLS are character values. However transforming them to numeric is bit complicated. Therefore I would like to import all of my table as:
\copy player_allstar FROM '/Users/Desktop/Rdaten/Data/player_allstar.csv' DELIMITER ';' CSV WITH NULL AS 'NULL' ';' HEADER
There must be a syntax error. But I tried different combinations and always get:
ERROR: syntax error at or near "WITH NULL"
LINE 1: COPY player_allstar FROM STDIN DELIMITER ';' CSV WITH NULL ...
I also tried:
\copy player_allstar FROM '/Users/Desktop/Rdaten/Data/player_allstar.csv' WITH(FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);
and get:
ERROR: invalid input syntax for integer: "NULL"
CONTEXT: COPY player_allstar, line 2, column dreb: "NULL"
I suppose it is caused by preprocessing with R. The Table came with NAs so I change them to:
data[data==NA] <- "NULL"
I`m not aware of a different way chaning to NULL. I think this causes strings. Is there a different way to preprocess and keep the NAs(as NULLS in postgres of course)?
Sample:
pts dreb oreb reb asts stl
11 NULL NULL 8 3 NULL
4 5 3 8 2 1
3 NULL NULL 1 1 NULL
data type is integer
Given /tmp/sample.csv:
pts;dreb;oreb;reb;asts;stl
11;NULL;NULL;8;3;NULL
4;5;3;8;2;1
3;NULL;NULL;1;1;NULL
then with a table like:
CREATE TABLE player_allstar (pts integer, dreb integer, oreb integer, reb integer, asts integer, stl integer);
it works for me:
\copy player_allstar FROM '/tmp/sample.csv' WITH (FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);
Your syntax is fine, the problem seem to be in the formatting of your data. Using your syntax I was able to load data with NULLs successfully:
mydb=# create table test(a int, b text);
CREATE TABLE
mydb=# \copy test from stdin WITH(FORMAT CSV, DELIMITER ';', NULL 'NULL', HEADER);
Enter data to be copied followed by a newline.
End with a backslash and a period on a line by itself.
>> col a header;col b header
>> 1;one
>> NULL;NULL
>> 3;NULL
>> NULL;four
>> \.
mydb=# select * from test;
a | b
---+------
1 | one
|
3 |
| four
(4 rows)
mydb=# select * from test where a is null;
a | b
---+------
|
| four
(2 rows)
In your case you can substitute to NULL 'NA' in the copy command, if the original value is 'NA'.
You should make sure that there's no spaces around your data values. For example, if your NULL is represented as NA in your data and fields are delimited with semicolon:
1;NA <-- good
1 ; NA <-- bad
1<tab>NA <-- bad
etc.

How do I get fprintf to preserve space for an empty numerical value?

In MATLAB, I'm using fprintf to print a list of numerical values under column headings, like so:
fprintf('%s %s %s %s\n', 'col1', 'col2', 'col3', 'col4')
for i = 1:length(myVar)
fprintf('%8.4g %8.4g %8.4g %8.4g\n', myVar{i,1}, myVar{i,2}, myVar{i,3}, myVar{i,4})
end
This results in something like this:
col1 col2 col3 col4
123.5 234.6 345.7 456.8
However, when one of the cells is empty (e.g. myVar{i,3} == []), space is not preserved:
col1 col2 col3 col4
123.5 234.6 456.8
How do I preserve space in my format for a numerical value that may be empty?
One option is to use the functions CELLFUN and NUM2STR to change each cell to a string first, then print each cell as a string using FPRINTF:
fprintf('%8s %8s %8s %8s\n', 'col1', 'col2', 'col3', 'col4');
for i = 1:size(myVar,1)
temp = cellfun(#(x) num2str(x,'%8.4g'),myVar(i,:),'UniformOutput',false);
fprintf('%8s %8s %8s %8s\n',temp{:});
end
This should give you output like:
col1 col2 col3 col4
123.5 234.6 456.8
Notice also that I added eights to your first FPRINTF call to fix the formatting of the column labels and changed length(myVar) to size(myVar,1) since you are looping over the rows of myVar.