Postgres: Inconsistent reported row numbers by failed Copy error - postgresql

I have a file foo that I'd like to copy into a table.
\copy stage.from_csv FROM '/path/foo.dat' CSV;
If foo has an error like column mismatch or bad type, the return error is normal:
CONTEXT: COPY from_csv, line 5, column report_year: "aa"
However, if the error is caused by an extraneous quotation mark, the reported line number is always one greater than the size of the file.
CONTEXT: COPY from_csv, line 11: "02,2004,"05","123","09228","00","SUSX","PR",30,,..."
The source file has 10 lines, and I placed the error in line 5. If you examine the information in the CONTEXT message, it contains the line 5 data, so I know postgres can identify the row itself. However, it cannot identify the row by number. I have done this with a few different file lengths and the returned line number behavior is consistent.
Anyone know the cause and/or how to get around this?

That is because the error manifests only at the end of the file.
Look at this CSV file:
"google","growing",1,2
"google","growing",2,3
"google","falling","3,4
"google","growing",4,5
"yahoo","growing",1,2
There is a bug in the third line, an extra " has been added before the 3.
Now to parse a CSV file, you first read until you hit the next line break.
But be careful, line breaks that are within double quotes don't count!
So all of the line breaks are part of a quoted string, and the line continues until the end of file.
Now that we have read our line, we can continue parsing it and notice that the number of quotes is unbalanced. Hence the error message:
ERROR: unterminated CSV quoted field
CONTEXT: COPY stock, line 6: ""google","falling","3,4
"google","growing",4,5
"yahoo","growing",1,2
"
In a nutshell: the error does not occur until we have reached the end of file.

Related

mongoimport on csv: bare " in non-quoted-field

Running mongoimport, I get the error:
Failed: read error on entry #2: line 3, column 25: bare " in non-quoted-field
This is line 3:
1,0xcrypton,"Hyderabad, India","'Hyderabad', ' India'",17.38405,78.45636
There are plenty of other questions about this error, but they're all related to double quotes or escaped quotes. There's none of that here. In fact, to be on the safe side I deleted all double quotes from the csv. What's going on?
edit: Also tried removing the entire rest of the csv so it's just this line and the header line. Still getting the error.
Well, this happened because I created the csv using pandas to_csv and used the encoding='utf16' option. Apparently doing so breaks quotation marks.

USQL Escape Quotes

I am new to Azure data lake analytics, I am trying to load a csv which is double quoted for sting and there are quotes inside a column on some random rows.
For example
ID, BookName
1, "Life of Pi"
2, "Story about "Mr X""
When I try loading, it fails on second record and throwing an error message.
1, I wonder if there is a way to fix this in csv file, unfortunatly we cannot extract new from source as these are log files?
2, is it possible to let ADLA to ignore the bad rows and proceed with rest of the records?
Execution failed with error '1_SV1_Extract Error :
'{"diagnosticCode":195887146,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_ROW_ERROR","message":"Error
occurred while extracting row after processing 9045 record(s) in the
vertex' input split. Column index: 9, column name:
'instancename'.","description":"","resolution":"","helpLink":"","details":"","internalDiagnostics":"","innerError":{"diagnosticCode":195887144,"severity":"Error","component":"RUNTIME","source":"User","errorId":"E_RUNTIME_USER_EXTRACT_EXTRACT_INVALID_CHARACTER_AFTER_QUOTED_FIELD","message":"Invalid
character following the ending quote character in a quoted
field.","description":"Invalid character is detected following the
ending quote character in a quoted field. A column delimiter, row
delimiter or EOF is expected.\nThis error can occur if double-quotes
within the field are not correctly escaped as two
double-quotes.","resolution":"Column should be fully surrounded with
double-quotes and double-quotes within the field escaped as two
double-quotes."
As per the error message, if you are importing a quoted csv, which has quotes within some of the columns, then these need to be escaped as two double-quotes. In your particular example, you second row needs to be:
..."Life after death and ""good death"" models - a qualitative study",...
So one option is to fix up the original file on output. If you are not able to do this, then you can import all the columns as one column, use RegEx to fix up the quotes and output the file again, eg
// Import records as one row then use RegEx to clean columns
#input =
EXTRACT oneCol string
FROM "/input/input132.csv"
USING Extractors.Text( '|', quoting: false );
// Fix up the quotes using RegEx
#output =
SELECT Regex.Replace(oneCol, "([^,])\"([^,])", "$1\"\"$2") AS cleanCol
FROM #input;
OUTPUT #output
TO "/output/output.csv"
USING Outputters.Csv(quoting : false);
The file will now import successfully. My results:

How to load a csv file in MATLAB by skipping erroneous rows?

I am using the csvread command to read a large CSV file:
M=csvread('myfile.csv');
But there are few rows in it, apparently, which do not allow MATLAB to load the file because of being text (or otherwise garbage). For example, line number 45372, 117573, etc. So, how do I skip them when loading the file?
When few mean already few you can use try-catch structure in while loop. It is also needed to know number of columns.
Following code expect error message:
Mismatch between file and format string.
Trouble reading number from file (row <row#>, field <field#>) ==> <faulty line content>
The code tries to load whole csv in once. If it fails it analyses error message, identifies the faulty line, loads the file to the line before and appends the result to the data variable. In next run it starts loading one line below the faulty line. This iteration is terminated when the remaining file is succesfully read to the end.
StartRow=<number of ignored rows (header)>;
Ncols=<width>;
data=zeros(0,Ncols);
csvFault=true;
while csvFault
try
Temp=csvread('YourFile.csv',StartRow,0) % Try to read file from StartRow to the end-of-file.
csvFault=false; % Executed only when CSVread is succesful
catch msg
%% Error occured
faultyRow=regexp(msg.message,' ','split'); % Split error message into cells
faultyRow=faultyRow{12}; % 12th cell contain row identifier
faultyRow=str2double(faultyRow(1:end-1)); % Identifier is number followed by 1 letter
%% Read the data from the segment between faulty lines
Temp=csvread('YourFile.csv',StartRow,0,[StartRow,0,faultyRow-1,Ncols-1])
StartRow=faultyRow+1;
end
data=[data;Temp];
end

"Uninitialized value" error when running word_align.pl script

I'm trying to run the word_align.pl script provided by CMUSphinx. I write the command as follows:
perl word_align.pl actualtext.txt batchOutputText.txt
But the terminal gives me the following errors:
Use of uninitialized value $ref_uttid in hash element at word_align.pl line 60, line 1.
Use of uninitialized value $ref_uttid in concatenation (.) or string at word_align.pl line 61, line 1.
UttID is not ignored but it could not found in any entries of the hypothesis file on line3 1 UTTID
I am not quite familiar with Perl and I can't figure out what is the problem here though I followed the instructions provided by CMUSphinx to run that script
You can find the script here
Edit: here is the reference's file link
The answer is in this error message
UttID is not ignored but it could not found in any entries of the hypothesis file on line3 1 UTTID
The reference file that you are passing is malformed, specifically its first line isn't formatted as it should be
More precisely, each line of the reference file requires a UTT ID—a unique string in parentheses like (output00000). It must be unique because it is used as a hash key. A simple digit like (1) won't work as it will be mistaken for an alternative pronunciation
The first line of your file must be different from that. You suggest
<s> text </s> (file12)
which actually works fine—I have tested it—and $ref_uttid comes out as FILE12. If you tell us what is actually in your file then I am sure we could help you better

Easytrieve A010 invalid file reference

I am getting the error in this line of my easytrieve prog..
JOB INPUT NULL MASTER-FILE
GET DATAPRM <~~~~~~~ LINE 59
DO WHILE NO EOF DATAPRM
...
GET DATAPRM
END-DO
..
59******A010 INVALID FILE REFERENCE - DATAPRM
..
I have a DLBL like this ..
//DLBL DATAPRM, 'DATAPRM.SAM'
I am trying to populate the masterfile by data using the input file DATAPRM (card) .. the records were being read(i assume since my counter is moving) but unfortunately, before it terminates the program, the error occurs.. maybe EOF?
You have no STOP in your program. Not just in the code that you have shown, but anywhere. Or if you do, it is conditional and the condition was not met.
Easytrieve Plus does an "automatic cycle". Usually with the file named on the JOB statement, but when NULL is specified, it just cycles from the last statement in the JOB through to the JOB again.
After you get to EOF in your DO, you need to STOP when you have finished everything else. What is happening now is you are getting EOF, getting out of the DO, cycling to the top again (the JOB), and then it does a GET, after EOF, so ******A010 INVALID FILE REFERENCE - DATAPRM