Reading .txt file in MATLAB with issue in formatting - matlab

I am using the MATLAB 2021b function readtable to read the following text file:
ISSUERID|FISCAL_YEAR|FIELD_ID|VALUE|PUBLISHED_DATE|SOURCE|DATA_TYPE|ADDITIONAL_INFO
IID000000002137286||DIVERSITY_DISCLOSURE_ETHNICITY_SOURCE|"https://www.cubesmart.com/about-us/corporate-responsibility/\""||{}||{}
The separator is the | (bar) character. Aenter code heres you can see, at the end of the "https://www.cubesmart.com/about-us/corporate-responsibility/\"" field value, there is the following \" character, which messes up the reading. I am trying to use the options 'Whitespace' to ignore it but for some reason it does not work. The code I am running is:
T_equ = readtable(file_name, 'FileType', 'text', 'Delimiter', {'|'}, 'Whitespace', '\"');
where file_name is just the path to the .txt file.
The results of the import is an empty table. I understand this results if the character \" would be read as a special character but from my understanding the 'Whitespace', '\"' pair/value argument should force the readtable function to ignore it. What am I missing here?

Related

reading data from csv files with `textscan` in MATLAB

[Edited:] I have a file data2007a.csv and I copied and pasted (using TextEdit in MacBook) the first consecutive few lines to a new file datatest1.csv for testing:
Nomenclature,ReporterISO3,ProductCode,ReporterName,PartnerISO3,PartnerName,Year,TradeFlowName,TradeFlowCode,TradeValue in 1000 USD
S3,ABW,0,Aruba,ANT,Netherlands Antilles,2007,Export,6,448.91
S3,ABW,0,Aruba,ATG,Antigua and Barbuda,2007,Export,6,0.312
S3,ABW,0,Aruba,CHN,China,2007,Export,6,24.715
S3,ABW,0,Aruba,COL,Colombia,2007,Export,6,95.885
S3,ABW,0,Aruba,DOM,Dominican Republic,2007,Export,6,11.432
I wanted to use textscan to read it into MATLAB with only columns 2,3,5 (starting from the second row) and I wrote the following code
clc,clear all
fid = fopen('datatest1.csv');
data = textscan(fid,'%*s %s %d %*s %s %*[^\n]',...
'Delimiter',',',...
'HeaderLines',1);
fclose(fid);
But I ended up with only the second row of columns 2,3 and 5:
I then keep the first row in data2007a.csv and selected several others to saved as datatest2.csv:
Nomenclature,ReporterISO3,ProductCode,ReporterName,PartnerISO3,PartnerName,Year,TradeFlowName,TradeFlowCode,TradeValue in 1000 USD
S3,ABW,1,Aruba,USA,United States,2007,Export,6,1.392
S3,ABW,1,Aruba,VEN,Venezuela,2007,Export,6,5633.157
S3,ABW,2,Aruba,ANT,Netherlands Antilles,2007,Export,6,310.734
S3,ABW,2,Aruba,USA,United States,2007,Export,6,342.42
S3,ABW,2,Aruba,VEN,Venezuela,2007,Export,6,63.722
S3,AGO,0,Angola,DEU,Germany,2007,Export,6,105.334
S3,AGO,0,Angola,ESP,Spain,2007,Export,6,8533.125
And I wrote:
clc,clear all
fid = fopen('datatest2.csv');
data = textscan(fid,'%*s %s %d %*s %s %*[^\n]',...
'Delimiter',',',...
'HeaderLines',1);
fclose(fid);
data{1}
It gives exactly what I wanted:
When I use the same code for my original data file data2007a.csv, it goes as in the first case.
What is going wrong and how can I fix it?
[Added:] If one replicates my experiments1, one can find that both cases work and the problem does not exist! I really don't know what is going on.
1 For "replicate" I mean copy-and-paste the data given above and save it as two new files, say, datatest4a.csv and datatest4b.csv. I used visdiff('datatest1.csv', 'datatest4a.csv') to compare two files and it returned:
Given how you fixed it, I think this is an end-of-line character issue. This sometimes comes up when moving text files between Windows and Unix based systems, as they use different conventions.
When you add %*[^\n] to the end of a textscan format, as you have here. it means to skip everything to the end of line. But if it expects a specific end of line character, and can't find one, it will skip everything to the end of the file. This would explain why you get one row correctly read and then nothing else.
If you don't specify what the end of line character is, Matlab appears to default to... something... in this not very clear specification in the help:
The default end-of-line sequence is \n, \r, or \r\n, depending on the contents of your file.
One way to try and cure this without having to create a new file would be to add this 'EndOfLine', '\r\n' to your textscan call:
If you specify '\r\n', then textscan treats any of \r, \n, and the
combination of the two (\r\n) as end-of-line characters.
This will hopefully handle most standard(ish) EOL conventions. It is likely that copy-pasting and saving with a different bit of software than was originally used to create the file changed the end of line characters such that Matlab was able to recognise them.

How to remove double quotation marks around numbers in a CSV file in Matlab

I am using csvread syntax m =csvread('reserve2.csv',7,3,[7,3,9,4]) from Matlab to read comma seperated values from a CSV file. Unfortunately the numbers in the specified rows and columns in the CSV file are listed with double quotation marks around them and I get the following error:
Error using dlmread (line 143)
Mismatch between file and format string.
Trouble reading 'Numeric' field from file (row number 1, field number 4) ==> "0","568"\n
Error in csvread (line 49)
m=dlmread(filename, ',', r, c, rng);
How can I invoke csvread such that it can read the values even when they are in double quotation marks? Or how can I write a code to get rid of the quotation marks in the CSV file?
There are several ways of reading in this file.
MATLAB's TEXTSCAN function can parse text and ignore delimiters enclosed within double quotation marks (" "). Use the TEXTSCAN function with the '%q' format type to identify strings demarcated by double quotation marks. For example:
str = 'a,A,"a,apple"';
out=textscan(str,'%s%s%q', 'delimiter',',')
This command will generate a cell array 'out' that contains 'a', 'A' and 'a,apple'.
If you are on a Windows platform and have Microsoft Excel installed, you can use the following syntax with XLSREAD to read your data into two cell arrays:
[num_data text_data] = xlsread(filename);
After executing this command, the data will be copied to 2 different arrays that treat the data differently:
"num_data" - containing numeric data only; strings and empty fields will be converted to NaN
"text_data" - containing all data read in as strings. The text between two double quotes will be parsed as a single string
Create a custom function to parse the file using multiple FREAD or FGETL commands.
For more information on these functions, refer to the documentation by executing the following at the MATLAB command prompt:
doc textscan
doc xlsread
doc fread
doc fgetl

Removing quotes that seem to be nonexistent in MatLab

I have an application that translates the .csv files I currently have into the correct format I need. But, the files I that I do have seem to have '"' double quotes around them, as seen in this image, which will not work with the program. As a result, I'm using this command to remove them:
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = strrep(current{m,n}, '"', '');
end
end
I'm not entirely sure that this works though, as it spits this back at me as it runs:
Warning: Inputs must be character arrays or cell arrays of strings.
When I open the file in matlab, it seems to only have the single quotes around it, which is normal for any other file. However, when I open it in notepad++, it seems to have '"' double quotes around everything. Is there a way to remove these double quotes in any other way? My code doesn't seem to do anything as seen here:
After using xlswrite to write the replacement cell-array to a .csv file, one appears corrupted. Any idea why?
So, my questions are:
Is there any way to remove the quotes in a more efficient manner or without rewriting to a csv?
and
What exactly is causing the corruption in the xlswrite function? The variable replacement seems perfectly normal.
Thanks in advance!
Regarding the "corrupted" file. That's not a corrupted file, that's an xls file (not xlsx). You could verify this opening the text file in a hex editor to compare the signature. This happens when you chose no file extension or ask excel to write a file which can't be encoded into csv. I assume it's some character which causes the problems, try to isolate the critical line writing only parts of the cell.
Regarding your warning, not having the actual input I could only guess what's wrong. Again, I can only give some advices to debug the problem yourself. Run this code:
lastwarn('')
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = strrep(current{m,n}, '"', '');
if ~isempty(lastwarn)
keyboard
lastwarn('')
end
end
end
This will launch the debugger when a warning is raised, allowing you to check the content of current{m,n}
It is mentionned in the documentation of strrep that :
The strrep function does not find empty strings for replacement. That is, when origStr and oldSubstr both contain the empty string (''), strrep does not replace '' with the contents of newSubstr.
You can use this user function substr like this :
for m = 1:currentsize(1)
for n = 1:currentsize(2)
replacement{m,n} = substr(current{m,n}, 2, -1);
end
end
Please use xlswrite function like this :
[status,message] = xlswrite(___)
and check the status if it is zero, that means the function did not succeed and an error message is generated in message as string.

Matlab: Printing literally using fprintf

I am writing a Matlab function that does some file manipulation as follows. I take the input file, read it line by line and write it to an output file. If the line contains a keyword, I do some further processing before writing the output. My problem is this: If the input line contains an escape character, it messes up the fprintf that I use to write the output. For example, if the input line contains a % sign, the rest of the line does not show up in the output. So, my question is: is there a way to force fprintf to ignore all escape sequences and print the literal string? Thanks in advance for the help.
Sample code below:
fptr_read = fopen('read_file.txt','r'); fptr_write = fopen('write_file.txt','w');
while(~feof(fptr_read))
current_line = fgetl(fptr_read);
fprintf(fptr_write,current_line);
end
If current_line looks like 'Gain is 5% larger', it will get written as 'Gain is 5'. I want the line reproduced verbatim without manually having to check for presence of escape characters.
I had a similar problem once, and solved it like this:
fprintf(fp,'%s',stringWithEscapesIWant)

MATLAB: How do you insert a line of text at the beginning of a file?

I have a file full of ascii data. How would I append a string to the first line of the file? I cannot find that sort of functionality using fopen (it seems to only append at the end and nothing else.)
The following is a pure MATLAB solution:
% write first line
dlmwrite('output.txt', 'string 1st line', 'delimiter', '')
% append rest of file
dlmwrite('output.txt', fileread('input.txt'), '-append', 'delimiter', '')
% overwrite on original file
movefile('output.txt', 'input.txt')
Option 1:
I would suggest calling some system commands from within MATLAB. One possibility on Windows is to write your new line of text to its own file and then use the DOS for command to concatenate the two files. Here's what the call would look like in MATLAB:
!for %f in ("file1.txt", "file2.txt") do type "%f" >> "new.txt"
I used the ! (bang) operator to invoke the command from within MATLAB. The command above sequentially pipes the contents of "file1.txt" and "file2.txt" to the file "new.txt". Keep in mind that you will probably have to end the first file with a new line character to get things to append correctly.
Another alternative to the above command would be:
!for %f in ("file2.txt") do type "%f" >> "file1.txt"
which appends the contents of "file2.txt" to "file1.txt", resulting in "file1.txt" containing the concatenated text instead of creating a new file.
If you have your file names in strings, you can create the command as a string and use the SYSTEM command instead of the ! operator. For example:
a = 'file1.txt';
b = 'file2.txt';
system(['for %f in ("' b '") do type "%f" >> "' a '"']);
Option 2:
One MATLAB only solution, in addition to Amro's, is:
dlmwrite('file.txt',['first line' 13 10 fileread('file.txt')],'delimiter','');
This uses FILEREAD to read the text file contents into a string, concatenates the new line you want to add (along with the ASCII codes for a carriage return and a line feed/new line), then overwrites the original file using DLMWRITE.
I get the feeling Option #1 might perform faster than this pure MATLAB solution for huge text files, but I don't know that for sure. ;)
How about using the frewind(fid) function to take the pointer to the beginning of the file?
I had a similar requirement and tried frewind() followed by the necessary fprintf() statement.
But, warning: It will overwrite on whichever is the 1st line. Since in my case, I was the one writing the file, I put a dummy data at the starting of the file and then at the end, let that be overwritten after the operations specified above.
BTW, even I am facing one problem with this solution, that, depending on the length(/size) of the dummy data and actual data, the program either leaves part of the dummy data on the same line, or bring my new data to the 2nd line..
Any tip in this regards is highly appreciated.