I have a large tab delimited file (10000 rows, 15000 columns) and would like to import it into Matlab.
I've tried to import it using textscan function the following way:
function [C_text, C_data] = ReadDataFile(filename, header, attributesCount, delimiter,
attributeFormats, attributeFormatCount)
AttributeTypes = SetAttributeTypeMatrix(attributeFormats, attributeFormatCount);
fid = fopen(filename);
if(header == 1)
%read column headers
C_text = textscan(fid, '%s', attributesCount, 'delimiter', delimiter);
C_data = textscan(fid, AttributeTypes{1, 1}, 'headerlines', 1);
else
C_text = '';
C_data = textscan(fid, AttributeTypes{1, 1});
end
fclose(fid);
AttributeTypes{1, 1} is a string wich describes variable types for each column (in this case there are 14740 float and 260 string type variables so the value of AttributeTypes{1, 1} is '%f%f......%f%s%s...%s where %f is repeated 14740 times and %s 260 times).
When I try to execute
>> [header, data] = ReadDataFile('data/orange_large_train.data.chunk1', 1, 15000, '\t', types, size);
header array seems to be correct (column names have been read correctly).
data is a 1 x 15000 array (only first row has been imported instead of 10000) and don't know what is causing such behavior.
I guess the problem is caused in this line:
C_data = textscan(fid, AttributeTypes{1, 1});
but don't know what could be wrong because there is a similar example described in the help reference.
I would be very thankful if anyone of you suggested any fix for the issue - How to read all 10000 rows.
I believe all your data are there. If you look inside data, every cell there should contains the whole column (10000x1). You can extract i-th cell as an array with data{i}.
You would probably want to separate double and string data. I don't know what is attributeFormats, you probably can use this array. But you can also use the AttributeTypes{1, 1}.
isdouble = strfind(AttributeTypes{1, 1}(2:2:end),'f');
data_double = cell2mat(data(isdouble));
To combine string data into one cell array of strings you can do:
isstring = strfind(AttributeTypes{1, 1}(2:2:end),'s');
data_string = horzcat(data{isstring});
Related
I have a data file that includes comma-delimited data that I am trying to read into Octave. Most of the data is fine, but some includes numbers between double quotes that use commas between the quotes. Here's a sample section of data:
.123,4.2,"4,123",700,12pie
.34,4.23,602,701,23dj
.4345,4.6,"3,623,234",700,134nfg
.951,68.5,45,699,4lkj
I've been using textscan to read the data (since there's a mix of number and strings), specifying comma delimiters, and that works most of the time, but occasionally the file contains these bigger integers in quotes scattered through that column. I was able to get around one of these quoted numbers earlier in the data file because I knew where it would be, but it wasn't pretty:
sclose = textscan(fid, '%n %n', 1, 'delimiter', ',');
junk = fgetl(fid, 1);
junk = textscan(fid, '%s', 1, 'delimiter', '"');
junk = fgetl(fid, 1);
sopen = textscan(fid, '%n %s', 1, 'delimiter', ',');
I don't care about the data in that column, but because it changes size and sometimes contains the quoted with extra commas that I want to ignore, I'm struggling with how to read/skip it. Any suggestions on how to handle it?
Here's my current (ugly) approach that reads the column as a string, then uses strfind to check for a " within the string. If it's present then it reads another comma-delimited string and repeats the check until the closing " is found and then resumes reading the data.
fid = fopen('sample.txt', 'r');
for k=1:4
expdata1(k, :) = textscan(fid, '%n %n %s', 1, 'delimiter', ','); #read first 3 data pts
qcheck = char(expdata1(k,3));
idx = strfind(qcheck, '"'); #look for "
dloc = ftell(fid);
for l=1:4
if isempty(idx) #if no " present, continue reading data
break
endif
dloc = ftell(fid); #save location so can return to next data point
expdata1(k, 3) = textscan(fid, '%s', 1, 'delimiter', ','); #if " present, read next comma segment and check for "
qcheck = char(expdata1(k,3));
idx = strfind(qcheck, '"');
endfor
fseek(fid, dloc);
expdata2(k, :) = textscan(fid, '%n %s', 1, 'delimiter', ',');
endfor
fclose(fid);
There's gotta be a better way...
I see this has a matlab tag on it, are you using matlab textscan or octave?
If in matlab, I would suggest using either readmatrix or readtable.
Also note, the format specifier for quoted string is %q. This should be applicable to both languages even for textscan.
Putting your sample data in data.csv, the following is possible:
>> readtable("data.csv", 'Format','%f%f%q%d%s');
ans =
4×5 table
Var1 Var2 Var3 Var4 Var5
______ ____ _____________ ____ __________
0.123 4.2 {'4,123' } 700 {'12pie' }
0.34 4.23 {'602' } 701 {'23dj' }
0.4345 4.6 {'3,623,234'} 700 {'134nfg'}
0.951 68.5 {'45' } 699 {'4lkj' }
How can I turn strings like 1-14 into 01_014? (and for 2-2 into 02_002)?
I can do something like this:
testpoint_number = '5-16';
temp = textscan(testpoint_number, '%s', 'delimiter', '-');
temp = temp{1};
first_part = temp{1};
second_part = temp{2};
output_prefix = strcat('0',first_part);
parsed_testpoint_number = strcat(output_prefix, '_',second_part);
parsed_testpoint_number
But I feel this is very tedious, and I don't know how to handle the second part (16 to 016)
As you are handling integer numbers, I would suggest to change the textscan to %d (integer numbers). With that, you can use the formatting power of the *printf commands (e.g. sprintf).
*printf allows you to specify the width of the integer. With %02d, a 2 chars wide integer, which will be filled up with zeros, is printed.
The textscan returns a {1x1} cell, which contains a 2x1 array of integers. *printf can handle this itsself, so you just have to supply the argument temp{1}:
temp = textscan(testpoint_number, '%d', 'delimiter', '-');
parsed_testpoint_number = sprintf('%02d_%03d',temp{1});
Your textscanning is probably the most intuitive way to do this, but from then on what I would recommend doing is instead converting the scanned first_part and second_part into numerical format, giving you integers.
You can then sprintf these into your target string using the correct 'c'-style formatters to indicate your zero-padding prefix width, e.g.:
temp = textscan(testpoint_number, '%d', 'delimiter', '-');
parsed_testpoint_number = sprintf('%02d_%03d', temp{1});
Take a look at the C sprintf() documentation for an explanation of the string formatting options.
I have a cell array that has both strings and numbers. I want to load all the elements of the cell array. For the same I used the following method:
load(filename);
This command is loading only strings and excluding the columns that has numbers. Basically since my file is not .mat extension, it is treating it as ASCII file and loading only text.
I tried importdata(filename). But that gives me struct of 1*1. I need the elements to be imported into another cell array of same dimension.
Is there a way to load all the values?
load is used to import .mat-files with workspace variables. Since your data is not an actual .mat-file, you need to use a different method.
Let's assume you have the file filename with tab-delimited data:
str1 1
str2 2
str3 3
str4 4
To get a cell-array where the first column is a string (using %s) and the second a double (using %f), you can use textscan. Check out the result, maybe it's already what you're searching for.
filename = 'data';
F = fopen(filename, 'r');
data = textscan(F, '%s %f', 'Delimiter', '\t');
If not, you can create a cell-array CA where the first column is a string (using cellstr) and the second one is a double (using num2cell).
CA = cell(size(data{1},1),2);
CA(:,1) = cellstr(data{1});
CA(:,2) = num2cell(data{2});
Result:
CA =
'str1' [1]
'str2' [2]
'str3' [3]
'str4' [4]
Loading a well formatted and delimited text file in Matlab is relatively simple, but I struggle with a text file that I have to read in. Sadly I can not change the structure of the source file, so I have to deal with what I have.
The basic file structure is:
123 180 (two integers, white space delimited)
1.5674e-8
.
.
(floating point numbers in column 1, column 2 empty)
.
.
100 4501 (another two integers)
5.3456e-4 (followed by even more floating point numbers)
.
.
.
.
45 String (A integer in column 1, string in column 2)
.
.
.
A simple
[data1,data2]=textread('filename.txt','%f %s', ...
'emptyvalue', NaN)
Does not work.
How can I properly filter the input data? All examples I found online and in the Matlab help so far deal with well structured data, so I am a bit lost at where to start.
As I have to read a whole bunch of those files >100 I rather not iterate trough every single line in every file. I hope there is a much faster approach.
EDIT:
I made a sample file available here: test.txt (google drive)
I've looked at the text file you supplied and tried to draw a few general conclusions -
When there are two integers on a line, the second integer corresponds to the number of rows following.
You always have (two integers (A, B) followed by "B" floats), repeated twice.
After that you have some free-form text (or at least, I couldn't deduce anything useful about the format after that).
This is a messy format so I doubt there are going to be any nice solutions. Some useful general principles are:
Use fgetl when you need to read a single line (it reads up to the next newline character)
Use textscan when it's possible to read multiple lines at once - it is much faster than reading a line at a time. It has many options for how to parse, which it is worth getting to know (I recommend typing doc textscan and reading the entire thing).
If in doubt, just read the lines in as strings and then analyse them in MATLAB.
With that in mine, here is a simple parser for your files. It will probably need some modifications as you are able to infer more about the structure of the files, but it is reasonably fast on the ~700 line test file you gave.
I've just given the variables dummy names like "a", "b", "floats" etc. You should change them to something more specific to your needs.
function output = readTestFile(filename)
fid = fopen(filename, 'r');
% Read the first line
line = '';
while isempty(line)
line = fgetl(fid);
end
nums = textscan(line, '%d %d', 'CollectOutput', 1);
a = nums{1}(1);
b = nums{1}(2);
% Read 'b' of the next lines:
contents = textscan(fid, '%f', b);
floats1 = contents{1};
% Read the next line:
line = '';
while isempty(line)
line = fgetl(fid);
end
nums = textscan(line, '%d %d', 'CollectOutput', 1);
c = nums{1}(1);
d = nums{1}(2);
% Read 'd' of the next lines:
contents = textscan(fid, '%f', d);
floats2 = contents{1};
% Read the rest:
rest = textscan(fid, '%s', 'Delimiter', '\n');
output.a = a;
output.b = b;
output.c = c;
output.d = d;
output.floats1 = floats1;
output.floats2 = floats2;
output.rest = rest{1};
end
You can read in the file line by line using the lower-level functions, then parse each line manually.
You open the file handle like in C
fid = fopen(filename);
Then you can read a line using fgetl
line = fgetl(fid);
String tokenize it on spaces is probably the best first pass, storing each piece in a cell array (because a matrix doesn't support ragged arrays)
colnum = 1;
while ~isempty(rem)
[token, rem] = strtok(rem, ' ');
entries{linenum, colnum} = token;
colnum = colnum + 1;
end
then you can wrap all of that inside another while loop to iterate over the lines
linenum = 1;
while ~feof(fid)
% getl, strtok, index bookkeeping as above
end
It's up to you whether it's best to parse the file as you read it or read it into a cell array first and then go over it afterwards.
Your cell entries are all going to be strings (char arrays), so you will need to use str2num to convert them to numbers. It does a good job of working out the format so that might be all you need.
I have a file in the following format:
**400**,**100**::400,descendsFrom,**76**::0
**400**,**119**::400,descendsFrom,**35**::0
**400**,**4**::400,descendsFrom,**45**::0
...
...
Now I need to read, the part only in the bold. I've written the following formatspec:
formatspec = '%d,%d::%*d,%*s,%d::%*d\n';
data = textscan(fileID, formatspec);
It doesn't seem to work. Can someone tell me what's wrong?
I also need to know how to 'not use' delimiter, and how to proceed if I want to express the exact way my file is written in, for example in the case above.
EDITED
A possible problem is with the %s part of the formatspec variable. Because %s is an arbitrary string therefore the descendsFrom,76::0 part of the line is ordered to this string. So with the formatspec '%d,%d::%d,%s,%d::%d\n' you will get the following cells form the first line:
400 100 400 'descendsFrom,76::0'
To solve this problem you have two possibilities:
formatspec = %d,%d::%d,descendsFrom,%d::%d\n
OR
formatspec = %d,%d::%d,%12s,%d::%d\n
In the first case the 'descendForm' string has to be contained by each row (as in your example). In the second case the string can be changed but its length must be 12.
Your Delimiter is "," you should first delimit it then maybe run a regex. Here is how I would go about it:
fileID = fopen('file.csv');
D = textscan(fileID,'%s %s %s %s ','Delimiter',','); %read everything as strings
column1 = regexprep(D{1},'*','')
column2 = regexprep(D{2},{'*',':'},{'',''})
column3 = D{3}
column4 = regexprep(D{4},{'*',':'},{'',''})
This should generate your 4 columns which you can then combine
I believe the Delimiter can only be one symbol. The more efficient way is to directly do regexprep on your entire line, which would generate:
test = '**400**,**4**::400,descendsFrom,**45**::0'
test = regexprep(test,{'*',':'},{'',''})
>> test = 400,4400,descendsFrom,450
You can do multiple delimiters in textscan, they need to be supplied as a cell array of strings. You don't need the end of line character in the format, and you need to set 'MultipleDelimsAsOne'. Don't have MATLAB to hand but something along these lines should work:
formatspec = '%d %d %*d %*s %d %*d';
data = textscan(fileID, formatspec,'Delimiter',{',',':'},'MultipleDelimsAsOne',1);
If you want to return it as a matrix of numbers not a cell array, try adding also the option 'CollectOutput',1