How to avoid the repeated paragraghs of long txt files being ignored for importdata in matlab - matlab

I am trying to import all double from a txt file, which has this form
#25x1 string
#9999x2 double
.
.
.
#(repeat ten times)
However, when I am trying to use import Wizard, only the first
25x1 string
9999x2 double.
was successfully loaded, the other 9 were simply ignored
How may I import all the data? (Does importdata has a maximum length or something?)
Thanks

It's nothing to do with maximum length, importdata is just not set up for the sort of data file you describe. From the help file:
For ASCII files and spreadsheets, importdata expects
to find numeric data in a rectangular form (that is, like a matrix).
Text headers can appear above or to the left of the numeric data,
as follows:
Column headers or file description text at the top of the file, above
the numeric data. Row headers to the left of the numeric data.
So what is happening is that the first section of your file, which does match the format importdata expects, is being read, and the rest ignored. Instead of importdata, you'll need to use textscan, in particular, this style:
C = textscan(fileID,formatSpec,N)
fileID is returned from fopen. formatspec tells textscan what to expect, and N how many times to repeat it. As long as fileID remains open, repeated calls to textscan continue to read the file from wherever the last read action stopped - rather than going back to the start of the file. So we can do this:
fileID = fopen('myfile.txt');
repeats = 10;
for n = 1:repeats
% read one string, 25 times
C{n,1} = textscan(fileID,'%s',25);
% read two floats, 9999 times
C{n,2} = textscan(fileID,'%f %f',9999);
end
You can then extract your numerical data out of the cell array (if you need it in one block you may want to try using 'CollectOutput',1 as an option).

Related

How to read a number from text file via Matlab

I have 1000 text files and want to read a number from each file.
format of text file as:
af;laskjdf;lkasjda123241234123
$sakdfja;lskfj12352135qadsfasfa
falskdfjqwr1351
##alskgja;lksjgklajs23523,
asdfa#####1217653asl123654fjaksj
asdkjf23s#q23asjfklj
asko3
I need to read the number ("1217653") behind "#####" in each txt file.
The number will follow the "#####" closely in all text file.
"#####" and the close following number just appear one time in each file.
clc
clear
MyFolderInfo = dir('yourpath/folder');
fidin = fopen(file_name,'r','n','utf-8');
while ~feof(fidin)
tline=fgetl(fidin);
disp(tline)
end
fclose(fidin);
It is not finish yet. I am stuck with the problem that it can not read after the space line.
This is another approach using the function regex. This will easily provide a more advanced way of reading files and does not require reading the full file in one go. The difference from the already given example is basically that I read the file line-by-line, but since the example use this approach I believe it is worth answering. This will return all occurences of "#####NUMBER"
function test()
h = fopen('myfile.txt');
str = fgetl(h);
k = 1;
while (isempty(str) | str ~= -1 ) % Empty line returns empty string and EOF returns -1
res{k} = regexp(str,'#####\d+','match');
k = k+1;
str = fgetl(h);
end
for k=1:length(res)
disp(res{k});
end
EDIT
Using the expression '#####(\d+)' and the argument 'tokens' instead of 'match' Will actually return the digits after the "#####" as a string. The intent with this post was also, apart from showing another way to read the file, to show how to use regexp with a simple example. Both alternatives can be used with suitable conversion.
Assuming the following:
All files are ASCII files.
The number you are looking to extract is directly following #####.
The number you are looking for is a natural number.
##### followed by a number only occurs once per file.
You can use this code snippet inside a for loop to extract each number:
regx='#####(\d+)';
str=fileread(fileName);
num=str2double(regexp(str,regx,'tokens','once'));
Example of for loop
This code will iterate through ALL files in yourpath/folder and save the numbers into num.
regx='#####(\d+)'; % Create regex
folderDir='yourpath/folder';
files=cellstr(ls(folderDir)); % Find all files in folderDir
files=files(3:end); % remove . and ..
num=zeros(1,length(files)); % Pre allocate
for i=1:length(files) % Iterate through files
str=fileread(fullfile(folderDir,files{i})); % Extract str from file
num(i)=str2double(regexp(str,regx,'tokens','once')); % extract number using regex
end
If you want to extract more ''advanced'' numbers e.g. Integers or Real numbers, or handle several occurrences of #####NUMBER in a file you will need to update your question with a better representation of your text files.

How to import large dataset and put it in a single column

I want to import the large data set (multiple column) by using the following code. I want to get all in a single column instead only one row (multi column). So i did transpose operation but it still doesn't work appropriately.
clc
clear all
close all
dataX_Real = fopen('dataX_Real_in.txt');dataX_Real=dataX_Real';
I will really appreciate your support and suggestions. Thank You
The sample files can be found using the following link.
When using fopen, all you are doing is opening up the file. You aren't reading in the data. What is returned from fopen is actually a file pointer that gives you access to the contents of the file. It doesn't actually read in the contents itself. You would need to use things like fread or fscanf to read in the content from the text data.
However, I would recommend you use dlmread instead, as this doesn't require a fopen call to open your file. This will open up the file, read the contents and store it into a variable in one function call:
dataX_Real = dlmread('dataX_Real_in.txt');
By doing the above and using your text file, I get 44825 elements. Here are the first 10 entries of your data:
>> format long;
>> dataX_Real(1:10)
ans =
Columns 1 through 4
-0.307224970000000 0.135961950000000 -1.072544100000000 0.114566020000000
Columns 5 through 8
0.499754310000000 -0.340369000000000 0.470609910000000 1.107567700000000
Columns 9 through 10
-0.295783020000000 -0.089266816000000
Seems to match up with what we see in your text file! However, you said you wanted it as a single column. This by default reads the values in on a row basis, so here you can certainly transpose:
dataX_Real = dataX_Real.';
Displaying the first 10 elements, we get:
>> dataX_Real = dataX_Real.';
>> dataX_Real(1:10)
ans =
-0.307224970000000
0.135961950000000
-1.072544100000000
0.114566020000000
0.499754310000000
-0.340369000000000
0.470609910000000
1.107567700000000
-0.295783020000000
-0.089266816000000

Reading large amount of data stored in lines from csv

I need to read in a lot of data (~10^6 data points) from a *.csv-file.
the data is stored in lines
I can't know how many data points per line and how many lines are there before I read it in
the amount of data points per line can be different for each line
So the *.csv-file could look like this:
x Header
x1,x2
y Header
y1,y2,y3, ...
z Header
z1,z2
...
Right now I read in every line as string and split it at every comma. This is what my code looks like:
index = 1;
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');
while ~isempty(headerLine{1})
dummy = textscan(csvFileHandle,'%s',1,'Delimiter','\n', ...
'BufSize',2^31 - 1);
rawData(index) = textscan(dummy{1}{1},'%f','Delimiter',',');
headerLine = textscan(csvFileHandle,'%s',1,'Delimiter','\n');
index = index + 1;
end
It's working, but it's pretty slow. Most of the time is used while splitting the string with textscan. (~95%).
I preallocated rawData with sample data, but it brought next to nothing for the speed.
Is there a better way than mine to read in something like this?
If not, is there a faster way to split this string?
First suggestion: to read a single line as a string when looping over a file, just use fgetl (returns a nice single string so no faffing with cell arrays).
Also, you might consider (if possible), reading everything in a single go rather than making repeating reads from file:
output = textscan(fid, '%*s%s','Delimiter','\n'); % skips headers with *
If the file is so big that you can't do everything at once, try to read in blocks (e.g. tackle 1000 lines at a time, parsing data as you go).
For converting the string, there are the options of str2num or strsplit+str2double but the only thing I can think of that might be slightly quicker than textscan is sscanf. Since this doesn't accept the delimiter as a separate input put it in the format string (the last value doesn't end with ,, true, but sscanf can handle that).
for n = 1:length(output);
data{n} = sscanf(output{n},'%f,');
end
Tests with a limited patch of test data suggests sscanf is a bit quicker (but might depend on machine/version/data sizes).

How to import column of numbers from mixed string numerical text file

Variations of this question have already been asked several times, for example here. However, I can't seem to get this to work for my data.
I have a text file with 3 columns. First and third columns are floating point numbers. Middle column is strings. I'm only interested in getting the first column really.
Here's what I tried:
filename=fopen('heartbeatn1nn.txt');
A = textscan(filename,'%f','HeaderLines',0);
fclose(filename);
When I do this A comes out as just a single number--the first element in the column. How do I get the whole column? I've also tried this with the '.tsv' file extension, same result.
Also tried:
filename=fopen('heartbeatn1nn.txt');
formatSpec='%f';sizeA=[1 Inf];
A = fscanf(filename,formatSpec,sizeA);
fclose(filename);
with same result.
Could the file size be a problem? Not sure how many rows but probably quite a few since file size is 1.7M.
Assuming the columns in your text file are separated by single whitespace characters your format specification should look like this:
A = textscan(filename,'%f %s %f');
A now contains the complete file content. To obtain the first column:
first_col = A{:,1};
Alternatively, you can tell textscan to skip the unneeded fields with the * option:
first_col = cell2mat( textscan(filename, '%f %*s %*f') );
This returns only the first column.

Matlab : Read a file name in string format from a .csv file

I am having a .csv file which contains let's say 50 rows.
At the beginning of each row I have a file name in the following format 001_02_03.bmp followed by values separated by commas. Something like this :
001_02_03.bmp,20,30,45,10,40,20
Can someone tell me how can I read the first column from the data?
I know how to obtain the data from the second column onward. I am using the csvread function like this X = csvread('filename.csv', 0, 1);. If I try to read the first column in the same manner it outputs an error, saying the csvread does not support string format.
Use textscan, ie:
fid1 = fopen(csvFileName);
X = textscan(fid1, '%s%f%f%f%f%f%f', 'Delimiter', ',');
fclose(fid1);
FirstCol = X{1, 1};
A little more detail? csvread only works with purely numeric data, so you can't use it to get in data with a .bmp, or underscores for that matter. Thus we use textscan. The funny looking format string input to textscan is just saying that the columns are, in order, of type string %s, then the next 6 columns are of type double %f%f%f%f%f%f (or you might choose to alter this to reflect an integer datatype - I personally rarely bother unless the quantity of data is huge or floating point precision is a problem).
Note, if you just wanted to get the first column and ignore the rest, you can replace the format string with %s% %*[^\n]. A final point, if your csv file has a header line, you can skip it using the HeaderLines optional input to textscan.