Textscan and Regexp Cell Data Layout - matlab

I'm trying to make some logic work with some legacy matlab code. I figured the easist thing was to make the data look the same as what the code is expecting.
I'm reading the relevant data from a CSV file, it's pretty simple -- but the format for the IDs changed from a simple number to an ID of the form [YY,ZZZZ].
As an example, the 'previous' CSV data looked like:
1,Simple,Data
2,More,Data-Dash-Data
3,Even,More
4,Really,More
The 'new' CSV data looks like:
[01,0001],Simple,Data
[02,1001],More,Data-Dash-Data
[03,9876],Even,More
[04,1234],Really,More
Previously, to read in the data, this logic was used:
fid = fopen(fileName);
data = textscan(fid,'%s%s%s%*s','Delimiter',',');
When this was done against the 'previous' CSV data, it returned data that look like this:
data =
1×3 cell array
{4×1 cell} {4×1 cell} {4×1 cell}
The cells then look like:
K>> data{:}
ans =
4×1 cell array
'1'
'2'
'3'
'4'
ans =
4×1 cell array
'Simple'
'More'
'Even'
'Data'
ans =
4×1 cell array
'Data'
'Data-Dash-Data'
'More'
'Data'
So to handle the ID of the form [YY,ZZZZ], I had to modify the 'textscan' logic to handle the new ID format that we're using. To do that, I'm using a regexp function:
fid = fopen(fileName);
rawData = textscan(fid,'%s','Delimiter','\n');
data = regexp(rawData{1},'[ \-\/\w]*([\[][^\)\]]*[\]])?', 'match')
This then, after it reads in the data, gives me data that is formatted like this:
K>> data
data =
4×1 cell array
{1×3 cell}
{1×3 cell}
{1×3 cell}
{1×3 cell}
K>> data{:}
ans =
1×3 cell array
'[01,0001]' 'Simple' 'Data'
ans =
1×3 cell array
'[02,1001]' 'More' 'Data-Dash-Data'
ans =
1×3 cell array
'[03,9876]' 'Even' 'More'
ans =
1×3 cell array
'[04,1234]' 'Really' 'More'
So you can see that it has the correct data in it -- but the data is laid out differently which is breaking on the legacy code. So my question is how can I make the 'new' data be laid out like this as it was coming out of the 'textscan' logic:
data =
1×3 cell array
{4×1 cell} {4×1 cell} {4×1 cell}

You can use fileread to read in the raw file directly. Afterwards, you can use a regex which splits on either a comma, not followed by a digit, or on a carriage return.
c = regexp(fileread(fileName),',(?!\d)|\r\n','split');
formattedData = {c(1:3:end)',c(2:3:end)',c(3:3:end)'};
>> formattedData
formattedData =
1×3 cell array
{4×1 cell} {4×1 cell} {4×1 cell}

Related

Matlab : Access timeseries stored in cell array

I have multiple mat files containing n*1 array.
Each of the cell in these arrays contain a 1 * 1 timeseries (size 6*n) :
Content of cell 1:1 {1x1 timseries}:
I am able to see the data using the workspace but when I try to print the data of a cell I only get {1x1 timseries} or error like "Dot indexing is not supported for variables of this type." when I do cell.Data.
Do you know how can I access the data inside a cell ? My goal is to xetract the data of each cell to generate a big csv file.
Use () notation to access an item in an array or vector, and use {} notation to access the contents. For example:
>> cellarray={[1,2,3],[4,5,6],[7,8]}
cellarray =
1×3 cell array
{1×3 double} {1×3 double} {1×2 double}
>> cellarray(2)
ans =
1×1 cell array
{1×3 double}
>> cellarray{2}
ans =
4 5 6
>> cellarray{1}(2)
ans =
2
I think in your case you want to use something like cell{Data}.

splitting a character array into a cell array and matrix

I have a text file.
In the file is approx 20,000 rows of data. Each row has one column & contains 256 characters (which are all numbers).
I need to split each row into a cell array or matrix. So each 8 characters are "one piece" of information. I want to split the first 3 characters into a cell array and the next 5 characters into a double, then same again for the next 8 characters.
example
1653256719812345
myCellArray (1 x 2) myDoubleArray (1 x 2)
[165, 198] [32567, 12345]
What is the best way to do this?
Use textscan.
fid = fopen(MyFileName.txt);
data = textscan(fid, '%3d%5d', 'Delimiter', '');
fclose(fid);
testing:
% Test with string of 256 random digits that all happen to be 1:8 repeated 32 times
x = '1234567812345678123456781234567812345678123456781234567812345678123456781234567812345678123456781234567812345678123456781234567812345678123456781234567812345678123456781234567812345678123456781234567812345678123456781234567812345678123456781234567812345678';
>> y = textscan(x, '%3d%5d', 'Delimiter', '')
y =
[32x1 int32] [32x1 int32]
>> y{1}
ans =
123
123
123
123
...
I don't know the exact format of your files, so you may have to do this line-by-line within a loop (in which case you would get each line using fgetl and then replace fid in the textscan statement with the output from fgetl).
In general, whenever you find yourself having to read in data that was produced by FORTRAN code (fixed field width text files), textscan's 'Delimiter, '' and 'Whitespace', '' parameters are your friend.
Use regexp. If the file data.txt contains
1653256719812345
1563256719812345
1233256719812345
1463256719812345
Then the following MATLAB statements will read the numbers.
>> txt = fileread('data.txt') % Read entire file in txt
>> out = regexp(txt,'(\d{3})(\d{5})(\d{3})(\d{5})','tokens') % Match regex capturing groups
out =
{1x4 cell} {1x4 cell} {1x4 cell} {1x4 cell}
Each cell in out is a row from the file containing the parsed numbers as strings.You can use str2double to convert the numbers to a numeric data type in MATLAB
>> nums = cellfun(#str2double,out,'uni',0)
nums =
[1x4 double] [1x4 double] [1x4 double] [1x4 double]
Iterate over your rows one by one and run something like the following code.
k=int2str(1653256719812345);
> myCellArray{1}=k(1:3)
myCellArray =
'165'
>> mydoublearray(1)=str2num(k(4:9))
mydoublearray =
325671
If there's some formulaic pattern you should incorporate that instead of manually hard coding it.

Read string from txt file and use string for loop

Trying to read a txt file, then to loop through all string of the txt file. Unfortunately not getting it to work.
fid = fopen(fullfile(source_dir, '1.txt'),'r')
read_current_item_cells = textscan(fid,'%s')
read_current_item = cell2mat(read_current_item_cells);
for i=1:length(read_current_item)
current_stock = read_current_item(i,1);
current_url = sprintf('http:/www.', current_item)
.....
I basically try to convert the cell arrays to a matrix as textscan outputs cell arrays. However now I get the message
Error using cell2mat (line 53) Cannot support cell arrays containing cell arrays or objects.
Any help is very much appreciated
That is the normal behaviour of textscan. It returns a cell array where each element of it is another cell OR array (depending on the specifier) containing the values corresponding to each format specifier in the format string you have passed to the function. For example, if 1.txt contains
appl 12
msft 23
running your code returns
>> read_current_item_cells
read_current_item_cells =
{4x1 cell}
>> read_current_item_cells{1}
ans =
'appl'
'12'
'msft'
'23'
which itself is another cell array:
>> iscell(read_current_item_cells{1})
ans =
1
and its elements can be accessed using
>> read_current_item_cells{1}{1}
ans =
appl
Now if you change the format from '%s' to '%s %d' you get
>> read_current_item_cells
read_current_item_cells =
{2x1 cell} [2x1 int32]
>> read_current_item_cells{1}
ans =
'appl'
'msft'
>> read_current_item_cells{2}
ans =
12
23
But the interesting part is that
>> iscell(read_current_item_cells{1})
ans =
1
>> iscell(read_current_item_cells{2})
ans =
0
That means the cell element corresponding to %s is turned into a cell array, while the one corresponding to %d is left as an array. Now since I do not know the exact format of the rows in your file, I guess you have one cell array with one element which in turn is another cell array containing all the elements in the table.
What can happen is that the data gets wrapped into a cell array of cell arrays, and to access the stored strings you need to index past the first array with
read_current_item_cells = read_current_item_cells{1};
Converting from cell2mat will not work if your strings are not equal in length, in which case you can use strvcat:
read_current_item = strvcat(read_current_item_cells{:});
Then you should be able to loop through the char array:
for ii=1:size(read_current_item,1)
current_stock = read_current_item(ii,:);
current_url = sprintf('http:/www.', current_stock)
.....

creating a cell array within a cell array using cellfun

If I create a cell array using:
clear all
data = {rand(1,5),rand(1,4),rand(1,4),rand(1,6)};
a = cell(1,length(data));
how is it then possible to create a cell array in each cell of a which is the same length as the corresponding cell in data. I know this can easily be done using a loop, but how would it be possible by using the cellfun function?
Do you want something like that?
data = {rand(1,5),rand(1,4),rand(1,4),rand(1,6)};
a2=cellfun(#(x) cell(size(x)),data,'UniformOutput',0)
a2 =
{1x5 cell} {1x4 cell} {1x4 cell} {1x6 cell}
You could also accomplish this by using CELLFUN to just get the sizes of each cell, create all the cells you need, then divide them up using MAT2CELL:
>> cellSizes = cellfun('size',data,2);
>> a = mat2cell(cell(1,sum(cellSizes)),1,cellSizes)
a =
{1x5 cell} {1x4 cell} {1x4 cell} {1x6 cell}

how to sort a cell array which includes arrays?

I would like to sort a cell array according to one of the columns. The array is created by textscan function. There are some answers on web, but I still cannot get it working. I have the following array:
>> DATA
DATA =
{303427x1 cell} {303427x1 cell} {303427x1 cell} {303427x1 cell} [303427x1 uint32] [303427x1 double] [303427x1 uint32] [303427x1 int32] {303427x1 cell}
Important is the sixth column, which are times converted by datenum function. I want to sort the whole DATA by this column but with the link to another columns.
Normal sort or sortrows doesnt work for me. Could you help me please?
I take it you want to sort within each cell of DATA, correct? The datenum function produces serial time stamps since "year zero," so they can be sorted normally.
times = DATA{6};
[~,idx] = sort(times,'ascend');
for i=1:length(DATA)
DATA{i} = DATA{i}(idx);
end
You can avoid for-loop in #reve_etrange's answer by using CELLFUN after you get the sorting index idx.
DATA = cellfun(#(x) x(idx), DATA, 'UniformOutput', false);