Grouping file names in clusters - matlab

I am using this line to read all Images in a file:
imagefiles = dir('Images\*.jpg');
Suppose I have the names: a1.jpg,a11.jpg,b13.JPG,b5.JPG,c1.jpg.
How do I group together all images with no more than 2 different characters (the number) in their name. for the given example group together all a and all b and atheired group for c.
By grouping I mean form some kind of data structure or order that will enable me to access each group separately for later processing?
I am assuming the file type is always 'jpg' and the numbers will always be smaller then 100 and positive. I am assuming a not case sensitive code regarding file type, that is jpg and JPG may appear (I don't know regular expression but will be happy to learn from a good link as well)

You could capture the initial non-number part of the file name using regexp, group them with unique and put them in a struct.
% Some test data
files = {'a11','a1','b2','a32','ca3','b45','c1','ca2'};
files = strcat(files, '.jpg');
% Capture and group
tag = regexp(files,'^\D+','match','once');
[unTag, ~, unIdx] = unique(tag);
for idx = 1:length(unTag)
fileGroups.(unTag{idx}) = files(unIdx == idx);
end
% The result
>> fileGroups =
a: {'a11.jpg' 'a1.jpg' 'a32.jpg'}
b: {'b2.jpg' 'b45.jpg'}
c: {'c1.jpg'}
ca: {'ca3.jpg' 'ca2.jpg'}
Depending on how your filenames you might have to update to a more detailed regular expression. You could use \D+(?=\d+\.(JPG|jpg)) to caputure a non-digit char before some number and the .jpg extension.
So if your file names are something like:
>> files
'dummyStr_a11.jpg'
'dummyStr_a1.jpg'
'dummyStr_b2.jpg'
'dummyStr_a32.jpg'
'dummyStr_ca3.jpg'
'dummyStr_b45.jpg'
'dummyStr_c1.jpg'
'dummyStr_ca2.jpg'
Capture with something like
tag = regexp(files,'[a-z]+(?=\d+\.(JPG|jpg))','match','once');
>> tag =
'a' 'a' 'b' 'a' 'ca' 'b' 'c' 'ca'

Related

dicom header personal information conversion to a .txt file

I have a series of DICOM Images which I want to anonymize, I found few Matlab codes and some programs which do the job, but none of them export a .txt file of removed personal information. I was wondering if there is a function which can also save removed personal information of a DICOM images in .txt format for features uses. Also, I am trying to create a table which shows the corresponding new images ID to their real name.(subjects real name = personal-information-removed image ID)
Any thoughts?
Thanks for considering my request!
I'm guessing you only want to output to your text file the fields that are changed by anonymization (either modified, removed, or added). First, you may want to modify some dicomanon options to reduce the number of changes, in particular passing the arguments 'WritePrivate', true to ensure private extensions are kept.
First, you can perform the anonymization, saving structures of pre- and post-anonymization metadata using dicominfo:
preAnonData = dicominfo('input_file.dcm');
dicomanon('input_file.dcm', 'output_file.dcm', 'WritePrivate', true);
postAnonData = dicominfo('output_file.dcm');
Then you can use fieldnames and setdiff to find fields that are removed or added by anonymization, and add them to the post-anonymization or pre-anonymization data, respectively, with a nan value as a place holder:
preFields = fieldnames(preAnonData);
postFields = fieldnames(postAnonData);
removedFields = setdiff(preFields, postFields);
for iField = 1:numel(removedFields)
postAnonData.(removedFields{iField}) = nan;
end
addedFields = setdiff(postFields, preFields);
for iField = 1:numel(addedFields)
preAnonData.(addedFields{iField}) = nan;
end
It will also be helpful to use orderfields so that both data structures have the same ordering for their field names:
postAnonData = orderfields(postAnonData, preAnonData);
Finally, now that each structure has the same fields in the same order we can use struct2cell to convert their field data to a cell array and use cellfun and isequal to find any fields that have been modified by the anonymization:
allFields = fieldnames(preAnonData);
preAnonCell = struct2cell(preAnonData);
postAnonCell = struct2cell(postAnonData);
index = ~cellfun(#isequal, preAnonCell, postAnonCell);
modFields = allFields(index);
Now you can create a table of the changes like so:
T = table(modFields, preAnonCell(index), postAnonCell(index), ...
'VariableNames', {'Field', 'PreAnon', 'PostAnon'});
And you could use writetable to easily output the table data to a text file:
writetable(T, 'anonymized_data.txt');
Note, however, that if any of the fields in the table contain vectors or structures of data, the formatting of your output file may look a little funky (i.e. lots of columns, most of them empty, except for those few fields).
One way to do this is to store the tags before and after anonymisation and use these to write your text file. In Matlab, dicominfo() will read the tags into a structure:
% Get tags before anonymization
tags_before = dicominfo(file_in);
% Anoymize
dicomanon(file_in, file_out); % Need to set tags values where required
% Get tags after anonymization
tags_after = dicominfo(file_out);
% Do something with the two structures
disp(['Patient ID:', tags_before.PatientID ' -> ' tags_after.PatientID]);
disp(['Date of Birth:', tags_before.PatientBirthDate ' -> ' tags_after.PatientBirthDate]);
disp(['Family Name:', tags_before.PatientName.FamilyName ' -> ' tags_after.PatientName.FamilyName]);
You can then write out the before/after fields into a text file. You'd need to modify dicomanon() to choose your own values for the removed fields, since by default they are set to empty.

Matlab - How can I find the lowest common directory of an arbitrary group of files?

I have a cell array of full file names and I want to find the lowest common directory where it makes sense to store accumulated data and what not.
Here is an example hierarchy of test data:
C:\Test\Run1\data1
C:\Test\Run1\data2
C:\Test\Run1\data3
C:\Test\Run2\data1
C:\Test\Run2\data2
.
.
.
In Matlab, the paths are stored in a cell array as follows (each run shares a row):
filePaths = {...
'C:\Test\Run1\data1','C:\Test\Run1\data2','C:\Test\Run1\data3'; ...
'C:\Test\Run2\data1','C:\Test\Run2\data2','C:\Test\Run2\data3'};
I want to write a routine that outputs the common path C:\Test\Run1 so that I can store relevant plots in a new directory there.
C:\Test\Run1\Accumulation_Plots
C:\Test\Run2\Accumulation_Plots
.
.
.
Previously, I was only concerned with two files in an x-by-2 cell, so the regiment below worked; however, strcmp lost it's appeal since I can't (AFAIK) index the whole cell at once.
d = 1;
while strcmp(filePaths{1}(1:d),filePaths{2}(1:d))
d = d + 1;
end
common_directory = filePaths{1}(1:d-1);
mkdir(common_directory,'Accumulation_Plots');
As suggested by #nekomatic, I'm posting my comment as an answer.
filePaths = {...
'C:\Test\Run1\data1','C:\Test\Run1\data2','C:\Test\Run1\data3'; ...
'C:\Test\Run2\data1','C:\Test\Run2\data2','C:\Test\Run2\data3'};
% Sort the file paths
temp = sort(filePaths(:));
% Take the first and the last one, and split by '\'
first = strsplit(temp{1}, '\');
last = strsplit(temp{end}, '\');
% Compare them up to the depth of the smallest. Find the 'first N matching values'
sizeMin = min(numel(first), numel(last));
N = find(~[cellfun(#strcmp, first(1:sizeMin), last(1:sizeMin)) 0], 1, 'first') - 1;
% Get the smallest common path
commonPath = strjoin(first(1:N), '\');
You just need to compare the first d characters of any path in the array - e.g. path 1 - with the first d characters of the other paths. The longest common base path can't be longer than path 1 and it can't be shorter than the shortest common base path between path 1 and any other path.
There must be several ways you could do that, but a concise one is using strfind to match the strings and cellfun with isempty to check which ones didn't match:
% filePaths should contain at least two paths
filePaths = {...
'C:\Test\Run1\data1','C:\Test\Run1\data2','C:\Test\Run1\data3'; ...
'C:\Test\Run2\data1','C:\Test\Run2\data2','C:\Test\Run2\data3'};
path1 = filePaths{1};
filePaths = filePaths(2:end);
% find longest common left-anchored substring
d = 1;
while ~any(cellfun(#isempty, strfind(filePaths, path1(1:d))))
d = d + 1;
end
% find common base path from substring
[common_directory, ~, ~] = fileparts(path1(1:d));
Your code leaves d containing the length of the longest common left-anchored substring between the paths, but that might be longer than the common base path; fileparts extracts the actual base path from that substring.

Matlab | How to load/use files with consecutive names (abc1, abc2, abc3) and then pass on to the next series (cba1, cba2, cba3)?

I have a folder containing a series of data with file names like this:
abc1
abc2
abc3
bca1
bca2
bca3
bca4
bca5
cba1
... etc
My goal is to load all the relevant files for each file name, so all the "abc" files, and plot them in one graph. Then move on to the next file name, and do the same, and so forth. Is there a way to do this?
This is what I currently have to load and run through all the files, grab the data in them and get their name (without the .mat extension) to be able to save the graph with the same filename.
dirName = 'C:\DataDirectory';
files = dir( fullfile(dirName,'*.mat') );
files = {files.name}';
data = cell(numel(files),1);
for i=1:numel(files)
fname = fullfile(dirName,files{i});
disp(fname);
files{i} = files{i}(1:length(files{i})-4);
disp(files{i});
[Rest of script]
end
You already found out about the cool features of dir, and have a cell array files, which contains all file names, e.g.
files =
'37abc1.mat'
'37abc2.mat'
'50bca1.mat'
'50bca2.mat'
'1cba1.mat'
'1cba2.mat'
The main task now is to find all prefixes, 37abc, 50bca, 1cba, ... which are present in files. This can be done using a regular expression (regexp). The Regexp Pattern can look like this:
'([\d]*[\D]*)[\d]*.mat'
i.e. take any number of numbers ([\d]*), then any number of non-numeric characters ([\D]*) and keep those (by putting that in brackets). Next, there will be any number of numeric characters ([\d]*), followed by the text .mat.
We call the regexp function with that pattern:
pre = regexp(files,'([\d]*[\D]*)[\d]*.mat','tokens');
resulting in a cell array (one cell for each entry in files), where each cell contains another cell array with the prefix of that file. To convert this to a simple not-nested cell array, we call
pre = [pre{:}];
pre = [pre{:}];
resulting in
pre =
'37abc' '37abc' '50bca' '50bca' '1cba' '1cba'
To remove duplicate entries, we use the unique function:
pre = unique(pre);
pre =
'37abc' '50bca' '1cba'
which leaves us with all prefixes, that are present. Now you can loop through each of these prefixes and apply your stuff. Everything put together is:
% Find all files
dirName = 'C:\DataDirectory';
files = dir( fullfile(dirName,'*.mat') );
files = {files.name}';
% Find unique prefixes
pre = regexp(files,'([\d]*[\D]*)[\d]*.mat','tokens');
pre = [pre{:}]; pre = [pre{:}];
pre = unique(pre);
% Loop through prefixes
for ii=1:numel(pre)
% Get files with this prefix
curFiles = dir(fullfile(dirName,[pre{ii},'*.mat']));
curFiles = {curFiles.name}';
% Loop through all files with this prefix
for jj=1:numel(curFiles)
% Here the magic happens
end
end
Sorry, I misunderstood your question, I found this solution:
file = dir('*.mat')
matching = regexp({file.name}, '^[a-zA-Z_]+[0-9]+\.mat$', 'match', 'once'); %// Match once on file name, must be a series of A-Z a-z chars followed by numbers.
matching = matching(~cellfun('isempty', matching));
str = unique(regexp(matching, '^[a-zA-Z_]*', 'match', 'once'));
str = str(~cellfun('isempty', str));
group = cell(size(str));
for is = 1:length(str)
ismatch = strncmp(str{is}, matching, length(str{is}));
group{is} = matching(ismatch);
end
Answer came from this source: Matlab Central

Creating variables with numbers from filename

I have a folder full of xls files, named data_00001 through data_10000. Each file has a dozen or so identically named tabs full of RV's. I am interested in reading all files and tabs and creating histograms of the RV's.
Is there a way to read in the last 5 digits of the file names and attached them to each tab name (which I saved as a variable)?
I used regexp to extract the number as a string and converted it to a double, and I used a for loop to save variable X{1,k}. How can I incorporate the saved double into this variable?
Are you looking for something like this?
filenames = ['data_00001','data_10000'];
nums = regexp(filenames, '[0-9]+', 'match');
tag = 'TAG';
for i=1:size(nums,2)
eval(['A_' tag '_' sprintf("%s",nums{1,i}) ' = zeros(1)']);
end
It creates matrices (zero in this case) with variable names
A_TAG_00001 = 0
A_TAG_10000 = 0

Matlab: finding/writing data from mutliple files by column header text

I have an array which I read into Matlab with importdata. It has 5 header lines
file = 'aoao.csv';
s = importdata(file,',', 5);
Matlab automatically treats the last line as the column header. I can then call up the column number that I want with
s.data(:,n); %n is desired column number
I want to be able to load up many similar files at once, and then call up the columns in the different files which have the same column header name (which are not necessarily the same column number). I want to be able to write and export all of these columns together into a new matrix, preferably with each column labelled with its file name,
what should I do?
samp = 'len-c.mp3'; %# define desired sample/column header name
file = dir('*.csv');
have the directory ready in main screen current folder. This creates a detailed description of file,
for i=1:length(file)
set(i) = importdata(file(i).name,',', 5);
end
this imports the data from each of the files (comma delimited, 5 header lines) and transports it to a cell array called 'set'
for k = 1:14;
for i=1:length(set(k).colheaders)
TF = strcmp(set(k).colheaders(i),samp); %compares strings for match
if TF == 1; %if match is true
group(:,k) = set(k).data(:,i); %save matching column# to 'group'
end
end
end
this retrieves the data from the named colheader within each file