Finding Duplicate string values in two cell array 22124x1 - matlab

I have a cell 22124x1 and it contain duplicate Values, I want to know how many times these values duplicate and their index
first cell contain these values Datacell=
'221853_s_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'222031_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'
symbol cell:
'OR1D4 '
' OR1D5'
' UTP14C'
'GTF2H2 '
'ZNF324B '
' LOC644504'
'JMJD7 '
'ZNF324B '
'JMJD7-PLA2G4B'
' OR2A5 '
'OR1D4 '
For example i want the output from cell 1 like this
ID duplicated index
'221853_s_at' 1 1
'221971_x_at' 4 {2:5,1}
I tried to use unique but it does not work. Any help will be highly appreciated

Generating the indices in a visually pleasing matter isn't necessarily a trivial exercise. It's made simpler if you assume d is sorted.
An alternative utilizing accumarray:
d = {'221853_s_at'; '221971_x_at'; '221971_x_at'; '221971_x_at'; '221971_x_at'; ...
'222031_at'; '222031_at'; '31637_s_at'; '37796_at'; '38340_at' ...
};
d = sort(d); % Sort to make indices easier
% Find unique strings and their locations
[uniquestrings, ~, stringbin] = unique(d);
counts = accumarray(stringbin, 1);
repeatidx = find(counts - 1 > 0);
repeatedstrings = uniquestrings(repeatidx);
repeatcounts = counts(repeatidx) - 1;
% Find where string repeats start
startidx = find([true; diff(stringbin) > 0]);
repeatstart = startidx(repeatidx);
repeatend = startidx(repeatidx + 1) - 1;
% Generate table, requires R2013b or newer
t = table(repeatedstrings, repeatcounts, repeatstart, repeatend, ...
'VariableNames', {'ID', 'Duplicated', 'StringStart', 'StringEnd'} ...
);
Which yields:
t =
ID Duplicated StringStart StringEnd
_____________ __________ ___________ _________
'221971_x_at' 3 2 5
'222031_at' 1 6 7

d = { '221853_s_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'222031_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'};
[ids,ia,ic]=unique(d);
ids has the unique strings
ia has an index corresponding to an instance of the unique string within d
ic has an index corresponding to which entry in ids is in that index within d
[ncnt] = hist(ic,1:numel(ids)) - 1; % minus 1 since you only want duplicates
ncnt =
0 3 1 0 0 0
Gets you the number of duplicates for
ids =
'221853_s_at'
'221971_x_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'
ic has the lookup table for the indexes.. use find or logical indexing

Related

Table sort by month

I have a table in MATLAB with attributes in the first three columns and data from the fourth column onwards. I was trying to sort the entire table based on the first three columns. However, one of the columns (Column C) contains months ('January', 'February' ...etc). The sortrows function would only let me choose 'ascend' or 'descend' but not a custom option to sort by month. Any help would be greatly appreciated. Below is the code I used.
sortrows(Table, {'Column A','Column B','Column C'} , {'ascend' , 'ascend' , '???' } )
As #AnonSubmitter85 suggested, the best thing you can do is to convert your month names to numeric values from 1 (January) to 12 (December) as follows:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t.ColumnC = month(datenum(t.ColumnC,'mmmm'));
This will facilitate the access to a standard sorting criterion for your ColumnC too (in this example, ascending):
t = sortrows(t,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
If, for any reason that is unknown to us, you are forced to keep your months as literals, you can use a workaround that consists in sorting a clone of the table using the approach described above, and then applying to it the resulting indices:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t_original = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t_clone = t_original;
t_clone.ColumnC = month(datenum(t_clone.ColumnC,'mmmm'));
[~,idx] = sortrows(t_clone,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
t_original = t_original(idx,:);

remove duplicates in a table (rexx language)

I have a question about removing duplicates in a table (rexx language), I am on netphantom applications that are using the rexx language.
I need a sample on how to remove the duplicates in a table.
I do have a thoughts on how to do it though, like using two loops for these two tables which are A and B, but I am not familiar with this.
My situation is:
rc = PanlistInsertData('A',0,SAMPLE)
TABLE A (this table having duplicate data)
123
1
1234
12
123
1234
I need to filter out those duplicates data into TABLE B like this:
123
1234
1
12
You can use lookup stem variables to test if you have already found a value.
This should work (note I have not tested so there could be syntax errors)
no=0;
yes=1
lookup. = no /* initialize the stem to no, not strictly needed */
j=0
do i = 1 to in.0
v = in.i
if lookup.v <> yes then do
j = j + 1
out.j = v
lookup.v = yes
end
end
out.0 = j
You can eliminate the duplicates by :
If InStem first element, Move the element to OutStem Else check all the OutStem elements for the current InStem element
If element is found, Iterate to the next InStem element Else add InStem element to OutStem
Code Snippet :
/*Input Stem - InStem.
Output Stem - OutStem.
Array Counters - I, J, K */
J = 1
DO I = 1 TO InStem.0
IF I = 1 THEN
OutStem.I = InStem.I
ELSE
DO K = 1 TO J
IF (InStem.I ?= OutStem.K) & (K = J) THEN
DO
J = J + 1
OutStem.J = InStem.I
END
ELSE
DO
IF (InStem.I == OutStem.K) THEN
ITERATE I
END
END
END
OutStem.0 = J
Hope this helps.

flatten matlab table by key

I have a large table whose entries are
KEY_A,KEY_B,VAL
where KEY_A and KEY_B are finite sets of keys. For arguments sake, we'll have 4 different KEY_B values and 4 different KEY_A values. And example table:
KEY_A KEY_B KEY_C
_____ _____ _________
1 1 0.45054
1 2 0.083821
1 3 0.22898
1 4 0.91334
2 1 0.15238
2 2 0.82582
2 3 0.53834
2 4 0.99613
3 1 0.078176
3 2 0.44268
3 3 0.10665
3 4 0.9619
4 1 0.0046342
4 2 0.77491
4 3 0.8173
4 4 0.86869
4 5 1
I want to elegantly flatten the table into
KEY_A KEY_B_1 KEY_B_2 KEY_B_3 KEY_B_4 KEY_B_5
_____ _________ ________ _______ _______ _______
1 0.45054 0.083821 0.22898 0.91334 -1
2 0.15238 0.82582 0.53834 0.99613 -1
3 0.078176 0.44268 0.10665 0.9619 -1
4 0.0046342 0.77491 0.8173 0.86869 1
I'd like to be able to handle missing B values (set them to a default like -1), but I think if I get an elegant way to do this to start then such things will fall into place.
The actual table has millions of records, so I do want to use a vectorized call.
The line I've got (which doesn't handle int invalid 5) is:
cell2mat(arrayfun(#(x)[x,testtable{testtable.KEY_A==x,3}'],unique(testtable{:,1}),'UniformOutput',false))
But
it doesn't output a different table
If there are missing keys in the table, it doesn't handle that
I would think that this isn't that uncommon of an activity...has anyone done something like this before?
If the input table is T, then you could try this for the given case -
KEY_B_ =-1.*ones(max(T.KEY_A),max(T.KEY_B))
KEY_B_(sub2ind(size(KEY_B_),T.KEY_A,T.KEY_B)) = T.KEY_C
T1 = array2table(KEY_B_)
Output for the edited input -
T1 =
KEY_B_1 KEY_B_2 KEY_B_3 KEY_B_4 KEY_B_5
_________ ________ _______ _______ _______
0.45054 0.083821 0.22898 0.91334 -1
0.15238 0.82582 0.53834 0.99613 -1
0.078176 0.44268 0.10665 0.9619 -1
0.0046342 0.77491 0.8173 0.86869 1
Edit by MadScienceDreams: This answer lead me to write the following function, which will smash together pretty much any table based on the input keys. Enjoy!
function [ OT ] = flatten_table( T,primary_keys,secondary_keys,value_key,default_value )
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
if nargin < 5
default_value = {NaN};
end
if ~iscell(default_value)
default_value={default_value};
end
if ~iscell(primary_keys)
primary_keys={primary_keys};
end
if ~iscell(secondary_keys)
secondary_keys={secondary_keys};
end
if ~iscell(value_key)
value_key={value_key};
end
primary_key_values = unique(T(:,primary_keys));
num_primary = size(primary_key_values,1);
[primary_key_map,primary_key_map] = ismember(T(:,primary_keys),primary_key_values);
secondary_key_values = unique(T(:,secondary_keys));
num_secondary = size(secondary_key_values,1);
[secondary_key_map,secondary_key_map] = ismember(T(:,secondary_keys),secondary_key_values);
%out =-1.*ones(max(T.KEY_A),max(T.KEY_B))
try
values = num2cell(T{:,value_key},2);
catch
values = num2cell(table2cell(T(:,value_key)),2);
end
if (~iscell(values))
values=num2cell(values);
end
OT=repmat(default_value,num_primary,num_secondary);
OT(sub2ind(size(OT),primary_key_map,secondary_key_map)) = values;
label_array = num2cell(cellfun(#(x,y)[x '_' mat2str(y)],...
repmat (secondary_keys,size(secondary_key_values,1),1),...
table2cell(secondary_key_values),'UniformOutput',false),1);
label_array = strcat(label_array{:});
OT = [primary_key_values,cell2table(OT,'VariableNames',label_array)];
end

Compute the Frequency of bigrams in Matlab

I am trying to compute and plot the distribution of bigrams frequencies
First I did generate all possible bigrams which gives 1296 bigrams
then i extract the bigrams from a given file and save them in words1
my question is how to compute the frequency of these 1296 bigrams for the file a.txt?
if there are some bigrams did not appear at all in the file, then their frequencies should be zero
a.txt is any text file
clear
clc
%************create bigrams 1296 ***************************************
chars ='1234567890abcdefghijklmonpqrstuvwxyz';
chars1 ='1234567890abcdefghijklmonpqrstuvwxyz';
bigram='';
for i=1:36
for j=1:36
bigram = sprintf('%s%s%s',bigram,chars(i),chars1(j));
end
end
temp1 = regexp(bigram, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp1(1:end-1)', temp1(2:end)','un',0);
bigrams = temp2;
bigrams = unique(bigrams);
bigrams = rot90(bigrams);
bigram = char(bigrams(1:end));
all_bigrams_len = length(bigrams);
clear temp temp1 temp2 i j chars1 chars;
%****** 1. Cleaning Data ******************************
collection = fileread('e:\a.txt');
collection = regexprep(collection,'<.*?>','');
collection = lower(collection);
collection = regexprep(collection,'\W','');
collection = strtrim(regexprep(collection,'\s*',''));
%*******************************************************
temp = regexp(collection, sprintf('\\w{1,%d}', 1), 'match');
temp2 = cellfun(#(x,y) [x '' y],temp(1:end-1)', temp(2:end)','un',0);
words1 = rot90(temp2);
%*******************************************************
words1_len = length(words1);
vocab1 = unique(words1);
vocab_len1 = length(vocab1);
[vocab1,void1,index1] = unique(words1);
frequencies1 = hist(index1,vocab_len1);
I. Character counting problem for a string
bsxfun based solution for counting characters -
counts = sum(bsxfun(#eq,[string1-0]',65:90))
Output -
counts =
2 0 0 0 0 2 0 1 0 0 ....
If you would like to get a tabulate output of counts against each letter -
out = [cellstr(['A':'Z']') num2cell(counts)']
Output -
out =
'A' [2]
'B' [0]
'C' [0]
'D' [0]
'E' [0]
'F' [2]
'G' [0]
'H' [1]
'I' [0]
....
Please note that this was a case-sensitive counting for upper-case letters.
For a lower-case letter counting, use this edit to this earlier code -
counts = sum(bsxfun(#eq,[string1-0]',97:122))
For a case insensitive counting, use this -
counts = sum(bsxfun(#eq,[upper(string1)-0]',65:90))
II. Bigram counting case
Let us suppose that you have all the possible bigrams saved in a 1D cell array bigrams1 and the incoming bigrams from the file are saved into another cell array words1. Let us also assume certain values in them for demonstration -
bigrams1 = {
'ar';
'de';
'c3';
'd1';
'ry';
't1';
'p1'}
words1 = {
'de';
'c3';
'd1';
'r9';
'yy';
'de';
'ry';
'de';
'dd';
'd1'}
Now, you can get the counts of the bigrams from words1 that are present in bigrams1 with this code -
[~,~,ind] = unique(vertcat(bigrams1,words1));
bigrams_lb = ind(1:numel(bigrams1)); %// label bigrams1
words1_lb = ind(numel(bigrams1)+1:end); %// label words1
counts = sum(bsxfun(#eq,bigrams_lb,words1_lb'),2)
out = [bigrams1 num2cell(counts)]
The output on code run is -
out =
'ar' [0]
'de' [3]
'c3' [1]
'd1' [2]
'ry' [1]
't1' [0]
'p1' [0]
The result shows that - First element ar from the list of all possible bigrams has no find in words1 ; second element de has three occurrences in words1 and so on.
Hey similar to Dennis solution you can just use histc()
string1 = 'ASHRAFF'
histc(string1,'ABCDEFGHIJKLMNOPQRSTUVWXYZ')
this checks the number of entries in the bins defined by the string 'ABCDEFGHIJKLMNOPQRSTUVWXYZ' which is hopefully the alphabet (just wrote it fast so no garantee). The result is:
Columns 1 through 21
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0
Columns 22 through 26
0 0 0 0 0
Just a little modification of my solution:
string1 = 'ASHRAFF'
alphabet1='A':'Z'; %%// as stated by Oleg Komarov
data=histc(string1,alphabet1);
results=cell(2,26);
for k=1:26
results{1,k}= alphabet1(k);
results{2,k}= data(k);
end
If you look at results now you can easily check rather it works or not :D
This answer creates all bigrams, loads in the file does a little cleanup, ans then uses a combination of unique and histc to count the rows
Generate all Bigrams
note the order here is important as unique will sort the array so this way it is created presorted so the output matches expectation;
[y,x] = ndgrid(['0':'9','a':'z']);
allBigrams = [x(:),y(:)];
Read The File
this removes capitalisation and just pulls out any 0-9 or a-z character then creates a column vector of these
fileText = lower(fileread('d:\loremipsum.txt'));
cleanText = regexp(fileText,'([a-z0-9])','tokens');
cleanText = cell2mat(vertcat(cleanText{:}));
create bigrams from file by shifting by one and concatenating
fileBigrams = [cleanText(1:end-1),cleanText(2:end)];
Get Counts
the set of all bigrams is added to our set (so the values are created for all possible). Then a value ∈{1,2,...,1296} is assigned to each unique row using unique's 3rd output. Counts are then created with histc with the bins equal to the set of values from unique's output, 1 is subtracted from each bin to remove the complete set bigrams we added
[~,~,c] = unique([fileBigrams;allBigrams],'rows');
counts = histc(c,1:1296)-1;
Display
to view counts against text
[allBigrams, counts+'0']
or for something potentially more useful...
[sortedCounts,sortInd] = sort(counts,'descend');
[allBigrams(sortInd,:), sortedCounts+'0']
ans =
or9
at8
re8
in7
ol7
te7
do6 ...
Did not look into the entire code fragment, but from the example at the top of your question, I think you are looking to make a histogram:
string1 = 'ASHRAFF'
nr = histc(string1,'A':'Z')
Will give you:
2 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
(Got a working solution with hist, but as #The Minion shows histc is more easy to use here.)
Note that this solution only deals with upper case letters.
You may want to do something like so if you want to put lower case letters in their correct bin:
string1 = 'ASHRAFF'
nr = histc(upper(string1),'A':'Z')
Or if you want them to be shown separately:
string1 = 'ASHRaFf'
nr = histc(upper(string1),['a':'z' 'A':'Z'])
bi_freq1 = zeros(1,all_bigrams_len);
for k=1: vocab_len1
for i=1:all_bigrams_len
if char(vocab1(k)) == char(bigrams(i))
bi_freq1(i) = frequencies1(k);
end
end
end

Matlab - preprocess CSV file

I have a CSV file in a format similar to the following one:
title1
index columnA1 columnA2 columnA3
1 2 3 6
2 23 23 1
3 2 3 45
4 2 2 101
title2
index columnB1 columnB2 columnB3
1 23 53 6
2 22 13 1
3 5 4 43
4 8 6 102
I want to build a function readCustomCSV which receives a CSV file in the bellow illustrated format and a row index i and returns an output file with (for let's say i = 3) the following content:
title1
index columnA1 columnA2 columnA3
3 2 3 45
title2
index columnB1 columnB2 columnB3
3 5 4 43
Do you know how to use the csvread function in order to obtain this type of functionality?
It confuses me that there are 2 types sections. I was thinking at using the whole thing as a string and then split it into 2 .csv files and then read the corresponding line line.
try using this function :
I assumed that all tables have equal number of columns/rows. The code can definitely be shortened / improved / extended ;)
function multi_table_csvread (row_index)
filename_INPUT = 'multi_table.csv' ;
filename_OUTPUT = 'selected_row.csv' ;
fIN = fopen(filename_INPUT,'r');
nextLine = fgetl(fIN);
tableIndex = 0;
tableLine = 0;
csvTable = [];
% start reading the csv file, line by line
while nextLine ~= -1
lineStr = strtrim(strsplit(nextLine,',')) ;
% remove empty cells
lineStr(cellfun('isempty',lineStr)) = [] ;
tableLine = tableLine + 1 ;
% if 1 element start new table
if numel(lineStr) == 1
tableIndex = tableIndex + 1;
tableLine = 1;
csvTable{tableIndex,tableLine} = lineStr ;
else
lineStr = add_comas(lineStr) ;
csvTable{tableIndex,tableLine} = lineStr ;
end
nextLine = fgetl(fIN);
end
fclose(fIN);
fOUT = fopen(filename_OUTPUT,'w');
if row_index > size(csvTable,2) -2
error('The row index exceeds the maximum number of rows!')
end
for k = 1 : size(csvTable,1)
title = csvTable{k,1};
columnHeaders = csvTable{k,2};
selected_row = csvTable{k,row_index+2};
fprintf(fOUT,'%s\n',title{:});
fprintf(fOUT,'%s',columnHeaders{:});
fprintf(fOUT,'\n');
fprintf(fOUT,'%s',selected_row{:});
fprintf(fOUT,'\n');
end
fclose(fOUT);
function line_with_comas = add_comas(this_line)
for ii = 1 : length(this_line)-1
this_line{ii} = strcat(this_line{ii},',') ;
end
line_with_comas = this_line ;