Detecting Capital Letter Strings in a Larger String - matlab

Is there a nice and clean way to find strings of capital letters of size 2-4 in length within a larger string in matlab. For example, lets say I have a string...
stringy = 'I imagine I could FLY';
Is there a nice way to just extract the FLY portion of the string? Currently I'm using the upper() function to identify all the characters in the string that are upper case like this...
for count = 1:length(stringy)
if upper(stringy(count))==stringy(count)
isupper(count)=1;
else
isupper(count)=0;
end
end
And then, I'm just going through the binary vector and identifying when
there there are 2-4 1's in the row.
This method is working... but I'm wondering if there is a cleaner way
to be doing this... thanks!!!

You can use regular expressions for this. The regular expression [A-Z]{2,4} will search for 2-4 capital letters in a string.
The corresponding matlab function is called regexp.
regexp(string,pattern) returns subindexes into string of all the places it matches pattern.
For your pattern I have two suggestions:
\<[A-Z]{2,4}\>. This searches for whole words that consist of 2-4 capital letters (so it doesn't grab TOUCH below):
stringy = 'I imagine I could FLY and TOUCH THE SKY';
regexp(stringy,'\<[A-Z]{2,4}\>') % returns 19, 33, 37 ('FLY','THE','SKY')
(Edit: Matlab uses \< and \> for word boundaries not the standard \b).
If you have strings where case can be mixed within a word and you want to extract those, try (?<![A-Z])[A-Z]{2,4}(?![A-Z]) (which means "2-4 capital letters that aren't surrounded by capital letters):
stringy = 'I image I could FLYandTouchTHEsky';
% returns 17 and 28 ('FLY', 'THE')
regexp(stringy,'(?<![A-Z])[A-Z]{2,4}(?![A-Z])')
% note '\<[A-Z]{2,4}\>' wouldn't match anything here since it looks for
% *whole words* that consist of 2-4 capital letters only.
% 'FLYandTouchTHEsky' doesn't satisfy this.
Pick the regex based on what behaviour you want to occur.

Related

notepad++ how to use regex to find lines that have two capital letters in a row?

how to use regular expression to find strings containing two capital letters in a row?
^([A-Z\s]+)$
^.*[A-Z]{2}.*$ matches as follows
^ Beginning of the line
.* Any char for any number of times
[A-Z]{2} Two consecutive capital letters
.* Any char for any number of times
$ End of line
Find a live example here:
https://regex101.com/r/m2hPbh/1
([A-Z][A-Z][a-z0-9]*) would find every word that contains 2 capital letters in a row

Format Matlab data to only have digits after the decimal place

I used dlmwrite to output some data in the following form:
-1.7693255974E+00,-9.7742420654E-04, 2.1528647648E-04,-1.4866241234E+00
What I really want is the following format:
-.1769325597E+00, -.9774242065E-04, .2152864764E-04, -.1486624123E+00
A space is required before each number, followed by a sign, if the number is negative, and the number format is comma delimited, in exponential form to 10 significant digits.
Just in case Matlab is not able to write to this format (-.1769325597E+00), what is it called specifically so that I can research other means of solving my problem?
Although this feels morally wrong, one can use regular expressions to move the decimal point. This is what the function
myFormat = #(x) regexprep(sprintf('%.9e', 10*x), '(\d)\.', '\.$1');
does. The input value is multiplied by 10 prior to formatting, to account for the point being moved. Example: myFormat(-pi^7) returns -.3020293228e+04.
The above works for individual numbers. The following version is also able to format arrays, providing comma separators. The second regexprep removes the trailing comma.
myArrayFormat = #(x) regexprep(regexprep(sprintf('%.9e, ', 10*x), '(\d)\.', '\.$1'), ', $', '');
Example: myArrayFormat(1000*rand(1,5)-500) returned
-.2239749230e+03, .1797026769e+03, .1550980040e+03, -.3373882648e+03, -.3810023184e+03
For individual numbers, myArrayFormat works identically to myFormat.

Generate a random String in MatLab

I am trying to generate an array of string from a long predefined array of chars as the following
if I have the following long string:
s= 'aardvaqrkaardwolfaajronabackabacusabvaftabalongeabandonabandzonedaba'
I want to create a group of random strings based on the following rules
the string should be between 4 and 12 chars should be end or start
with one of the following chars {j,q,v,f,x,g,b,d,z}
So here a solution which gives all strings which fullfill the following rules:
the starting and ending char has to be from the string:
start_end_char= 'jqvfxgbdz';
The length has to be between 4 and 8 chars long
The string has to be sequentially correct. Meaning the resulting
strings have to appear in the exact same way in the "long" string
So what am I doing?
First of all I find all the positions where the predefined starting and ending chars appear in the main string (careful I used s2 instead of s as string name).
Then I get a sorted list of those points (list_sorted)
Next thing is to get for each element a list of indices which acceptable ending chars (following rule 1 and 2 stated above). These are saved in helper which has to be a cell-datatype because of different length in the strings
last but not least I construct all those strings and save them in resulting_strings which also have to be a cell-datatype.
s2= 'aardvaqrkaardwolfaajronabackabacusabvaftabalongeabandonabandzonedaba';
start_end_char= 'jqvfxgbdz';
length_start = length(start_end_char);
%%finding all positions of possible starting/ending points
position_char= cell(1,length_start);
for k=1:length_start
position_char{k}=find(s2==start_end_char(k));
end
list_of_start_end_points=[];
%% getting an array with all starting/ending points in the given array
for k=1:length_start
list_of_start_end_points= horzcat(list_of_start_end_points,position_char{k});
end
sorted_list= sort(list_of_start_end_points);
%% getting possible combinations
helper = cell(1, length(sorted_list));
length_helper=[];
for k=1:length(sorted_list)
helper{k}=find(and(sorted_list-sorted_list(k)>=4,sorted_list-sorted_list(k)<=8));
length_helper = length_helper + length(helper);
end
resulting_strings = cell(1, length_helper);
l=1;
for k=1:length(sorted_list)
for m=1:length(helper{k})
resulting_strings{1,l} = s2(sorted_list(k):sorted_list(helper{k}(m)));
l=l+1;
end
end
This solution is using quite a few loops, while the first 2 loops are negatable (No of loops in the size of acceptable start/ending letters), the later two loops can be quite time consuming if the original string is much longer. So maybe someone will find a vectorized solution for the later loops.

how to find all the possible longest common subsequence from the same position

I am trying to find all the possible longest common subsequence from the same position of multiple fixed length strings (there are 700 strings in total, each string have 25 alphabets ). The longest common subsequence must contain at least 3 alphabets and belong to at least 3 strings. So if I have:
String test1 = "abcdeug";
String test2 = "abxdopq";
String test3 = "abydnpq";
String test4 = "hzsdwpq";
I need the answer to be:
String[] Answer = ["abd", "dpq"];
My one problem is this needs to be as fast as possible. I am trying to find the answer with suffix tree, but the solution of suffix tree method is ["ab","pq"].Suffix tree can only find continuous substring from multiple strings.The common longest common subsequence algorithm cannot solve this problem.
Does anyone have any idea on how to solve this with low time cost?
Thanks
I suggest you cast this into a well known computational problem before you try to use any algorithm that sounds like it might do what you want.
Here is my suggestion: Convert this into a graph problem. For each position in the string you create a set of nodes (one for each unique letter at that position amongst all the strings in your collection... so 700 nodes if all 700 strings differ in the same position). Once you have created all the nodes for each position in the string you go through your set of strings looking at how often two positions share more than 3 equal connections. In your example we would look first at position 1 and 2 and see that three strings contain "a" in position 1 and "b" in position 2, so we add a directed edge between the node "a" in the first set of nodes of the graph and "b" in the second group of nodes (continue doing this for all pairs of positions and all combinations of letters in those two positions). You do this for each combination of positions until you have added all necessary links.
Once you have your final graph, you must look for the longest path; I recommend looking at the wikipedia article here: Longest Path. In our case we will have a directed acyclic graph and you can solve it in linear time! The preprocessing should be quadratic in the number of string positions since I imagine your alphabet is of fixed size.
P.S: You sent me an email about the biclustering algorithm I am working on; it is not yet published but will be available sometime this year (fingers crossed). Thanks for your interest though :)
You may try to use hashing.
Each string has at most 25 characters. It means that it has 2^25 subsequences. You take each string, calculate all 2^25 hashes. Then you join all the hashes for all strings and calculate which of them are contained at least 3 times.
In order to get the lengths of those subsequences, you need to store not only hashes, but pairs <hash, subsequence_pointer> where subsequence_pointer determines the subsequence of that hash (the easiest way is to enumerate all hashes of all strings and store the hash number).
Based on the algo, the program in the worst case (700 strings, 25 characters each) will run for a few minutes.

Set of unambiguous looking letters & numbers for user input

Is there an existing subset of the alphanumerics that is easier to read? In particular, is there a subset that has fewer characters that are visually ambiguous, and by removing (or equating) certain characters we reduce human error?
I know "visually ambiguous" is somewhat waffly of an expression, but it is fairly evident that D, O and 0 are all similar, and 1 and I are also similar. I would like to maximize the size of the set of alpha-numerics, but minimize the number of characters that are likely to be misinterpreted.
The only precedent I am aware of for such a set is the Canada Postal code system that removes the letters D, F, I, O, Q, and U, and that subset was created to aid the postal system's OCR process.
My initial thought is to use only capital letters and numbers as follows:
A
B = 8
C = G
D = 0 = O = Q
E = F
H
I = J = L = T = 1 = 7
K = X
M
N
P
R
S = 5
U = V = Y
W
Z = 2
3
4
6
9
This problem may be difficult to separate from the given type face. The distinctiveness of the characters in the chosen typeface could significantly affect the potential visual ambiguity of any two characters, but I expect that in most modern typefaces the above characters that are equated will have a similar enough appearance to warrant equating them.
I would be grateful for thoughts on the above – are the above equations suitable, or perhaps are there more characters that should be equated? Would lowercase characters be more suitable?
I needed a replacement for hexadecimal (base 16) for similar reasons (e.g. for encoding a key, etc.), the best I could come up with is the following set of 16 characters, which can be used as a replacement for hexadecimal:
0 1 2 3 4 5 6 7 8 9 A B C D E F Hexadecimal
H M N 3 4 P 6 7 R 9 T W C X Y F Replacement
In the replacement set, we consider the following:
All characters used have major distinguishing features that would only be omitted in a truly awful font.
Vowels A E I O U omitted to avoid accidentally spelling words.
Sets of characters that could potentially be very similar or identical in some fonts are avoided completely (none of the characters in any set are used at all):
0 O D Q
1 I L J
8 B
5 S
2 Z
By avoiding these characters completely, the hope is that the user will enter the correct characters, rather than trying to correct mis-entered characters.
For sets of less similar but potentially confusing characters, we only use one character in each set, hopefully the most distinctive:
Y U V
Here Y is used, since it always has the lower vertical section, and a serif in serif fonts
C G
Here C is used, since it seems less likely that a C would be entered as G, than vice versa
X K
Here X is used, since it is more consistent in most fonts
F E
Here F is used, since it is not a vowel
In the case of these similar sets, entry of any character in the set could be automatically converted to the one that is actually used (the first one listed in each set). Note that E must not be automatically converted to F if hexadecimal input might be used (see below).
Note that there are still similar-sounding letters in the replacement set, this is pretty much unavoidable. When reading aloud, a phonetic alphabet should be used.
Where characters that are also present in standard hexadecimal are used in the replacement set, they are used for the same base-16 value. In theory mixed input of hexadecimal and replacement characters could be supported, provided E is not automatically converted to F.
Since this is just a character replacement, it should be easy to convert to/from hexadecimal.
Upper case seems best for the "canonical" form for output, although lower case also looks reasonable, except for "h" and "n", which should still be relatively clear in most fonts:
h m n 3 4 p 6 7 r 9 t w c x y f
Input can of course be case-insensitive.
There are several similar systems for base 32, see http://en.wikipedia.org/wiki/Base32 However these obviously need to introduce more similar-looking characters, in return for an additional 25% more information per character.
Apparently the following set was also used for Windows product keys in base 24, but again has more similar-looking characters:
B C D F G H J K M P Q R T V W X Y 2 3 4 6 7 8 9
My set of 23 unambiguous characters is:
c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9
I needed a set of unambiguous characters for user input, and I couldn't find anywhere that others have already produced a character set and set of rules that fit my criteria.
My requirements:
No capitals: this supposed to be used in URIs, and typed by people who might not have a lot of typing experience, for whom even the shift key can slow them down and cause uncertainty. I also want someone to be able to say "all lowercase" so as to reduce uncertainty, so I want to avoid capital letters.
Few or no vowels: an easy way to avoid creating foul language or surprising words is to simply omit most vowels. I think keeping "e" and "y" is ok.
Resolve ambiguity consistently: I'm open to using some ambiguous characters, so long as I only use one character from each group (e.g., out of lowercase s, uppercase S, and five, I might only use five); that way, on the backend, I can just replace any of these ambiguous characters with the one correct character from their group. So, the input string "3Sh" would be replaced with "35h" before I look up its match in my database.
Only needed to create tokens: I don't need to encode information like base64 or base32 do, so the exact number of characters in my set doesn't really matter, besides my wanting to to be as large as possible. It only needs to be useful for producing random UUID-type id tokens.
Strongly prefer non-ambiguity: I think it's much more costly for someone to enter a token and have something go wrong than it is for someone to have to type out a longer token. There's a tradeoff, of course, but I want to strongly prefer non-ambiguity over brevity.
The confusable groups of characters I identified:
A/4
b/6/G
8/B
c/C
f/F
9/g/q
i/I/1/l/7 - just too ambiguous to use; note that european "1" can look a lot like many people's "7"
k/K
o/O/0 - just too ambiguous to use
p/P
s/S/5
v/V
w/W
x/X
y/Y
z/Z/2
Unambiguous characters:
I think this leaves only 9 totally unambiguous lowercase/numeric chars, with no vowels:
d,e,h,j,m,n,r,t,3
Adding back in one character from each of those ambiguous groups (and trying to prefer the character that looks most distinct, while avoiding uppercase), there are 23 characters:
c,d,e,f,h,j,k,m,n,p,r,t,v,w,x,y,2,3,4,5,6,8,9
Analysis:
Using the rule of thumb that a UUID with a numerical equivalent range of N possibilities is sufficient to avoid collisions for sqrt(N) instances:
an 8-digit UUID using this character set should be sufficient to avoid collisions for about 300,000 instances
a 16-digit UUID using this character set should be sufficient to avoid collisions for about 80 billion instances.
Mainly drawing inspiration from this ux thread, mentioned by #rwb,
Several programs use similar things. The list in your post seems to be very similar to those used in these programs, and I think it should be enough for most purposes. You can add always add redundancy (error-correction) to "forgive" minor mistakes; this will require you to space-out your codes (see Hamming distance), though.
No references as to particular method used in deriving the lists, except trial and error
with humans (which is great for non-ocr: your users are humans)
It may make sense to use character grouping (say, groups of 5) to increase context ("first character in the second of 5 groups")
Ambiguity can be eliminated by using complete nouns (from a dictionary with few look-alikes; word-edit-distance may be useful here) instead of characters. People may confuse "1" with "i", but few will confuse "one" with "ice".
Another option is to make your code into a (fake) word that can be read out loud. A markov model may help you there.
If you have the option to use only capitals, I created this set based on characters which users commonly mistyped, however this wholly depends on the font they read the text in.
Characters to use: A C D E F G H J K L M N P Q R T U V W X Y 3 4 6 7 9
Characters to avoid:
B similar to 8
I similar to 1
O similar to 0
S similar to 5
Z similar to 2
What you seek is an unambiguous, efficient Human-Computer code. What I recommend is to encode the entire data with literal(meaningful) words, nouns in particular.
I have been developing a software to do just that - and most efficiently. I call it WCode. Technically its just Base-1024 Encoding - wherein you use words instead of symbols.
Here are the links:
Presentation: https://docs.google.com/presentation/d/1sYiXCWIYAWpKAahrGFZ2p5zJX8uMxPccu-oaGOajrGA/edit
Documentation: https://docs.google.com/folder/d/0B0pxLafSqCjKOWhYSFFGOHd1a2c/edit
Project: https://github.com/San13/WCode (Please wait while I get around uploading...)
This would be a general problem in OCR. Thus for end to end solution where in OCR encoding is controlled - specialised fonts have been developed to solve the "visual ambiguity" issue you mention of.
See: http://en.wikipedia.org/wiki/OCR-A_font
as additional information : you may want to know about Base32 Encoding - wherein symbol for digit '1' is not used as it may 'confuse' the users with the symbol for alphabet 'l'.
Unambiguous looking letters for humans are also unambiguous for optical character recognition (OCR). By removing all pairs of letters that are confusing for OCR, one obtains:
!+2345679:BCDEGHKLQSUZadehiopqstu
See https://www.monperrus.net/martin/store-data-paper
It depends how large you want your set to be. For example, just the set {0, 1} will probably work well. Similarly the set of digits only. But probably you want a set that's roughly half the size of the original set of characters.
I have not done this, but here's a suggestion. Pick a font, pick an initial set of characters, and write some code to do the following. Draw each character to fit into an n-by-n square of black and white pixels, for n = 1 through (say) 10. Cut away any all-white rows and columns from the edge, since we're only interested in the black area. That gives you a list of 10 codes for each character. Measure the distance between any two characters by how many of these codes differ. Estimate what distance is acceptable for your application. Then do a brute-force search for a set of characters which are that far apart.
Basically, use a script to simulate squinting at the characters and see which ones you can still tell apart.
Here's some python I wrote to encode and decode integers using the system of characters described above.
def base20encode(i):
"""Convert integer into base20 string of unambiguous characters."""
if not isinstance(i, int):
raise TypeError('This function must be called on an integer.')
chars, s = '012345689ACEHKMNPRUW', ''
while i > 0:
i, remainder = divmod(i, 20)
s = chars[remainder] + s
return s
def base20decode(s):
"""Convert string to unambiguous chars and then return integer from resultant base20"""
if not isinstance(s, str):
raise TypeError('This function must be called on a string.')
s = s.translate(bytes.maketrans(b'BGDOQFIJLT7KSVYZ', b'8C000E11111X5UU2'))
chars, i, exponent = '012345689ACEHKMNPRUW', 0, 1
for number in s[::-1]:
i += chars.index(number) * exponent
exponent *= 20
return i
base20decode(base20encode(10))
base58:123456789ABCDEFGHJKLMNPQRSTUVWXYZabcdefghijkmnopqrstuvwxyz