RE Pattern for finding sets in a string - python-3.7

I'm searching for the best pattern to use in re.findall to find all none-empty sets in my input string.
for example:
['{2}', '{5, 65, 75909}', '{26, 90, 4590984}']
is the desirable output for input string comes below:
'kka{2}343{}lds{5, 65, 75909}892,{26, 90, 4590984}'
I've surfed the net but didn't find anything; Does anybody know what is the best pattern?
And also, my attempts of using (,\s)? in patterns resulted in a list of 2-tuples containing only the last number of sets along with an empty string. How I get tuples containing empty strings? What sort of thing in a pattern can lead me to empty strings?
Thanks in advance.

This code might help you, it looks for every set which has number[s] in it and extracts it.
import re
pattern = re.compile("{\d+[, \d]*}")
string = "'kka{2}343{}lds{5, 65, 75909}892,{26, 90, 4590984}'"
print(pattern.findall(string))
The resuls is ['{2}', '{5, 65, 75909}', '{26, 90, 4590984}']

Just for completeness: this can be done without any modules, too:
instr = 'kka{2}343{}lds{5, 65, 75909}892,{26, 90, 4590984}'
pos = 0
output = []
while pos < len(instr):
start, end = pos + instr[pos:].find("{"), pos + instr[pos:].find("}") # find positions of opening and closing bracket in the unprocessed part of the string
pos = end + 1 # update pos so we don't process only the first couple of brackets each loop
if start + 1 == end: # opening bracket { is directly before a closing bracket }, that means we have an empty set
continue
output.append(instr[start:end + 1])
print(output)

Related

Convert the position of a character in a string to account for "gaps" (i.e., non alphanumeric characters in the string)

In a nutshell
I have a string that looks something like this ...
---MNTSDSEEDACNERTALVQSESPSLPSYTRQTDPQHGTTEPKRAGHT--------LARGGVAAPRERD
And I have a list of positions and corresponding characters that looks something like this...
position character
10 A
12 N
53 V
54 A
This position/character key doesn't account for hyphen (-) characters in the string. So for example, in the given string the first letter M is in position 1, the N in position 2, the T in position 3, etc. The T preceding the second chunk of hyphens is position 47, and the L after that hyphen chunk is position 48.
I need to convert the list of positions and corresponding characters so that the position accounts for hyphen characters. Something like this...
position character
13 A
15 N
64 V
65 A
I think there should be a simple enough way to do this, but I am fairly new so I am probably missing something obvious, sorry about that! I am doing this as part of bigger script, so if anyone had a way to accomplish this using perl that would be amazing. Thank you so much in advance and please let me know if I can clarify anything or provide more information!
What I tried
At first, I took a substring of characters equal to the position value, counted the number of hyphens in that substring, and added the hyphen count onto the original position. So for the first position/character in my list, take the first 10 characters, and then there are 3 hyphens in that substring, so 10+3 = 13 which gives the correct position. This works for most of my positions, but fails when the original position falls within a bunch of hyphens like for positions 53 and 54.
I also tried grabbing the character by taking out the hyphens and then using the original position value like this...
my #array = ($string =~ /\w/g);
my $character = $array[$position];
which worked great, but then I was having a hard time using this to convert the position to include the hyphens because there are too many matching characters to match the character I grabbed here back to the original string with hyphens and find the position in that (this may have been a dumb thing to try from the start).
The actual character seems not to be relevant. It's enough to count the non-hyphens:
use strict;
use warnings;
use Data::Dumper;
my $s = '---MNTSDSEEDACNERTALVQSESPSLPSYTRQTDPQHGTTEPKRAGHT--------LARGGVAAPRERD';
my #positions = (10,12,53,54);
my #transformed = ();
my $start = 0;
for my $loc(#positions){
my $dist = $loc - $start;
while ($dist){
$dist-- if($s =~ m/[^-]/g);
}
my $pos = pos($s);
push #transformed, $pos;
$start = $loc;
}
print Dumper \#transformed;
prints:
$VAR1 = [
13,
15,
64,
65
];

How to subtract strings that are not consecutive python

What I mean is if I have a string, "apwswe", and another string "appegwisbnwe", if we "subtract" the two strings together, which means "appegwisbnwe" - "apwswe", I want to get "pegibn". Is there a way to do this? BTW pegibn is the characters that they don't have in "common" with eachother.
Not exactly a thing of beauty, but this will get you there:
subtrahend = "apwswe"
minuend = list("appegwisbnwe")
for char in subtrahend:
if minuend.count(char):
minuend.remove(char)
difference = "".join(minuend)
print(difference)
pgibne
Possible alternatives to rhurwitz's solution:
input = "appegwisbnwe"
for char, occurrences in collections.Counter("apwswe"):
input = input.replace(char, '', occurrences)
this is quite simple and can be implemented as a straightforward functools.reduce expression but will rewrite the input string as many times as there are different characters in the filter.
A possibly more efficient alternative as it works in O(len(input) + len(filter)) rather than O(len(input)*len(uniq(filter))
input = "appegwisbnwe"
filter = collections.Counter("apwswe")
output = ''
for c in input:
if filter[c]:
filter[c] -= 1
else:
output += c

Extracting and joining exons from multiple sequence alignments

Using my (fairly) basic coding skills, I have put together a script that will parse an aligned multi-fasta file (a multiple sequence alignment) and extract all the data between two specified columns.
use Bio::SimpleAlign;
use Bio::AlignIO;
$str = Bio::AlignIO->new(-file => $inputfilename, -format => 'fasta');
$aln = $str->next_aln();
$mini = $aln->slice($array[0], $array[1]);
$out = Bio::AlignIO->new(-file => $array[3], -format => 'fasta');
$out->write_aln($mini);
The problem I have is that I want to be able to slice multiple regions from the same alignment and then join these regions prior to writing to an outfile. The complication is that I want to supply a file with a list of co-ordinates where each line contains two or more co-ordinates between which data should be extracted and joined.
Here is an example co-ordinate file
ORF1, 10, 50, exon1 # The above line should produce a slice between columns 10-50 and write to an outfile
ORF2, 70, 140, exon1
ORF2, 190, 270, exon2
ORF2, 500, 800, exon3 # Data should be extracted between the ranges specified here and in the above two lines and then joined (side by side) to produce the outfile.
ORF3, 1200, 1210, exon1
etc etc
And here is an (small) example of an aligned fasta file
\>Sample1
ATGGCGACCGTGCACTACTCCCGCCGACCTGGGACCCCGCCGGTCACCCTCACGTCGTCC
CCCAGCATGGATGACGTTGCGACCCCCATCCCCTACCTACCCACATACGCCGAGGCCGTG
GCAGACGCGCCCCCCCCTTACAGAAGCCGCGAGAGTCTGGTGTTCTCCCCGCCTCTTTTT
CCTCACGTGGAGAATGGCACCACCCAACAGTCTTACGATTGCCTAGACTGCGCTTATGAT
GGAATCCACAGACTTCAGCTGGCTTTTCTAAGAATTCGCAAATGCTGTGTACCGGCTTTT
TTAATTCTTTTTGGTATTCTCACCCTTACTGCTGTCGTGGTCGCCATTGTTGCCGTTTTT
CCCGAGGAACCTCCCAACTCAACTACATGA
\>Sample2
ATGGCGACCGTGCACTACTCCCGCCGACCTGGGACCCCGCCGGTCACCCTCACGTCGTCC
CCCAGCATGGATGACGTTGCGACCCCCATCCCCTACCTACCCACATACGCCGAGGCCGTG
GCAGACGCGCCCCCCCCTTACAGAAGCCGCGAGAGTCTGGTGTTCTCCCCGCCTCTTTTT
CCTCACGTGGAGAATGGCACCACCCAACAGTCTTACGATTGCCTAGACTGCGCTTATGAT
GGAATCCACAGACTTCAGCTGGCTTTTCTAAGAATTCGCAAATGCTGTGTACCGGCTTTT
TTAATTCTTTTTGGTATTCTCACCCTTACTGCTGTCGTGGTCGCCATTGTTGCCGTTTTT
CCCGAGGAACCTCCCAACTCAACTACATGA
I think there should be a fairly simple way to solve this problem, potentially using the information in the first column, paired with the exon number, but I can't for the life of me figure out how this can be done.
Can anyone help me out?
The aligned fasta file you posted -- at least as it appears on the stackoverflow web page -- did not compile. According to https://en.wikipedia.org/wiki/FASTA_format, the description lines should begin with a >, not with \>.
Be sure to run all Perl programs with use strict; use warnings;. This will facilitate debugging.
You have not populated #array. Consequently, you can expect to get errors like these:
Use of uninitialized value $start in pattern match (m//) at perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm line 1086, <GEN0> line 16.
Use of uninitialized value $start in concatenation (.) or string at perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm line 1086, <GEN0> line 16.
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Slice start has to be a positive integer, not []
STACK: Error::throw
STACK: Bio::Root::Root::throw perl-5.24.0/lib/site_perl/5.24.0/Bio/Root/Root.pm:444
STACK: Bio::SimpleAlign::slice perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm:1086
STACK: fasta.pl:26
Once you assign plausible values, e.g.,
#array = (1,17);
... you will get more plausible results:
$ perl fasta.pl
>Sample1/1-17
ATGGCGACCGTGCACTA
>Sample2/1-17
ATGGCGACCGTGCACTA
HTH!

How to search a text document by position

I need to search a text file that is about 30 lines. I need it to search row by row and grab numbers based on their position in the text file that will remain constant throughout the text file. Currently I need to get the first 2 numbers and then the last 4 numbers of each row. My code now:
FileToOpen = fopen(textfile.txt)
if FileToOpen == -1
disp('Error')
return;
end
while true
msg = fgetl(FileToOpen)
if msg == -1
break;
end
end
I would like to use the fgetl command if possible as I somewhat know that command, but if their is an easier way that will be more than welcome.
This looks like a good start. You should be able to use msg - '0' to get the value of the numbers. For ascii code the digits are placed next to each other in the right order (0,1,2,3,4,5,6,7,8,9). What you do when you subtract with '0' is that you subtract msg with the ascii code of '0'. You will then get the digits as
tmp = msg - '0';
idx = find(tmp>=0 & tmp < 10); % Get the position in the row
val = tmp(idx); % Or tmp(tmp>=0 & tmp < 10) with logical indexing.
I agree that fgetl is probably the best to use for text without specific structure. However, in case you have a special structure of the text you can use that and thus be able to use more effective algorithms.
In case you was actually after finding the absolute position of the digits in the text, you can save the msgLength = msgLength + length(msg) for every iteration and use that to calculate the absolute position of the digits.

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');