How to detect 4 digit using regexp - matlab

How do I get the year (4 digits) when given a source code, I can only detect the day (29), but could not detect the year(1997). There is something wrong in my regexp checking.
age = regexp(CharData,'(\d{1,4})','match','once')
For example,
Registered On
March 29, 1997
Desired output: 1997
Error output: 29
for i = 1:2
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F22_TR\f22_TR_pdata_',int2str(i),''),'r')
CharData = fread(data2, '*char'); %read text file and store data in CharData
fclose(data2);
age = regexp(CharData,'(\d{4})','match','once')
end
file : f22_TR_pdata_1 --> Registered On
June 24, 1997
file : f22_TR_pdata_2 --> Registered On
March 29, 1997
Age: 1997

To only grab four digits
age = regexp(CharData,'(\d{4})','match','once')
Doing d{1,4} means look for numbers with a length between 1 and 4. Meaning, 1, 29, 123, 4444 would all match because their length is between 1 and 4
d{4} says, get me the number with exact length of 4. Meaning, 1997, 2001, 1800 would all match.

Related

Extract date from string with another numbers from R

I need to extract the date from this text:
Mellisoni 2014 Malbec (Columbia Valley (WA))
Okapi 2013 Estate Cabernet Sauvignon (Napa Valley)
Podere dal Nespoli 2015 Prugneto Sangiovese (Romagna)
Simonnet-Febvre 2015 Chablis
Lagler 2012 1000 Eimerberg Smaragd Neuburger (Wachau)
I use this code:
vino<-mutate(vino, year1=sub("^.*([0-9]{4}).*", "\\1", vino$title))
It works, but I have the last value extract on 1000 instead of 2012, how can I fix it if have another numbers?

How to convert output of Emboss:Palindrome into gff/bed file (perl)

I am sorry ton ask this kind of stupid question but I could not find it by myself... I learned perl a while ago and I am a little lost.
I want to convert this kind of output :
Palindromes of: seq1
Sequence length is: 24
Start at position: 1
End at position: 24
Minimum length of Palindromes is: 6
Maximum length of Palindromes is: 12
Maximum gap between elements is: 6
Number of mismatches allowed in Palindrome: 0
Palindromes:
1 aaaaaaaaaaa 11
|||||||||||
24 ttttttttttt 14
Palindromes of: seq2
Sequence length is: 15
Start at position: 1
End at position: 15
Minimum length of Palindromes is: 6
Maximum length of Palindromes is: 12
Maximum gap between elements is: 6
Number of mismatches allowed in Palindrome: 0
Palindromes:
1 aaaaaac 7
|||||||
15 ttttttg 9
Into a gff or bed file :
seq1 1 24
seq2 1 15
I found a perl module to do it : https://metacpan.org/pod/Bio::Tools::GFF
This is my little script :
#!/usr/bin/perl
use strict;
use warnings 'all';
use Bio::Tools::EMBOSS::Palindrome;
use Bio::Tools::GFF;
my $filename = "truc.pal";
# a simple script to turn palindrome output into GFF3
my $parser = Bio::Tools::EMBOSS::Palindrome->new(-file => $filename);
my $out = Bio::Tools::GFF->new(-gff_version => 3,
-file => ">$filename.gff");
while( my $seq = $parser->next_seq ) {
for my $feat ( $seq->get_SeqFeatures ) {
$out->write_feature($feat);
}
}
This is the result :
##gff-version 3
seq1 palindrome similarity 14 24 . - 1 allowed_mismatches=0;end=24;maximum gap=6;maximum_length=12;minimum_length=6;seqlength=24;start=1
seq2 palindrome similarity 9 15 . - 1 allowed_mismatches=0;end=15;maximum gap=6;maximum_length=12;minimum_length=6;seqlength=15;start=1
The issue is : I want to have it the result the start and the end of the palindrome and the specific position in the last line.
Exemple of what I want:
##gff-version 3
seq1 palindrome similarity 1 24 . - 1 mismatches=0;gap_positions=11-14;gap_size=3
seq2 palindrome similarity 1 15 . - 1 mismatches=0;gap_positions=7-9;gap_size=2
Thank you in advance.

Reading CSV file with text using textread

I am reading a csv file in Matlab using textread function and storing the values in the cells of string and float types.
[string1, string2, values] = textread('/path/xyz.csv', '%s %s %f', 'headerlines', 1);
Data has three columns. Two of them I believe are of string type and one is float.
Sample Data
#timestamp host value
March 5th 2019, 13:41:54.879 tscompute1 0.399
March 5th 2019, 13:41:54.879 tscompute1 0.599
March 5th 2019, 13:41:54.879 tscompute1 0
March 5th 2019, 13:41:54.879 tscompute1 0.2
March 5th 2019, 13:41:54.879 tscompute1 0
March 5th 2019, 13:41:54.879 tscompute1 0
March 5th 2019, 13:41:54.879 tscompute1 0
March 5th 2019, 13:41:54.879 tscompute1 0
March 5th 2019, 13:41:54.879 tscompute1 0
March 5th 2019, 13:41:54.879 tscompute1 100
March 5th 2019, 13:41:54.879 tscompute1 0.4
There is not execution error. But the read values are not as expected. Please find the sample output below.
Values stored in string1 looks like as follows
'"March'
','
'"March'
','
'"March'
','
'"March'
','
Values stored in string2 looks like as follows
'5th'
'13:41:54.879",tscompute1,0.399'
'5th'
'13:41:54.879",tscompute1,0.599'
'5th'
'13:41:54.879",tscompute1,0'
'5th'
'13:41:54.879",tscompute1,0.2'
Values stored in values looks like as follows
2019
0
2019
0
2019
0
2019
0
Your text seems to have inconsistent delimiters, the date is separated from the time by a comma, while the time, the name "tscompute1" and the number are separated by white-spaces.
The simplest is to read every line as six elements each separated by white-spaces with five of them being strings and the sixth being a number.
[s1, s2, s3, s4, s5, values] = textread('/path/xyz.csv', '%s %s %s %s %s %f', 'headerlines', 1);
That allows you to get the date (concatenate strings in s1-s3, remove the trailing comma), the time (s4), the name (s5) and the value.

reading specific lines from .txt file

I have large text file which has lots of information inside. I need to access only a line which stars with '***'. That line contains 17 numbers with a space between them.
Example of the file is,
Msg count = 2629
max send msg count = 34
avg send msg count = 10.27
imbalance send msg count = 3.31
------------------------------
max recv msg count = 35
avg recv msg count = 10.27
imbalance recv msg count = 3.41
***1.100020 306852 1381937 11045 5398.19 2.05 10465 5398.19 1.94 2629 34 10.27 3.31 35 10.27 3.41 0.000000
[INFO] +++ Sat Sep 24 15:15:33 2016
+++ (test.c:816) stat1 end
Is there a way to do this?
Try to use this code:
infilename = 'nameoffile.txt'; % name of your file
m = memmapfile(infilename); % load file to memory (and after close it)
instrings = strsplit(char(m.Data.'),'\n','CollapseDelimiters',true).';
checkstr = '***';
% find all string (their indices) starting with checkstr
ind = find(strncmpi(instrings,checkstr,length(checkstr)));
if isempty(ind)
fprintf('\n No strings with %s',checkstr)
else
% first string with string checkstr
instrings(ind(1));
end

MATLAB Cut 3D array of daily data into monthly segments

I have a 1437x159x1253 large matrix (let's call it A) of daily sea ice data for a little over 2 years. I need to write a code that takes the daily data from each month and does mean(A, 3) on it. So basically, 1253 is the t in days. If I start from January, I need to do mean(A,3) of the first 31 days, then the mean(A,3) of February, the next 28 or 29 days. Because the days alternate between 31 and 30 (and 28 or 29 for February), I don't know how to write a code to do this. I can do it manually, but that would take a while.
Thanks!
You can initialize an array containing the number of days in each month, Mon = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31] using boolean to check whether it's a leap year (to set Mon(2) = 29). The number of days will help you index each month appropriately, using a loop like:
index=1;
for i=1:12
average = mean(A(:,:,index:(index+Mon(i)-1),3);
index = index+M(i); % Starting location of the next month
end