Matlab: parsing large segmented data with empty strings - matlab

I have a complex data text file to parse, my first problem is some of the strings values are missing (such as row 5 column 4 shown in Data below, I tried using treatAsEmpty with 8 blank spaces but it didn't work it keeps moving the B from the 5th row over and not registering the rest of the row [To be honest I don't need that column, if you can show me how to ignore it that would solve this problem]).
textscan(fileName .'%4d %4d %4d %8s \t %1s %2d \b %2s %7s %5d %*[^\n]','delimiter','\r','treatAsEmpty',' ','EmptyValue',-Inf);
Data:
0439 0444 0441 S09E44SF A 13 ES 3.7E-04 10230
0727 0736 0732 S27W23SF A 29 ES 1.2E-03 10226
0937 0945 0942 S29W16SF A 23 ES 8.8E-04 10226
2000 2016 2008 S28W27SF C 23 ES 1.8E-03 10226
2134 2217 2153 B 27 ES 4.8E-02 10229
0032 0042 0037 S25W27SF C 45 ES 2.1E-03 10226
0142 0147 0145 S09E35SF C 14 ES 4.1E-04 10230
0536 0555 0541 S09E33SF C 16 ES 1.6E-03 10230
0214 0312 0252 N23W422F A 11 ES 2.3E-02 10223
My second problem is, the blank space that is row 6 and row 10. I need to get rows 1-5 in cells (1x9), rows 7-9 in cells (2x9), row 11 in cells (3x9), etc.

Related

Get matrix from one text file

i need get matrix from attached text.
this file:
stk.v.11.0
WrittenBy STK_v11.2.0
BEGIN Attitude
NumberOfAttitudePoints 1441
BlockingFactor 20
InterpolationOrder 1
CentralBody Earth
ScenarioEpoch 1 Sep 2019 07:30:00.000000
Epoch in JDate format: 2458727.81250000000000
Epoch in YYDDD format: 19244.31250000000000
Time of first point: 1 Sep 2019 07:30:00.000000000 UTCG = 2458727.81250000000000 JDate = 19244.31250000000000 YYDDD
CoordinateAxes Fixed
AttitudeTimeQuaternions
0.00000000000000e+00 -6.31921733422396e-01 -3.17293118154861e-01 -3.46458799928973e-01 6.16414065342263e-01
6.00000000000000e+01 -6.43812442721383e-01 -3.36172942616126e-01 -3.26612615262581e-01 6.04828480481266e-01
1.20000000000000e+02 -6.55079207620880e-01 -3.54631351046725e-01 -3.06354155948717e-01 5.92667670562959e-01
1.80000000000000e+02 -6.65707851270427e-01 -3.72650894083426e-01 -2.85703135087044e-01 5.79946623834615e-01
i need seprate this 4 row of matrix and asset it to one matrix variable...
in fact i have bigger matrix than that (that near 1400 rows) i must extract from text file and store it in one variable of matrix...
Use readmatrix and specify the number of lines to skip to get to the matrix data. If the number is fixed at 25 as in your example, then:
>> a = readmatrix('textfile.txt','NumHeaderLines',25)
a =
0 -0.6319 -0.3173 -0.3465 0.6164
60.0000 -0.6438 -0.3362 -0.3266 0.6048
120.0000 -0.6551 -0.3546 -0.3064 0.5927
180.0000 -0.6657 -0.3727 -0.2857 0.5799

Text file processing in Matlab

I have a text output from a program with a set format. I need to parse ~200 of them to extract an information. I tried in MATLAB with 'textscan' but did not work. Following is the input:
MOTIFS SUMMARY:
1) TTATAGCCGC (GCGGCTATAA) 1.986
2) AAACCGCCTC (GAGGCGGTTT) 1.865
DETAILED RESULTS:
1) TTATAGCCGC (GCGGCTATAA) 1.986
Matrix: MAT1 TTATAGCCGC
A 0.1249 0.177 0.7364 0.1189 0.7072 0.1149 0.09858 0.1096
C 0.0899 0.07379 0.1136 0.1298 0.08662 0.1293 0.7528 0.721
G 0.06828 0.1284 0.07195 0.1031 0.1352 0.6708 0.05556 0.0713
T 0.7169 0.6209 0.07802 0.6482 0.07096 0.08492 0.09305 0.09804
OCCURRENCES:
>GENE_1 1 TTATAGCCGC 1 561 +
>GENE_2 24 TAATAGCCGC 0.928699 762 -
>GENE_3 10 ATATAGCCGC 0.904905 185 -
>GENE_1 7 TTATAGCAGC 0.901785 726 +
**********
2) AAACCGCCTC (GAGGCGGTTT) 1.865
Matrix: MAT2 AAACCGCCTC
A 0.653 0.7401 0.7763 0.1323 0.09619 0.09134 0.07033 0.1383
C 0.1163 0.07075 0.09441 0.749 0.6347 0.1132 0.6559 0.6982
G 0.09136 0.09402 0.07385 0.04209 0.1799 0.7332 0.1241 0.07568
T 0.1393 0.09518 0.05541 0.07659 0.08921 0.06234 0.1497 0.08786
OCCURRENCES:
>GENE_1 21 AAACCGCCTC 1 963 +
>GENE_2 14 AAACGGCCTC 0.928198 212 +
>GENE_2 8 AAACCGTCTC 0.92009 170 +
>GENE_4 3 TAACCGCCTC 0.918883 370 +
**********
I am trying to count the unique() occurrence under each motif and add it to the MOTIF SUMMARY and a final average of them. My expected output is:
MOTIFS SUMMARY:
1) TTATAGCCGC (GCGGCTATAA) 1.986 3
2) AAACCGCCTC (GAGGCGGTTT) 1.865 3
AVERAGE OCCURRENCE: 3
For motif 1, unique occurrence is 3 (GENE_1, GENE_2, GENE_3). Similarly for motif 2, it is again 3 (GENE_1, GENE_2, GENE_4)
How can I use OCCURRENCES and ****** as blocks ? so that, I can regexp GENE_x to store it and count.
Kindly help.
Thanks,
AP
You better try to change the original text file so that it will be legal matlab m file code, then just use 'eval' function to run it .
Most of the job will be to find where to insert '=' and '[' ']' and '%' for ignore parts.
If all files are identical in format than it will be easy.

downsampling rate with movement data (first point equal from the original matrix)

I was wondering if the procedure applied trying to download the sample rate was the appropriate as follows the instruction: y = downsample(x,n)
downsamp_rate = 40;
downsampled_data = downsample(X,downsamp_rate);
.. because my doubt relays in why the first column from both matrices is exactly the same (the original matrix and the sample donwloaded)maintaining the same data....
then the other data have already transformed to a lower sample rate.
Thank you so much!
Best!
edited: Sample data. I pasted the data but I can upload de .mat files.
Original data.
column 1 column 2 column 3
-0,593600000000000 -0,592699999999996 -0,591899999999995
2,42180000000000 2,41010000000000 2,40360000000000
1,78550000000000 1,79020000000000 1,79530000000000
-1,30590000000000 -1,31520000000000 -1,31530000000000
-0,707800000000003 -0,712699999999999 -0,727700000000003
-0,986500000000001 -0,996000000000002 -1,00460000000000
-0,989699999999999 -0,989699999999999 -0,989699999999999
1,23500000000000 1,22970000000000 1,21880000000000
0,122899999999998 0,127899999999997 0,128899999999998
0,938300000000003 0,937500000000002 0,936200000000004
0,248600000000004 0,248500000000002 0,248700000000002
-0,381499999999996 -0,393199999999999 -0,393699999999997
0,294099999999997 0,279299999999999 0,271299999999997
-0,223200000000001 -0,223699999999999 -0,227299999999997
0,0879999999999992 0,117300000000004 0,122500000000003
-0,167899999999999 -0,170999999999999 -0,174800000000003
-0,687499999999996 -0,697199999999998 -0,701600000000002
-0,681700000000002 -0,682200000000000 -0,683000000000000
1,19659999999999 1,19670000000000 1,19490000000000
-0,565500000000008 -0,565199999999999 -0,557400000000008
Downsampled data
column 1 column 2 column 3
-0,593600000000000 0,821900000000003 0,936300000000001
2,42180000000000 1,14610000000000 -0,255400000000000
1,78550000000000 2,86550000000000 3,66890000000000
-1,30590000000000 7,01950000000000 12,9564000000000
-0,707800000000003 3,05920000000000 0,852999999999998
-0,986500000000001 -0,372200000000000 -0,951000000000002
-0,989699999999999 -0,988000000000000 -1,21730000000000
1,23500000000000 5,79700000000000 3,40880000000000
0,122899999999998 5,32230000000000 5,19260000000000
0,938300000000003 4,88130000000000 7,55900000000000
0,248600000000004 4,79290000000000 2,96620000000000
-0,381499999999996 -0,400000000000000 0,641500000000000
0,294099999999997 -0,131400000000004 -1,20040000000000
-0,223200000000001 1,49610000000000 1,59030000000000
0,0879999999999992 0,418700000000000 -0,0114999999999976
-0,167899999999999 0,0149999999999983 -0,857500000000000
-0,687499999999996 -0,593100000000002 0,119700000000000
-0,681700000000002 -0,170000000000003 0,126799999999999
1,19659999999999 1,17670000000000 1,15780000000000
-0,565500000000008 8,89019999999999 6,58569999999999
A possible for your output is a periodic input signal with a period length of downsamp_rate-1. To give a short demonstration:
>> X=repmat(1:39,1,10);
>> downsampled_data = downsample(X,downsamp_rate);
>> downsampled_data
downsampled_data =
Columns 1 through 9
1 2 3 4 5 6 7 8 9
Column 10
10
Thus, take a look at your rows 40,41,42. I assume the first value is identical to your row 1,2,3

Creating open-high-low-close (ohlc) bars from tick data in Matlab

I have a CSV file 'XPQ12.csv' of futures tick data in the following form:
20090312 30:14.0 717.25 1 E
20090312 30:15.0 718.47 1 E
20090312 30:17.0 717.25 1 E
20090312 30:32.0 718.42 1 E
20090312 30:49.0 715.32 1 E
20090312 30:58.0 717.57 1 E
20090312 31:06.0 716.65 3 E
20090312 31:12.0 718.35 2 E
20090312 31:45.0 721.14 1 E
20090312 31:52.0 719.24 1 E
20090312 32:11.0 717.02 6 E
20090312 32:29.0 717.14 1 E
20090312 32:35.0 717.34 1 E
20090312 32:55.0 717.26 1 E
(The first column is the yearmonthdate, the second column is the minute:second:tenthofsecond, the third column is the price, the fourth column is the number of contracts traded, and the fifth indicates if the trade was electronic or in a pit). In my actual data set, I may have thousands of price quotes within any given minute.
I read the file using the following code:
fid = fopen('C:\Program Files\MATLAB\R2013a\XPQ12.csv','r');
[c] = fscanf(fid, '%d,%d:%d.%d,%f,%d,%c')
Which outputs:
20090312
30
14
0
717.25
1
69
20090312
30
15
0
718.47
3
69
.
.
.
(the 69s are the matlab representation for E I believe)
Now I want to cut this up into one minute ohlc bars, so that for each minute, I record what the first, highest, lowest, and last price was within that minute. I'd really like to know the best way to go about this.
My original idea was to store the sequence of minutes in a vector d, and while working through the data, each time the number at the end of d changed I would record the corresponding price as an open, record the previous price as a close for the last bar, and find the largest and smallest prices within each open and close.
c(2) is the first minute, so I said:
d(1)=c(2);
and then noting that I'd always be counting by 7 before getting to the next minute, I said:
Nrows = numel(textread('XPQ12.csv','%1c%*[^\n]')); % counts rows in file
for i=1:Nrows
if mod(i-2,7)== 0;
d(end+1)=c(i);
end
end
which should fill up d with all the minutes:
30
30
30
30
30
30
31
31
31
31
32
32
32
32
in the case of the example data. I'm kind of lost what to do from here, or if what I'm doing is on the right track.
From where you are:
Minutes = c(2:7:end);
MinuteValues=unique(Minutes);
Prices = c(5:7:end);
if (length(Prices)>length(Minutes))
Prices=Prices(1:length(Minutes));
elseif (length(Prices)<length(Minutes))
Minutes=Minutes(1:length(Prices));
OverflowValues=1+find(Minutes(2:end)==0 & Minutes(1:end-1)==59);
for v=length(OverflowValues):-1:1
Minutes(OverflowValues(v):end)=Minutes(OverflowValues(v):end)+60;
end
Highs=zeros(1,length(MinuteValues));
Lows=zeros(1,length(MinuteValues));
First=zeros(1,length(MinuteValues));
Last=zeros(1,length(MinuteValues));
for v=1:length(MinuteValues)
Highs(v) = max(Prices(Minutes==MinuteValues(v)));
Lows(v) = min(Prices(Minutes==MinuteValues(v)));
First(v) = Prices(find(Minutes==MinuteVales(v),1,'first'));
Last(v) = Prices(find(Minutes==MinuteVales(v),1,'last'));
end
Using textread would make this easier for you, as mentioned.
(If you are lost at this stage, I wouldn't find accumarray as mentioned in the comments is the best place to start!)
By the way, this is assuming that minutes increases above 60 and you don't have hours in there somewhere. Otherwise this won't work at all.

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?
I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926
I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.
When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).