Regular expression for MS SQL - filtering

I want arrange these numbers first two numbers into following categories i confuse because it repeats it self.
Thank you fo your help.
210690, 391910, 392490, 880390, 847321, 940290, 300420, 300410, 901890, 901890, 030269,080530, 630399
1-5
6-14
5
16-24
25-27
28-38
39-40
41-41
44-46
47-49
50-63
64-67
68-70
71
72-83
84-85
86-89
90-92
93
94-96
97
98-99

Related

How to get the difference between two columns using Talend

I Have two excel sheets that i want to compare using talend job
First excel named Compare_Me_1
PN
STT
Designation
AY73101000
20
RC0402FR-0743K2L
AY73101000
22
RK73H1ETTP4322F
AY73101000
22
ERJ-2RKF4322X
Ac2566
70
CRCW040243K2FKED
Second excel named Compare_Me_2
PN
STT
Designation
AY73101000
20
RC0402FR-0743K2L
AY73101000
22
RK73H1ETTP4322F
AY73101000
21
ERJ-2RKF4322X
Ac2566
70
CRCW040243K2FKED
what i want to achieve is this output
PN1
STT1
STT2
STT_OK_Ko
Designation1
Designation2
Designation_Ok_Ko
AY73101000
20
20
ok
RC0402FR-0743K2L
RC0402FR-0743K2L
ok
AY73101000
22
22
ok
RK73H1ETTP4322F
RK73H1ETTP4322F
ok
AY73101000
22
21
ko
ERJ-2RKF4322X
ERJ-2RKF4322X
ok
Ac2566
70
70
ok
CRCW040243K2FKED
CRCW040243K2FKED
ok
So to achieve this i developed a talend job that looks like below :
In My tMap i linked PN with a leftouterjoin and All Matches correspandance .
And to get for example STT_Ok_KO i used bellow code to compare my two input :
(!Relational.ISNULL(row14.STT) && !Relational.ISNULL(row13.STT) &&
row14.STT.equals(row13.STT) ) ||
(Relational.ISNULL(row14.STT) && Relational.ISNULL(row13.STT))
?"ok":"ko"
Is this the correct way to achieve my ouput ? If not , recommand me to use an other method
Any suggest is welcome .
You probably need to follow the long steps below :

delete rows with character in cell array

I need some basic help. I have a cell array:
TITLE 13122423
NAME Bob
PROVIDER James
and many more rows with text...
234 456 234 345
324 346 234 345
344 454 462 435
and many MANY (>4000) more with only numbers
text
text
and more text and mixed entries
Now what I want is to delete all the rows where the first column contain a character, and end up with only those rows containing numbers. Row 44 - 46 in this example.
I tried to use
rawdataTruncated(strncmp(rawdataTruncated(:, 1), 'A', 1), :) = [];
but then i need to go throught the whole alphabet, right?
Given data of the form:
C = {'FIRSTX' '350.0000' '' '' ; ...
'350.0000' '0.226885' '254.409' '0.755055'; ...
'349.9500' '0.214335' '254.41' '0.755073'; ...
'250.0000' 'LASTX' '' '' };
You can remove any row that has character strings containing letters using isstrprop, cellfun, and any like so:
index = ~any(cellfun(#any, isstrprop(C, 'alpha')), 2);
C = C(index, :)
C =
2×4 cell array
'350.0000' '0.226885' '254.409' '0.755055'
'349.9500' '0.214335' '254.41' '0.755073'

Behavior of contradicting soft constraints

I have a testcase in which the behavior seems wrong. I see that in all generations the num_of_red_shoes is high, while I would expect a more even distribution. What is the cause of this behavior and how can it be fixed?
<'
struct closet {
kind:[SMALL,BIG];
num_of_shoes:uint;
num_of_red_shoes:uint;
num_of_black_shoes:uint;
num_of_yellow_shoes:uint;
keep soft num_of_red_shoes < 10;
keep soft num_of_black_shoes < 10;
keep soft num_of_yellow_shoes < 10;
keep num_of_yellow_shoes + num_of_black_shoes + num_of_red_shoes == num_of_shoes;
when BIG closet {
keep num_of_shoes in [50..100];
};
};
extend sys {
closets[100]:list of BIG closet;
};
'>
Generation results:
item type kind num_of_sh* num_of_re* num_of_bl* num_of_ye*
---------------------------------------------------------------------------
0. BIG closet BIG 78 73 1 4
1. BIG closet BIG 67 50 8 9
2. BIG closet BIG 73 68 0 5
3. BIG closet BIG 73 66 3 4
4. BIG closet BIG 51 50 0 1
5. BIG closet BIG 78 76 1 1
6. BIG closet BIG 55 43 7 5
7. BIG closet BIG 88 87 1 0
8. BIG closet BIG 99 84 6 9
9. BIG closet BIG 92 92 0 0
10. BIG closet BIG 63 55 3 5
11. BIG closet BIG 59 50 9 0
12. BIG closet BIG 51 44 2 5
13. BIG closet BIG 82 76 1 5
14. BIG closet BIG 81 74 2 5
15. BIG closet BIG 97 93 2 2
16. BIG closet BIG 54 41 8 5
17. BIG closet BIG 55 44 5 6
18. BIG closet BIG 70 55 9 6
19. BIG closet BIG 63 57 1 5
When there are contradicting soft constraints, Specman does not randomize the softs which are enforced, but rather gives priority to the constraints which were written last. Since the soft on red shoes was first in the test, it is the one which is always overridden.
If the softs are known to be mutually exclusive (which is not the case here) you could use a simple flag to randomly choose which soft should hold. e.g. the code would look like this:
flag:uint[0..2];
keep soft read_only(flag==0) => num_of_red_shoes < 10;
keep soft read_only(flag==1) => num_of_black_shoes < 10;
keep soft read_only(flag==2) => num_of_yellow_shoes < 10;
However, since here there is no knowledge in advance how many softs are expected to hold (and it's possible and two or all three will be satisfied) a more complex solution should be made. Here is a code which does this randomization:
struct closet {
kind:[SMALL,BIG];
num_of_shoes:uint;
num_of_red_shoes:uint;
num_of_black_shoes:uint;
num_of_yellow_shoes:uint;
//replaces the original soft constraints (if a flag is true the correlating
//right-side implication will be enforced
soft_flags[3]:list of bool;
keep for each in soft_flags {
soft it == TRUE;
};
//this list is used to shuffle the flags so their enforcement will be done
//with even distribution
soft_indices:list of uint;
keep soft_indices.is_a_permutation({0;1;2});
keep soft_flags[soft_indices[0]] => num_of_red_shoes < 10;
keep soft_flags[soft_indices[1]] => num_of_black_shoes < 10;
keep soft_flags[soft_indices[2]] => num_of_yellow_shoes < 10;
keep num_of_yellow_shoes + num_of_black_shoes + num_of_red_shoes == num_of_shoes;
};
I'm not with Cadence, so I can't give you a definite answer. I think the solver will try to break as few constraints as possible and it just chooses the first one if finds (in your case the one for red shoes). Try changing the order and see if this changes (if the black constraint is first, I'd think you'll always get more black shoes).
As a solution, you could just remove the soft constraints when you have a big closet:
when BIG closet {
keep num_of_red_shoes.reset_soft();
keep num_of_black_shoes.reset_soft();
keep num_of_yellow_shoes.reset_soft();
keep num_of_shoes in [50..100];
};
If you want to randomly choose which one of them to disable (sometimes more than 10 red shoes, sometimes more than 10 black shoes, etc.), then you'll need a helper field:
when BIG closet {
more_shoes : [ RED, BLACK, YELLOW ];
keep more_shoes == RED => num_of_red_shoes.reset_soft();
keep more_shoes == BLACK => num_of_black_shoes.reset_soft();
keep more_shoes == YELLOW => num_of_yellow_shoes.reset_soft();
keep num_of_shoes in [50..100];
};
It depends on what you mean by "a more even distribution".
There is no way to satisfy all of your hard and soft constraints for a BIG closet. Therefore Specman attempts to find a solution by ignoring some of your soft constraints. The IntelliGen constraint solver doesn't ignore all of the soft constraints, but attempts to find a solution while still using a subset. This is explained in the "Specman Generation User Guide" (sn_igenuser.pdf):
[S]oft constraints that are loaded later are considered to have a higher priority than soft constraints loaded previously."
In this case that means that Specman discards the soft constraint on red shoes and since it can find a solution still obeying the other soft constraints it does not discard them.
If you combine all of your soft constraints into one, then you will probably get the result you were hoping for:
keep soft ((num_of_red_shoes < 10) and (num_of_black_shoes < 10) and
(num_of_yellow_shoes < 10));
There are advantages to giving later constraints priority: This means that using AOP you can add new soft constraints and they will get the highest priority.
For more distributed values, I would suggest the following.
I'm sure you can follow the intended logic too.
var1, var2 : uint;
keep var1 in [0..30];
keep var2 in [0..30];
when BIG closet {
keep num_of_shoes in [50..100];
keep num_of_yellow_shoes == (num_of_shoes/3) - 15 + var1;
keep num_of_black_shoes == (num_of_shoes - num_of_yellow_shoes)/2 - 15 + var2;
keep num_of_red_shoes == num_of_shoes - (num_of_yellow_shoes - num_of_black_shoes);
keep gen (var1, var2) before (num_of_shoes);
keep gen (num_of_shoes) before (num_of_yellow_shoes, num_of_black_shoes, num_of_red_shoes);
keep gen (num_of_yellow_shoes) before (num_of_black_shoes, num_of_red_shoes);
keep gen (num_of_black_shoes) before (num_of_red_shoes);
};

Example of 'a subroutine may have several entry and exit points'

I'm reading the paper of Non structured programming, and found it says:
Unlike a procedure, a subroutine may have several entry and exit points, and a direct jump into or out of subroutine is (theoretically) allowed
I can't understand it, could anyone give me an code sample of:
a subroutine may have several entry and exit points
a direct jump into or out of subroutine
Thanks
10 A = 1
20 GOSUB 100
30 A = 2
40 GOSUB 110
50 A = 3
60 GOTO 130
70 END
100 PRINT A
110 PRINT "HELLO"
120 IF A = 1 THEN RETURN
130 PRINT "THERE"
140 IF A = 3 THEN GOTO 70
150 RETURN
The subroutine has three entry points (lines 100, 110, and 130) and three exit points (lines 120, 140, and 150). There is a direct jump into line 130 (from line 60) and a direct jump out (at line 140).

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?
I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926
I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.
When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).