How to get the difference between two columns using Talend - talend

I Have two excel sheets that i want to compare using talend job
First excel named Compare_Me_1
PN
STT
Designation
AY73101000
20
RC0402FR-0743K2L
AY73101000
22
RK73H1ETTP4322F
AY73101000
22
ERJ-2RKF4322X
Ac2566
70
CRCW040243K2FKED
Second excel named Compare_Me_2
PN
STT
Designation
AY73101000
20
RC0402FR-0743K2L
AY73101000
22
RK73H1ETTP4322F
AY73101000
21
ERJ-2RKF4322X
Ac2566
70
CRCW040243K2FKED
what i want to achieve is this output
PN1
STT1
STT2
STT_OK_Ko
Designation1
Designation2
Designation_Ok_Ko
AY73101000
20
20
ok
RC0402FR-0743K2L
RC0402FR-0743K2L
ok
AY73101000
22
22
ok
RK73H1ETTP4322F
RK73H1ETTP4322F
ok
AY73101000
22
21
ko
ERJ-2RKF4322X
ERJ-2RKF4322X
ok
Ac2566
70
70
ok
CRCW040243K2FKED
CRCW040243K2FKED
ok
So to achieve this i developed a talend job that looks like below :
In My tMap i linked PN with a leftouterjoin and All Matches correspandance .
And to get for example STT_Ok_KO i used bellow code to compare my two input :
(!Relational.ISNULL(row14.STT) && !Relational.ISNULL(row13.STT) &&
row14.STT.equals(row13.STT) ) ||
(Relational.ISNULL(row14.STT) && Relational.ISNULL(row13.STT))
?"ok":"ko"
Is this the correct way to achieve my ouput ? If not , recommand me to use an other method
Any suggest is welcome .

You probably need to follow the long steps below :

Related

Regular expression for MS SQL

I want arrange these numbers first two numbers into following categories i confuse because it repeats it self.
Thank you fo your help.
210690, 391910, 392490, 880390, 847321, 940290, 300420, 300410, 901890, 901890, 030269,080530, 630399
1-5
6-14
5
16-24
25-27
28-38
39-40
41-41
44-46
47-49
50-63
64-67
68-70
71
72-83
84-85
86-89
90-92
93
94-96
97
98-99

Calculate routes via GET - Polyline incorrect

Iam testing routing API(https://developer.here.com/documentation/routing-api/api-reference-swagger.html) and all seems to be working as expect except polyline.
Here I share my service invocation
curl --location --request GET 'https://router.hereapi.com/v8/routes?transportMode=car&origin=3.4844257,-76.5256287&destination=3.478483,-76.517984&routingMode=fast&return=elevation,polyline,actions,instructions,summary&apiKey=1234'
As part of response we have the next field:
"polyline": "B2Fkt10Grm4-xE8yTU4mBnBnG8pBAzP0tBAjD4IA_E8LAvMokBT3IkXTzF8QTnG8QTvMsiBTrJjDA3_BjXoB_sB_OAriB3NAzKjDAnLsTAnLkhBnBzKsdA_J8fTnLkcT3DwHAvC8GArJoQAnL8QA7LsTA7V0jBT_M8UA",
As you know the polyline field is encoded.
According to documentation I proceeded to decoded it with a library/code suggested from:
https://github.com/heremaps/flexible-polyline/tree/master/java
The result of decode the field in not correct. The list of points (Lat,Long,Elevation) returned are not matched with the correct location. In the example, the coordinates are from Colombia and the results, after decodification, returns a list of points in a middle of the Atlanthic ocean.
Further, in order to discard library issues I was checking decoding de polyline with other decoder as:
https://developers.google.com/maps/documentation/utilities/polylineutility
https://open-polyline-decoder.60devs.com/
And the result is the same.
So, seems to be the problems is HereAPI side(API routing v8)
Any ideas? Maybe I am invoking the API in the incorrect way
The decoder on https://github.com/heremaps/flexible-polyline/tree/master/java works correct, see please code with your encoded string:
private void testSOLatLngDecoding() {
List<LatLngZ> computed = decode("B2Fkt10Grm4-xE8yTU4mBnBnG8pBAzP0tBAjD4IA_E8LAvMokBT3IkXTzF8QTnG8QTvMsiBTrJjDA3_BjXoB_sB_OAriB3NAzKjDAnLsTAnLkhBnBzKsdA_J8fTnLkcT3DwHAvC8GArJoQAnL8QA7LsTA7V0jBT_M8UA");
List<String> seqCrds = new ArrayList<>();
for (int i = 0; i < computed.size(); ++i) {
LatLngZ c = computed.get(i);
List<String> crds = new ArrayList<>();
crds.add(String.valueOf(c.lat));
crds.add(String.valueOf(c.lng));
crds.add(String.valueOf(c.z));
seqCrds.add(String.join(",", crds));
//assertEquals(computed.get(i), pairs.get(i));
}
System.out.println(String.join(",", seqCrds));
}
We have a javascript tool "Encode/Decode to/from Flexible polyline" on https://demo.support.here.com/examples/v3.1/enc_dec_flexible_polyline
see please result:
Please double check code on your side.
I would suggest removing your API key from the get request you posted in your question.
Using the polyline that you provided gets decoded into the following, which is in Columbia. As a result, I think you need to check the decoder you are using and/or what is being fed into it.
Index
Lat
Lon
Elev
0
3.48437
-76.52567
1003
1
3.48438
-76.52505
1001
2
3.48428
-76.52438
1001
3
3.48403
-76.52365
1001
4
3.48398
-76.52351
1001
5
3.4839
-76.52332
1001
6
3.4837
-76.52274
1000
7
3.48356
-76.52237
999
8
3.48347
-76.5221
998
9
3.48337
-76.52183
997
10
3.48317
-76.52128
996
11
3.48302
-76.52133
996
12
3.482
-76.5217
998
13
3.48128
-76.52194
998
14
3.48073
-76.52216
998
15
3.48056
-76.52221
998
16
3.48038
-76.5219
998
17
3.4802
-76.52137
996
18
3.48003
-76.5209
996
19
3.47987
-76.52039
995
20
3.47969
-76.51994
994
21
3.47963
-76.51982
994
22
3.47959
-76.51971
994
23
3.47944
-76.51945
994
24
3.47926
-76.51918
994
25
3.47907
-76.51887
994
26
3.47872
-76.5183
993
27
3.478512
-76.517966
993

How to read a file containing numbers in Octave using textscan

I am trying to import data from text file named xMat.txt which has the data in the following format.
200 space separated elements in one line and some 767 lines.
This is how xMat.txt looks.
386.0 386.0 388.0 394.0 402.0 413.0 ... .0 800.0 799.0 796
801.0 799.0 799.0 802.0 802.0 80 ... 399.0 397.0 394.0 391
.
.
.
This is my file - for reference.
When I try to read the file using
file = fopen('xMat.txt','r')
c = textscan(file,'%f');
I get the output as:
> c = { [1,1] =
> 386
> 386
> 388
> 394
> 402
> 413
> 427
> 442
> 458
> 473
> 487
> 499
> 509
> 517
> 524 ... in column format
What I need is a matrix of size (767X200). How can I do this?
I wouldn't use textscan in this case because your text file is purely numeric. Your text file contains 767 rows of 200 numbers per row where each number is delimited by a space. You couldn't get it to be any better suited for use with dlmread (MATLAB doc, Octave doc). dlmread can do this for you in one go:
c = dlmread('xMat.txt');
c will contain a 767 x 200 array for you that contains the data stored in the text file xMat.txt. Hopefully you can dump textscan in this case because what you're really after is trying to read your data into Octave... and dlmread does the job for you quite nicely.

How to merge two xml file (Web.config)?

I have two xml files,if i use xmlDiffPatch of microsoft,it do Delete Element,but i Dont want to delete nodes,just replace value or add nodes,How to do that?for exampel:
web1.config
is like '
600
150
75
25
'
and web2.config is like this:
'
<PartPriceInfo xmlns:ns2="http://www.aa.com">
<ns2:Subaru model="Outback">
<ns2:Muffler> 700 </ns2:Muffler>
<ns2:Bumper> 150 </ns2:Bumper>
</ns2:Subaru>
</PartPriceInfo>
'
i want the results like below:
'
<PartPriceInfo xmlns:ns2="http://www.aa.com">
<ns2:Subaru model="Outback">
<ns2:Muffler> 700</ns2:Muffler>
<ns2:Bumper> 150 </ns2:Bumper>
<ns2:Floormat> 75 </ns2:Floormat>
<ns2:WindShieldWipers> 25 </ns2:WindShieldWipers>
</ns2:Subaru>
</PartPriceInfo>
'
Finally I found Microsoft XML Diff and Patch Tool,and my job done good.good luck every one.

How can I searching for different variants of bioinformatics motifs in string, using Perl?

I have a program output with one tandem repeat in different variants. Is it possible to search (in a string) for the motif and to tell the program to find all variants with maximum "3" mismatches/insertions/deletions?
I will take a crack at this with the very limited information supplied.
First, a short friendly editorial:
<editorial>
Please learn how to ask a good question and how to be precise.
At a minimum, please:
Refrain from domain specific jargon such as "motif" and "tandem repeat" and "base pairs" without providing links or precise definitions;
Say what the goal is and what you have done so far;
Important: Provide clear examples of input and desired output.
It is not helpful to potential helpers on SO have to have to play 20 questions in comments to try and understand your question! I spent more time trying to figure out what you were asking than answering it.
</editorial>
The following program generates a string of 2 character pairs 5,428 pairs long in an array of 1,000 elements long. I realize it is more likely that you will be reading these from a file, but this is just an example. Obviously you would replace the random strings with your actual data from whatever source.
I do not know if 'AT','CG','TC','CA','TG','GC','GG' that I used are legitimate base pair combinations or not. (I slept through biology...) Just edit the map block pairs to legitimate pairs and change the 7 to the number of pairs if you want to generate legitimate random strings for testing.
If the substring at the offset point is 3 differences or less, the array element (a scalar value) is stored in an anonymous array in the value part of a hash. The key part of the hash is the substring that is a near match. Rather than array elements, the values could be file names, Perl data references or other relevant references you want to associate with your motif.
While I have just looked at character by character differences between the strings, you can put any specific logic that you need to look at by replacing the line foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); } with the comparison logic that works for your problem. I do not know how mismatches/insertions/deletions are represented in your string, so I leave that as an exercise to the reader. Perhaps Algorithm::Diff or String::Diff from CPAN?
It is easy to modify this program to have keyboard input for $target and $offset or have the string searched beginning to end rather than several strings at a fixed offset. Once again: it was not really clear what your goal is...
use strict; use warnings;
my #bps;
push(#bps,join('',map { ('AT','CG','TC','CA','TG','GC','GG')[rand 7] }
0..5428)) for(1..1_000);
my $len=length($bps[0]);
my $s_count= scalar #bps;
print "$s_count random strings generated $len characters long\n" ;
my $target="CGTCGCACAG";
my $offset=832;
my $nlen=length $target;
my %HoA;
my $diffs=0;
my #a2=split(//, $target);
substr($bps[-1], $offset, $nlen)=$target; #guarantee 1 match
substr($bps[-2], $offset, $nlen)="CATGGCACGG"; #anja example
foreach my $i (0..$#bps) {
my $cand=substr($bps[$i], $offset, $nlen);
my #a1=split(//, $cand);
$diffs=0;
foreach my $j (0..$#a1) { $diffs++ unless ($a1[$j] eq $a2[$j]); }
next if $diffs > 3;
push (#{$HoA{$cand}}, $i);
}
foreach my $hit (keys %HoA) {
my #a1=split(//, $hit);
$diffs=0;
my $ds="";
foreach my $j (0..$#a1) {
if($a1[$j] eq $a2[$j]) {
$ds.=" ";
} else {
$diffs++;
$ds.=$a1[$j];
}
}
print "Target: $target\n",
"Candidate: $hit\n",
"Differences: $ds $diffs differences\n",
"Array element: ";
foreach (#{$HoA{$hit}}) {
print "$_ " ;
}
print "\n\n";
}
Output:
1000 random strings generated 10858 characters long
Target: CGTCGCACAG
Candidate: CGTCGCACAG
Differences: 0 differences
Array element: 999
Target: CGTCGCACAG
Candidate: CGTCGCCGCG
Differences: CGC 3 differences
Array element: 696
Target: CGTCGCACAG
Candidate: CGTCGCCGAT
Differences: CG T 3 differences
Array element: 851
Target: CGTCGCACAG
Candidate: CGTCGCATGG
Differences: TG 2 differences
Array element: 986
Target: CGTCGCACAG
Candidate: CATGGCACGG
Differences: A G G 3 differences
Array element: 998
..several cut out..
Target: CGTCGCACAG
Candidate: CGTCGCTCCA
Differences: T CA 3 differences
Array element: 568 926
I believe that there are routines for this sort of thing in BioPerl.
In any case, you might get better answers if you asked this over at BioStar, the bioinformatics stack exchange.
When I was in my first couple years of learning perl, I wrote what I now consider to be a very inefficient (but functional) tandem repeat finder (which used to be available on my old job's company website) called tandyman. I wrote a fuzzy version of it a couple years later called cottonTandy. If I were to re-write it today, I would use hashes for a global search (given the allowed mistakes) and utilize pattern matching for a local search.
Here's an example of how you use it:
#!/usr/bin/perl
use Tandyman;
$sequence = "ATGCATCGTAGCGTTCAGTCGGCATCTATCTGACGTACTCTTACTGCATGAGTCTAGCTGTACTACGTACGAGCTGAGCAGCGTACgTG";
my $tandy = Tandyman->new(\$sequence,'n'); #Can't believe I coded it to take a scalar reference! Prob. fresh out of a cpp class when I wrote it.
$tandy->SetParams(4,2,3,3,4);
#The parameters are, in order:
# repeat unit size
# min number of repeat units to require a hit
# allowed mistakes per unit (an upper bound for "mistake concentration")
# allowed mistakes per window (a lower bound for "mistake concentration")
# number of units in a "window"
while(#repeat_info = $tandy->FindRepeat())
{print(join("\t",#repeat_info),"\n")}
The output of this test looks like this (and takes a horrendous 11 seconds to run):
25 32 TCTA 2 0.87 TCTA TCTG
58 72 CGTA 4 0.81 CTGTA CTA CGTA CGA
82 89 CGTA 2 0.87 CGTA CGTG
45 51 TGCA 2 0.87 TGCA TGA
65 72 ACGA 2 0.87 ACGT ACGA
23 29 CTAT 2 0.87 CAT CTAT
36 45 TACT 3 0.83 TACT CT TACT
24 31 ATCT 2 1 ATCT ATCT
51 59 AGCT 2 0.87 AGTCT AGCT
33 39 ACGT 2 0.87 ACGT ACT
62 72 ACGT 3 0.83 ACT ACGT ACGA
80 88 ACGT 2 0.87 AGCGT ACGT
81 88 GCGT 2 0.87 GCGT ACGT
63 70 CTAC 2 0.87 CTAC GTAC
32 38 GTAC 2 0.87 GAC GTAC
60 74 GTAC 4 0.81 GTAC TAC GTAC GAGC
23 30 CATC 2 0.87 CATC TATC
71 82 GAGC 3 0.83 GAGC TGAGC AGC
1 7 ATGC 2 0.87 ATGC ATC
54 60 CTAG 2 0.87 CTAG CTG
15 22 TCAG 2 0.87 TCAG TCGG
70 81 CGAG 3 0.83 CGAG CTGAG CAG
44 50 CATG 2 0.87 CTG CATG
25 32 TCTG 2 0.87 TCTA TCTG
82 89 CGTG 2 0.87 CGTA CGTG
55 73 TACG 5 0.75 TAGCTG TAC TACG TACG AG
69 83 AGCG 4 0.81 ACG AGCTG AGC AGCG
15 22 TCGG 2 0.87 TCAG TCGG
As you can see, it allows indels and SNPs. The columns are, in order:
Start position
Stop position
Consensus sequence
The number of units found
A quality metric out of 1
The repeat units separated by spaces
Note, that it's easy to supply parameters (as you can see from the output above) that will output junk/insignificant "repeats", but if you know how to supply good params, it can find what you set it upon finding.
Unfortunately, the package is not publicly available. I never bothered to make it available since it's so slow and not amenable to even prokaryotic-sized genome searches (though it would be workable for individual genes). In my novice coding days, I had started to add a feature to take a "state" as input so that I could run it on sections of a sequence in parallel and I never finished that once I learned hashes would make it so much faster. By that point, I had moved on to other projects. But if it would suit your needs, message me, I can email you a copy.
It's just shy of 1000 lines of code, but it has lots of bells & whistles, such as the allowance of IUPAC ambiguity codes (BDHVRYKMSWN). It works for both amino acids and nucleic acids. It filters out internal repeats (e.g. does not report TTTT or ATAT as 4nt consensuses).