how to split the string with |(pipe) as delimiter - perl

When is use split with| as delimiter, it don't give me the expected output. Is there any solution to it?
use warnings;
my $exclude_list = "1213:sutrust.com,sutrust1.com,sutrust3.com|1321:line.com";
my #exclude_client = split(/|/, $exclude_list);
print "Printing excluse #exclude_client \n";
output
Printing excluse 1 2 1 3 : s u n t r u s t . c o m , s u t r u s t 1 . c o m , s u t r u s t 3 . c o m | 1 3 2 1 : l i n e . c o m
Expected output:
Printing excluse 1213:sutrust.com,sutrust1.com,sutrust3.com 1321:line.com

You didn't use the | character as the delimiter, you used the | regular expression as the delimiter. That pattern always matches, so the result is splitting between every character. Escape the |.
split(/\|/, $exclude_list)

The pipe character is a special character for split, and it needs to be escaped.
my #exclude_client = split(/\|/, $exclude_list);

You need to escape the pipe using a backslash: \| :
#!/usr/bin/perl -w
use strict;
my $exclude_list = "1213:sutrust.com,sutrust1.com,sutrust3.com|1321:line.com";
my #exclude_client = split(/\|/, $exclude_list);
foreach(#exclude_client){
print "Printing exclude $_ ";
}
Outputs:
1213:sutrust.com,sutrust1.com,sutrust3.com 1321:line.com

Related

How to parse rows in my txt file properly using perl

I hope to parse a txt file that looks like this:
A a, b, c
B e
C f, g
The format I hope to get is:
A a
A b
A c
B e
C f
C g
I tried this:
perl -ane '#s=split(/\,/, $F[1]); foreach $k (#s){print "$F[0] $k\n";}' txt.txt
but it only works when there's no space after commas. In the original file, there is a space after each comma. What should I do?
$ perl -lane 'print "$F[0] $_" for map { tr/,//rd } #F[1..$#F]' input.txt
A a
A b
A c
B e
C f
C g
Use auto-split mode on whitespace like normal, and for each element of an array slice of #F from the second field to the last one, remove any commas (I used tr//d, the more usual s/// works too, of course) and print it with the first field prepended.
Alternatively, don't use -a because it splits too much.
perl -le'#F = split(" ", $_, 2); print "$F[0] $_" for split(/,\s*/, $F[1])'

Perl script to check another array values depending on current array index

I'm working on a perl assignment, that has three arrays - #array_A, #array_B and array_C with some values in it, I grep for a string "CAT" on array A and fetching its indices too
my #index = grep { $#array_A[$_] =~ 'CAT' } 0..$#array_A;
print "Index : #index\n";
Output: Index : 2 5
I have to take this as an input and check the value of other two arrays at indices 2 and 5 and print it to a file.
Trick is the position of the string - "CAT" varies. (Index might be 5 , 7 and 9)
I'm not quite getting the logic here , looking for some help with the logic.
Here's an overly verbose example of how to extract the values you want as to show what's happening, while hopefully leaving some room for you to have to further investigate. Note that it's idiomatic Perl to use regex delimiters when using =~. eg: $name =~ /steve/.
use warnings;
use strict;
my #a1 = qw(AT SAT CAT BAT MAT CAT SLAT);
my #a2 = qw(a b c d e f g);
my #a3 = qw(1 2 3 4 5 6 7);
# note the difference in the next line... no # symbol...
my #indexes = grep { $a1[$_] =~ /CAT/ } 0..$#a1;
for my $index (#indexes){
my $a2_value = $a2[$index];
my $a3_value = $a3[$index];
print "a1 index: $index\n" .
"a2 value: $a2_value\n" .
"a3 value: $a3_value\n" .
"\n";
}
Output:
a1 index: 2
a2 value: c
a3 value: 3
a1 index: 5
a2 value: f
a3 value: 6

how to remove special character "," from a string using perl

hi i have some data like below
S_ METHOD m0 : 47|8#0- (1,0) [0|0] ""
S_ CTRL m1 : 15|8#0- (0.01,-200) [0|0] ""
from above 2 lines i am trying to extract that are in curve brackets () i have written a perl script
my #temp_signal = split(":",$line);
my #signal= split(" ",#temp_signal[0]);
my #Factor_temp1 = split (" ",#temp_signal[1]);
my #factor_temp = split ('\(',#Factor_temp1[1]);
my #factor = chop(#factor_temp);
my #offset = split (",",#factor_temp);
print OUTFILE1 "#offset[0]\n";
print OUTFILE1 "$signal[1]\n";
but when am trying to print #offset[1] & #offset[0] its printing some other value which is not even exist in the line how can i get the values as
1 0
0.01 -200
You can use a regular expression match to extract what's inside parentheses separated by a comma:
if ( my #numbers = $line =~ /\((.*),(.*)\)/) {
print "$numbers[0] $numbers[1]\n";
}

What do the non-printable characters in the Perl symbol table represent?

I just learned that in Perl, the symbol table for a given module is stored in a hash that matches the module name -- so, for example, the symbol table for the fictional module Foo::Bar would be %Foo::Bar. The default symbol table is stored in %main::. Just for the sake of curiosity, I decided that I wanted to see what was in %main::, so iterated through each key/value pair in the hash, printing them out as I went:
#! /usr/bin/perl
use v5.14;
use strict;
use warnings;
my $foo;
my $bar;
my %hash;
while( my ( $key, $value ) = each %:: ) {
say "Key: '$key' Value '$value'";
}
The output looked like this:
Key: 'version::' Value '*main::version::'
Key: '/' Value '*main::/'
Key: '' Value '*main::'
Key: 'stderr' Value '*main::stderr'
Key: '_<perl.c' Value '*main::_<perl.c'
Key: ',' Value '*main::,'
Key: '2' Value '*main::2'
...
I was expecting to see the STDOUT and STDERR file handles, and perhaps #INC and %ENV... what I wasn't expecting to see was non-ascii characters ... what the code block above doesn't show is that the third line of the output actually had a glyph indicating a non-printable character.
I ran the script and piped it as follows:
perl /tmp/asdf.pl | grep '[^[:print:]]' | while read line
do
echo $line
od -c <<< $line
echo
done
The output looked like this:
Key: '' Value '*main::'
0000000 K e y : ' 026 ' V a l u e '
0000020 * m a i n : : 026 ' \n
0000032
Key: 'ARNING_BITS' Value '*main::ARNING_BITS'
0000000 K e y : ' 027 A R N I N G _ B I
0000020 T S ' V a l u e ' * m a i n
0000040 : : 027 A R N I N G _ B I T S ' \n
0000060
Key: '' Value '*main::'
0000000 K e y : ' 022 ' V a l u e '
0000020 * m a i n : : 022 ' \n
0000032
Key: 'E_TRIE_MAXBUF' Value '*main::E_TRIE_MAXBUF'
0000000 K e y : ' 022 E _ T R I E _ M A
0000020 X B U F ' V a l u e ' * m a
0000040 i n : : 022 E _ T R I E _ M A X B
0000060 U F ' \n
0000064
Key: ' Value '*main:'
0000000 K e y : ' \b ' V a l u e '
0000020 * m a i n : : \b ' \n
0000032
Key: '' Value '*main::'
0000000 K e y : ' 030 ' V a l u e '
0000020 * m a i n : : 030 ' \n
0000032
So what are non-printable characters doing in the Perl symbol table? What are they symbols for?
Guru is on the right track: specifically, the answer is to be found in perlvar, which says:
"Perl variable names may also be a sequence of digits or a single punctuation or control character. These names are all reserved for special uses by Perl; for example, the all-digits names are used to hold data captured by backreferences after a regular expression match. Perl has a special syntax for the single-control-character names: It understands ^X (caret X) to mean the control-X character. For example, the notation $^W (dollar-sign caret W) is the scalar variable whose name is the single character control-W. This is better than typing a literal control-W into your program.
Since Perl 5.6, Perl variable names may be alphanumeric strings that begin with control characters (or better yet, a caret). These variables must be written in the form ${^Foo}; the braces are not optional. ${^Foo} denotes the scalar variable whose name is a control-F followed by two o's. These variables are reserved for future special uses by Perl, except for the ones that begin with ^_ (control-underscore or caret-underscore). No control-character name that begins with ^_ will acquire a special meaning in any future version of Perl; such names may therefore be used safely in programs. $^_ itself, however, is reserved."
If you want to print those names in a readable way, you could add a line like this to your code:
$key = '^' . ($key ^ '#') if $key =~ /^[\0-\x1f]/;
If first character of $key is a control character, this will replace it with a caret followed by the corresponding letter (^A for control-A, ^B for control-B, etc.).
Perl has special variables such as $", $, , $/ , $\ and so on. All these are part of symbol table which is what you are seeing. Also, you should be able to see #INC, %ENV in the symbol table as well.

Reformatting separated char to couples

Input:
rs001 A C T G C G T T
rs002 C C T T G G A A
out1:
rs001 AC TG CG TT
rs002 CC TT GG AA
out2 :
rs001 1 1 1 2
rs002 2 2 2 2
Ok so basically I want to replace any two similar nucleotides (like AA, CC, TT, or GG) to 2 and any two different (like AT, TA, CG, .. etc) to 1 taking into account that the input should be converted first to out1 then to out2. Also we have so many fields (like 200 columns) in each row, so loops are needed here.
This is what I tried:
cat input | awk '{ for (x = 2; x <= NF; x = x+2) print $x$(x+1) }'
Results are so weird, so can anyone please tell me why I can't get out1 ?! What mistakes I did in awk loops ?
Thanks in advance
For the first,
sed 's/ \([ACGT]\) / \1/g' input >out1
This will remove the space after every other nucleitude. It matches a nucleotide with a space on both sides; the next match will pick up where the previous ended.
For the second,
sed 's/\([ACGT]\)\1/2/g;s/[ACGT][ACGT]/1/g' out1 >out2
This replaces two adjacent identical letters with 2, then any remaining adjacent two letters with 1.
This assumes you have Linux; other sed dialects may require minor modifications.
awk '{
out1 = out2 = $1
for (i=2;i<=NF;i+=2) {
out1 = out1 FS $i $(i+1)
out2 = out2 FS ($i == $(i+1) ? 2 : 1)
}
print out1 > "out1"
print out2 > "out2"
}' input
Here's how you fix your awk script to get output 1:
awk '{ printf "%s ", $1; for (x = 2; x <= NF; x = x + 2) {printf "%s%s ", $x, $(x+1)} printf "\n"}' input
print adds a new line at the end by default, so you'll have to use formatted strings printf to specify where exactly you want the new lines.
(Also added printf "%s ", $1; at the start to print the header at the start of each line)
Edit: Triplee's solution looks much more elegant than mine - you should ditch awk and go with his =)
This might work for you (GNU sed):
sed -re 's/ (.) / \1/g;w out1' -e 's/([ACTG])\1/2/g;s/[ACTG]./1/g' file >out2