How to get substring of the line enclosed in double quotes - perl

I have an input string :
ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
If I use split function, I'm getting weird output.
my ($field1, $field2, $field3, $field4) = "";
while (<DATAFILE>) {
$row = $_;
$row =~ s/\r?\n$//;
($field1, $field2, $field3, $field4) = split(/,/, $row);
}
output I am getting is:
field1 :: ACC000121
field2 :: 2290
field3 :: "01009900
field4 :: 01009901
Expected output:
field1 = ACC000121
field2 = 2290
field3 = 01009900,01009901,01009902,01009903,01009904
field4 = 4
field5 = 5
field6 = 6
I am quite weak in Perl. Please help me

If you have CSV data, you really want to use Text::CSV to parse it. As you've discovered, parsing CSV data is usually not as trivial as just splitting on commas, and Text::CSV can handle all the edge cases for you.
use strict;
use warnings;
use Data::Dump;
use Text::CSV;
my $csv = Text::CSV->new;
while (<DATA>) {
$csv->parse($_);
my #fields = $csv->fields;
dd(\#fields);
}
__DATA__
ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
Output:
[
"ACC000121",
2290,
"01009900,01009901,01009902,01009903,01009904",
4,
5,
6,
]

I agree with Matt Jacob's answer — you should parse CSV with Text::CSV unless you've got a very good reason not to do so.
If you're going to deal with it using regular expressions, I think you'll do better with m// than split. For example, this seems to cover most single line CSV data variants, though it does not remove the quotes around a quoted field as Text::CSV would — that requires a separate post-processing step.
use strict;
use warnings;
sub splitter
{
my($row) = #_;
my #fields;
my $i = 0;
while ($row =~ m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/g)
{
print "Found [$1]\n";
$fields[$i++] = $1;
}
for (my $j = 0; $j < #fields; $j++)
{
print "$j = [$fields[$j]]\n";
}
}
my $row;
$row = q'ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6';
print "Row 1: $row\n";
splitter($row);
$row = q'ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""';
print "Row 2: $row\n";
splitter($row);
Obviously, that has a fair amount of diagnostic code in it. The output (from Perl 5.22.0 on Mac OS X 10.11.1) is:
Row 1: ACC000121,2290,"01009900,01009901,01009902,01009903,01009904",4,5,6
Found [ACC000121]
Found [2290]
Found ["01009900,01009901,01009902,01009903,01009904"]
Found [4]
Found [5]
Found [6]
0 = [ACC000121]
1 = [2290]
2 = ["01009900,01009901,01009902,01009903,01009904"]
3 = [4]
4 = [5]
5 = [6]
Row 2: ACC000121,",",2290,"01009900,""aux data"",01009902,01009903,01009904",,5"abc",6,""
Found [ACC000121]
Found [","]
Found [2290]
Found ["01009900,""aux data"",01009902,01009903,01009904"]
Found []
Found [5"abc"]
Found [6]
Found [""]
0 = [ACC000121]
1 = [","]
2 = [2290]
3 = ["01009900,""aux data"",01009902,01009903,01009904"]
4 = []
5 = [5"abc"]
6 = [6]
7 = [""]
In the Perl code, the match is:
m/((?=,)|[^",][^,]*|"([^"]|"")*")(?:,|$)/
This looks for and captures (in $1) either an empty field followed by a comma, or for something other than a double quote followed by zero or more non-commas, or for a double quote followed by a sequence of zero or more occurrences of "not a double quote or two consecutive double quotes" and another double quote; it then expects either a comma or end of string.
Handling multi-line fields requires a little more work. Removing the escaping double quotes also requires a little more work.
Using Text::CSV is much simpler and much less error prone (and it can handle more variants than this can).

Related

Regular Expression Matching Perl for first case of pattern

I have multiple variables that have strings in the following format:
some_text_here__what__i__want_here__andthen_some 
I want to be able to assign to a variable the what__i__want_here portion of the first variable. In other words, everything after the FIRST double underscore. There may be double underscores in the rest of the string but I only want to take the text after the FIRST pair of underscores.
Ex.
If I have $var = "some_text_here__what__i__want_here__andthen_some", I would like to assign to a new variable only the second part like $var2 = "what__i__want_here__andthen_some"
I'm not very good at matching so I'm not quite sure how to do it so it just takes everything after the first double underscore.
my $text = 'some_text_here__what__i__want_here';
# .*? # Match a minimal number of characters - see "man perlre"
# /s # Make . match also newline - see "man perlre"
my ($var) = $text =~ /^.*?__(.*)$/s;
# $var is not defined when there is no __ in the string
print "var=${var}\n" if defined($var);
You might consider this an example of where split's third parameter is useful. The third parameter to split constrains how many elements to return. Here is an example:
my #examples = (
'some_text_here__what__i_want_here',
'__keep_this__part',
'nothing_found_here',
'nothing_after__',
);
foreach my $string (#examples) {
my $want = (split /__/, $string, 2)[1];
print "$string => ", (defined $want ? $want : ''), "\n";
}
The output will look like this:
some_text_here__what__i_want_here => what__i_want_here
__keep_this__part => keep_this__part
nothing_found_here =>
nothing_after__ =>
This line is a little dense:
my $want = (split /__/, $string, 2)[1];
Let's break that down:
my ($prefix, $want) = split /__/, $string, 2;
The 2 parameter tells split that no matter how many times the pattern /__/ could match, we only want to split one time, the first time it's found. So as another example:
my (#parts) = split /#/, "foo#bar#baz#buzz", 3;
The #parts array will receive these elements: 'foo', 'bar', 'baz#buzz', because we told it to stop splitting after the second split, so that we get a total maximum of three elements in our result.
Back to your case, we set 2 as the maximum number of elements. We then go one step further by eliminating the need for my ($throwaway, $want) = .... We can tell Perl we only care about the second element in the list of things returned by split, by providing an index.
my $want = ('a', 'b', 'c', 'd')[2]; # c, the element at offset 2 in the list.
my $want = (split /__/, $string, 2)[1]; # The element at offset 1 in the list
# of two elements returned by split.
You use brackets to capature then reorder the string, the first set of brackets () is $1 in the next part of the substitution, etc ...
my $string = "some_text_here__what__i__want_here";
(my $newstring = $string) =~ s/(some_text_here)(__)(what__i__want_here)/$3$2$1/;
print $newstring;
OUTPUT
what__i__want_here__some_text_here

Why below code is not working as expected?

I have below code to match particular keyword from file, Please note that particular keyword is present in that file. (Verified)
#!/usr/bin/perl
use strict;
use warnings;
use Getopt::Long;
my $fname="sample.txt";
my #o_msg_rx;
my $tempStr='=?UTF-8?B?U2Now4PCtm5l?=\, Ma ';
push #o_msg_rx, $tempStr;
foreach my $rx_temp (#o_msg_rx) {
print "rx_temp = $rx_temp\n";
}
my #msg_arr;
open MM, '<', $fname;
chomp(#msg_arr = (<MM>));
close MM;
my (%o_msg_rx, %msg_anti_rx);
foreach my $rx (#o_msg_rx){
($rx =~ s/^!// ? $msg_anti_rx{$rx} : $o_msg_rx{$rx}) = 0 if $rx;
print "rx = \t$rx\n";
print "o_msg_rx_rx = \t$o_msg_rx{$rx}\n";
}
if(#msg_arr) {
foreach my $rx (keys %o_msg_rx) {
$o_msg_rx{$rx} = 1 if grep (/$rx/i, #msg_arr);
}
}
my $regex_ok = (! scalar grep (! $o_msg_rx{$_}, keys %o_msg_rx));
print "regex_ok = $regex_ok\n";
I am attaching few lines from the file for clarification.
# Step: 23 14:48:52
#
# var: expect-count='1'
# var: msg-rx=""=?UTF-8?B?U2Now4PCtm5l?=\, Maik ""
# etc etc etc
Do you intend for $tempStr to be interpreted as a regular expression? If so, then you should know that the ? is a regular expression operator and will not literally match a ? in the target string.
Also, it has a space after Ma but your sample file has Maik so that part won't match.
These changes will produce a different result:
my $tempStr='=?UTF-8?B?U2Now4PCtm5l?=\, Ma'; # remove the extra space
grep (/\Q$rx/i, #msg_arr); # Add \Q to match the literal string $tempStr in regexp
Or you could make $tempStr a real regexp from the start:
my $tempStr=qr/=\?UTF-8\?B\?U2Now4PCtm5l\?=\\, Ma/;
Or you could leave it as a string but put it in regexp syntax (needs an extra doubling of the double backslash, very ugly):
my $tempStr='=\?UTF-8\?B\?U2Now4PCtm5l\?=\\\\, Ma';

Perl - Converting special characters to corresponding numerical values in quality score

I am trying to convert the following set of characters into their corresponding values for a quality score that accompanies a fasta file:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
They should have the values 0-93. So when I input a fastq file that uses these symbols I want to output the numerical values for each in a quality score file.
I have tried putting them into an array using split // and then making into a hash where each key is the symbol and the value is its position in the array:
for (my $i = 0; $i<length(#qual); $i++) {
print "i is $i, elem is $qual[$i]\n";
$hash{$qual[$i]} = $i;
I have tried hard coding the hash:
my %hash = {"!"=>"0", "\""=>"1", "#"=>"2", "\$"=>"3"...
With and without escapes for the special characters that require them but cannot seem to get this to work.
This merely outputs:
.
.
.
i is 0, elem is !
i is 1, elem is "
i is 0, elem is !
i is 1, elem is "
i is 0, elem is !
i is 1, elem is "
" 1
Use of uninitialized value $hash{"HASH(0x100804ed0)"} in concatenation (.) or string at convert_fastq.pl line 24, <> line 40.
HASH(0x100804ed0)
! 0
Does anyone have any ideas? I appreciate the help.
Perhaps subtracting 33 from the character's ord to yield the value you want would be helpful:
use strict;
use warnings;
my $string = q{!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~};
for ( split //, $string ) {
print "$_ = ", ord($_) - 33, "\n";
}
Partial output:
! = 0
" = 1
# = 2
$ = 3
% = 4
& = 5
' = 6
( = 7
) = 8
* = 9
+ = 10
...
This way, you don't need to build a hash with character/value pairs, but just use $val = ord ($char) - 33; to get the value.
{ ... }
is similar to
do { my %anon; %anon = ( ... ); \%anon }
So when you did
my %hash = { ... };
you assigned a single item to the hash (a reference to a hash) rather than a list of key-values as you should. Perl warned you about that with the following:
Reference found where even-sized list expected
(Why didn't you mention this?!)
You should be using
my %decode_map = ( ... );
For example,
my %decode_map;
{
my $encoded = q{!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~};
my #encoded = split //, $encoded;
$decode_map{$encoded[$_]} = $_ for 0..$#encoded;
}
Given that those are basically the non-whitespace printable ASCII characters, so you could simply use
my %decode_map = map { chr($_ + 0x21) => $_ } 0x21..0x7E;
Which means you could avoid building the hash at all, replacing
my %decode_map = map { chr($_ + 0x21) => $_ } 0x21..0x7E;
die if !exists($decode_map{$c});
my $num = $decode_map{$c};
with just
die if ord($c) < 0x21 || ord($c) > 0x7E;
my $num = ord($c) - 0x21;
From a language-agnostic point of view: Use an array with 256 entries, one for each ASCII character. You can then store 0 at ['!'], 1 at ['"'] and so on. When parsing the input, you can lookup the index of a char in that array directly. Fore careful error handling, you could store -1 at all invalid chars and check that while parsing the file.

Better way to extract elements from a line using perl?

I want to extract some elements from each line of a file.
Below is the line:
# 1150 Reading location 09ef38 data = 00b5eda4
I would like to extract the address 09ef38 and the data 00b5eda4 from this line.
The way I use is the simple one like below:
while($line = < INFILE >) {
if ($line =~ /\#\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*(\S+)\s*=\s*(\S+)/) {
$time = $1;
$address = $4;
$data = $6;
printf(OUTFILE "%s,%s,%s \n",$time,$address,$data);
}
}
I am wondering is there any better idea to do this ? easier and cleaner?
Thanks a lot!
TCGG
Another option is to split the string on whitespace:
my ($time, $addr, $data) = (split / +/, $line)[1, 4, 7];
You could use matching and a list on LHS, something likes this:
echo '# 1150 Reading location 09ef38 data = 00b5eda4' |
perl -ne '
$,="\n";
($time, $addr, $data) = /#\s+(\w+).*?location\s+(\w+).*?data\s*=\s*(\w+)/;
print $time, $addr, $data'
Output:
1150
09ef38
00b5eda4
In python the appropriate regex will be like:
'[0-9]+[a-zA-Z ]*([0-9]+[a-z]+[0-9]+)[a-zA-Z ]*= ([0-9a-zA-Z]+)'
But I don't know exactly how to write it in perl. You can search for it. If you need any explanation of this regexp, I can edit this post with more precise description.
I find it convenient to just split by one or more whitespaces of any kind, using \s+. This way you won't have any problems if the input string has any tab characters in it instead of spaces.
while($line = <INFILE>)
{
my ($time, $addr, $data) = (split /\s+/, $line)[1, 4, 7];
}
When splitting by ANY kind of whitespace it's important to note that it'll also split by the newline at the end, so you'll get an empty element at the end of the return. But in most cases, unless you care about the total amount of elements returned, there's no need to care.

Parsing files that use synonyms

If I had a text file with the following:
Today (is|will be) a (great|good|nice) day.
Is there a simple way I can generate a random output like:
Today is a great day.
Today will be a nice day.
Using Perl or UNIX utils?
Closures are fun:
#!/usr/bin/perl
use strict;
use warnings;
my #gens = map { make_generator($_, qr~\|~) } (
'Today (is|will be) a (great|good|nice) day.',
'The returns this (month|quarter|year) will be (1%|5%|10%).',
'Must escape %% signs here, but not here (%|#).'
);
for ( 1 .. 5 ) {
print $_->(), "\n" for #gens;
}
sub make_generator {
my ($tmpl, $sep) = #_;
my #lists;
while ( $tmpl =~ s{\( ( [^)]+ ) \)}{%s}x ) {
push #lists, [ split $sep, $1 ];
}
return sub {
sprintf $tmpl, map { $_->[rand #$_] } #lists
};
}
Output:
C:\Temp> h
Today will be a great day.
The returns this month will be 1%.
Must escape % signs here, but not here #.
Today will be a great day.
The returns this year will be 5%.
Must escape % signs here, but not here #.
Today will be a good day.
The returns this quarter will be 10%.
Must escape % signs here, but not here %.
Today is a good day.
The returns this month will be 1%.
Must escape % signs here, but not here %.
Today is a great day.
The returns this quarter will be 5%.
Must escape % signs here, but not here #.
Code:
#!/usr/bin/perl
use strict;
use warnings;
my $template = 'Today (is|will be) a (great|good|nice) day.';
for (1..10) {
print pick_one($template), "\n";
}
exit;
sub pick_one {
my ($template) = #_;
$template =~ s{\(([^)]+)\)}{get_random_part($1)}ge;
return $template;
}
sub get_random_part {
my $string = shift;
my #parts = split /\|/, $string;
return $parts[rand #parts];
}
Logic:
Define template of output (my $template = ...)
Enter loop to print random output many times (for ...)
Call pick_one to do the work
Find all "(...)" substrings, and replace them with random part ($template =~ s...)
Print generated string
Getting random part is simple:
receive extracted substring (my $string = shift)
split it using | character (my #parts = ...)
return random part (return $parts[...)
That's basically all. Instead of using function you could put the same logic in s{}{}, but it would be a bit less readable:
$template =~ s{\( ( [^)]+ ) \)}
{ my #parts = split /\|/, $1;
$parts[rand #parts];
}gex;
Sounds like you may be looking for Regexp::Genex. From the module's synopsis:
#!/usr/bin/perl -l
use Regexp::Genex qw(:all);
$regex = shift || "a(b|c)d{2,4}?";
print "Trying: $regex";
print for strings($regex);
# abdd
# abddd
# abdddd
# acdd
# acddd
# acdddd
Use a regex to match each parenthetical (and the text inside it).
Use a string split operation (pipe delimiter) on the text inside of the matched parenthetical to get each of the options.
Pick one randomly.
Return it as the replacement for that capture.
Smells like a recursive algorithm
Edit: misread and thought you wanted all possibilities
#!/usr/bin/python
import re, random
def expand(line, all):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
for v in variants:
expand(line[:result.start()] + v + line[result.end():], all)
else:
all.append(line)
return all
line = "Today (is|will be) a (great|good|nice) day."
all = expand(line, [])
# choose a random possibility at the end:
print random.choice(all)
A similar construct that produces a single random line:
def expand_rnd(line):
result = re.search('\([^\)]+\)', line)
if result:
variants = result.group(0)[1:-1].split("|")
choice = random.choice(variants)
return expand_rnd(
line[:result.start()] + choice + line[result.end():])
else:
return line
Will fail however on nested constructs