Extracting substring in perl

Extracting substring in perl - perl

I need to extract a substring from a given string in my perl program.
The string is of the form:
<PrefixString>_<MyString>_<SuffixString>.pdf
Example: abcd_ThisIsWhatIWant_xyz.pdf
I need to extract "ThisIsWhatIWant"
Can anyone help me please?
Thanks!
This is what I am trying through a subroutine:
sub extractString{
my ($fileName) = #_;
my $offset = 2;
my $delimeter = '_';
my $fileNameLen = index($fileName, $delimeter, $offset);
my $extractedFileName = substr($fileName, 8, $fileNameLen-1);
return $extractedFileName;
}

You can either use split or a regular expression. This short program shows both alternatives.
use strict;
use warnings;
my $filename = 'abcd_ThisIsWhatIWant_xyz.pdf';
my ($substring1) = $filename =~ /_([^_]*)_/;
print $substring1, "\n";
my $substring2 = (split /_/, $filename)[1];
print $substring2, "\n";
output
ThisIsWhatIWant
ThisIsWhatIWant

Related

Perl int to array

I have a number stored in a Perl variable and I want to 'pass/convert/store' its digits in the different positions of an array. An example for a better sight:
I have, let's say, this number stored:
$hello = 429384
And I need a new array with the digits stored in it, so:
$hello2[0] = 4
$hello2[1] = 2
$hello2[2] = 9
Etc..
I can probably make it with a couple of loops, but I want to know if there is an efficient and fast way to do it. Thx in advance!

my #hello = split //, $hello;
In Perl if you use number in a string operator, the conversion is done automatically

$hello = 429384;
#hello = split //, $hello;
print $hello[0];

Using only Regex and without using any inbuilt function:
#!/usr/bin/perl
use strict;
use warnings;
my $string=429384;
my #numbers = $string =~ /./g; # dot matches a single character at a time
#and returns it
print "#numbers \n";

this is significantly faster than the regexp way:
$string = '1234567890';
$_-=48 for #digits = unpack 'C*',$string;
benchmark:
use Time::HiRes;
$string = '1234567890';
$start_time = [Time::HiRes::gettimeofday()];
for (1.. 100000){
$_-=48 for #digits= unpack 'C*',$string;
}
$diff = Time::HiRes::tv_interval($start_time);
print "\n\n$diff\n";
$start_time = [Time::HiRes::gettimeofday()];
for (1.. 100000){
#digits = split //, $string;
}
$diff = Time::HiRes::tv_interval($start_time);
print "\n\n$diff\n";
output:
0.265814
0.314735

perl: how to make compact name from a numbered sequence

[perl 5.8.8]
I have a sequence of names of things like:
names='foobar1304,foobar1305,foobar1306,foobar1307'
where the names differ only by a contiguous string of digits somewhere in the name. The strings of digits in any sequence are all of the same length, and the digit strings form a continuous numeric sequence with no skips, e.g. 003,004,005.
I want a compact representation like:
compact_name='foobar1304-7'
(The compact form is just a name, so it's exact form is negotiable.)
There will usually only be <10 things, though some sets might span a decade, e.g.
'foobaz2205-11'
Is there some concise way to do this in perl? I'm not a big perl hacker, so be a little gentle...
Bonus points for handling embedded sequences like:
names='foobar33-pqq,foobar34-pqq,foobar35-pqq'
The ideal script would neatly fall back to 'firstname2301-lastname9922' in case it can't identify a sequence in the names.

I am not sure I got your specification, but it works somehow:
#!/usr/bin/perl
use warnings;
use strict;
use Test::More;
sub compact {
my $string = shift;
my ($name, $value) = split /=/, $string;
$name =~ s/s$// or die "Cannot create compact name for $name.\n"; #/ SO hilite bug
$name = 'compact_' . $name;
$value =~ s/^'|'$//g; #/ SO hilite bug
my #values = split /,/, $value; #/ SO hilite bug
my ($prefix, $first, $suffix) = $values[0] =~ /^(.+?)([0-9]+)(.*)$/;
my $last = $first + $#values;
my $same = 0;
$same++ while substr($first, 0, $same) eq substr($last, 0, $same);
$last = substr $last, $same - 1;
for my $i ($first .. $first + $#values) {
$values[$i - $first] eq ($prefix . $i . $suffix)
or die "Invalid sequence at $values[$i-$first].\n";
}
return "$name='$prefix$first-$last$suffix'";
}
is( compact("names='foobar1304,foobar1305,foobar1306,foobar1307'"),
"compact_name='foobar1304-7'");
is( compact("names='foobaz2205,foobaz2206,foobaz2207,foobaz2208,foobaz2209,foobaz2210,foobaz2211'"),
"compact_name='foobaz2205-11'");
is( compact("names='foobar33-pqq,foobar34-pqq,foobar35-pqq'"),
"compact_name='foobar33-5-pqq'");
done_testing();

Someone sure will post an more elegant solution, but the following
use strict;
use warnings;
my $names='foobar1308-xy,foobar1309-xy,foobar1310-xy,foobar1311-xy';
my #names = split /,/,$names;
my $pfx = lcp(#names);
my #nums = map { m/$pfx(\d*)/; $1 } #names;
my $first=shift #nums;
my $last = pop #nums;
my $suf=$names[0];
$suf =~ s/$pfx\d*//;
print "$pfx\{$first-$last}$suf\n";
#https://gist.github.com/3309172
sub lcp {
my $match = shift;
substr($match, (($match ^ $_) =~ /^\0*/, $+[0])) = '' for #_;
$match;
}
prints:
foobar13{08-11}-xy

How to match Chinese character in the web pages whose charset is big5 (Perl)?

I am doing a match between Chinese words, for example, "语言中心“ and a mount of web files (php, html, htm, etc).
However, somehow I get the following error:
Malformed UTF-8 character (1 byte, need 2, after start byte 0xdf) in regexp compilation at ../Final_FindOnlyNoReplace_CLE_Chinese.pl line 89, <INFILE> line 12.
Can anyone help?
Here is my code.
#!/usr/bin/env perl
use Encode qw/encode decode/;
use utf8;
use strict;
use Cwd;
use LWP::UserAgent;
my($path) = #_;
## append a trailing / if it's not there
$path .= '/' if($path !~ /\/$/);
use File::Glob ':glob';
my #all_files = bsd_glob($path."*");
for my $eachFile (#all_files) {
open(INFILE, "<$eachFile") || die ("Could not open '$eachFile'\n");
my(#inlines) = <INFILE>;
my($line, $find);
my $outkey = 1;
foreach $line (#inlines) {
$find = &find($line);
if ($find ne 'false') {
chomp($line);
print "\tline$outkey : $line\n";
}
$outkey ++;
}
}
#subroutine
sub find {
my $m = encode("utf8", decode("big5", #_));
my $html = LWP::UserAgent->new
->get($m)
->decoded_content;
my $str_chinese = '語言中心';
if ($m =~ /$str_chinese/) {
$m; ##if match, return the whole line.
}
}

You aren't searching in $html you've retrieved and decoded, but in URL instead: $m =~ /$str_chinese/, which, I guess, is not what you intend.
Also, you're comparing result of find function with exact string "false," which will never work. Change if ($find ne 'false') to if (defined($find)) and add explicit returns for success and failure to find for clarity.
Finally, you script seems to fail because you point it to directory that have some other Perl scripts amongst other files. They're most likely in UTF-8, so when your script tries to read them as big5 data, it falis on decoding. Just change your glob to cover data files only.

#!/usr/bin/env perl
use utf8;
use strictures;
use LWP::UserAgent qw();
use Path::Class::Rule qw();
use URI::file qw();
my $start_directory = q(.);
my $search_text = qr'語言中心';
my $next = Path::Class::Rule->new->name(qw(*.php *.htm*))->iter($start_directory);
my #matching_lines;
while (my $file = $next->()) {
for my $line (split /\R/, LWP::UserAgent
->new
->get(URI::file->new_abs($file))
->decoded_content
) {
push #matching_lines, $line if $line =~ $search_text;
}
}
# #matching_lines is (
# '<title>Untitled 語言中心 Document</title>',
# 'abc 語言中心 cde',
# '天天向上語言中心他'
# )

How to extract multiple columns from a CSV file using Perl

I'm pretty new with Perl and was hoping if anyone could help me with this issue. I need to extract two columns from a CSV file embedded commas. This is how the format looks like:
"ID","URL","DATE","XXID","DATE-LONGFORMAT"
I need to extract the DATE column, the XXID column, and the column immediately after XXID. Note, each line doesn't necessarily follow the same number of columns.
The XXID column contains a 2 letter prefix and doesn't always starts with the same letter. It can pretty much be any letter of the aplhabet. The length is always the same.
Finally, once these three columns are extracted, I need to sort on the XXID column and get a count on duplicates.

I published a module called Tie::Array::CSV which lets Perl interact with your CSV as a native Perl nested array. If you use this, you can take your search logic and apply it just as if your data were already in an array of array-references. Take a look!
#!/usr/bin/env perl
use strict;
use warnings;
use File::Temp;
use Tie::Array::CSV;
use List::MoreUtils qw/first_index/;
use Data::Dumper;
# this builds a temporary file from DATA
# normally you would just make $file the filename
my $file = File::Temp->new;
print $file <DATA>;
#########
tie my #csv, 'Tie::Array::CSV', $file;
#find column from data in first row
my $colnum = first_index { /^\w.{6}$/ } #{$csv[0]};
print "Using column: $colnum\n";
#extract that column
my #column = map { $csv[$_][$colnum] } (0..$#csv);
#build a hash of repetitions
my %reps;
$reps{$_}++ for #column;
print Dumper \%reps;

Here's a sample script using the Text::CSV module to parse your csv data. Consult the documentation for the module to find the proper settings for your data.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ binary => 1 });
while (my $row = $csv->getline(*DATA)) {
print "Date: $row->[2]\n";
print "Col#1: $row->[3]\n";
print "Col#2: $row->[4]\n";
}

You definitely want to use a CPAN library for parsing CSV, as you will never account for all the quirks of the format.
Please see: How can I parse quoted CSV in Perl with a regex?
Please see: How do I efficiently parse a CSV file in Perl?
However, here is a very naive and non-idiomatic solution for that particular string you provided:
use strict;
use warnings;
my $string = '"ID","URL","DATE","XXID","DATE-LONGFORMAT"';
my #words = ();
my $word = "";
my $quotec = '"';
my $quoted = 0;
foreach my $c (split //, $string)
{
if ($quoted)
{
if ($c eq $quotec)
{
$quoted = 0;
push #words, $word;
$word = "";
}
else
{
$word .= $c;
}
}
elsif ($c eq $quotec)
{
$quoted = 1;
}
}
for (my $i = 0; $i < scalar #words; ++$i)
{
print "column " . ($i + 1) . " = $words[$i]\n";
}

finding the substring present in string and also count the number of occurrences

Could anyone tel me what is the mistake? As the program is for finding the substrings in a given string and count there number of occurrences for those substrings. but the substring must check the occurrences for every three alphabets.
for eg: String: AGAUUUAGA (i.e. for AGA, UUU, AGA)
output: AGA-2
UUU-1
print"Enter the mRNA Sequence\n";
$count=0;
$count1=0;
$seq=<>;
chomp($seq);
$p='';
$ln=length($seq);
$j=$ln/3;
for($i=0,$k=0;$i<$ln,$k<$j;$k++) {
$fra[$k]=substr($seq,$i,3);
$i=$i+3;
if({$fra[$k]} eq AGA) {
$count++;
print"The number of AGA is $count";
} elseif({$fra[$k]} eq UUU) {
$count1++;
print" The number of UUU is $count1";
}
}

This is a Perl FAQ:
perldoc -q count
This code will count the occurrences of your 2 strings:
use warnings;
use strict;
my $seq = 'AGAUUUAGA';
my $aga_cnt = () = $seq =~ /AGA/g;
my $uuu_cnt = () = $seq =~ /UUU/g;
print "The number of AGA is $aga_cnt\n";
print "The number of UUU is $uuu_cnt\n";
__END__
The number of AGA is 2
The number of UUU is 1
If you use strict and warnings, you will get many messages pointing out errors in your code.
Here is another approach which is more scalable:
use warnings;
use strict;
use Data::Dumper;
my $seq = 'AGAUUUAGA';
my %counts;
for my $key (qw(AGA UUU)) {
$counts{$key} = () = $seq =~ /$key/g;
}
print Dumper(\%counts);
__END__
$VAR1 = {
'AGA' => 2,
'UUU' => 1
};

Have a try with this, that avoids overlaps:
#!/usr/bin/perl
use strict;
use warnings;
use 5.10.1;
use Data::Dumper;
my $str = q!AGAUUUAGAGAAGAG!;
my #list = $str =~ /(...)/g;
my ($AGA, $UUU);
foreach(#list) {
$AGA++ if $_ eq 'AGA';
$UUU++ if $_ eq 'UUU';
}
say "number of AGA is $AGA and number of UUU is $UUU";
output:
number of AGA is 2 and number of UUU is 1

This is an example of how quickly you can get things done in Perl. Grouping the strands together as a alternation is one way to make sure there is no overlap. Also a hash is a great way to count occurrences of they key.
$values{$_}++ foreach $seq =~ /(AGA|UUU)/g;
print "AGA-$values{AGA} UUU-$values{UUU}\n";
However, I generally want to generalize it to something like this, thinking that this might not be the only time you have to do something like this.
use strict;
use warnings;
use English qw<$LIST_SEPARATOR>;
my %values;
my #spans = qw<AGA UUU>;
my $split_regex
= do { local $LIST_SEPARATOR = '|';
qr/(#spans)/
}
;
$values{$_}++ foreach $seq =~ /$split_regex/g;
print join( ' ', map { "$_-$values{$_}" } #spans ), "\n";

Your not clear on how many "AGA" the string "AGAGAGA" contains.
If 2,
my $aga = () = $seq =~ /AGA/g;
my $uuu = () = $seq =~ /UUU/g;
If 3,
my $aga = () = $seq =~ /A(?=GA)/g;
my $uuu = () = $seq =~ /U(?=UU)/g;

If I understand you correctly (and certainly that is questionable; almost every answer so far is interpreting your question differently than every other answer):
my %substring;
$substring{$1}++ while $seq =~ /(...)/;
print "There are $substring{UUU} UUU's and $substring{AGA} AGA's\n";

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Extracting substring in perl - perl

Related

Perl int to array

perl: how to make compact name from a numbered sequence

How to match Chinese character in the web pages whose charset is big5 (Perl)?

How to extract multiple columns from a CSV file using Perl

finding the substring present in string and also count the number of occurrences

Categories

Resources