Perl - finding multiple substrings in a variable? - perl

I'm very new to Perl and I've been using this site for a few years looking for answers to questions I've had regarding other projects I've been working on but I can't seem to find the answer to this one..
My script receives a HTTP response with data that I need to strip down to parse the correct information.
So far, I have this for the substring..
my $output = substr ($content, index($content, 'Down'), index($content, '>'));
This seems to be doing what it's supposed to be.. finding the word 'Down' and then substringing it up until it finds a >.
However, the string 'Down' can appear many times within the response and this stops looking after it finds the first example of this.
How would I go about getting the substring to be recursive?
Thanks for your help :)

One way like this:
my $x="aa Down aaa > bb Down bbb > cc Down ccc >";
while ($x =~/(Down[^>]+>)/g){
print $1;
}

Another solution,without iteration and just storing what is down.
use Data::Dumper;
my $x="aa Down aaa > bb Down bbb > cc Down ccc >";
my #downs = $x =~ m!Down([^>]+)>!gis;
print Dumper(\#downs);

Related

Error in Linkdatagen : Use of uninitiated value $chr in concatenation (.) or string

Hi I was trying to use linkdatagen, which is a perl based tool. It requires a vcf file (using mpileup from SAMtools) and a hapmap annotation file (provided). I have followed the instructions but the moment I use the perl script provided, I get this error.
The codes I used are:
samtools mpileup -d10000 -q13 -Q13 -gf hg19.fa -l annotHapMap2U.txt samplex.bam | bcftools view -cg -t0.5 - > samplex.HM.vcf
Perl vcf2linkdatagen.pl -variantCaller mpileup -annotfile annotHapMap2U.txt -pop CEU -mindepth 10 -missingness 0 samplex.HM.vcf > samplex.brlmm
Use of uninitiated value $chr in concatenation (.) or string at vcf2linkdatagentest.pl line 487, <IN> line 1.... it goes on and on.. I have mailed the authors, and haven't heard from them yet. Can anyone here please help me? What am I doing wrong?
The perl script is :
http://bioinf.wehi.edu.au/software/linkdatagen/vcf2linkdatagen.pl
The HapMap file can be downloaded from the website mentioned below.
http://bioinf.wehi.edu.au/software/linkdatagen/
Thanks so much
Ignoring lines starting with #, vcf2linkdatagen.pl expects the first field of the first line of the VCF to contain something of the form "chrsomething", and your file doesn't meet that expectation. Examples from a comment in the code:
chr1 888659 . T C 226 . DP=26;AF1=1;CI95=1,1;DP4=0,0,9,17;MQ=49;FQ=-81 GT:PL:GQ 1/1:234,78,0:99
chr1 990380 . C . 44.4 . DP=13;AF1=7.924e-09;CI95=1.5,0;DP4=3,10,0,0;MQ=49;FQ=-42 PL 0
The warning means that a variable used in a string is not initialized (undefined). It is an indication that something might be wrong. The line in question can be traced to this statement
my $chr = $1 if ($tmp[0] =~ /chr([\S]+)/);
It is bad practice to use postfix if statements on a my statement.
As ikegami notes a workaround for this might be
my ($chr) = $tmp[0] =~ /chr([\S])/;
But since the match failed, it will likely return the same error. The only way to solve is to know more about the purpose of this variable, if the error should be fatal or not. The author has not handled this case, so we do not know.
If you want to know more about the problem, you might add a debug line such as this:
warn "'chr' value not found in the string '$tmp[0]'" unless defined $chr;
Typically, an error like this occurs when someone gives input to a program that the author did not expect. So if you see which lines give this warning, you might find out what to do about it.

Convert hex values to characters

I have strings like: "Film-DVD \x{bb}Once / The Swell Season (Collector's Edition\x{ab} John Carney" which are the result of Data::Dumper.
Now I want the hex-values \x{bb}, \x{ab} to be replaced with corresponding characters » and «.
I already tried:
$a =~ s/\\x\{(.{2})\}/chr(hex($1))/eg;
But this returns me "Film-DVD �Once / The Swell Season (Collector's Edition� John Carney"
Do you have any ideas what i could do?
The code you posted is correct.
The problem appears to be that you forgot to tell Perl to encode your output. This is normally done using
use open ':std', ':encoding(UTF-8)';

To find the matched and unmatched values in perl

I am a newbie to programming and I hope someone can explain this to me:
So I have two text files i.e. Scan1.txt and Scan2.txt that are stored in my computer. Scan1.txt contains:
Tom
white
black
mark
john
ben
Scan2.txt contains:
bob
ben
white
gary
tom
black
patrick
I have to extract the matched values of these two files and the unmatched values and print them separately. I somehow found the solution for this which works fine. But can someone please explain how exactly the match happens here. Looks like somehow just this line:
$hash{$matchline}++ in the code does the matching and increments the value of hash when the match is found. I understand the logic but I do not understand how this match actually happens. Can someone help me understand this?
Thank you in advance!
Here is the code:
open (F1, "Scan1.txt");
open (F2, "Scan2.txt");
%hash=();
while ($matchline= <F1> ){
$hash{$matchline}=1;
}
close F1;
while( $matchline= <F2> ){
$hash{$matchline}++;
}
close F2;
foreach $matchline (keys %hash){
if ($hash{$matchline} == 1){
chomp($matchline);
push(#unmatched, $matchline);
}
else{
chomp($matchline);
push (#matched, $matchline);
}
}
print "Matched Entries are >>\n";
print "```````````````````````\n";
print join ("\n", #matched) . "\n";
print "```````````````````````\n";
print "Unmatched Entries are >>\n";
print "```````````````````````\n";
print join ("\n", #unmatched) . "\n";
print "```````````````````````\n";
The code you mention above will give you a false result if a given word exists more than one time in the second file and not exists in the first.
this line:
$hash{$matchline}++
increments a different counter for each different word.
in the first loop it sets to 1 for the words in the first file.
so if a word exists in each file the counter will be at least 2.
the $hash itself is a set of counters.
A more generalized version of your problem is that of computing the set union or intersection between two sets. This link gives a very good treatment of the problem in general.
In your case, the set is nothing but the list of values from each file. The logic is, if a certain value was present in both files then $hash{matchline} == 2, because the value will be incremented in both the while loops. However, if the line was present in only one of the files, the value of $hash{matchline} == 1, since only one while loop will increment the value and not the other.
Also, Lajos Veres raises a very important point: if a certain word, say "Tom" is present twice in the same file, then the algorithm will fail. It is a subtle detail, which can be resolved in many ways- removing duplicates beforehand, using two hashes, etc.
Hope this helps.

creating a hash with regex matches in perl

Lets say i have a file like below:
And i want to store all the decimal numbers in a hash.
hello world 10 20
world 10 10 10 10 hello 20
hello 30 20 10 world 10
i was looking at this
and this worked fine:
> perl -lne 'push #a,/\d+/g;END{print "#a"}' temp
10 20 10 10 10 10 20 30 20 10 10
Then what i need was to count number of occurrences of each regex.
for this i think it would be better to store all the matches in a hash and assign an incrementing value for each and every key.
so i tried :
perl -lne '$a{$1}++ for ($_=~/(\d+)/g);END{foreach(keys %a){print "$_.$a{$_}"}}' temp
which gives me an output of:
> perl -lne '$a{$1}++ for ($_=~/(\d+)/g);END{foreach(keys %a){print "$_.$a{$_}"}}' temp
10.4
20.7
Can anybody correct me whereever i was wrong?
the output i expect is:
10.7
20.3
30.1
although i can do this in awk,i would like to do it only in perl
Also order of the output is not a concern for me.
$a{$1}++ for ($_=~/(\d+)/g);
This should be
$a{$_}++ for ($_=~/(\d+)/g);
and can be simplified to
$a{$_}++ for /\d+/g;
The reason for this is that /\d+/g creates a list of matches, which is then iterated over by for. The current element is in $_. I imagine $1 would contain whatever was left in there by the last match, but it's definitely not what you want to use in this case.
Another option would be this:
$a{$1}++ while ($_=~/(\d+)/g);
This does what I think you expected your code to do: loop over each successful match as the matches happen. Thus the $1 will be what you think it is.
Just to be clear about the difference:
The single argument for loop in Perl means "do something for each element of a list":
for (#array)
{
#do something to each array element
}
So in your code, a list of matches was built first, and only after the whole list of matches was found did you have the opportunity to do something with the results. $1 got reset on each match as the list was being built, but by the time your code was run, it was set to the last match on that line. That is why your results didn't make sense.
On the other hand, a while loop means "check if this condition is true each time, and keep going until the condition is false". Therefore, the code in a while loop will be executed on each match of a regex, and $1 has the value for that match.
Another time this difference is important in Perl is file processing. for (<FILE>) { ... } reads the entire file into memory first, which is wasteful. It is recommended to use while (<FILE>) instead, because then you go through the file line by line and keep only the information you want.

How do I parse this file and store it in a table?

I have to parse a file and store it in a table. I was asked to use a hash to implement this. Give me simple means to do that, only in Perl.
-----------------------------------------------------------------------
L1234| Archana20 | 2010-02-12 17:41:01 -0700 (Mon, 19 Apr 2010) | 1 line
PD:21534 / lserve<->Progress good
------------------------------------------------------------------------
L1235 | Archana20 | 2010-04-12 12:54:41 -0700 (Fri, 16 Apr 2010) | 1 line
PD:21534 / Module<->Dir,requires completion
------------------------------------------------------------------------
L1236 | Archana20 | 2010-02-12 17:39:43 -0700 (Wed, 14 Apr 2010) | 1 line
PD:21534 / General Page problem fixed
------------------------------------------------------------------------
L1237 | Archana20 | 2010-03-13 07:29:53 -0700 (Tue, 13 Apr 2010) | 1 line
gTr:SLC-163 / immediate fix required
------------------------------------------------------------------------
L1238 | Archana20 | 2010-02-12 13:00:44 -0700 (Mon, 12 Apr 2010) | 1 line
PD:21534 / Loc Information Page
------------------------------------------------------------------------
I want to read this file and I want to perform a split or whatever to extract the following fields in a table:
the id that starts with L should be the first field in a table
Archana20 must be in the second field
timestamp must be in the third field
PD must be in the fourth field
Type (content preceding / must be in the last field)
My questions are:
How to ignore the --------… (separator line) in this file?
How to extract the above?
How to split since the file has two delimiters (|, /)?
How to implement it using a hash and what is the need for this?
Please provide some simple means so that I can understand since I am a beginner to Perl.
My questions are:
How to ignore the --------… (separator line) in this file?
How to extract the above?
How to split since the file has two delimiters (|, /)?
How to implement it using a hash and what is the need for this?
You will probably be working through the file line by line in a loop. Take a look at perldoc -f next. You can use regular expressions or a simpler match in this case, to make sure that you only skip appropriate lines.
You need to split first and then handle each field as needed after, I would guess.
Split on the primary delimiter (which appears to be ' | ' - more on that in a minute), then split the final field on its secondary delimiter afterwards.
I'm not sure if you are asking whether you need a hash or not. If so, you need to pick which item will provide the best set of (unique) keys. We can't do that for you since we don't know your data, but the first field (at a glance) looks about right. As for how to get something like this into a more complex data structure, you will want to look at perldoc perldsc eventually, though it might only confuse you right now.
One other thing, your data above looks like it has a semi-important typo in the first line. In that line only, there is no space between the first field and its delimiter. Everywhere else it's ' | '. I mention this only because it can matter for split. I nearly edited this, but maybe the data itself is irregular, though I doubt it.
I don't know how much of a beginner you are to Perl, but if you are completely new to it, you should think about a book (online tutorials vary widely and many are terribly out of date). A reasonably good introductory book is freely available online: Beginning Perl. Another good option is Learning Perl and Intermediate Perl (they really go together).
When you say This is not a homework...to mean this will be a start to assess me in perl I assume you mean that this is perhaps the first assignment you have at a new job or something, in which case It seems that if we just give you the answer it will actually harm you later since they will assume you know more about Perl than you do.
However, I will point you in the right direction.
A. Don't use split, use regular expressions. You can learn about them by googling "perl regex"
B. Google "perl hash" to learn about perl hashes. The first result is very good.
Now to your questions:
regular expressions will help you ignore lines you don't want
regular expressions with extract items. Look up "capture variables"
Don't split, use regex
See point B above.
If this file is line based then you can do a line by line based read in a while loop. Then skip those lines that aren't formatted how you wish.
After that, you can either use regex as indicated in the other answer. I'd use that to split it up and get an array and build a hash of lists for the record. Either after that (or before) clean up each record by trimming whitespace etc. If you use regex, then use the capture expressions to add to your list in that fashion. Its up to you.
The hash key is the first column, the list contains everything else. If you are just doing a direct insert, you can get away with a list of lists and just put everything in that instead.
The key for the hash would allow you to look at particular records for fast lookup. But if you don't need that, then an array would be fine.
You can try this one,
Points need to know:
read the file line by line
By using regular expression, removing '----' lines.
after that use split function to populate Hashes of array .
#!/usr/bin/perl
use strict;
use warning;
my $test_file = 'test.txt';
open(IN, '<' ,"$test_file") or die $!;
my (%seen, $id, $name, $timestamp, $PD, $type);
while(<IN>){
chomp;
my $line = $_;
if($line =~ m/^-/){ #removing '---' lines
# print "$line:hello\n";
}else{
if ($line =~ /\|/){
($id , $name, $timestamp) = split /\|/, $line, 4;
} else{
($PD, $type) = split /\//, $line , 3;
}
$seen{$id}= [$name, $timestamp, $PD, $type]; //use Hashes of array
}
}
for my $test(sort keys %seen){
my $test1 = $seen{$test};
print "$test:#{$test1}\n";
}
close(IN);