Creating a hashmap using perl split function - perl

I am attempting to create a hashmap from a text file. The way the text file is set up is as follows.
(integer)<-- varying white space --> (string value)
. . .
. . .
. . .
(integer)<-- varying white space --> (string value)
eg:
5 this is a test
23 this is another test
123 this is the final test
What I want to do is assign the key to the integer, and then the entire string following to the value. I was trying something along the lines of
%myHashMap;
while(my $info = <$fh>){
chomp($info);
my ($int, $string) = split/ /,$info;
$myHashMap{$int} = $string;
}
This doesn't work though because I have spaces in the string. Is there a way to clear the initial white space, grab the integer, assign it to $int, then clear white space till you get to the string, then take the remainder of the text on that line and place it in my $string value?

You could replace
split / /, $info # Fields are separated by a space.
with
split / +/, $info # Fields are separated by spaces.
or the more general
split /\s+/, $info # Fields are separated by whitespace.
but you'd still face with the problem of the leading spaces. To ignore those, use
split ' ', $info
This special case splits on whitespace, ignoring leading whitespace.
Don't forget to tell Perl that you expect at most two fields!
$ perl -E'say "[$_]" for split(" ", " 1 abc def ghi", 2)'
[1]
[abc def ghi]
The other option would be to use the following:
$info =~ /^\s*(\S+)\s+(\S.*)/

You just need to split each line of text on whitespace into two fields
This example program assumes that the input file is passed as a parameter on the command line. I have used Data::Dump only to show the resulting hash structure
use strict;
use warnings 'all';
my %data;
while ( <DATA> ) {
s/\s*\z//;
my ($key, $val) = split ' ', $_, 2;
next unless defined $val; # Ensure that there were two fields
$data{$key} = $val;
}
use Data::Dump;
dd \%data;
output
{
5 => "this is a test",
23 => "this is another test",
123 => "this is the final test",
}

First you clear initial white space use this
$info =~ s/^\s+//g;
second you have more than 2 spaces in between integer and string so use split like this to give 2 space with plus
split/ +/,$info;
The code is
use strict;
use warnings;
my %myHashMap;
while(my $info = <$fh>){
chomp($info);
$info =~ s/^\s+//g;
my ($int, $string) = split/ +/,$info;
$myHashMap{$int} = $string;
}

Related

Perl script that parses CSV file excluding the contents enclosed in []

Hi there I am struggling with perl script that parses a an eight column CSV line into another CSV line using the split command. But i want to exclude all the text enclosed by square brackets []. The line looks like :
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
I used the following script but when i print $fields[7] it gives me N. one of the fields inside [] above.but by print "$fields[7]" i want it to be 1399385680 which is the last field in the above line. the script i tried was.
while (my $line = <LOG>) {
chomp $line;
my #fields=grep { !/^[\[.*\]]$/ } split ",", $line;
my $timestamp=$fields[7];
print "$fields[7]";
}
Thanks for your time. I will appreciate your help.
Always include use strict; and use warnings; at the top of EVERY perl script.
Your "csv" file isn't proper csv. So the only thing I can suggest is to remove the contents in the brackets before you split:
use strict;
use warnings;
while (<DATA>) {
chomp;
s/\[.*?\]//g;
my #fields = split ',', $_;
my $timestamp = $fields[7];
print "$timestamp\n";
}
__DATA__
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
Outputs:
1399385680
Obviously it is possible to also capture the contents of the bracketed fields, but you didn't say that was a requirement or goal.
Update
If you want to capture the bracket delimited field, one method would be to use a regex for capturing instead.
Note, this current regex requires that each field has a value.
chomp;
my #fields = $_ =~ /(\[.*?\]|[^,]+)(?:,|$)/g;
my $timestamp = $fields[7];
print "$timestamp";
Well, if you want to actually ignore the text between square brackets, you might as well get rid of it:
while ( my $line = <LOG> ) {
chomp $line;
$line =~ s,\[.*?\],,; # Delete all text between square brackets
my #fields = split ",", $line;
my $timestamp = $fields[7];
print $fields[7], "\n";
}

replace a string of characters with the line number

I have a text file that has approximately 3,000 lines. 99% of the time I need all 3,000 lines. However, periodically I will grep out the lines I need and direct the output to another text file to use.
The only problem I have in doing so, is: Embedded in the text file is a 6 character string of numbers that indicate the line number. In order to use the file, this area needs to be correctly renumbered...(I don't need to re-sort the data, but I need to replace the current six characters with the new line number. and it must be padded with zeros! Unfortuantely the entire rows is one long row of data with no field separators!
For example, my first three rows might look something like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
The six characters at positions 17-22 (Immediately following the "ZZ"), need be renumbered based on the current row number...so the above needs to look like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
Any ideas would be greatly appreciated!
Thanks,
KSL.
Here's the solution I came up with Perl. It assumes that the numbering is always 6 digits after the ZZ sequence.
In convert.pl:
use strict;
use warnings;
my $i = 1; # or the value you want to start numbering
while (<STDIN>) {
my $replace = sprintf("%06d", $i++);
$_ =~ s/ZZ\d{6}/ZZ$replace/g;
print $_;
}
In data.dat:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
To run:
cat data.dat | perl convert.pl
Output
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
If I would solve this, I would create a simple python script to read those lines by filtering as grep does and using a internal counter from inside the python script.
As simple hints you can read each line in a string and access them using variablename[17:22] (17:22 is the position of the string you are trying to use).
Now, there is a method in the string in python which does the replace, just replace the values by the counter you create.
I hope this helps.
To do this in awk:
awk '{print substr($0,1,16) sprintf("%06d", NR) substr($0,23)}'
or
gawk 'match($0,/^(.*ZZ)[0-9]{6}(.*)/,a) {print a[1] sprintf("%06d",NR) a[2]}'
This is exactly the type of thing where unpack is useful.
#!/usr/bin/env perl
use v5.10.0;
use strict;
use warnings;
while( my $line = <> ){
chomp $line;
my #elem = unpack 'A16 A6 A*', $line;
$elem[1] = sprintf '%06d', $.;
# $. is the line number for the last used file handle
say #elem;
}
Actually looking at the lines, it looks like there is date information stored in the first 14 characters.
Assuming that at some point you might want to parse the lines for some reason you can use the following as an example of how you could use unpack to split up the lines.
#!/usr/bin/env perl
use v5.10.0; # say()
use strict;
use warnings;
use DateTime;
my #date_elem = qw'
year month day
hour minute second
';
my #elem_names = ( #date_elem, qw'
ZZ
line_number
random_data
');
while( my $line = <> ){
chomp $line;
my %data;
#data{ #elem_names } = unpack 'A4 (A2)6 A6 A*', $line;
# choose either this:
$data{line_number} = sprintf '%06d', $.;
say #data{#elem_names};
# or this:
$data{line_number} = $.;
printf '%04d' . ('%02d'x5) . "%2s%06d%s\n", #data{ #elem_names };
# the choice will affect the contents of %data
# this just shows the contents of %data
for( #elem_names ){
printf qq'%12s: "%s"\n', $_, $data{$_};
}
# you can create a DateTime object with the date elements
my $dt = DateTime->new(
(map{ $_, $data{$_} } #date_elem),
time_zone => 'floating',
);
say $dt;
print "\n";
}
Although it would be better to use a regular expression, so that you could throw out bogus data.
use v5.14; # /a modifier
...
my $rdate = join '', map{"(\\d{$_})"} 4, (2)x5;
my $rx = qr'$rdate (ZZ) (\d{6}) (.*)'xa;
while( my $line = <> ){
chomp $line;
my %data;
unless( #data{ #elem_names } = $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
...
It would be better still; to use named capture groups added in 5.10.
...
my $rx = qr'
(?<year> \d{4} ) (?<month> \d{2} ) (?<day> \d{2} )
(?<hour> \d{2} ) (?<minute> \d{2} ) (?<second> \d{2} )
ZZ
(?<line_number> \d{6} )
(?<random_data> .* )
'xa;
while( my $line = <> ){
chomp $line;
unless( $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
my %data = %+;
# for compatibility with previous examples
$data{ZZ} = 'ZZ';
...

How to split a this string 'gi|216ATGCTGATGCTGTG' in this format 'gi|216 ATGCTGTGCTGATGCTG' in Perl?

I am parsing the fasta alignment file which contains
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
I want to split this string into this:
gi|216 CCAACGAAATGATCGCCACACAA
gi|21- GCTGGTTCAGCGACCAAAAGTAGC
For first string, I use
$aar=split("\d",$string);
But that didn't work. What should I do?
So you're parsing some genetic data and each line has a gi| prefix followed by a sequence of numbers and hyphens followed by the nucleotide sequence? If so, you could do something like this:
my ($number, $nucleotides);
if($string =~ /^gi\|([\d-]+)([ACGT]+)$/) {
$number = $1;
$nucleotides = $2;
}
else {
# Broken data?
}
That assumes that you've already stripped off leading and trailing whitespace. If you do that, you should get $number = '216' and $nucleotides = 'CCAACGAAATGATCGCCACACAA' for the first one and $number = '216-' and $nucleotides = 'GCTGGTTCAGCGACCAAAAGTAGC' for the second one.
Looks like BioPerl has some stuff for dealing with fasta data so you might want to use BioPerl's tools rather than rolling your own.
Here's how I'd go about doing that.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
while ( my $line = <DATA> ) {
my #strings =
grep {m{\A \S+ \z}xms} # no whitespace tokens
split /\A ( \w+ \| [\d-]+ )( [ACTG]+ ) /xms, # capture left & right
$line;
print Dumper( \#strings );
}
__DATA__
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
If you just want to add a space (can't really tell from your question), use substitution. To put a space in front of any grouping of ACTG:
$string =~ s/([ACTG]+)/ \1/;
or to add a tab after any grouping of digits and dashes:
$string =~ s/([\d-]+)/\1\t/;
note that this will substitute on $string in place.

How to Split on three different delimiters then ucfirst each result[]?

I am trying to figure out how to split a string that has three possible delimiters (or none) without a million lines of code but, code is still legible to a guy like me.
Many possible combinations in the string.
this-is_the.string
this.is.the.string
this-is_the_string
thisisthestring
There are no spaces in the string and none of these characters:
~`!##$%^&*()+=\][{}|';:"/?>,<.
The string is already stripped of all but:
0-9
a-Z
-
_
.
There are also no sequential dots, dashes or underscores.
I would like the result to be displayed like Result:
This Is The String
I am really having a difficult time trying to get this going.
I believe I will need to use a hash and I just have not grasped the concept even after hours of trial and error.
I am bewildered at the fact I could possibly split a string on multiple delimiters where the delimiters could be in any order AND/OR three different types (or none at all) AND maintain the order of the result!
Any possibilities?
Split the string into words, capitalise the words, then join the words while inserting spaces between them.
It can be coded quite succinctly:
my $clean = join ' ', map ucfirst lc, split /[_.-]+/, $string;
If you just want to print out the result, you can use
use feature qw( say );
say join ' ', map ucfirst lc, split /[_.-]+/, $string;
or
print join ' ', map ucfirst lc, split /[_.-]+/, $string;
print "\n";
It is simple to use a global regular expression to gather all sequences of characters that are not a dot, dash, or underscore.
After that, lc will lower-case each string and ucfirst will capitalise it. Stringifying an array will insert spaces between the elements.
for ( qw/ this-is_the.string this.is.the.string this-is_the_string / ) {
my #string = map {ucfirst lc } /[^-_.]+/g;
print "#string\n";
}
output
This Is The String
This Is The String
This Is The String
" the delimiters could be anywhere AND/OR three different types (or none at all)" ... you need a delimiter to split a string, you can define multiple delimiters with a regular expression to the split function
my #parts = split(/[-_\.]/, $string);
print ucfirst "$_ " foreach #parts;
print "\n"
Here's a solution that will work for all but your last test case. It's extremely hard to split a string without delimiters, you'd need to have a list of possible words, and even then it would be prone to error.
#!/usr/bin/perl
use strict;
use warnings;
my #strings = qw(
this-is_the.string
this.is.the.string
this-is_the_string
thisisthestring
);
foreach my $string (#strings) {
print join(q{ }, map {ucfirst($_)} split(m{[_.-]}smx,$string)) . qq{\n};
}
And here's an alternative for the loop that splits everything into separate statements to make it easier to read:
foreach my $string (#strings) {
my #words = split m{[_.-]}smx, $string;
my #upper_case_words = map {ucfirst($_)} #words;
my $string_with_spaces = join q{ }, #upper_case_words;
print $string_with_spaces . qq{\n};
}
And to prove that just because you can, doesn't mean you should :P
$string =~ s{([A-Za-z]+)([_.-]*)?}{ucfirst(lc("$1")).($2?' ':'')}ge;
For all but last possibility:
use strict;
use warnings;
my $file;
my $newline;
open $file, "<", "testfile";
while (<$file>) {
chomp;
$newline = join ' ', map ucfirst lc, split /[-_\.]/, $_;
print $newline . "\n";
}

LWP::Simple - how to implement a loop into it [with live demo]

good evening dear community!
i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
This page has got more than 6000 results! Well how do i get all the results?
I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records
Attempt: Here are the first 5 page URLs:
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200
We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50
for each page there after. We can use this information to create a loop:
my $i_first = "0";
my $i_last = "6100";
my $i_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
tadmc (a very very supportive user) has created a great script that puts out a cvs-formated results. i have build in this loop in the code: (Note - i guess that there has gone wrong something! See the musings below... with the code-snippets and the error-messages:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $i_first = "0";
my $i_last = "6100";
my $i_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for #$row;
load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /n+/, $h{name};
#h{qw/phone fax/} = split /n+/, $h{phone};
trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
There have been some issues - i have made a mistake i guess that the error is here:
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
i have written down some kind of double - code. I need to leave out one part ... this one here
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
see the results in the command line:
martin#suse-linux:~> cd perl
martin#suse-linux:~/perl> perl bavaria_all_.pl
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
syntax error at bavaria_all_.pl line 59, near "/,"
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Substitution replacement not terminated at bavaria_all_.pl line 63.
martin#suse-linux:~/perl>
what do you think!?
look forward to hear from you
btw - see the code, created by tadmc, without any improved spider-logic....This runs very very nciely - without any issue: it spits out a nice formatted cvs-output!!
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my #cols = qw(
rownum
number
name
phone
type
website
);
my #fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for #$row;
load the fields into the hash using a "hash slice"
my %h;
#h{#cols} = #$row;
derive some fields from base fields, again using a hash slice
#h{qw/name street postal town/} = split /n+/, $h{name};
#h{qw/phone fax/} = split /n+/, $h{phone};
trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for #h{qw/name street postal town/};
$csv->combine(#h{#fields});
print $csv->string, "\n";
}
}
Note: this above mentioned code runs nicely - it spits out csv-formated output.
A different approach to achieve paging is to extract all URLs from the page and detect the pager URLs.
...
for (#urls) {
if (is_pager_url($_) and not exists $seen{$_}) {
push #pager_url, $_;
$seen{$_}++;
}
}
...
sub is_pager_url {
my ($url) = #_;
return 1 if $url =~ m{schulsuche.asp\?q=e\&a=\d+\&s=\d+};
}
This way you don't have to deal with incrementing counters or establishing the total number of pages. It will also work for different values of a and s. By keeping a %seen hash, you can cheaply avoid differentiating between prev and next pages.
Excellent! I was waiting for you to figure out how to get the multiple pages on your own!
1) put my code inside of the page-getting loop (move the "}" way down to the end).
2) $html = get $pageurl; # change this to use your new URL
3) put my backslash back where I had it: tr/\r//d;