Transform data to array with Perl - perl

How do I transform my data to an array with Perl?
Here is my data:
my $data =
"203.174.38.128203.174.38.129203.174.38.1" .
"30203.174.38.131203.174.38.132203.174.38" .
".133203.174.38.134173.174.38.135203.174." .
"38.136203.174.38.137203.174.38.142";
And I want to transform it to be array like this
my #array= (
"203.174.38.128",
"203.174.38.129",
"203.174.38.130",
"203.174.38.131",
"203.174.38.132",
"203.174.38.133",
"203.174.38.134",
"173.174.38.135",
"203.174.38.136",
"203.174.38.137",
"203.174.38.142"
);
Anyone know how to do that with Perl?

If the first part of IP logged is always 203, it's kinda easy:
my #arr = split /(?<=\d)(?=203\.)/, $data;
In the example given it's not, but the first part is always 3-digit, and the second part is always 174, so it's enough to do...
my #arr = split /(?<=\d)(?=\d{3}\.174\.)/, $data;
... to get the correct result.
But please understand that it's close to impossible to give a more generic (and bulletproof) solution here - when these 'marker' parts are... too dynamic. For example, take this string...
11.11.11.22222.11.11.11
The question is, where to split it? Should it be 11.11.11.22; 222.11.11.11? Or 11.11.11.222; 22.11.11.11? Both are quite valid IPs, if you ask me. And it could get even worse, with trying to split '2222' part (can be '2; 222', '22; 22' and even '222; 2').
You can, for example, make a rule: "split each sequence of > 3 digits followed by a dot sign so that the second part of this split would always start from 3 digits":
my #arr = split /(?<=\d)(?=\d{3}\.)/, $data;
... but this will obviously fail to work properly in the ambiguous cases mentioned earlier IF there are IPs with two- or even one-digit first octet in your datastring.

If you write a regex that will match any valid value for one of the numbers in the quartet then you can just search for them all and recombine them in sets of four. This
/2[0-5][0-5]|1\d\d|[1-9]\d|\d/
matches 200-255 or 100-199 or 10-99 or 0-9, and a program to use it is shown below.
There is no way to know which option to take if there is more than one way to split the string, and this solution assigns the longest value to the first of the two ip addresses. For instance, 1.1.1.1234.1.1.1 will split as 1.1.1.123 and 4.1.1.1
use strict;
use warnings;
use feature 'say';
my $data =
"203.174.38.128203.174.38.129203.174.38.1" .
"30203.174.38.131203.174.38.132203.174.38" .
".133203.174.38.134173.174.38.135203.174." .
"38.136203.174.38.137203.174.38.142";
my $byte = qr/2[0-5][0-5]|1\d\d|\d\d|\d/;
my #bytes = $data =~ /($byte)/g;
my #addresses;
push #addresses, join('.', splice(#bytes, 0, 4)) while #bytes;
say for #addresses;
output
203.174.38.128
203.174.38.129
203.174.38.130
203.174.38.131
203.174.38.132
203.174.38.133
203.174.38.134
173.174.38.135
203.174.38.136
203.174.38.137
203.174.38.142

Using your sample, it looks like you have 3 digits for the first and last node. That would prompt using this pattern:
/(\d{3}\.\d{1,3}\.\d{1,3}\.\d{3})/
Add that with a /g switch and it will pull every one.
However, if you have a larger and divergent set of data than what you show for your sample, somebody should have separated the ips before dumping them into this string. If they are separate data points, they should have some separation.

Related

perl to hardcode a static value in a field

I am still learning perl and have all most got a program written. My question, as simple as it may be, is if I want to hardcode a string to a field would the below do that? Thank you :).
$out[45]="VUS";
In the other lines I use the below to define the values that are passed into the `$[out], but the one in question is hardcoded and the others come from a split.
my #vals = split/\t/; # this splits the line at tabs
my #mutations=split/,/,$vals[9]; # splits on comma to create an array of mutations
my ($gene,$transcript,$exon,$coding,$aa);
for (#mutations)
{
($gene,$transcript,$exon,$coding,$aa) = split/\:/; # this takes col AB and splits it at colons
grep {$transcript eq $_} keys %nms or next;
}
my #out=($.,#colsleft,$_,#colsright);
$out[2]=$gene;
$out[3]=$nms{$transcript};
$out[4]=$transcript;
$out[15]=$coding;
$out[17]=$aa;
Your line of code: $out[45]="VUS"; is correct in that it is defining that 46th element of the array #out to the string, "VUS". I am trying to understand from your code, however why you would want to do that? Usually, it is better practice to not hardcode if at all possible. You want to make it your goal to make your program as dynamic as possible.

Perl $1 giving uninitialized value error

I am trying to extract a part of a string and put it into a new variable. The string I am looking at is:
maker-scaffold_26653|ref0016423-snap-gene-0.1
(inside a $gene_name variable)
and the thing I want to match is:
scaffold_26653|ref0016423
I'm using the following piece of code:
my $gene_name;
my $scaffold_name;
if ($gene_name =~ m/scaffold_[0-9]+\|ref[0-9]+/) {
$scaffold_name = $1;
print "$scaffold_name\n";
}
I'm getting the following error when trying to execute:
Use of uninitialized value $scaffold_name in concatenation (.) or string
I know that the pattern is right, because if I use $' instead of $1 I get
-snap-gene-0.1
I'm at a bit of a loss: why will $1 not work here?
If you want to use a value from the matching you have to make () arround the character in regex
To expand on Jens' answer, () in a regex signifies an anonymous capture group. The content matched in a capture group is stored in $1-9+ from left to right, so for example,
/(..):(..):(..)/
on an HH:MM:SS time string will store hours, minutes, and seconds in $1, $2, $3 respectively. Naturally this begins to become unwieldy and is not self-documenting, so you can assign the results to a list instead:
my ($hours, $mins, $secs) = $time =~ m/(..):(..):(..)/;
So your example could bypass the use of $ variables by doing direct assignment:
my ($scaffold_name) = $gene_name =~ m/(scaffold_[0-9]+[|]ref[0-9]+)/;
# $scaffold_name now contains 'scaffold_26653|ref0016423'
You can even get rid of the ugly =~ binding by using for as a topicalizer:
my $scaffold_name;
for ($gene_name) {
($scaffold_name) = m/(scaffold_\d+[|]ref\d+)/;
print $scaffold_name;
}
If things start to get more complex, I prefer to use named capture groups (introduced in Perl v5.10.0):
$gene_name =~ m{
(?<scaffold_name> # ?<name> creates a named capture group
scaffold_\d+? # 'scaffold' and its trailing digits
[|] # Literal pipe symbol
ref\d+ # 'ref' and its trailing digits
)
}xms; # The x flag lets us write more readable regexes
print $+{scaffold_name}, "\n";
The results of named capture groups are stored in the magic hash %+. Access is done just like any other hash lookup, with the capture groups as the keys. %+ is locally scoped in the same way the $ are, so it can be used as a drop-in replacement for them in most situations.
It's overkill for this particular example, but as regexes start to get larger and more complicated, this saves you the trouble of either having to scroll all the way back up and count anonymous capture groups from left to right to find which of those darn $ variables is holding the capture you wanted, or scan across a long list assignment to find where to add a new variable to hold a capture that got inserted in the middle.
My personal rule of thumb is to assign the results of anonymous captured to descriptively named lexically scoped variables for 3 or less captures, then switch to using named captures, comments, and indentation in regexes when more are necessary.

Perl get array count so can start foreach loop at a certain array element

I have a file that I am reading in. I'm using perl to reformat the date. It is a comma seperated file. In one of the files, I know that element.0 is a zipcode and element.1 is a counter. Each row can have 1-n number of cities. I need to know the number of elements from element.3 to the end of the line so that I can reformat them properly. I was wanting to use a foreach loop starting at element.3 to format the other elements into a single string.
Any help would be appreciated. Basically I am trying to read in a csv file and create a cpp file that can then be compiled on another platform as a plug-in for that platform.
Best Regards
Michael Gould
you can do something like this to get the fields from a line:
my #fields = split /,/, $line;
To access all elements from 3 to the end, do this:
foreach my $city (#fields[3..$#fields])
{
#do stuff
}
(Note, based on your question I assume you are using zero-based indexing. Thus "element 3" is the 4th element).
Alternatively, consider Text::CSV to read your CSV file, especially if you have things like escaped delimiters.
Well if your line is being read into an array, you can get the number of elements in the array by evaluating it in scalar context, for example
my $elems = #line;
or to be really sure
my $elems = scalar(#line);
Although in that case the scalar is redundant, it's handy for forcing scalar context where it would otherwise be list context. You can also find the index of the last element of the array with $#line.
After that, if you want to get everything from element 3 onwards you can use an array slice:
my #threeonwards = #line[3 .. $#line];

(3 lines) from bash to perl?

I have these three lines in bash that work really nicely. I want to add them to some existing perl script but I have never used perl before ....
could somebody rewrite them for me? I tried to use them as they are and it didn't work
note that $SSH_CLIENT is a run-time parameter you get if you type set in bash (linux)
users[210]=radek #where 210 is tha last octet from my mac's IP
octet=($SSH_CLIENT) # split the value on spaces
somevariable=$users[${octet[0]##*.}] # extract the last octet from the ip address
These might work for you. I noted my assumptions with each line.
my %users = ( 210 => 'radek' );
I assume that you wanted a sparse array. Hashes are the standard implementation of sparse arrays in Perl.
my #octet = split ' ', $ENV{SSH_CLIENT}; # split the value on spaces
I assume that you still wanted to use the environment variable SSH_CLIENT
my ( $some_var ) = $octet[0] =~ /\.(\d+)$/;
You want the last set of digits from the '.' to the end.
The parens around the variable put the assignment into list context.
In list context, a match creates a list of all the "captured" sequences.
Assigning to a scalar in a list context, means that only the number of scalars in the expression are assigned from the list.
As for your question in the comments, you can get the variable out of the hash, by:
$db = $users{ $some_var };
# OR--this one's kind of clunky...
$db = $users{ [ $octet[0] =~ /\.(\d+)$/ ]->[0] };
Say you have already gotten your IP in a string,
$macip = "10.10.10.123";
#s = split /\./ , $macip;
print $s[-1]; #get last octet
If you don't know Perl and you are required to use it for work, you will have to learn it. Surely you are not going to come to SO and ask every time you need it in Perl right?

What's the simplest way of adding one to a binary string in Perl?

I have a variable that contains a 4 byte, network-order IPv4 address (this was created using pack and the integer representation). I have another variable, also a 4 byte network-order, subnet. I'm trying to add them together and add one to get the first IP in the subnet.
To get the ASCII representation, I can do inet_ntoa($ip&$netmask) to get the base address, but it's an error to do inet_ntoa((($ip&$netmask)+1); I get a message like:
Argument "\n\r&\0" isn't numeric in addition (+) at test.pm line 95.
So what's happening, the best as I can tell, is it's looking at the 4 bytes, and seeing that the 4 bytes don't represent a numeric string, and then refusing to add 1.
Another way of putting it: What I want it to do is add 1 to the least significant byte, which I know is the 4th byte? That is, I want to take the string \n\r&\0 and end up with the string \n\r&\1. What's the simplest way of doing that?
Is there a way to do this without having to unpack and re-pack the variable?
What's happening is that you make a byte string with $ip&$netmask, and then try to treat it as a number. This is not going to work, as such. What you have to feed to inet_ntoa is.
pack("N", unpack("N", $ip&$netmask) + 1)
I don't think there is a simpler way to do it.
Confusing integers and strings. Perhaps the following code will help:
use Socket;
$ip = pack("C4", 192,168,250,66); # why not inet_aton("192.168.250.66")
$netmask = pack("C4", 255,255,255,0);
$ipi = unpack("N", $ip);
$netmaski = unpack("N", $netmask);
$ip1 = pack("N", ($ipi&$netmaski)+1);
print inet_ntoa($ip1), "\n";
Which outputs:
192.168.250.1