I have a perl script that I was cleaning up, it works very well, but I was wondering if anyone knows of a good way to combine 2 splits into one command.
I have a .csv file, like this small example:
Move_VALIDATE,020212191ABC01,SNSNT---01CAB101A-1-1-4-20,circuit 402339-1,interface 1/1/0
Move_VALIDATE,030323202ABC01,SNSNT01CAB101A-1-1-4-20,circuit 303559-1,interface 2/2/0
Section in script with the two splits:
foreach my $line (#file_array){
my $CHECK_ID = (split /,/, $line) [2];
my #SPLIT_ID = (split /-/, $CHECK_ID);
my $FINAL_ID = ($CHECK_ID =~ /---/) ? "$SPLIT_ID[0]---$SPLIT_ID[3]" : "$SPLIT_ID[0]";
print "$FINAL_ID\n";
}
Output:
SNSNT---01CAB101A
SNSNT01CAB101A
First, if you have long lines with many fields, it pays to limit how many fields you are asking split to return. In this case, you need the third field, so you want to limit split to four fields.
Second, it is easier to remove the part you don't need at the end of the field.
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <DATA>) {
(my $id = (split /,/, $line, 4)[2]) =~ s/(?:-[0-9]{1,2})+\z//;
print "$id\n";
}
__DATA__
Move_VALIDATE,020212191ABC01,SNSNT---01CAB101A-1-1-4-20,circuit 402339-1,interface 1/1/0
Move_VALIDATE,030323202ABC01,SNSNT01CAB101A-1-1-4-20,circuit 303559-1,interface 2/2/0
$ ./klklj.pl
SNSNT---01CAB101A
SNSNT01CAB101A
You can use one split for commas as you have it and use word-boundary assertion for the second one, like:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
printf qq|%s\n|, (split /\b-\b/, (split /,/, $_)[2])[0];
}
__DATA__
Move_VALIDATE,020212191ABC01,SNSNT---01CAB101A-1-1-4-20,circuit 402339-1,interface 1/1/0
Move_VALIDATE,030323202ABC01,SNSNT01CAB101A-1-1-4-20,circuit 303559-1,interface 2/2/0
Run it like:
perl script.pl
That yields:
SNSNT---01CAB101A
SNSNT01CAB101A
Related
First of I have to apologize for editing my initial post. But after I provide my code I did the question fuzzy.
So, I have this an array (#start_cod) containing lines separated by /n as follows:
print #start_cod;
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
I need to remove the newlines and add ">text" ONLY at the beginning of the array as follow:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa
I tried:
s/\s+\z// for #start_cod;
print ">text#start_cod";
I tried also with chomp
chomp #start_cod;
print ">text#start_cod";
and
my #start_cod = split("\n",$start_cod);
$start_cod = join("",#start_cod);
print ">text$start_cod";
but I get
aaaaaaaaaaaaaaaaaaa>textcacacacacaacaccacaac>textaattatatattataattatatttat
Any suggestions on how to handle this in Perl Programming?
Here is my code which works 100%.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my %alliloux =();
$/="\n>";
while (<>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
my ($sp, $head) = split (/\./, $onoma);
push #{ $alliloux{$sp} }, join "\n", ">$onoma", #seq;
}
foreach my $sp (keys %alliloux) {
chomp $sp;
my ($head, $dna) = split(/\t/, $sp);
my #start_cod = substr($dna, 3);
say #start_cod;
Input file:
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
>namex aattatatattataattatatttat
output after Perl run
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
Desired output:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa
If I understand your question correctly, this should do what you want:
use strict;
use warnings;
my #start_cod = (
'aaaaaaaaaaaaaaaaaa',
'acacacacacaacaccacaac',
'aattatatattataattatatttat',
);
print ">text\n", #start_cod, "\n";
The print first prints ">text" and a newline once, then you get the #start_cod items on a line, and the last "\n" makes sure you have a newline after the last element.
Output:
>text
aaaaaaaaaaaaaaaaaaacacacacacaacaccacaacaattatatattataattatatttat
You might want to see Read FASTA into Hash. It's the same problem and very close to the code I wrote before I read it. Also, there are modules on CPAN that can handle FASTA.
I think you want to combine the sequences that start with the same name, disregarding the numbers. The sequences shouldn't have interior whitespace. In your code, you are constantly adding whitespace. You even join on a newline. So, you go to the doctor and say "My arm hurts when I do this", and the doctor says "So don't do that". :)
When you run into these sort of problems, check the results of your operations at each step to see if you get what you expect. Here's a much simplified version of a program that I think does what you want. I've removed most of the data structure because they are complicating your process.
In short, read a line and remove the newline at the end. That's one source of your newlines. Then, extract the sequence and concatenate that to the previous sequence. When you join with newlines, you are adding newlines. So, don't do that:
use v5.14;
use warnings;
use Data::Dumper;
my %alliloux = ();
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
# now split on whitespace, but only up to two parts.
# There's no array here.
my( $name, $seq ) = split /\s+/, $_, 2;
# remove the numbers at the end to get the prefix of the
# name.
my $prefix = $name =~ s/\d+\z//r;
# append the current sequence for this prefix to what we
# have already seen.f
$alliloux{$prefix} .= $seq;
}
say Dumper( \%alliloux );
foreach my $base ( keys %alliloux ) {
say ">text $alliloux{$base}";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
You don't need the intermediate array. You can build up your string as you go. You don't need to have all the parts before you do that.
Now, to figure out where you might be going wrong, do a little at once. Ensure that you've extracted the right thing. It's handle to put characters around the variables you interpolate so you can see whitespace at the beginning or end:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
}
Then, add another step, and ensure that works:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
my $prefix = $name =~ s/\d+\z//r;
say "Prefix: <$prefix>";
}
Repeat this process for each step. Then, when you come with a question, you've pinpointed the point where things diverge. Here's the same technique in your program:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
say "Onoma: <$onoma>";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
The output shows that you never had anything in #seq. You are splitting on a newline, but unless you've changed the default line ending, you'll only get a newline at the end:
Onoma: <name aaa>
Onoma: <name2 cccc>
Onoma: <name99 aattaatt>
Now there's nothing in #seq, so a line like join "\n", ">$onoma", #seq; is really just join "\n", ">$onoma". You could have seen that with a little checking.
The description lacks clarity of the problem.
By looking at the desired output the following code comes to mind. Please see if it does what you was looking for.
Even looking at your code it is not clear what you try to do -- some part of the code does not make much sense.
use strict;
use warnings;
use feature 'say';
my #start_cod;
while( <DATA> ) {
chomp;
next unless />\s?name.?\s+(.*)/;
push #start_cod, $1;
}
print ">text\n " . join('',#start_cod);
__DATA__
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
.
.
.
> namex aattatatattataattatatttat
how to remove last single line available in file using perl.
I have my data like below.
"A",1,-2,-1,-4,
"B",3,-5,-2.-5,
how to remove the last line... I am summing all the numbers but receiving a null value at the end.
Tried using chomp but did not work.
Here is the code currently being used:
while (<data>) {
chomp(my #row = (split ',' , $_ , -1);
say sum #row[1 .. $#row];
}
Try this (shell one-liner) :
perl -lne '!eof() and print' file
or as part of a script :
while (defined($_ = readline ARGV)) {
print $_ unless eof();
}
You should be using Text::CSV or Text::CSV_XS for handling comma separated value files. Those modules are available on CPAN. That type of solution would look like this:
use Text::CSV;
use List::Util qw(sum);
my $csv = Text::CSV->new({binary => 1})
or die "Cannot use CSV: " . Text::CSV->error_diag;
while(my $row = $csv->getline($fh)) {
next unless ($row->[0] || '') =~ m/\w/; # Reject rows that don't start with an identifier.
my $sum = sum(#$row[1..$#$row]);
print "$sum\n";
}
If you are stuck with a solution that doesn't use a proper CSV parser, then at least you'll need to add this to your existing while loop, immediately after your chomp:
next unless scalar(#row) && length $row[0]; # Skip empty rows.
The point to this line is to detect when a row is empty -- has no elements, or elements were empty after the chomp.
I suspect this is an X/Y question. You think you want to avoid processing the final (empty?) line in your input when actually you should be ensuring that all of your input data is in the format you expect.
There are a number of things you can do to check the validity of your data.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use List::Util 'sum';
use Scalar::Util 'looks_like_number';
while (<DATA>) {
# Chomp the input before splitting it.
chomp;
# Remove the -1 from your call to split().
# This automatically removes any empty trailing fields.
my #row = split /,/;
# Skip lines that are empty.
# 1/ Ensure there is data in #row.
# 2/ Ensure at least one element in #row contains
# non-whitespace data.
next unless #row and grep { /\S/ } #row;
# Ensure that all of the data you pass to sum()
# looks like numbers.
say sum grep { looks_like_number $_ } #row[1 .. $#row];
}
__DATA__
"A",1.2,-1.5,4.2,1.4,
"B",2.6,-.50,-1.6,0.3,-1.3,
I've written the following script because I need to do some cleanup in some files. I have a specific number of hex characters that needs to be changed into another set of hex characters (ie null to space, see below). I've written the following script, my problem is that it only replaces the first occurence and nothing else.
I've tried the /g just like a regular sed pattern but it doesnt work. Is there a way to do this and replace all matches?
(The reason i havent used a $line =~ s/... is because I think its neater and more maintainable that way, and this script will need to be accessed and run on occasion by others who may need to edit the hex values to be replaced). Another reason is because i need to change from 10+ hex values to an equivalent amount, so a huge one liner would be hard to read. Thank you in advance.
#!/usr/bin/perl
use strict;
use warnings;
my $filebase = shift || "testreplace.txt";
my $filefilter = shift || "testf";
open my $fh1, '>', 'testreplaceout';
# Iterate over file and read lines
open my $file1, '<', $filebase;
while (my $line = <$file1>)
{
chomp($line);
for ($line) {
s/\x00/\x20/g;
s/\x31/\x32/g;
}
print {$fh1} "$line \n";
}
/g will do what you want. If it doesn't seem to be working, add some debugging:
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
And in your loop:
print Dumper($line);
for ($line) {
s/\x00/\x20/g;
s/\x31/\x32/g;
}
print Dumper($line);
Using tr with paired delimiters instead can be very readable/maintainable:
$line =~ tr[\x00\x31]
[\x20\x32];
Also, consider adding use autodie;
tr/// is probably your best bet here (since you are dealing with constant single character replacements). The following is a more generic solution.
my %replacements = (
'foo' => 'bar',
'bar' => 'baz',
);
my $pat = join '|', map quotemeta, keys(%replacement);
s/($pat)/$replacements{$1}/g;
Update: read comments for caveats of this answer.
Here's one way that'll allow you to keep your list of regex search/replaces at the top of your script nice and clean for ease of viewing and modification:
use warnings;
use strict;
my #re_list = (
['a', 'x'],
['b', 'y'],
);
while (my $line = <DATA>){
for my $re (#re_list){
$line =~ s/$re->[0]/$re->[1]/g;
}
print $line;
}
__DATA__
aaabbbccc
bbbcccddd
ababababa
Output:
xxxyyyccc
yyycccddd
xyxyxyxyx
So, i have a file to read like this
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
What I want is if $x matches with Some.Text for example, how can I get a variable with "Some big text with spaces and numbers and something" or if it matches with "Some.Text2" to get "Again some big test, etc".
open FILE, "<cats.txt" or die $!;
while (<FILE>) {
chomp;
my #values = split('~~~', $_);
foreach my $val (#values) {
print "$val\n" if ($val eq $x)
}
exit 0;
}
close FILE;
And from now on I don't know what to do. I just managed to print "Some.text" if it matches with my variable.
splice can be used to remove elements from #values in pairs:
while(my ($matcher, $printer) = splice(#values, 0, 2)) {
print $printer if $matcher eq $x;
}
Alternatively, if you need to leave #values intact you can use a c style loop:
for (my $i=0; $i<#values; $i+=2) {
print $values[$i+1] if $values[$i] eq $x;
}
Your best option is perhaps not to split, but to use a regex, like this:
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
while (/Some.Text2?~~~(.+?)~~~/g) {
say $1;
}
}
__DATA__
Some.Text~~~Some big text with spaces and numbers and something~~~Some.Text2~~~Again some big test, etc~~~Text~~~Big text~~~And so on
Output:
Some big text with spaces and numbers and something
Again some big test, etc
I'm reading this textfile to get ONLY the words in it and ignore all kind of whitespaces:
hello
now
do you see this.sadslkd.das,msdlsa but
i hoohoh
And this is my Perl code:
#!usr/bin/perl -w
require 5.004;
open F1, './text.txt';
while ($line = <F1>) {
#print $line;
#arr = split /\s+/, $line;
foreach $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
#print #arr;
}
close F1;
And this is the output:
hello
now
do
you
see
this.sadslkd.das,msdlsa
but
i
hoohoh
The output is showing two newlines but I am expecting the output to be just words. What should I do to just get words?
You should always use strict and use warnings (in preference to the -w command-line qualifier) at the top of every Perl program, and declare each variable at its first point of use using my. That way Perl will tell you about simple errors that you may otherwise overlook.
You should also use lexical file handles with the three-parameter form of open, and check the status to make sure it succeeded. There is little point in explicitly closing an input file unless you expect your program to run for an appreciable time, as Perl will close all files for you on exit.
Do you really need to require Perl v5.4? That version is fifteen years old, and if there is anything older than that installed then you have a museum!
Your program would be better like this:
use strict;
use warnings;
open my $fh, '<', './text.txt' or die $!;
while (my $line = <$fh>) {
my #arr = split /\s+/, $line;
foreach my $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
}
Note: my apologies. The warnings pragma and lexical file handles were introduced only in v5.6 so that part of my answer is irrelevant. The latest version of Perl is v5.16 and you really should upgrade
As Birei has pointed out, the problem is that, when the line has leading whitespace, there is a empty field before the first separator. Imagine if your data was comma-separated, then you would want Perl to report a leading empty field if the line started with a comma.
To extract all the non-space characters you can use a regular expression that does exactly that
my #arr = $line =~ /\S+/g;
and this can be emulated by using the default parameter for split which is a single quoted space (not a regular expression)
my #arr = $line =~ split ' ', $line;
In this case split behaves like the awk utility and discards any leading empty fields as you expected.
This is even simpler if you let Perl use the $_ variable in the read loop, as all of the parameters for split can be defaulted:
while (<F1>) {
my #arr = split;
foreach my $w (#arr) {
print "$w\n" if $w !~ /^\s+$/;
}
}
This line is the problem:
#arr=split(/\s+/,$line);
\s+ does a match just before the leading spaces. Use ' ' instead.
#arr=split(' ',$line);
I believe that in this line:
if(!($w =~ /^\s+$/))
You wanted to ask if there's nothing in this row - don't print it.
But the "+" in the REGEX actually force it to have at least 1 space.
If you change the "\s+" to "\s*", you'll see that it's working. because * is 0 occurrences or more ...