How to remove array's newlines and add an element at the beginning of it in Perl? - perl

First of I have to apologize for editing my initial post. But after I provide my code I did the question fuzzy.
So, I have this an array (#start_cod) containing lines separated by /n as follows:
print #start_cod;
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
I need to remove the newlines and add ">text" ONLY at the beginning of the array as follow:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa
I tried:
s/\s+\z// for #start_cod;
print ">text#start_cod";
I tried also with chomp
chomp #start_cod;
print ">text#start_cod";
and
my #start_cod = split("\n",$start_cod);
$start_cod = join("",#start_cod);
print ">text$start_cod";
but I get
aaaaaaaaaaaaaaaaaaa>textcacacacacaacaccacaac>textaattatatattataattatatttat
Any suggestions on how to handle this in Perl Programming?
Here is my code which works 100%.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my %alliloux =();
$/="\n>";
while (<>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
my ($sp, $head) = split (/\./, $onoma);
push #{ $alliloux{$sp} }, join "\n", ">$onoma", #seq;
}
foreach my $sp (keys %alliloux) {
chomp $sp;
my ($head, $dna) = split(/\t/, $sp);
my #start_cod = substr($dna, 3);
say #start_cod;
Input file:
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
>namex aattatatattataattatatttat
output after Perl run
tatatattataattatatttat
cacacacaacaccacaac
aaaaaaaaaaaaaaa
Desired output:
>text
tatatattataattatatttatcacacacaacaccacaacaaaaaaaaaaaaaaa

If I understand your question correctly, this should do what you want:
use strict;
use warnings;
my #start_cod = (
'aaaaaaaaaaaaaaaaaa',
'acacacacacaacaccacaac',
'aattatatattataattatatttat',
);
print ">text\n", #start_cod, "\n";
The print first prints ">text" and a newline once, then you get the #start_cod items on a line, and the last "\n" makes sure you have a newline after the last element.
Output:
>text
aaaaaaaaaaaaaaaaaaacacacacacaacaccacaacaattatatattataattatatttat

You might want to see Read FASTA into Hash. It's the same problem and very close to the code I wrote before I read it. Also, there are modules on CPAN that can handle FASTA.
I think you want to combine the sequences that start with the same name, disregarding the numbers. The sequences shouldn't have interior whitespace. In your code, you are constantly adding whitespace. You even join on a newline. So, you go to the doctor and say "My arm hurts when I do this", and the doctor says "So don't do that". :)
When you run into these sort of problems, check the results of your operations at each step to see if you get what you expect. Here's a much simplified version of a program that I think does what you want. I've removed most of the data structure because they are complicating your process.
In short, read a line and remove the newline at the end. That's one source of your newlines. Then, extract the sequence and concatenate that to the previous sequence. When you join with newlines, you are adding newlines. So, don't do that:
use v5.14;
use warnings;
use Data::Dumper;
my %alliloux = ();
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
# now split on whitespace, but only up to two parts.
# There's no array here.
my( $name, $seq ) = split /\s+/, $_, 2;
# remove the numbers at the end to get the prefix of the
# name.
my $prefix = $name =~ s/\d+\z//r;
# append the current sequence for this prefix to what we
# have already seen.f
$alliloux{$prefix} .= $seq;
}
say Dumper( \%alliloux );
foreach my $base ( keys %alliloux ) {
say ">text $alliloux{$base}";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
You don't need the intermediate array. You can build up your string as you go. You don't need to have all the parts before you do that.
Now, to figure out where you might be going wrong, do a little at once. Ensure that you've extracted the right thing. It's handle to put characters around the variables you interpolate so you can see whitespace at the beginning or end:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
}
Then, add another step, and ensure that works:
while (<DATA>) {
chomp; # get rid of that newline!
s/>//g;
my( $name, $seq ) = split /\s+/, $_, 2;
say "Name: <$name>";
say "Seq: <$seq>"
my $prefix = $name =~ s/\d+\z//r;
say "Prefix: <$prefix>";
}
Repeat this process for each step. Then, when you come with a question, you've pinpointed the point where things diverge. Here's the same technique in your program:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
while (<DATA>) {
s/>//g;
my ($onoma, #seq) = split (/\n/, $_);
say "Onoma: <$onoma>";
}
__DATA__
>name aaa
>name2 cccc
>name99 aattaatt
The output shows that you never had anything in #seq. You are splitting on a newline, but unless you've changed the default line ending, you'll only get a newline at the end:
Onoma: <name aaa>
Onoma: <name2 cccc>
Onoma: <name99 aattaatt>
Now there's nothing in #seq, so a line like join "\n", ">$onoma", #seq; is really just join "\n", ">$onoma". You could have seen that with a little checking.

The description lacks clarity of the problem.
By looking at the desired output the following code comes to mind. Please see if it does what you was looking for.
Even looking at your code it is not clear what you try to do -- some part of the code does not make much sense.
use strict;
use warnings;
use feature 'say';
my #start_cod;
while( <DATA> ) {
chomp;
next unless />\s?name.?\s+(.*)/;
push #start_cod, $1;
}
print ">text\n " . join('',#start_cod);
__DATA__
>name aaaaaaaaaaaaaaaaaa
>name2 acacacacacaacaccacaac
.
.
.
> namex aattatatattataattatatttat

Related

Replace multiple hex values

I've written the following script because I need to do some cleanup in some files. I have a specific number of hex characters that needs to be changed into another set of hex characters (ie null to space, see below). I've written the following script, my problem is that it only replaces the first occurence and nothing else.
I've tried the /g just like a regular sed pattern but it doesnt work. Is there a way to do this and replace all matches?
(The reason i havent used a $line =~ s/... is because I think its neater and more maintainable that way, and this script will need to be accessed and run on occasion by others who may need to edit the hex values to be replaced). Another reason is because i need to change from 10+ hex values to an equivalent amount, so a huge one liner would be hard to read. Thank you in advance.
#!/usr/bin/perl
use strict;
use warnings;
my $filebase = shift || "testreplace.txt";
my $filefilter = shift || "testf";
open my $fh1, '>', 'testreplaceout';
# Iterate over file and read lines
open my $file1, '<', $filebase;
while (my $line = <$file1>)
{
chomp($line);
for ($line) {
s/\x00/\x20/g;
s/\x31/\x32/g;
}
print {$fh1} "$line \n";
}
/g will do what you want. If it doesn't seem to be working, add some debugging:
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
And in your loop:
print Dumper($line);
for ($line) {
s/\x00/\x20/g;
s/\x31/\x32/g;
}
print Dumper($line);
Using tr with paired delimiters instead can be very readable/maintainable:
$line =~ tr[\x00\x31]
[\x20\x32];
Also, consider adding use autodie;
tr/// is probably your best bet here (since you are dealing with constant single character replacements). The following is a more generic solution.
my %replacements = (
'foo' => 'bar',
'bar' => 'baz',
);
my $pat = join '|', map quotemeta, keys(%replacement);
s/($pat)/$replacements{$1}/g;
Update: read comments for caveats of this answer.
Here's one way that'll allow you to keep your list of regex search/replaces at the top of your script nice and clean for ease of viewing and modification:
use warnings;
use strict;
my #re_list = (
['a', 'x'],
['b', 'y'],
);
while (my $line = <DATA>){
for my $re (#re_list){
$line =~ s/$re->[0]/$re->[1]/g;
}
print $line;
}
__DATA__
aaabbbccc
bbbcccddd
ababababa
Output:
xxxyyyccc
yyycccddd
xyxyxyxyx

Split, insert and join

Here's I want to archive. I want to split a one-liner comma-separated and insert #domain.com then join it back as comma-separated.
The one-liner contains something like:
username1,username2,username3
and I want to be something like:
username1#domain.com,username2#domain.com,username3#domain.com
So my Perl script that I tried which doesn't not work properly:
my $var ='username1,username2,username3';
my #tkens = split /,/, $var;
my #user;
foreach my $tken (#tkens) {
push (#user, "$tken\#domain.com");
}
my $to = join(',',#user);
Is there any shortcut on this in Perl and please post sample please. Thanks
Split, transform, stitch:
my $var ='username1,username2,username3';
print join ",", map { "$_\#domain.com" } split(",", $var);
# ==> username1#domain.com,username2#domain.com,username3#domain.com
You could also use a regular expression substitution:
#!/usr/bin/perl
use strict;
use warnings;
my $var = "username1,username2,username3";
# Replace every comma (and the end of the string) with a comma and #domain.com
$var =~ s/$|,/\#domain.com,/g;
# Remove extra comma after last item
chop $var;
print "$var\n";
You already have good answers. Here I am just telling why your script is not working. I didn't see any print or say line in your code, so not sure how you are trying to print something. No need of last line in your program. You can simply suffix #domain.com with each value, push to an array and print it with join.
#!/usr/bin/perl
use strict;
use warnings;
my $var = 'username1,username2,username3';
my #tkens = split ',', $var;
my #user;
foreach my $tken (#tkens)
{
push #user, $tken."\#domain.com"; # `.` after `$tken` for concatenation
}
print join(',', #user), "\n"
Output:
username1#domain.com,username2#domain.com,username3#domain.com

Perl - combine two split commands into one

I have a perl script that I was cleaning up, it works very well, but I was wondering if anyone knows of a good way to combine 2 splits into one command.
I have a .csv file, like this small example:
Move_VALIDATE,020212191ABC01,SNSNT---01CAB101A-1-1-4-20,circuit 402339-1,interface 1/1/0
Move_VALIDATE,030323202ABC01,SNSNT01CAB101A-1-1-4-20,circuit 303559-1,interface 2/2/0
Section in script with the two splits:
foreach my $line (#file_array){
my $CHECK_ID = (split /,/, $line) [2];
my #SPLIT_ID = (split /-/, $CHECK_ID);
my $FINAL_ID = ($CHECK_ID =~ /---/) ? "$SPLIT_ID[0]---$SPLIT_ID[3]" : "$SPLIT_ID[0]";
print "$FINAL_ID\n";
}
Output:
SNSNT---01CAB101A
SNSNT01CAB101A
First, if you have long lines with many fields, it pays to limit how many fields you are asking split to return. In this case, you need the third field, so you want to limit split to four fields.
Second, it is easier to remove the part you don't need at the end of the field.
#!/usr/bin/env perl
use strict;
use warnings;
while (my $line = <DATA>) {
(my $id = (split /,/, $line, 4)[2]) =~ s/(?:-[0-9]{1,2})+\z//;
print "$id\n";
}
__DATA__
Move_VALIDATE,020212191ABC01,SNSNT---01CAB101A-1-1-4-20,circuit 402339-1,interface 1/1/0
Move_VALIDATE,030323202ABC01,SNSNT01CAB101A-1-1-4-20,circuit 303559-1,interface 2/2/0
$ ./klklj.pl
SNSNT---01CAB101A
SNSNT01CAB101A
You can use one split for commas as you have it and use word-boundary assertion for the second one, like:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
chomp;
printf qq|%s\n|, (split /\b-\b/, (split /,/, $_)[2])[0];
}
__DATA__
Move_VALIDATE,020212191ABC01,SNSNT---01CAB101A-1-1-4-20,circuit 402339-1,interface 1/1/0
Move_VALIDATE,030323202ABC01,SNSNT01CAB101A-1-1-4-20,circuit 303559-1,interface 2/2/0
Run it like:
perl script.pl
That yields:
SNSNT---01CAB101A
SNSNT01CAB101A

Why is my Perl code not omitting newlines?

I'm reading this textfile to get ONLY the words in it and ignore all kind of whitespaces:
hello
now
do you see this.sadslkd.das,msdlsa but
i hoohoh
And this is my Perl code:
#!usr/bin/perl -w
require 5.004;
open F1, './text.txt';
while ($line = <F1>) {
#print $line;
#arr = split /\s+/, $line;
foreach $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
#print #arr;
}
close F1;
And this is the output:
hello
now
do
you
see
this.sadslkd.das,msdlsa
but
i
hoohoh
The output is showing two newlines but I am expecting the output to be just words. What should I do to just get words?
You should always use strict and use warnings (in preference to the -w command-line qualifier) at the top of every Perl program, and declare each variable at its first point of use using my. That way Perl will tell you about simple errors that you may otherwise overlook.
You should also use lexical file handles with the three-parameter form of open, and check the status to make sure it succeeded. There is little point in explicitly closing an input file unless you expect your program to run for an appreciable time, as Perl will close all files for you on exit.
Do you really need to require Perl v5.4? That version is fifteen years old, and if there is anything older than that installed then you have a museum!
Your program would be better like this:
use strict;
use warnings;
open my $fh, '<', './text.txt' or die $!;
while (my $line = <$fh>) {
my #arr = split /\s+/, $line;
foreach my $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
}
Note: my apologies. The warnings pragma and lexical file handles were introduced only in v5.6 so that part of my answer is irrelevant. The latest version of Perl is v5.16 and you really should upgrade
As Birei has pointed out, the problem is that, when the line has leading whitespace, there is a empty field before the first separator. Imagine if your data was comma-separated, then you would want Perl to report a leading empty field if the line started with a comma.
To extract all the non-space characters you can use a regular expression that does exactly that
my #arr = $line =~ /\S+/g;
and this can be emulated by using the default parameter for split which is a single quoted space (not a regular expression)
my #arr = $line =~ split ' ', $line;
In this case split behaves like the awk utility and discards any leading empty fields as you expected.
This is even simpler if you let Perl use the $_ variable in the read loop, as all of the parameters for split can be defaulted:
while (<F1>) {
my #arr = split;
foreach my $w (#arr) {
print "$w\n" if $w !~ /^\s+$/;
}
}
This line is the problem:
#arr=split(/\s+/,$line);
\s+ does a match just before the leading spaces. Use ' ' instead.
#arr=split(' ',$line);
I believe that in this line:
if(!($w =~ /^\s+$/))
You wanted to ask if there's nothing in this row - don't print it.
But the "+" in the REGEX actually force it to have at least 1 space.
If you change the "\s+" to "\s*", you'll see that it's working. because * is 0 occurrences or more ...

With Perl, how do I read records from a file with two possible record separators?

Here is what I am trying to do:
I want to read a text file into an array of strings. I want the string to terminate when the file reads in a certain character (mainly ; or |).
For example, the following text
Would you; please
hand me| my coat?
would be put away like this:
$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';
Could I get some help on something like this?
This will do it. The trick to using split while preserving the token you're splitting on is to use a zero-width lookback match: split(/(?<=[;|])/, ...).
Note: mctylr's answer (currently the top rated) isn't actually correct -- it will split fields on newlines, b/c it only works on a single line of the file at a time.
gbacon's answer using the input record separator ($/) is quite clever--it's both space and time efficient--but I don't think I'd want to see it in production code. Putting one split token in the record separator and the other in the split strikes me as a little too unobvious (you have to fight that with Perl ...) which will make it hard to maintain. I'm also not sure why he's deleting multiple newlines (which I don't think you asked for?) and why he's doing that only for the end of '|'-terminated records.
# open file for reading, die with error message if it fails
open(my $fh, '<', 'data.txt') || die $!;
# set file reading to slurp (whole file) mode (note that this affects all
# file reads in this block)
local $/ = undef;
my $string = <$fh>;
# convert all newlines into spaces, not specified but as per example output
$string =~ s/\n/ /g;
# split string on ; or |, using a zero-width lookback match (?<=) to preserve char
my (#strings) = split(/(?<=[;|])/, $string);
One way is to inject another character, like \n, whenever your special character is found, then split on the \n:
use warnings;
use strict;
use Data::Dumper;
while (<DATA>) {
chomp;
s/([;|])/$1\n/g;
my #string = split /\n/;
print Dumper(\#string);
}
__DATA__
Would you; please hand me| my coat?
Prints out:
$VAR1 = [
'Would you;',
' please hand me|',
' my coat?'
];
UPDATE: The original question posed by James showed the input text on a single line, as shown in __DATA__ above. Because the question was poorly formatted, others edited the question, breaking the 1 line into 2. Only James knows whether 1 or 2 lines was intended.
I prefer #toolic's answer because it deals with multiple separators very easily.
However, if you wanted to overly complicate things, you could always try:
#!/usr/bin/perl
use strict; use warnings;
my #contents = ('');
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
$line =~ s{$/}{ };
if ( $line =~ /^([^|;]+[|;])(.+)$/ ) {
$contents[-1] .= $1;
push #contents, $2;
}
else {
$contents[-1] .= $1;
}
}
print "[$_]\n" for #contents;
__DATA__
Would you; please
hand me| my coat?
Something along the lines of
$text = <INPUTFILE>;
#string = split(/[;!]/, $text);
should do the trick more or less.
Edit: I've changed "/;!/" to "/[;!]/".
Let Perl do half the work for you by setting $/ (the input record separator) to vertical bar, and then extract semicolon-separated fields:
#!/usr/bin/perl
use warnings;
use strict;
my #string;
*ARGV = *DATA;
$/ = "|";
while (<>) {
s/\n+$//;
s/\n/ /g;
push #string => $1 while s/^(.*;)//;
push #string => $_;
}
for (my $i = 0; $i < #string; ++$i) {
print "\$string[$i] = '$string[$i]';\n";
}
__DATA__
Would you; please
hand me| my coat?
Output:
$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';