perl: printing new line instead in a single line - perl

I am new to perl.
when i m trying to print the values of the array along with one variable in the while loop,
the variable is printing in the new line.
while($line=<FH>)
{
chomp($line);
$tem = grep(/gooty/,$line);
if($tem==1)
{
$Date=$date;
#array=split(/\|/,$line);
$sth = "INSERT INTO TABLE VALUES $array[1],$array[2],$date \n";
}
}
print "$sth \n";
the output:
INSERT INTO TABLE VALUES alan ,777
,2012-07-31
instead i want the output as :
INSERT INTO TABLE VALUES alan ,777,2012-07-31
in single line

This is a common problem for new perl programmers. Say
while (defined($line = <FH>))
{
chomp $line; # Eliminate terminating newline if there
...
If the results are still not right, you may be trying to read a text file with MSDOS/Windows line endings using a version of Perl (like Cygwin) that doesn't handle them correctly. This can cause chomp to malfunction. You can work around the problem using this instead:
$line =~ s/[\r\n]+$//;
This cleans all end-of-line characters from the end of the line, no matter how many there are.,
Additional notes on your code: You'll save lots of trouble for yourself with use strict; and use warnings;, which will require variable declarations with my and our. You don't need to call grep. Just say if ($line =~ /gooty/) {. If there is any chance of extra whitespace in your data, a better split pattern is \s+\|\s+. This will consume whitespace around the vertical bar field separators. In that case you also want to use
$line =~ s/\s+$//;
instead of chomp $line. This will clean all whitespace from the end of line, which includes end-of-line characters.

You have a newline at the end of $line. chomp it, before splitting it, to get the desired output.
chomp $line;
perldoc -f chomp

I assume that you don't want your elements of #array enclosed by whitespace characters. Then we should trim them before printing them.
my $line = <FH>;
my $date = '2012-07-31'; # or whatever
if($line =~ /gooty/)
{
my #array = split /[|]/, $line;
foreach (#array) {
s/^\s+//; # removes leading whitespaces
s/\s+$//; # removes trailing whitespaces
}
print "INSERT INTO TABLE VALUES $array[1],$array[2],$date \n";
}
This should print the desired output.
But I cannot be sure until you show us the input you gave your code that produced the unexpected output. (Or could it be that you modify your $sth between the loop and the print statement? I see you appended two newlines?)
Btw: use strict; use warnings!

The cleanest approach, instead of using chomp, is to remove all trailing whitespace from the end of the line
Start your loop with
$line =~ s/\s+\z//;

Related

Perl, Why it has a extra white space in output

input:
output:
but my out put have one extra white space on last two line.
my output:
my code:
#content = <FILE>;
foreach $line (#content){
if($line =~ /^#(\d+)/){
$number = $1;
$line =~ s/^#(\d+)/$content[$number-1]/;
}
print "$line";
}
Any help will be appreciated.
Here's a version of your code with sample input data. If you want people to help you with problems like this, then it's a good idea to make it as easy as possible for them. Posting images of your input data does not make it easy. Also, it's a good development trick to store sample data in the DATA filehandle so that the code and data are together in the same file.
#!/usr/bin/perl
use strict;
use warnings;
my #content = <DATA>;
foreach my $line (#content){
if($line =~ /^#(\d+)/){
my $number = $1;
$line =~ s/^#(\d+)/$content[$number-1]/;
}
print "$line";
}
__DATA__
line A
line B
line C
#7
line D
#2
line E
I've also added use strict and use warnings to your code. In this case, they don't really help, but you should get into the habit of always including them in your Perl programs.
Your problem is here:
$line =~ s/^#(\d+)/$content[$number-1]/;
Each of the lines in your #content array will include a newline character at the end. But in this line you're replacing the # symbol and the following digit with a complete other line from the array. You're not replacing the original newline and you're adding another newline (from the replacement string) so the line ends up containing two newlines.
The easiest fix is to add the newline to the pattern you are matching.
$line =~ s/^#(\d+)\n/$content[$number-1]/;
Note that an experienced Perl programmer would write your code like this:
#!/usr/bin/perl
use strict;
use warnings;
my #content = <DATA>;
for (#content){
s/^#(\d+)\n/$content[$1 - 1]/;
print;
}

Perl script that parses CSV file excluding the contents enclosed in []

Hi there I am struggling with perl script that parses a an eight column CSV line into another CSV line using the split command. But i want to exclude all the text enclosed by square brackets []. The line looks like :
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
I used the following script but when i print $fields[7] it gives me N. one of the fields inside [] above.but by print "$fields[7]" i want it to be 1399385680 which is the last field in the above line. the script i tried was.
while (my $line = <LOG>) {
chomp $line;
my #fields=grep { !/^[\[.*\]]$/ } split ",", $line;
my $timestamp=$fields[7];
print "$fields[7]";
}
Thanks for your time. I will appreciate your help.
Always include use strict; and use warnings; at the top of EVERY perl script.
Your "csv" file isn't proper csv. So the only thing I can suggest is to remove the contents in the brackets before you split:
use strict;
use warnings;
while (<DATA>) {
chomp;
s/\[.*?\]//g;
my #fields = split ',', $_;
my $timestamp = $fields[7];
print "$timestamp\n";
}
__DATA__
128.39.120.51,0,49788,6,SYN,[8192:127:1:52:M1460,N,W2,N,N,S:.:Windows:XP/2000 (RFC1323+, w+, tstamp-):link:ethernet/modem],1,1399385680
Outputs:
1399385680
Obviously it is possible to also capture the contents of the bracketed fields, but you didn't say that was a requirement or goal.
Update
If you want to capture the bracket delimited field, one method would be to use a regex for capturing instead.
Note, this current regex requires that each field has a value.
chomp;
my #fields = $_ =~ /(\[.*?\]|[^,]+)(?:,|$)/g;
my $timestamp = $fields[7];
print "$timestamp";
Well, if you want to actually ignore the text between square brackets, you might as well get rid of it:
while ( my $line = <LOG> ) {
chomp $line;
$line =~ s,\[.*?\],,; # Delete all text between square brackets
my #fields = split ",", $line;
my $timestamp = $fields[7];
print $fields[7], "\n";
}

Why is my Perl code not omitting newlines?

I'm reading this textfile to get ONLY the words in it and ignore all kind of whitespaces:
hello
now
do you see this.sadslkd.das,msdlsa but
i hoohoh
And this is my Perl code:
#!usr/bin/perl -w
require 5.004;
open F1, './text.txt';
while ($line = <F1>) {
#print $line;
#arr = split /\s+/, $line;
foreach $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
#print #arr;
}
close F1;
And this is the output:
hello
now
do
you
see
this.sadslkd.das,msdlsa
but
i
hoohoh
The output is showing two newlines but I am expecting the output to be just words. What should I do to just get words?
You should always use strict and use warnings (in preference to the -w command-line qualifier) at the top of every Perl program, and declare each variable at its first point of use using my. That way Perl will tell you about simple errors that you may otherwise overlook.
You should also use lexical file handles with the three-parameter form of open, and check the status to make sure it succeeded. There is little point in explicitly closing an input file unless you expect your program to run for an appreciable time, as Perl will close all files for you on exit.
Do you really need to require Perl v5.4? That version is fifteen years old, and if there is anything older than that installed then you have a museum!
Your program would be better like this:
use strict;
use warnings;
open my $fh, '<', './text.txt' or die $!;
while (my $line = <$fh>) {
my #arr = split /\s+/, $line;
foreach my $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
}
Note: my apologies. The warnings pragma and lexical file handles were introduced only in v5.6 so that part of my answer is irrelevant. The latest version of Perl is v5.16 and you really should upgrade
As Birei has pointed out, the problem is that, when the line has leading whitespace, there is a empty field before the first separator. Imagine if your data was comma-separated, then you would want Perl to report a leading empty field if the line started with a comma.
To extract all the non-space characters you can use a regular expression that does exactly that
my #arr = $line =~ /\S+/g;
and this can be emulated by using the default parameter for split which is a single quoted space (not a regular expression)
my #arr = $line =~ split ' ', $line;
In this case split behaves like the awk utility and discards any leading empty fields as you expected.
This is even simpler if you let Perl use the $_ variable in the read loop, as all of the parameters for split can be defaulted:
while (<F1>) {
my #arr = split;
foreach my $w (#arr) {
print "$w\n" if $w !~ /^\s+$/;
}
}
This line is the problem:
#arr=split(/\s+/,$line);
\s+ does a match just before the leading spaces. Use ' ' instead.
#arr=split(' ',$line);
I believe that in this line:
if(!($w =~ /^\s+$/))
You wanted to ask if there's nothing in this row - don't print it.
But the "+" in the REGEX actually force it to have at least 1 space.
If you change the "\s+" to "\s*", you'll see that it's working. because * is 0 occurrences or more ...

Selectively joining elements of an array into fewer elements of a new array

I'm having some trouble manipulating an array of DNA sequence data that is in .fasta format. What I would specifically like to do is take a file that has a few thousand sequences and adjoin sequence data for each sequence in the file onto a single line in the file. [Fasta format is as such: A sequence ID starts with > after which everything on that line is a description. On the next line(s) the sequence corresponding to this ID is present. And this can continue indefinitely until the next line that begins with >, which is the id of the next sequence in the file] So, in my particular file most of my sequences are on multiple lines, so what I would like to do is essentially remove the newlines, but only the new lines between sequence data, not between sequence data and sequence ID lines (that start with >).
I'm doing this because I want to be able to attain sequence lengths of each sequence (through length, I believe is the easiest way), and then get an average sequence length of all the sequences in the whole file.
Here's my script so far, that doesnt seem to want to work:
#!/usr/bin/perl -w
##Subroutine
sub get_file_data1 {
my($filename) = $_[0];
my #filedata = ();
unless( open(GET_FILE_DATA, $filename)) {
print STDERR "Cannot open file \"$filename\"\n\n";
exit;
}
#filedata = <GET_FILE_DATA>;
close GET_FILE_DATA;
return #filedata;
}
##Opening files
my $fsafile = $ARGV[0];
my #filedata = &get_file_data1($fsafile);
##Procedure
my #count;
my #ids;
my $seq;
foreach $seq (#filedata){
if ($seq =~ /^>/) {push #ids, $seq;
push #count, "\n";
}
else {push #count, $seq;
}
}
foreach my $line (#count) {
if ($line =~ /^[AGTCagtc]/){
$line =~ s/^([AGTCagtc]*)\n/$1/;
}
}
##Make a text file to have a look
open FILE3, "> unbrokenseq.txt" or die "Cannot open output.txt: $!";
foreach (#count)
{
print FILE3 "$_\n"; # Print each entry in our array to the file
}
close FILE3;
__END__
##Creating array of lengths
my $number;
my #numberarray;
foreach $number (#count) {
push #numberarray, length($number);
}
print #numberarray;
__END__
use List::Util qw(sum);
sub mean {
return sum(#numberarray)/#numberarray;
}
There's something wrong with the second foreach line of the Procedure section and I can't seem to figure out what it is. Note that the code after the END lines I haven't even tried yet because I cant seem to get the code in the procedure step to do what I want. Any idea how I can get a nice array with elements of unbroken sequence (I've chosen to just remove the sequence ID lines from the new array..)? When I can then get an array of lengths, after which I can then average?
Finally I should unfortunately admit that I cannot get Bio::Perl working on my computer, I have tried for hours but the errors are beyond my skill to fix. Ill be talking to someone who can hopefully help me with my Bio::perl issues. But for now I'm just going to have to press on without it.
Thanks! Sorry for the length of this post, I appreciate the help.
Andrew
The problem with your second loop is that you are not actually changing anything in #count because $line contains a copy of the values in #count.
But, if all you want to do in the second loop is to remove the newline character at the end, use the chomp function. with this you wouldn't need your second loop. (And it would also be faster than using the regex.)
# remove newlines for all array elements before doing anything else with it
chomp #filedata;
# .. or you can do it in your first loop
foreach $seq (#filedata){
chomp $seq;
if ($seq =~ /^>/) {
...
}
An additional tip: Using get_file_data1 to read the entire file into an array might be slow if your files are large. In that case it would be better to iterate through the file as you go:
open my $FILE_DATA, $filename or die "Cannot open file \"$filename\"\n";
while (my $line = <$FILE_DATA>) {
chomp $line;
# process the record as in your Procedure section
...
}
close $FILE_DATA;
Your regex captures specifically to $1 but you are printing $_ to the file. The result being most likely not what you intended.
Be careful with the '*' or 'greedy' modifier to your character groups in s///. You usually want the '+' instead. '*' will also match lines containing none of your characters.
A Search expression with a 'g' modifier can also count characters. Like this:
$perl -e '$a="aggaacaat"; $b = $a =~ s/[a]//g; print $b; '
5
Pretty cool huh! Alternately, in your code, you could just call length() against $1.
I was taken aback to see the escaped '/n' in your regex. While it works fine, the common 'end-of-line' search term is '$'. This is more portable and doesn't mess up your character counts.

Neatest way to remove linebreaks in Perl

I'm maintaining a script that can get its input from various sources, and works on it per line. Depending on the actual source used, linebreaks might be Unix-style, Windows-style or even, for some aggregated input, mixed(!).
When reading from a file it goes something like this:
#lines = <IN>;
process(\#lines);
...
sub process {
#lines = shift;
foreach my $line (#{$lines}) {
chomp $line;
#Handle line by line
}
}
So, what I need to do is replace the chomp with something that removes either Unix-style or Windows-style linebreaks.
I'm coming up with way too many ways of solving this, one of the usual drawbacks of Perl :)
What's your opinion on the neatest way to chomp off generic linebreaks? What would be the most efficient?
Edit: A small clarification - the method 'process' gets a list of lines from somewhere, not nessecarily read from a file. Each line might have
No trailing linebreaks
Unix-style linebreaks
Windows-style linebreaks
Just Carriage-Return (when original data has Windows-style linebreaks and is read with $/ = '\n')
An aggregated set where lines have different styles
After digging a bit through the perlre docs a bit, I'll present my best suggestion so far that seems to work pretty good. Perl 5.10 added the \R character class as a generalized linebreak:
$line =~ s/\R//g;
It's the same as:
(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
I'll keep this question open a while yet, just to see if there's more nifty ways waiting to be suggested.
Whenever I go through input and want to remove or replace characters I run it through little subroutines like this one.
sub clean {
my $text = shift;
$text =~ s/\n//g;
$text =~ s/\r//g;
return $text;
}
It may not be fancy but this method has been working flawless for me for years.
$line =~ s/[\r\n]+//g;
Reading perlport I'd suggest something like
$line =~ s/\015?\012?$//;
to be safe for whatever platform you're on and whatever linefeed style you may be processing because what's in \r and \n may differ through different Perl flavours.
Note from 2017: File::Slurp is not recommended due to design mistakes and unmaintained errors. Use File::Slurper or Path::Tiny instead.
extending on your answer
use File::Slurp ();
my $value = File::Slurp::slurp($filename);
$value =~ s/\R*//g;
File::Slurp abstracts away the File IO stuff and just returns a string for you.
NOTE
Important to note the addition of /g , without it, given a multi-line string, it will only replace the first offending character.
Also, the removal of $, which is redundant for this purpose, as we want to strip all line breaks, not just line-breaks before whatever is meant by $ on this OS.
In a multi-line string, $ matches the end of the string and that would be problematic ).
Point 3 means that point 2 is made with the assumption that you'd also want to use /m otherwise '$' would be basically meaningless for anything practical in a string with >1 lines, or, doing single line processing, an OS which actually understands $ and manages to find the \R* that proceed the $
Examples
while( my $line = <$foo> ){
$line =~ $regex;
}
Given the above notation, an OS which does not understand whatever your files '\n' or '\r' delimiters, in the default scenario with the OS's default delimiter set for $/ will result in reading your whole file as one contiguous string ( unless your string has the $OS's delimiters in it, where it will delimit by that )
So in this case all of these regex are useless:
/\R*$// : Will only erase the last sequence of \R in the file
/\R*// : Will only erase the first sequence of \R in the file
/\012?\015?// : When will only erase the first 012\015 , \012 , or \015 sequence, \015\012 will result in either \012 or \015 being emitted.
/\R*$// : If there happens to be no byte sequences of '\015$OSDELIMITER' in the file, then then NO linebreaks will be removed except for the OS's own ones.
It would appear nobody gets what I'm talking about, so here is example code, that is tested to NOT remove line feeds. Run it, you'll see that it leaves the linefeeds in.
#!/usr/bin/perl
use strict;
use warnings;
my $fn = 'TestFile.txt';
my $LF = "\012";
my $CR = "\015";
my $UnixNL = $LF;
my $DOSNL = $CR . $LF;
my $MacNL = $CR;
sub generate {
my $filename = shift;
my $lineDelimiter = shift;
open my $fh, '>', $filename;
for ( 0 .. 10 )
{
print $fh "{0}";
print $fh join "", map { chr( int( rand(26) + 60 ) ) } 0 .. 20;
print $fh "{1}";
print $fh $lineDelimiter->();
print $fh "{2}";
}
close $fh;
}
sub parse {
my $filename = shift;
my $osDelimiter = shift;
my $message = shift;
print "Parsing $message File $filename : \n";
local $/ = $osDelimiter;
open my $fh, '<', $filename;
while ( my $line = <$fh> )
{
$line =~ s/\R*$//;
print ">|" . $line . "|<";
}
print "Done.\n\n";
}
my #all = ( $DOSNL,$MacNL,$UnixNL);
generate 'Windows.txt' , sub { $DOSNL };
generate 'Mac.txt' , sub { $MacNL };
generate 'Unix.txt', sub { $UnixNL };
generate 'Mixed.txt', sub {
return #all[ int(rand(2)) ];
};
for my $os ( ["$MacNL", "On Mac"], ["$DOSNL", "On Windows"], ["$UnixNL", "On Unix"]){
for ( qw( Windows Mac Unix Mixed ) ){
parse $_ . ".txt", #{ $os };
}
}
For the CLEARLY Unprocessed output, see here: http://pastebin.com/f2c063d74
Note there are certain combinations that of course work, but they are likely the ones you yourself naĆ­vely tested.
Note that in this output, all results must be of the form >|$string|<>|$string|< with NO LINE FEEDS to be considered valid output.
and $string is of the general form {0}$data{1}$delimiter{2} where in all output sources, there should be either :
Nothing between {1} and {2}
only |<>| between {1} and {2}
In your example, you can just go:
chomp(#lines);
Or:
$_=join("", #lines);
s/[\r\n]+//g;
Or:
#lines = split /[\r\n]+/, join("", #lines);
Using these directly on a file:
perl -e '$_=join("",<>); s/[\r\n]+//g; print' <a.txt |less
perl -e 'chomp(#a=<>);print #a' <a.txt |less
To extend Ted Cambron's answer above and something that hasn't been addressed here: If you remove all line breaks indiscriminately from a chunk of entered text, you will end up with paragraphs running into each other without spaces when you output that text later. This is what I use:
sub cleanLines{
my $text = shift;
$text =~ s/\r/ /; #replace \r with space
$text =~ s/\n/ /; #replace \n with space
$text =~ s/ / /g; #replace double-spaces with single space
return $text;
}
The last substitution uses the g 'greedy' modifier so it continues to find double-spaces until it replaces them all. (Effectively substituting anything more that single space)