Include file data in Perl array - perl

I am trying to pass all the data from a file into a Perl array and then I am trying to use a foreach loop to process every string in the array. The problem is that the foreach instead of printing each individual line is printing the entire array.I am using the following script.
while (<FILE>) {
$_ =~ s/(\)|\()//g;
push #array, $_;
}
foreach $n(#array) {
print "$n\n";
}
Say for example the data in the array is #array=qw(He goes to the school everyday)
the array is getting printed properly but the foreach loop instead of printing every element on different line is printing the entire array.

After reading your comments, I am guessing that your problem is that your source file does not contain any newlines: I.e. the entire file is just one line. Some text editors just wrap the text without actually adding any line break characters.
There is no "solution" to that problem; You have to add line breaks where you want them. You could write a script to do it, but I doubt it would make much sense. It all depends on what you want to do with this text.
Here's my code suggestions for your snippet.
chomp(#array = <FILE>);
s/[()]//g for #array;
print "$_\n" for #array;
or
#array = <FILE>;
s/[()]//g for #array;
print #array;
Note that if you have a file from another filesystem, you may get \r characters left over at the end of your strings after chomp, causing the output to look corrupted, overwriting itself.
Additional notes:
(\)|\() is better written as a character class: [()].
#array = <FILE> will read the entire file into the array. No need
to loop.
As shown in my examples, print can be assigned a list of items
(e.g. an array) as arguments. And you can have a postfix loop to
print sequentially.
With a (postfix) loop, all the loop elements are aliased to $_,
which is a handy way to do substitutions on the array.

Since the entire file is just one line.You can split the string on basis of whitespace and print every element of array in new line
use strict;
use warnings;
open(FILE,'YOURFILE' ) || die ("could not open");
my $line= <FILE>;
my #array = split ' ',$line;
foreach my $n(#array)
{
print "$n\n";
}
close(FILE);
Input File
In recent years many risk factors for the development of breast cancer that .....
Output
In
recent
years
many
risk
factors
for
the
development
of
breast
cancer
that
.....

Related

search a paragraph in reverse order without array in perl

i have a log file where the errors will be mentioned as "ERROR" in the beginning of the line and next line will have the detailed text about the error. I would like to search for "ERROR" in the reverse order so i can find the last error and print the next line or copy is the line to a variable.
In shell i can try the below command which will help me to achieve the same. Can some one give me a equivalent perl code.
grep -A2 ERROR sapinst.log | tail -2
As the log file will be huge (~5000+ lines), so I don't want to store it in an array.
Your file size is rather small and perl is pretty quick, so I wouldn't worry about reverse order that much. This little program reads lines of input from the files you specify on the command line (or standard input if you specify none), skips lines until it finds ERROR, then prints that and the next line:
#!perl
while( <> ) {
next unless /ERROR/;
print;
print scalar <>;
}
From there you can use tail if you like. This Perl does the same as the grep you posted (although since you already have that solution I wonder why you want a different one).
If you don't want to use tail, keep track of the two lines you'll output and replace them when you find a new set:
my( $error_line, $next_line );
while( <> ) {
next unless /ERROR/;
$error_line = $_;
$next_line = scalar <>;
}
print $error_line, $next_line;
If you have a recent enough perl, you can use the safer double diamond line input operator:
use v5.22;
my( $error_line, $next_line );
while( <<>> ) {
next unless /ERROR/;
$error_line = $_;
$next_line = scalar <<>>;
}
print $error_line, $next_line;
You can use File::ReadBackwards, but you'll have to do the same task by remembering every line then checking if the previous line had ERROR. For you data sizes, the benefit probably isn't apparent. If the simple solution isn't fast enough, it's time to get fancier (but not before then).

Perl: string formatting in tab delimited file

I have no background in programming whatsoever, so I would appreciate it if you would explain how and why any code you recommend should be written the way it is.
I have a data matrix 2,000+ samples, and need to do the following manipulate the format in one column.
I would also like to manipulate the format of one of the columns so that it is easier to merge with my other matrix. For example, one column is known as sample number (column #16). The format is currently similar to ABCD-A1-A0SD-01A-11D-A10Y-09, yet I would like to change it to be formatted to the following ABCD-A1-A0SD-01A. This will allow me to have it in the right format so that I can merge it with another matrix. I seem to not be able to find any information on how to proceed with this step.
The sample input should look like this:
ABCD-A1-A0SD-01A-11D-A10Y-09
ABCD-A1-A0SD-01A-11D-A10Y-09
ABCD-A1-A0SE-01A-11D-A10Y-09
ABCD-A1-A0SE-01A-11D-A10Y-09
ABCD-A1-A0SF-01A-11D-A10Y-09
ABCD-A1-A0SH-01A-11D-A10Y-09
ABCD-A1-A0SI-01A-11D-A10Y-09
I want the last three extensions removed. The output sample should look like this:
ABCD-A1-A0SD-01A
ABCD-A1-A0SD-01A
ABCD-A1-A0SE-01A
ABCD-A1-A0SE-01A
ABCD-A1-A0SF-01A
ABCD-A1-A0SH-01A
ABCD-A1-A0SI-01A
Finally, the matrix that I want to merge with has a different layout, in other words the number of columns and rows are different. This is a issue when I tackle the next step which is merging the two matrices together. The original matrix has about 52 columns and 2,000+ rows, whereas the merging matrix only has 15 column and 467 rows.
Each row of the original matrix has mutational information for a patient. This means that the same patient with the same ID might appear many times. The second matrix contains the patient information, so no patients are repeated in that matrix. When merging the matrix, I want to make sure that every patient mutation (each row) is matched with its corresponding information from the merging matrix.
My sample code:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'sorted_samples_2.txt';
open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'sorted_samples_changed.txt');
foreach my $line (<INFILE>) {
print "The input line is $line\n";
my #columns = split('\t', $line);
($columns[15]) = $columns[15]=~/:((\w\w\w\w-\w\d-\w|\w\w-\d\d\w)+)$/;
printf $outfile "#columns/n";
}
Issues: The code deletes the header and deleted the string in column 16.
A few issues about your code:
Good job on include use strict; and use warnings;. Keep doing that
Anytime you're doing file or directory processing, include use autodie; as well.
Always use lexical file handles $infh instead of globs INFILE.
Use the 3 parameter form of open.
Always process a file line by line using a while loop. Using a for loop loads the entire file into memory
Don't forget to chomp your input from a file.
Use the line number variable $. if you want special logic for your header
The first parameter of split is a pattern. Use /\t/. The only exception to this is ' ' which has special meaning. Currently your introducing a bug by using a single quoted string.
When altering a value with a regex, try to focus on what you DO want instead of what you DON'T. In this case it looks like you want 4 groups separated by dashes, and then truncate the rest. Focus on matching those groups.
Don't use printf when you mean print.
The following applies these fixes to your script:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $infile = 'sorted_samples_2.txt';
my $outfile = 'sorted_samples_changed.txt';
open my $infh, '<', $infile;
open my $outfh, '>', $outfile;
while (my $line = <$infh>) {
chomp $line;
my #columns = split /\t/, $line;
if ($. > 1) {
$columns[15] =~ s/^(\w{4}-\w\d-\w{4}-\w{3}).*/$1/
or warn "Unable to fix column at line $.";
}
print $outfh join("\t", #columns), "\n";
}
You need to define scope for your variables with 'my' in declaration itself when you use 'use strict'.
In your case, you should use my #sort = sort {....} in first line and
you should have an array reference $t defined somewhere to de-reference it in second line. You don't have #array declared anywhere in this code, that is the reason you got all those errors. Make sure you understand what you are doing before you do it.

While loop and diamond operator in Perl

I am trying to input a text file to Perl program and reverse its order of lines i.e. last line will become first, second last will become second etc. I am using following code
#!C:\Perl64\bin
$k = 0;
while (<>){
print "the value of i is $i";
#array[k] = $_;
++$k;
}
print "the array is #array";
But for some reason, my array is only printing the last line of the text file.
Any suggestions?
Typically, rather than keep a separate array index, perl programs use the push operator to push a string onto an array. One way to do this in your program:
push #array, $_;
If you really want to do it by array index, then you need to use the following syntax:
$array[$k] = $_;
Notice the $ rather than # in front. This tells perl that you're dealing with a single element from the array, not multiple elements. #array gives you the entire array, while $array[$k] gives you a single element. (There is a more advanced topic called "slices," but let's not get into that here. I will say that #array[$k] gives you a slice, and that isn't what you want here.)
If you really just want to slurp the entire file into an array, you can do that in one step:
#array = ( <> );
That will read the entire file into #array in one step.
You might have noticed I omitted/ignored your print statement. I'm not sure what it's doing printing out a variable named $i, since it didn't seem connected at all to the rest of the code. I reasoned it was debug code you had added, and not really relevant to the task at hand.
Anyway, that should get your input into #array. Now reversing the array... There are many ways you could do this in perl, but I'll let you discover those yourself.
Instead of:
#array[k] = $_;
you want:
$array[$k] = $_;
To reference the scalar variable $k, you need the $ on the front. Without that it is interpreted as the literal string 'k', which when used as an array index would be interpreted as 0 (since a non-numeric string will be interpreted as 0 in a numeric context).
So, each time around the loop you are setting the first element to the line read in (overwriting the value set in the previous iteration).
A few other tips:
#array[ ] is actually the syntax for an array slice rather than a single element. It works in this case because you are assigning to a slice of 1. The usual syntax for accessing a single element would be $array[ ].
I recommend placing 'use strict;' at the top of your script - you would have gotten an error pointing out the incorrect reference to $k
Instead of using an index variable, you could push the values onto the end of the array, eg:
while (<>) {
push #array, $_;
}
Accept input until it finds the word end
Solution1
#!/usr/bin/perl
while(<>) {
last if $_=~/end/i;
push #array,$_;
}
for (my $i=scalar(#array);$i>=0;$i--){
print pop #array;
}
Solution2
while(<>){
last if $_=~/end/i;
push #array,$_;
}
print reverse(#array);

A Perl script to process a CSV file, aggregating properties spread over multiple records

Sorry for the vague question, I'm struggling to think how to better word it!
I have a CSV file that looks a little like this, only a lot bigger:
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:
550672,1;2;3;4
656372,1;2
766153,1;4
etc.
I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!
I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience
It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
$csv->parse($line) or die "Invalid data line";
my ($key, $val) = $csv->fields;
push #{ $data{$key} }, $val
}
for my $id (sort keys %data) {
printf "%s,%s\n", $id, join ';', #{ $data{$id} };
}
output
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Firstly props for seeking an approach not a solution.
As you've probably already found with perl, There Is More Than One Way To Do It.
The approach I would take would be;
use strict; # will save you big time in the long run
my %ids # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
split line into ID and property variable # google the split function
append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created
}
foreach my $key (keys %ids) {
deduplicate properties
print/display/do whatever you need to do with the result
}
This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem.
A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.
Check out
this question
for a discussion on how to do the deduplication.
Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.
and here you go
dtpwmbp:~ pwadas$ cat input.txt
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl
#!/opt/local/bin/perl
my %hash;
while (<>)
{
chomp;
my($key, $value) = split /,/;
push #{$hash{$key}} , $value ;
}
foreach my $key (sort keys %hash)
{
print $key . "," . join(";", #{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'
Another (not perl) way which incidentally is shorter and more elegant:
#!/opt/local/bin/gawk -f
BEGIN {FS=OFS=",";}
NF > 0 { IDs[$1]=IDs[$1] ";" $2; }
END { for (i in IDs) print i, substr(IDs[i], 2); }
The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.
The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.

How do I print on a single line all content between certain start- and stop-lines?

while(<FILE>)
{
chomp $_;
$line[$i]=$_;
++$i;
}
for($j=0;$j<$i;++$j)
{
if($line[$j]=~/Syn_Name/)
{
do
{
print OUT $line[$j],"\n";
++$j;
}
until($line[$j]=~/^\s*$/)
}
}
This is my code I am trying to print data between Syn_Name and a blank line.
My code extracts the chunk that I need.
But the data between the chunk is printed line by line. I want the data for each chunk to get printed on a single line.
Simplification of your code. Using the flip-flop operator to control the print. Note that printing the final line will not add a newline (unless the line contained more than one newline). At best, it prints the empty string. At worst, it prints whitespace.
You do not need a transition array for the lines, you can use a while loop. In case you want to store the lines anyway, I added a commented line with how that is best done.
#chomp(my #line = <FILE>);
while (<FILE>) {
chomp;
if(/Syn_Name/ .. /^\s*$/) {
print OUT;
print "\n" if /^\s*$/;
}
}
Contents
Idiomatic Perl
Make errors easier to fix
Warnings about common programming errors
Don't execute unless variable names are consistent
Developing this habit will save you lots of time
Perl's range operator
Working demos
Print chomped lines immediately
Join lines with spaces
One more edge case
Idiomatic Perl
You seem to have a background with the C family of languages. This is fine because it gets the job done, but you can let Perl handle the machinery for you, namely
chomp defaults to $_ (also true with many other Perl operators)
push adds an element to the end of an array
to simplify your first loop:
while (<FILE>)
{
chomp;
push #line, $_;
}
Now you don't have update $i to keep track of how many lines you've already added to the array.
On the second loop, instead of using a C-style for loop, use a foreach loop:
The foreach loop iterates over a normal list value and sets the variable VAR to be each element of the list in turn …
The foreach keyword is actually a synonym for the for keyword, so you can use foreach for readability or for for brevity. (Or because the Bourne shell is more familiar to you than csh, so writing for comes more naturally.) If VAR is omitted, $_ is set to each value.
This way, Perl handles the bookkeeping for you.
for (#line)
{
# $_ is the current element of #line
...
}
Make errors easier to fix
Sometimes Perl can be too accommodating. Say in the second loop you made an easy typographical error:
for (#lines)
Running your program now produces no output at all, even if the input contains Syn_Name chunks.
A human can look at the code and see that you probably intended to process the array you just created and pluralized the name of the array by mistake. Perl, being eager to help, creates a new empty #lines array, which leaves your foreach loop with nothing to do.
You may delete the spurious s at the end of the array's name but still have a program produces no output! For example, you may have an unhandled combination of inputs that doesn't open the OUT filehandle.
Perl has a couple of easy ways to spare you these (and more!) kinds of frustration from dealing with silent failures.
Warnings about common programming errors
You can turn on an enormous list of warnings that help diagnose common programming problems. With my imagined buggy version of your code, Perl could have told you
Name "main::lines" used only once: possible typo at ./synname line 16.
and after fixing the typo in the array name
print() on unopened filehandle OUT at ./synname line 20, <FILE> line 8.
print() on unopened filehandle OUT at ./synname line 20, <FILE> line 8.
print() on unopened filehandle OUT at ./synname line 20, <FILE> line 8.
print() on unopened filehandle OUT at ./synname line 20, <FILE> line 8.
print() on unopened filehandle OUT at ./synname line 20, <FILE> line 8.
Right away, you see valuable information that may be difficult or at least tedious to spot unaided:
variable names are inconsistent, and
the program is trying to produce output but needs a little more plumbing.
Don't execute unless variable names are consistent
Notice that even with the potential problems above, Perl tried to execute anyway. With some classes of problems such as the variable-naming inconsistency, you may prefer that Perl not execute your program but stop and make you fix it first. You can tell Perl to be strict about variables:
This generates a compile-time error if you access a variable that wasn't declared via our or use vars, localized via my, or wasn't fully qualified.
The tradeoff is you have to be explicit about which variables you intend to be part of your program instead of allowing them to conveniently spring to life upon first use. Before the first loop, you would declare
my #line;
to express your intent. Then with the bug of a mistakenly pluralized array name, Perl fails with
Global symbol "#lines" requires explicit package name at ./synname line 16.
Execution of ./synname aborted due to compilation errors.
and you know exactly which line contains the error.
Developing this habit will save you lots of time
I begin almost every non-trivial Perl program I write with
#! /usr/bin/env perl
use strict;
use warnings;
The first is the shebang line, an ordinary comment as far as Perl is concerned. The use lines enable the strict pragma and the warnings pragma.
Not wanting to be a strict-zombie, as Mark Dominus chided, I'll point out that use strict; as above with no option makes Perl strict in dealing with three error-prone areas:
strict vars, as described above;
strict refs, disallows use of symbolic references; and
strict subs, requires the programmer to be more careful in referring to subroutines.
This is a highly useful default. See the strict pragma's documentation for more details.
Perl's range operator
The perlop documentation describes .., Perl's range operator, that can help you greatly simplify the logic in your second loop:
In scalar context, .. returns a boolean value. The operator is bistable, like a flip-flop, and emulates the line-range (comma) operator of sed, awk, and various editors. Each .. operator maintains its own boolean state, even across calls to a subroutine that contains it. It is false as long as its left operand is false. Once the left operand is true, the range operator stays true until the right operand is true, AFTER which the range operator becomes false again. It doesn't become false till the next time the range operator is evaluated.
In your question, you wrote that you want “data between Syn_Name and a blank line,” which in Perl is spelled
/Syn_Name/ .. /^\s*$/
In your case, you also want to do something special at the end of the range, and .. provides for that case too, ibid.
The final sequence number in a range has the string "E0" appended to it, which doesn't affect its numeric value, but gives you something to search for if you want to exclude the endpoint.
Assigning the value returned from .. (which I usually do to a scalar named $inside or $is_inside) allows you to check whether you're at the end, e.g.,
my $is_inside = /Syn_Name/ .. /^\s*$/;
if ($is_inside =~ /E0$/) {
...
}
Writing it this way also avoids duplicating the code for your terminating condition (the right-hand operand of ..). This way if you need to change the logic, you change it in only one place. When you have to remember, you'll forget sometimes and create bugs.
Working demos
See below for code you can copy-and-paste to get working programs. For demo purposes, they read input from the built-in DATA filehandle and write output to STDOUT. Writing it this way means you can transfer my code into yours with little or no modification.
Print chomped lines immediately
As defined in your question, there's no need for one loop to collect the lines in a temporary array and then another loop to process the array. Consider the following code
#! /usr/bin/env perl
use strict;
use warnings;
# for demo only
*FILE = *DATA;
*OUT = *STDOUT;
while (<FILE>)
{
chomp;
if (my $is_inside = /Syn_Name/ .. /^\s*$/) {
my $is_last = $is_inside =~ /E0$/;
print OUT $_, $is_last ? "\n" : ();
}
}
__DATA__
ERROR IF PRESENT IN OUTPUT!
Syn_Name
foo
bar
baz
ERROR IF PRESENT IN OUTPUT!
whose output is
Syn_Namefoobarbaz
We always print the current line, stored in $_. When we're at the end of the range, that is, when $is_last is true, we also print a newline. When $is_last is false, the empty list in the other branch of the ternary operator is the result—meaning we print $_ only, no newline.
Join lines with spaces
You didn't show us an example input, so I wonder whether you really want to butt the lines together rather than joining them with spaces. If you want the latter behavior, then the program becomes
#! /usr/bin/env perl
use strict;
use warnings;
# for demo only
*FILE = *DATA;
*OUT = *STDOUT;
my #lines;
while (<FILE>)
{
chomp;
if (my $is_inside = /Syn_Name/ .. /^\s*$/) {
push #lines, $_;
if ($is_inside =~ /E0$/) {
print OUT join(" ", #lines), "\n";
#lines = ();
}
}
}
__DATA__
ERROR IF PRESENT IN OUTPUT!
Syn_Name
foo
bar
baz
ERROR IF PRESENT IN OUTPUT!
This code accumulates in #lines only those lines within a Syn_Name chunk, prints the chunk, and clears out #lines when we see the terminator. The output is now
Syn_Name foo bar baz
One more edge case
Finally, what happens if we see Syn_Name at the end of the file but without a terminating blank line? That may be impossible with your data, but in case you need to handle it, you'll want to use Perl's eof operator.
eof FILEHANDLE
eof
Returns 1 if the next read on FILEHANDLE will return end of file or if FILEHANDLE is not open … An eof without an argument uses the last file read.
So we terminate on either a blank line or end of file.
#! /usr/bin/env perl
use strict;
use warnings;
# for demo only
*FILE = *DATA;
*OUT = *STDOUT;
my #lines;
while (<FILE>)
{
s/\s+$//;
#if (my $is_inside = /Syn_Name/ .. /^\s*$/) {
if (my $is_inside = /Syn_Name/ .. /^\s*$/ || eof) {
push #lines, $_;
if ($is_inside =~ /E0$/) {
print OUT join(" ", #lines), "\n";
#lines = ();
}
}
}
__DATA__
ERROR IF PRESENT IN OUTPUT!
Syn_Name
foo
bar
YOU CANT SEE ME!
Syn_Name
quux
potrzebie
Output:
Syn_Name foo bar
Syn_Name quux potrzebie
Here instead of chomp, the code removes any trailing invisible whitespace at the ends of lines. This will make sure spacing between joined lines is uniform even if the input is a little sloppy.
Without the eof check, the program does not print the latter line, which you can see by commenting out the active conditional and uncommenting the other.
Another simplified version:
foreach (grep {chomp; /Syn_Name/ .. /^\s*$/ } <FILE>) {
print OUT;
print OUT "\n" if /^\s*$/;
}