Split string into variables and use output as element - perl

Well.. I'm stuck again. I've read up quite a few topic with similar problems but not finding a solution for mine. I have a ; delimited csv file and the strings at the 8th column ($elements[7]) is as following: "aaaa;bb;cccc;ddddd;eeee;fffff;gg;". What i'm trying is to split the string based on ; and capture the outputs to variables. Then use those variables in the main csv file in their own column.
So now the file is like:
3d;2f;7j;8k;4s;2b;5g;"aaaa;bb;cccc;ddddd;eeee;fffff;gg;";4g;1a;5g;2g;7h;3d;2f;7j
3c;9k;5l;4g;1a;5g;3d;"aaaa;bb;cccc;ddddd;eeee;fffff;gg;";4g;1a;5g;2g;7h;3d;2f;7j
4g;1a;5g;2g;7h;3d;8k;"aaaa;bb;cccc;ddddd;eeee;fffff;gg;";3d;2f;7j;8k;4s;2b;4g;1a
And i want it like:
3d;2f;7j;8k;4s;2b;5g;4g;1a;5g;2g;7h;3d;2f;7j;aaaa;bb;cccc;ddddd;eeee;fffff;gg
3c;9k;5l;4g;1a;5g;3d;4g;1a;5g;2g;7h;3d;2f;7j;aaaa;bb;cccc;ddddd;eeee;fffff;gg;
4g;1a;5g;2g;7h;3d;8k;3d;2f;7j;8k;4s;2b;4g;1a;aaaa;bb;cccc;ddddd;eeee;fffff;gg;
This is my code i've been trying it with. I know.. it's terrible! But i'm hoping someone can help me?
use strict;
use warnings;
my $inputfile = shift || die "Give files\n";
my $outputfile = shift || die "Give output\n";
open my $INFILE, '<', $inputfile or die "In use / Not found :$!\n";
open my $OUTFILE, '>', $outputfile or die "In use :$!\n";
while (<$INFILE>) {
s/"//g;
my #elements = split /;/, $_;
my ($varA, $varB, $varC, $varD, $varE, $varF, $varG, $varH) split (';', $elements[10]);
$elements[16] = $varA;
$elements[17] = $varB;
$elements[18] = $varC;
$elements[19] = $varD;
$elements[20] = $varE;
$elements[21] = $varF;
$elements[22] = $varG;
$elements[23] = $varH;
my $output_line = join(";", #elements);
print $OUTFILE $output_line;
}
close $INFILE;
close $OUTFILE;
exit 0;
I'm confused about the my statement as well, it shouldn't be possible right? I mean the $vars are in a closed part so it shouldn't be possible to write them to $elements?
EDIT
This is how i adjusted the code with TLP's suggestions:
use strict;
use warnings;
use Text::CSV;
my $inputfile = shift || die "Give files\n";
my $outputfile = shift || die "Give output\n";
open my $INFILE, '<', $inputfile or die "In use / Not found :$!\n";
open my $OUTFILE, '>', $outputfile or die "In use :$!\n";
my $csv = Text::CSV->new({ # create a csv object
sep_char => ";", # delimiter
eol => "\n", # adds newline to print
});
while (my $row = $csv->getline($INFILE)) { # $row is an array ref
my $line = splice(#$row, 10, 1); # remove 8th line
$csv->parse($line); # parse the line
push #$row, $csv->fields(); # push newly parsed fields onto main array
$csv->print($OUTFILE, $row);
}
close $INFILE;
close $OUTFILE;
exit 0;

You should use a CSV module, e.g. Text::CSV to parse your data. Here's a brief example on how it can be done. You can replace the file handles I used below with your own.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ # create a csv object
sep_char => ";", # delimiter
eol => "\n", # adds newline to print
});
while (my $row = $csv->getline(*DATA)) { # $row is an array ref
my $line = splice(#$row, 7, 1); # remove 8th line
$csv->parse($line); # parse the line
push #$row, $csv->fields(); # push newly parsed fields onto main array
$csv->print(*STDOUT, $row);
}
__DATA__
3d;2f;7j;8k;4s;2b;5g;"aaaa;bb;cccc;ddddd;eeee;fffff;gg;";4g;1a;5g;2g;7h;3d;2f;7j
3c;9k;5l;4g;1a;5g;3d;"aaaa;bb;cccc;ddddd;eeee;fffff;gg;";4g;1a;5g;2g;7h;3d;2f;7j
4g;1a;5g;2g;7h;3d;8k;"aaaa;bb;cccc;ddddd;eeee;fffff;gg;";3d;2f;7j;8k;4s;2b;4g;1a
Output:
3d;2f;7j;8k;4s;2b;5g;4g;1a;5g;2g;7h;3d;2f;7j;aaaa;bb;cccc;ddddd;eeee;fffff;gg;
3c;9k;5l;4g;1a;5g;3d;4g;1a;5g;2g;7h;3d;2f;7j;aaaa;bb;cccc;ddddd;eeee;fffff;gg;
4g;1a;5g;2g;7h;3d;8k;3d;2f;7j;8k;4s;2b;4g;1a;aaaa;bb;cccc;ddddd;eeee;fffff;gg;

Related

perl- extract duplicate sequences from a multi-fasta file

I have a big fasta file input.fasta which consists many duplicate sequences. I want to enter a header name and extract out all the sequences with the matching header. I know this could be done easily done with awk/sed/grep but I need a Perl code.
input.fasta
>OGH38127_some_organism
PAAALGFSHLARQEDSALTPKHYTWTAPGEGDVRAPCPVLNTLANHEFLPHNGKNITVDK
AITALGDAMNISPALATTFFTGGLKTNPTPNATWFDLDMLHKHNVLEHDGSLSRRDMHFD
TSNKFDAATFANFLSYFDANATVLGVNETADARARHAYDMSKMNPEFTITSSMLPIMVGE
SVMMMLVWGSVEEPGAQRDYFEYFFRNERLPVELGWTPGETEIGVPVVTAMITAMVAASP
TDVP
>ABC14110_some_different_org_name
WWVAPGPGDSRGPCPGLNTLANHGYLPHDGKGITLSILADAMLDGFNIARSDALLLFTQ
AIRTSPQYPATNSFNLHDLGRDQLNRHNVLEHDASLSRADDFFGSNHIFNETVFDESRAY
AMLANSKIARQINSKAFNPQYKFTSKTEQFSLGEIAAPIIAFGNSTSGEVNRTLVEYFFM
NERLPIELGWKKSEDGIALDDILRVTQMISKAASLITPSALSWTAETLTP
>OGH38127_some_organism
LPWSRPGPGAVRAPCPMLNTLANHGFLPHDGKNISEARTVQALGRALNIEKELSQFLFEK
ALTTNPHTNATTFSLNDLSRHNLLEHDASLSRQDAYFGDNHDFNQTIFDETRSYWPHPVI
DIQAAALSRQARVNTSIAKNPTYNMSELGLDFSYGETAAYILILGDKDFGKVNRSWVEYL
FENERLPVELGWTRHNETITSDDLNTMLEKVVN
.
.
.
I have tried with the following script but it is not giving any output.
script.pl
#!/perl/bin/perl -w
use strict;
use warnings;
print "Enter a fasta header to search for:\n";
my $head = <>;
my $file = "input.fasta";
open (READ, "$file") || die "Cannot open $file: $!.\n";
my %seqs;
my $header;
while (my $line = <READ>){
chomp $line;
$line =~ s/^>(.*)\n//;
if ($line =~ m/$head/){
$header = $1;
}
}
close (READ);
open( my $out , ">", "out.fasta" ) or die $!;
my #count_seq = keys %seqs;
foreach (#count_seq){
print $out $header, "\n";
print $out $seqs{$header}, "\n";
}
exit;
Please help me correct this script.
Thanks!
If you use the Bioperl module Bio::SeqIO to handle the parsing of the fasta files, it becomes really simple:
#!/usr/bin/perl
use warnings;
use strict;
use Bio::SeqIO;
my ($file, $name) = #ARGV;
my $in = Bio::SeqIO->new(-file => $file, -format => "fasta");
my $out = Bio::SeqIO->new(-fh => \*STDOUT, -format => "fasta");
while (my $s = $in->next_seq) {
$out->write_seq($s) if $s->display_id eq $name;
}
run with perl grep_fasta.pl input.fasta OGH38127_some_organism
There's no need to store the sequences in memory, you can print them directly when reading the file. Use a flag variable ($inside in the example) that tells you whether you're reading the desired sequence or not.
#! /usr/bin/perl
use warnings;
use strict;
my ($file, $header) = #ARGV;
my $inside;
open my $in, '<', $file or die $!;
while (<$in>) {
$inside = $1 eq $header if /^>(.*)/;
print if $inside;
}
Run as
perl script.pl file.fasta OGH38127_some_organism > output.fasta

Print variable after closing the file in Perl

Below code works fine but I want $ip to be printed after closing the file.
use strict;
use warnings;
use POSIX;
my $file = "/tmp/example";
open(FILE, "<$file") or die $!;
while ( <FILE> ) {
my $lines = $_;
if ( $lines =~ m/address/ ) {
my ($string, $ip) = (split ' ', $lines);
print "IP address is: $ip\n";
}
}
close(FILE);
sample data in /tmp/example file
$cat /tmp/example
country us
ip_address 192.168.1.1
server dell
This solution looks for the first line that contains ip_address followed by some space and a sequence of digits and dots
Wrapping the search in a block makes perl delete the lexical variable $fh. Because it is a file handle, that handle will also be automatically closed
Note that I've used autodie to avoid the need to explicitly check the status of the open call
This algorithm will find the first occurrence of ip_address and stop reading the file immediately
use strict;
use warnings 'all';
use autodie;
my $file = '/tmp/example';
my $ip;
{
open my $fh, '<', $file;
while ( <$fh> ) {
if ( /ip_address\h+([\d.]+)/ ) {
$ip = $1;
last;
}
}
}
print $ip // 'undef', "\n";
output
192.168.1.1
Store all ips in an array and you'll then have it for later processing.
The shown code can also be simplified a lot. This assumes a four-number ip and data like that shown in the sample
use warnings;
use strict;
use feature 'say';
my $file = '/tmp/example';
open my $fh, '<', $file or die "Can't open $file: $!";
my #ips;
while (<$fh>) {
if (my ($ip) = /ip_address\s*(\d+\.\d+\.\d+\.\d+)/) {
push #ips, $ip;
}
}
close $fh;
say for #ips;
Or, once you open the file, process all lines with a map
my #ips = map { /ip_address\s*(\d+\.\d+\.\d+\.\d+)/ } <$fh>;
The filehandle is here read in a list context, imposed by map, so all lines from the file are returned. The block in map applies to each in turn, and map returns a flattened list with results.
Some notes
Use three-argument open, it is better
Don't assign $_ to a variable. To work with a lexical use while (my $line = <$fh>)
You can use split but here regex is more direct and it allows you to assign its match so that it is scoped. If there is no match the if fails and nothing goes onto the array
use warnings;
use strict;
my $file = "test";
my ( $string,$ip);
open my $FH, "<",$file) or die $!;
while (my $lines = <FH>) {
if ($lines =~ m/address/){
($string, $ip) = (split ' ', $lines);
}
}
print "IP address is: $ip\n";
This will give you the output you needed. But fails in the case of multiple IP match lines in the input file overwrites the last $ip variable.

Write If Statement Variable to New File

I am trying to send a variable that is defined in an if statement $abc to a new file. The code seems correct but, I know that it is not working because the file is not being created.
Data File Sample:
bos,control,x1,x2,29AUG2016,y1,y2,76.4
bos,control,x2,x3,30AUG2016,y2,y3,78.9
bos,control,x3,x4,01SEP2016,y3,y4,72.5
bos,control,x4,x5,02SEP2016,y4,y5,80.5
Perl Code:
#!/usr/bin/perl
use strict;
use warnings 'all';
use POSIX qw(strftime); #Pull in date
my $currdate = strftime( "%Y%m%d", localtime ); #Date in YYYYMMDD format
my $modded = strftime( "%d%b%Y", localtime ); #Date in DDMONYYYY format
my $newdate = uc $modded; #converts lowercase to uppercase
my $filename = '/home/.../.../text_file'; #Define full file path before opening
open(FILE, '<', $filename) or die "Uh, where's the file again?\n"; #Open file else give up and relay snarky error
while(<FILE>) #Open While Loop
{
chomp;
my #fields = split(',' , $_); #Identify columns
my $site = $fields[0];
my $var1 = $fields[1];
my $var2 = $fields[4];
my $var3 = $fields[7];
my $abc = print "$var1,$var2,$var3\n" if ($var1 =~ "control" && $var2 =~ "$newdate");
open my $abc, '>', '/home/.../.../newfile.txt';
close $abc;
}
close FILE;
In your code you have a few odd things that are likely mistakes.
my $abc = print "$var1,$var2,$var3\n" if ($var1 =~ "c01" && $var2 =~ "$newdate");
print will return success, which it does as 1. So you will print out the string to STDOUT, and then assign 1 to a new lexical variable $abc. $abc is now 1.
All of that only happens if that condition is met. Don't do conditional assignments. The behavior for this is undefined. So if the condition is false, your $abc might be undef. Or something else. Who knows?
open my $abc, '>', '/home/.../.../newfile.txt';
close $abc;
You are opening a new filehandle called $abc. The my will redeclare it. That's a warning that you would get if you had use warnings in your code. It also overwrites your old $abc with a new file handle object.
You don't write anything to the file
... are weird foldernames, but that's probably just obfuscation for your example
I think what you actually want to do is this:
use strict;
use warnings 'all';
# ...
open my $fh, '<', $filename or die $!;
while ( my $line = <$fh> ) {
chomp $line;
my #fields = split( ',', $line );
my $site = $fields[0];
my $var1 = $fields[1];
my $var2 = $fields[4];
my $var3 = $fields[7];
open my $fh_out, '>', '/home/.../.../newfile.txt';
print $fh_out "$var1,$var2,$var3\n" if ( $var1 =~ "c01" && $var2 =~ "$newdate" );
close $fh_out;
}
close $fh;
You don't need the $abc variable in between at all. You can just print to your new file handle $fh_out that's open for writing.
Note that you will overwrite the newfile.txt file every time you have a match in a line inside $filename.
Your current code:
Prints the string
Assigns the result of printing it to a variable
Immediately overwrites that variable with a file handle (assuming open succeeded)
Closes that file handle without using it
Your logic should look more like this:
if ( $var1 =~ "c01" && $var2 =~ "$newdate" ) {
my $abc = "$var1,$var2,$var3\n"
open (my $file, '>', '/home/.../.../newfile.txt') || die("Could not open file: " . $!);
print $file $abc;
close $file;
}
You have a number of problems with your code. In addition to what others have mentioned
You create a new output file every time you find a matching input line. That will leave the file containing only the last printed string
Your test checks whether the text in the second column contains c01, but all of the lines in your sample input have control in the second column, so nothing will be printed
I'm guessing that you want to test for string equality, in which case you need eq instead of =~ which does a regular expression pattern match
I think it should look something more like this
use strict;
use warnings 'all';
use POSIX 'strftime';
my $currdate = uc strftime '%d%b%Y', localtime;
my ($input, $output) = qw/ data.txt newfile.txt /;
open my $fh, '<', $input or die qq{Unable to open "$input" for input: $!};
open my $out_fh, '>', $output or die qq{Unable to open "$output" for output: $!};
while ( <$fh> ) {
chomp;
my #fields = split /,/;
my ($site, $var1, $var2, $var3) = #fields[0,1,4,7];
next unless $var1 eq 'c01' and $var2 eq $currdate;
print $out_fh "$var1,$var2,$var3\n";
}
close $out_fh or die $!;

Retrieve first row from CSV as headers using Text::CSV

I feel like I'm missing something rather obvious, but can't find any answers in the documentation. Still new to OOP with Perl, but I'm using Text::CSV to parse a CSV for later use.
How would I go about extracting the first row and pushing the values to array #headers?
Here's what I have so far:
#!/usr/bin/perl
use warnings;
use diagnostics;
use strict;
use Fcntl ':flock';
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ',' });
my $file = "sample.csv";
my #headers; # Column names
open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
chomp $line;
if ($csv->parse($line)) {
my $r = 0; # Increment row counter
my $field_count = $csv->fields(); # Number of fields in row
# While data exists...
while (my $fields = $csv->getline( $data )) {
# Parse row into columns
print "Row ".$r.": \n";
# If row zero, process headers
if($r==0) {
# Add value to #columns array
push(#headers,$fields->[$c]);
} else {
# do stuff with records...
}
}
$r++
}
close $data;
You'd think that there would be a way to reference the existing fields in the first row.
Pretty much straight from the documentation, for example.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
my $csv = Text::CSV_XS->new ({ binary => 1, eol => $/ });
my $file = 'o33.txt';
open my $io, "<", $file or die "$file: $!";
my $header = $csv->getline ($io);
print join("-", #$header), "\n\n";
while (my $row = $csv->getline ($io)) {
print join("-", #$row), "\n";
}
__END__
***contents of o33.txt
lastname,firstname,age,gender,phone
mcgee,bobby,27,M,555-555-5555
kincaid,marl,67,M,555-666-6666
hofhazards,duke,22,M,555-696-6969
Prints:
lastname-firstname-age-gender-phone
mcgee-bobby-27-M-555-555-5555
kincaid-marl-67-M-555-666-6666
hofhazards-duke-22-M-555-696-6969
Update: Thinking about your problem, it may be that you want to address the data by its column name. For that, you might be able to use something (also from the docs), like this:
$csv->column_names ($csv->getline ($io));
while (my $href = $csv->getline_hr ($io)) {
print "lastname is: ", $href->{lastname},
" and gender is: ", $href->{gender}, "\n"
}
Note: You can use Text::CSV instead of Text::CSV_XS, as the former is a wrapper around the latter.
Thought I'd post my results for others.
#!/usr/bin/perl
use warnings;
use diagnostics;
use strict;
use Text::CSV;
sub read_csv {
my $csv = Text::CSV->new({ sep_char => ',' });
my $file = shift;
open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file' $!\n";
# Process Row Zero
my $header = $csv->getline ($data);
my $field_count = $csv->fields();
# Read the rest of the file
while (my $line = <$data>) {
chomp $line;
# Read line if possible
if ($csv->parse($line)) {
my $r = 0;
# While data exists...
while (my $fields = $csv->getline( $data )) {
# Parse row into columns
print Display->H2;
print "Row ".$r.": ".#$fields." columns. \n";
# Print column values
for(my $c=0; $c<#$fields; $c++) {
print #$header[$c]." : ".#$fields[$c]."\n";
}
$r++
}
}
close $data;
}
}
Cheers

Adding header of the first file to all the other split files in Perl

I need to add header of the first main file to all the split files. i.e I am able to get header for the 1st split file but i need it for all the split files, here I am splitting DAT file. Below is what i have done so for:
#!usr/bin/perl -w
my $chunksize = 25000000; # 25MB
my $filenumber = 0;
my $infile = "Test.dat";
my $outsize = 0;
my $eof = 0;
my $line = $_;
open INFILE, $infile;
open OUTFILE, ">outfile_".$filenumber.".dat";
while (<INFILE>) {
chomp;
if ($outsize > $chunksize) {
close OUTFILE;
$outsize = 0;
$filenumber++;
open (OUTFILE, ">outfile_".$filenumber.".dat")
or die "Can't open outfile_".$filenumber.".dat";
}
print OUTFILE "$_\n";
$outsize += length;
}
close INFILE;
You should always use warnings (in preference to the command-line -w) and use strict. That way many simple errors that you may otherwise have obverlooked will be flagged
Use the three-parameter form of open with lexical filehandles
Check the result of all open calls and flag errors containing the value of $! in a die string
Define constant values with the use constant pragma father than as Perl variables
The number of bytes printed to a filehandle can be evaluated using the tell function, so there is no need to keep your own count
To solve your specific problem, you should read and remember the first line of your input file, and print it to new output files every time they are opened
It is easier to keep track of the output files if you open them when you have new data to write and no open file, and close them when they are full or if you have reached the end of the input data
This program demonstrates the ideas and does what is required
use strict;
use warnings;
use constant INFILE => 'Test.dat';
use constant CHUNKSIZE => 25_000_000; # 25MB
open my $infh, '<', INFILE or die $!;
my $header = <$infh>;
my $outfh;
my $filenumber = 0;
while (my $line = <$infh>) {
unless ($outfh) {
my $outfile = "outfile_$filenumber.dat";
open $outfh, '>', $outfile or die "Can't open '$outfile': $!";
print { $outfh } $header;
$filenumber++;
}
print { $outfh } $line;
if (tell $outfh > CHUNKSIZE or eof $infh) {
close $outfh or die $!;
undef $outfh;
}
}
You need to store the header from the input file and print it every time a new file is opened:
use strict;
use warnings;
use autodie;
# initializations ...
open my $in, '<', $infile;
open my $out, '>', "outfile_${file_number}.dat";
my $header = <$in>; # Save the header...
chomp $header; # ... not strictly necessary
while ( <$in> ) {
chomp; # Not strictly necessary
if ( $outsize > $chunksize) {
close $out;
$outsize = 0;
$filenumber++;
open $out, '>', "outfile_${file_number}.dat";
print $out $header, "\n"; # Prints header at beginning of file
# Newline needed if $header chomped
}
print $out $_, "\n"; # Newline needed if $_ chomped
$outsize += length;
}