Multi-column minimal representation discovery in perl - perl

I have a csv of data with about 20 columns and each column will have more than one distinct value. Each row after the top one which is the header, is an individual data sample. I want to narrow the list down programatically so that I have the smallest number of data samples but each permutation of the column data is still represented.
Example data
SERIAL,ACTIVE,COLOR,CLASS,SEASON,SEATS
.0xb468d47cc9749fb862990426ff79aafb,T,GREEN,BETA,SUMMER,3
.0x847129b35bad62f5837eec30dc07a8a4,T,VIOLET,DELTA,SUMMER,1
.0x14b8df88fd6d6547e387f4caa99e52fd,F,ORANGE,ALPHA,SUMMER,4
.0x0a07fb97224caf79ea73d3fdd5495b8f,T,YELLOW,DELTA,WINTER,1
.0x7d747e689bb27b60198283d7b86db409,F,READ,DELTA,SPRING,2
.0x8247524df49bd19c4c316ee070a2dd4a,T,BLUE,GAMA,WINTER,2
.0x4103ed42af6e8e463708a6c629907fb5,T,YELLOW,ALPHA,SPRING,5
.0xc38deea7f02fbfbcdde1d3718d6decb4,T,YELLOW,DELTA,FALL,5
.0xa3d562edcf64e151d7de08ff8f8e0a94,F,VIOLET,DELTA,SUMMER,3
.0x9da58b3b05603325c24629f700c25c97,T,YELLOW,OMEGA,SPRING,4
.0xef0c0e75083229d654c9b111e3af8726,T,BLUE,GAMA,FALL,1
.0xa9022c8713f0aba2a8e1d20475a3104a,T,YELLOW,BETA,SUMMER,2
.0x5bb5f73e6030730610866cee80cfc2fb,F,ORANGE,BETA,FALL,5
.0xc202e5b43dd65525754fdc52b89e7375,T,BLUE,OMEGA,SUMMER,3
.0xfac9145af33a74aedae7cc0442426432,F,READ,BETA,SPRING,1
.0x457949648053f710b4f2d55cb237a91d,T,GREEN,BETA,SPRING,3
.0xed94d4df300f10f5c4dc5d3ac76cf9e5,F,VIOLET,ALPHA,WINTER,15
.0x870130135beed4cbbe06478e368b40b3,F,YELLOW,ALPHA,SPRING,3
.0x3b6f17841edb9651e732e3ffbacbe14a,T,GREEN,OMEGA,SUMMER,3
.0xfb30e054466b9e4cf944c8e48ff74c93,F,VIOLET,DELTA,SUMMER,8
.0xf741ddc71b4a667585acaa35b67dc6c9,F,BLUE,BETA,FALL,4
.0x60257ad6c299e466086cc6e5bb0a9a33,F,VIOLET,OMEGA,SPRING,1
.0xa5d208bfee5a27a7619ba07dcbdaeea0,T,GREEN,OMEGA,FALL,1
.0x53bc78fa8863e53e8c9fb11c5f6d2320,F,GREEN,GAMA,SPRING,2
.0x5a01253ce5cb0a6aa5213f34f0b35416,T,READ,BETA,WINTER,3
.0xaed9a979ba9f6fbf39895b610dde80f4,T,ORANGE,DELTA,WINTER,1
.0xe7769918e36671af77b5d3d59ea15cfe,T,ORANGE,OMEGA,FALL,4
.0x9e5327a1583332e4c56d29c356dbc5d2,T,INDEGO,ALPHA,WINTER,5
.0x79c5c70732ff04b4d00e81ac3a07c3b7,T,READ,OMEGA,FALL,5
.0x55f54d3c9cd2552e286364894aeef62a,F,READ,GAMA,SPRING,15

Use a hash to determine whether you have seen a particular column combination before, and then use that to determine whether to print a particular line.
Here is a rather basic example to demonstrate the idea:
filter.pl
#!/usr/bin/env perl
use warnings;
use strict;
die "usage: $0 file col1,col2,col3, ... coln\n" unless #ARGV;
my ($file, $columns) = #ARGV;
-f $file or die "$file does not exist!";
defined $columns or die "need to pass in columns!";
my #columns;
for my $col ( split /,/, $columns ) {
die "Invalid column id $col" unless $col >= 1; # 1-based
push #columns, $col - 1; # 0-based
}
scalar #columns or die "No columns!";
open my $fh, "<", $file or die "Unable to open $file : $!";
my %uniq;
while (<$fh>) {
chomp();
next if $. == 1; # Skip Header
my (#data) = split /,/, $_; # Use Text::CSV for any non-trivial csv file
my $key = join '|', #data[ #columns ]; # key will look like 'foo|bar|baz'
if (not defined $uniq{ $key } ) {
print $_ . "\n"; # Print the whole line with the first unique set of columns
$uniq{ $key } = 1; # Now we have seen this combo
}
}
data.csv
SERIAL,TRUTH,IN,PARALLEL
123,TRUE,YES,5
124,TRUE,YES,5
125,TRUE,YES,3
126,TRUE,NO,5
127,FALSE,YES,1
128,FALSE,YES,3
129,FALSE,NO,7
Output
perl filter.pl data.csv 2,3
123,TRUE,YES,5
126,TRUE,NO,5
127,FALSE,YES,1
129,FALSE,NO,7

Related

Perl: Read columns and convert to array

I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks
Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}
Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.

Split my output into multiple files

I have the following list in a CSV file, and my goal is to split this list into directories named YYYY-Month based on the date in each row.
NAME99;2018/06/13;12:27:30
NAME01;2018/06/13;13:03:59
NAME00;2018/06/15;11:33:01
NAME98;2018/06/15;12:22:00
NAME34;2018/06/15;16:58:45
NAME17;2018/06/18;15:51:10
NAME72;2018/06/19;10:06:37
NAME70;2018/06/19;12:44:03
NAME77;2018/06/19;16:36:55
NAME25;2018/06/11;16:32:57
NAME24;2018/06/11;16:32:57
NAME23;2018/06/11;16:37:15
NAME01;2018/06/11;16:37:15
NAME02;2018/06/11;16:37:15
NAME01;2018/06/11;16:37:18
NAME02;2018/06/05;09:51:17
NAME00;2018/06/13;15:04:29
NAME07;2018/06/19;10:02:26
NAME08;2018/06/26;16:03:57
NAME09;2018/06/26;16:03:57
NAME02;2018/06/27;16:58:12
NAME03;2018/07/03;07:47:21
NAME21;2018/07/03;10:53:00
NAMEXX;2018/07/05;03:13:01
NAME21;2018/07/05;15:39:00
NAME01;2018/07/05;16:00:14
NAME00;2018/07/08;11:50:10
NAME07;2018/07/09;14:46:00
What is the smartest method to achieve this result without having to create a list of static routes, in which to carry out the append?
Currently my program writes this list to a directory called YYYY-Month only on the basis of localtime but does not do anything on each line.
Perl
#!/usr/bin/perl
use strict;
use warnings 'all';
use feature qw(say);
use File::Path qw<mkpath>;
use File::Spec;
use File::Copy;
use POSIX qw<strftime>;
my $OUTPUT_FILE = 'output.csv';
my $OUTFILE = 'splitted_output.csv';
# Output to file
open( GL_INPUT, $OUTPUT_FILE ) or die $!;
$/ = "\n\n"; # input record separator
while ( <GL_INPUT> ) {
chomp;
my #lines = split /\n/;
my $i = 0;
foreach my $lines ( #lines ) {
# Encapsulate Date/Time
my ( $name, $y, $m, $d, $time ) =
$lines[$i] =~ /\A(\w+);(\d+)\/(\d+)\/(\d+);(\d+:\d+:\d+)/;
# Generate Directory YYYY-Month - #2009-January
my $dir = File::Spec->catfile( $BASE_LOG_DIRECTORY, "$y-$m" ) ;
unless ( -e $dir ) {
mkpath $dir;
}
my $log_file_path = File::Spec->catfile( $dir, $OUTFILE );
open( OUTPUT, '>>', $log_file_path ) or die $!;
# Here I append value into files
print OUTPUT join ';', "$y/$m/$d", $time, "$name\n";
$i++;
}
}
close( GL_INPUT );
close( OUTPUT );
There is no reason to care about the actual date, or to use date functions at all here. You want to split up your data based on a partial value of one of the columns in the data. That just happens to be the date.
NAME08;2018/06/26;16:03:57 # This goes to 2018-06/
NAME09;2018/06/26;16:03:57 #
NAME02;2018/06/27;16:58:12 #
NAME03;2018/07/03;07:47:21 # This goes to 2018-07/
NAME21;2018/07/03;10:53:00 #
NAMEXX;2018/07/05;03:13:01 #
NAME21;2018/07/05;15:39:00 #
The easiest way to do this is to iterate your input data, then stick it into a hash with keys for each year-month combination. But you're talking about log files, and they might be large, so that's inefficient.
We should work with different file handles instead.
use strict;
use warnings;
my %months = ( 6 => 'June', 7 => 'July' );
my %handles;
while (my $row = <DATA>) {
# no chomp, we don't actually care about reading the whole row
my (undef, $dir) = split /;/, $row; # discard name and everything after date
# create the YYYY-MM key
$dir =~ s[^(....)/(..)][$1-$months{$2}];
# open a new handle for this year/month if we don't have it yet
unless (exists $handles{$dir}) {
# create the directory (skipped here) ...
open my $fh, '>', "$dir/filename.csv" or die $!;
$handles{$dir} = $fh;
}
# write out the line to the correct directory
print { $handles{$dir} } $row;
}
__DATA__
NAME08;2018/06/26;16:03:57
NAME09;2018/06/26;16:03:57
NAME02;2018/06/27;16:58:12
NAME03;2018/07/03;07:47:21
NAME21;2018/07/03;10:53:00
NAMEXX;2018/07/05;03:13:01
NAME21;2018/07/05;15:39:00
I've skipped the part about creating the directory as you already know how to do this.
This code will also work if your rows of data are not sequential. It's not the most efficient as the number of handles will grow the more data you have, but as long you don't have 100s of them at the same time that does not really matter.
Things of note:
You don't need chomp because you don't care about working with the last field.
You don't need to assign all of the values after split because you don't care about them.
You can discard values by assigning them to undef.
Always use three-argument open and lexical file handles.
the {} in print { ... } $roware needed to tell Perl that this is the handle we are printing too. See http://perldoc.perl.org/functions/print.html.

Count the number of items derived from split without putting into an array

I am looking to spare the use of an array for memory's sake, but still get the number of items derived from the split function for each pass of a while loop.
The ultimate goal is to filter the output files according to the number of their sequences, which could either be deduced by the number of rows the file has, or the number of carrots that appear, or the number of line breaks, etc.
Below is my code:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
"TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
"TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
"TTTATCG" => "AGTCATGCTTTATCGCGATCGAT",
"TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
"TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
"TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
"TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
"TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
"CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
);
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
foreach my $sequence (split /\t/, $line){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
}
}
}
The input file, "Clustered_Barcodes.txt" when opened, looks like the following:
TTTATGC TTTATGG TTTATCC TTTATCG
TTTATAA TTTATAA TTTATAT TTTATAT TTTATTA
CTTGTAA
There will be three output files from the code, "Clustered_Barcode_1.txt", "Clustered_Barcode_2.txt", and "Clustered_Barcode_3.txt". An example of what the output files would look like could be the 3rd and final file, which would look like the following:
>CTTGTAA
ATCGATCGCTTGTAACGATTAGC
I need some way to modify my code to identify the number of rows, carrots, or sequences that appear in the file and work that into the title of the file. The new title for the above sequence could be something like "Clustered_Barcode_Number_3_1_Sequence.txt"
PS- I made the hash in the above code manually in attempt to make things simpler. If you want to see the original code, here it is. The input file format is something like:
>TAGCTAGC
GCTAAGCGATGCTACGGCTATTAGCTAGCCGGTA
Here is the code for setting up the hash:
my $dir = ("~/Documents/Sequences");
open(INFILE, "<", "~/Documents/Clustered_Barcodes.txt") or die $!;
my %hash = ();
my #ArrayofFiles = glob "$dir/*"; #put all files from the specified directory into an array
#print join("\n", #ArrayofFiles), "\n"; #this is a diagnostic test print statement
foreach my $file (#ArrayofFiles){ #make hash of barcodes and sequences
open (my $sequence, $file) or die "can't open file: $!";
while (my $line = <$sequence>) {
if ($line !~/^>/){
my $seq = $line;
$seq =~ s/\R//g;
#print $seq;
$seq =~ m/(CATCAT|TACTAC)([TAGC]{16})([TAGC]+)([TAGC]{16})(CATCAT|TACTAC)/;
$hash{$2} = $3;
}
}
}
while(<INFILE>){
etc
You can use regex to get the count:
my $delimiter = "\t";
my $line = "zyz pqr abc xyz";
my $count = () = $line =~ /$delimiter/g; # $count is now 3
print $count;
Your hash structure is not right for your problem as you have multiple entries for same ids. for example TTTATAA hash id has 2 entries in your %hash.
To solve this, use hash of array to create the hash.
Change your hash creation code in
$hash{$2} = $3;
to
push(#{$hash{$2}}, $3);
Now change your code in the while loop
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
my %id_list;
foreach my $sequence (split /\t/, $line){
$id_list{$sequence}=1;
}
foreach my $sequence(keys %id_list)
{
foreach my $val (#{$hash{$sequence}})
{
print $out ">$sequence\n$val\n";
}
}
}
I have assummed that;
The first digit in the output file name is the input file line number
The second digit in the output file name is the input file column number
That the input hash is a hash of arrays to cover the case of several sequences "matching" the one barcode as mentioned in the comments
When a barcode has a match in the hash, that the output file will lists all the sequences in the array, one per line.
The simplest way to do this that I can see is to build the output file using a temporary filename and the rename it when you have all the data. According to the perl cookbook, the easiest way to create temporary files is with the module File::Temp.
The key to this solution is to move through the list of barcodes that appear on a line by column index rather than the usual perl way of simply iterating over the list itself. To get the actual barcodes, the column number $col is used to index back into #barcodes which is created by splitting the line on whitespace. (Note that splitting on a single space is special cased by perl to emulate the behaviour of one of its predecessors, awk (leading whitespace is removed and the split is on whitespace, not a single space)).
This way we have the column number (indexed from 1) and the line number we can get from the perl special variable, $. We can then use these to rename the file using the builtin, rename().
use warnings;
use strict;
use diagnostics;
use File::Temp qw(tempfile);
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => [ "TATAGCGCTTTATGCTAGCTAGC" ],
"TTTATGG" => [ "TAGCTAGCTTTATGGGCTAGCTA" ],
"TTTATCC" => [ "GCTAGCTATTTATCCGCTAGCTA" ],
"TTTATCG" => [ "AGTCATGCTTTATCGCGATCGAT" ],
"TTTATAA" => [ "TAGCTAGCTTTATAATAGCTAGC", "ATCGATCGTTTATAACGATCGAT" ],
"TTTATAT" => [ "TCGATCGATTTATATTAGCTAGC", "TAGCTAGCTTTATATGCTAGCTA" ],
"TTTATTA" => [ "GCTAGCTATTTATTATAGCTAGC" ],
"CTTGTAA" => [ "ATCGATCGCTTGTAACGATTAGC" ]
);
my $cbn = "Clustered_Barcode_Number";
my $trailer = "Sequence.txt";
while (my $line = <INFILE>) {
chomp $line ;
my $line_num = $. ;
my #barcodes = split " ", $line ;
for my $col ( 1 .. #barcodes ) {
my $barcode = $barcodes[ $col - 1 ]; # arrays indexed from 0
# skip this one if its not in the hash
next unless exists $hash{$barcode} ;
my #sequences = #{ $hash{$barcode} } ;
# Have a hit - create temp file and output sequences
my ($out, $temp_filename) = tempfile();
say $out ">$barcode" ;
say $out $_ for (#sequences) ;
close $out ;
# Rename based on input line and column
my $new_name = join "_", $cbn, $line_num, $col, $trailer ;
rename ($temp_filename, $new_name) or
warn "Couldn't rename $temp_filename to $new_name: $!\n" ;
}
}
close INFILE
All of the barcodes in your sample input data have a match in the hash, so when I run this, I get 4 files for line 1, 5 for line 2 and 1 for line 3.
Clustered_Barcode_Number_1_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt
Clustered_Barcode_Number_1_3_Sequence.txt
Clustered_Barcode_Number_1_4_Sequence.txt
Clustered_Barcode_Number_2_1_Sequence.txt
Clustered_Barcode_Number_2_2_Sequence.txt
Clustered_Barcode_Number_2_3_Sequence.txt
Clustered_Barcode_Number_2_4_Sequence.txt
Clustered_Barcode_Number_2_5_Sequence.txt
Clustered_Barcode_Number_3_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt for example has:
>TTTATGG
TAGCTAGCTTTATGGGCTAGCTA
and Clustered_Barcode_Number_2_5_Sequence.txt has:
>TTTATTA
GCTAGCTATTTATTATAGCTAGC
Clustered_Barcode_Number_2_3_Sequence.txt - which matched a hash key with two sequences - had the following;
>TTTATAT
TCGATCGATTTATATTAGCTAGC
TAGCTAGCTTTATATGCTAGCTA
I was speculating here about what you wanted when a supplied barcode had two matches. Hope that helps.

replace 4th column from the last and also pick unique value from 3rd column at the same time

I have two files both of them are delimited by pipe.
First file:
has may be around 10 columns but i am interested in first two columns which would useful in updating the column value of the second file.
first file detail:
1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9
Second file detail:
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**1**|a10|a11|a12
a1|a2|**ray**|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|**kate**|a3|a4|a5|a6|a7|a8|**20**|a10|a11|a12
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**6**|a10|a11|a12
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**45**|a10|a11|a12
My requirement here is to find unique values from 3rd column and also replace the 4th column from the last . The 4th column from the last may/may not have numeric number . This number would be appearing in the first field of first file as well. I need replace (second file )this number with the corresponding value that appears in the second column of the first file.
expected output:
unique string : ray kate bob
a1|a2|bob|a3|a4|a5|a6|a7|a8|**alpha**|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|**charlie**|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|**romeo**|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
I am able to pick the unique string using below command
awk -F'|' '{a[$3]++}END{for(i in a){print i}}' filename
I would dont want to read the second file twice , first to pick the unique string and second time to replace 4th column from the last as the file size is huge. It would be around 500mb and there are many such files.
Currently i am using perl (Text::CSV) module to read the first file ( this file is of small size ) and load the first two columns into a hash , considering first column as key and second as value. then read the second file and replace the n-4 column with hash value. But this seems to be time consuming as Text::CSV parsing seems to be slow.
Any awk/perl solution keeping speed in mind would be really helpful :)
Note: Ignore the ** asterix around the text , they are just to highlight they are not part of the data.
UPDATE : Code
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Utils;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ sep_char => '|' });
my $file = $ARGV[0] or die "Need to get CSV file on the command line\n";
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
chomp $line;
if ($csv->parse($line)) {
my #fields = $csv->fields();
$hash{$field[0]}=$field[1];
} else {
warn "Line could not be parsed: $line\n";
}
}
close($data);
my $csv = Text::CSV->new({ sep_char => '|' , blank_is_undef => 1 , eol => "\n"});
my $file2 = $ARGV[1] or die "Need to get CSV file on the command line\n";
open ( my $fh,'>','/tmp/outputfile') or die "Could not open file $!\n";
open(my $data2, '<', $file2) or die "Could not open '$file' $!\n";
while (my $line = <$data2>) {
chomp $line;
if ($csv->parse($line)) {
my #fields = $csv->fields();
if (defined ($field[-4]) && looks_like_number($field[-4]))
{
$field[-4]=$hash{$field[-4]};
}
$csv->print($fh,\#fields);
} else {
warn "Line could not be parsed: $line\n";
}
}
close($data2);
close($fh);
Here's an option that doesn't use Text::CSV:
use strict;
use warnings;
#ARGV == 3 or die 'Usage: perl firstFile secondFile outFile';
my ( %hash, %seen );
local $" = '|';
while (<>) {
my ( $key, $val ) = split /\|/, $_, 3;
$hash{$key} = $val;
last if eof;
}
open my $outFH, '>', pop or die $!;
while (<>) {
my #F = split /\|/;
$seen{ $F[2] } = undef;
$F[-4] = $hash{ $F[-4] } if exists $hash{ $F[-4] };
print $outFH "#F";
}
close $outFH;
print 'unique string : ', join( ' ', reverse sort keys %seen ), "\n";
Command-line usage: perl firstFile secondFile outFile
Contents of outFile from your datasets (asterisks removed):
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
STDOUT:
unique string : ray kate bob
Hope this helps!
Use getline instead of parse, it is much faster. The following is a more idiomatic way of performing this task. Note that you can reuse the same Text::CSV object for multiple files.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Text::CSV;
my $csv = Text::CSV->new({
auto_diag => 1,
binary => 1,
blank_is_undef => 1,
eol => $/,
sep_char => '|'
}) or die "Can't use CSV: " . Text::CSV->error_diag;
open my $map_fh, '<', 'map.csv' or die "map.csv: $!";
my %mapping;
while (my $row = $csv->getline($map_fh)) {
$mapping{ $row->[0] } = $row->[1];
}
close $map_fh;
open my $in_fh, '<', 'input.csv' or die "input.csv: $!";
open my $out_fh, '>', 'output.csv' or die "output.csv: $!";
my %seen;
while (my $row = $csv->getline($in_fh)) {
$seen{ $row->[2] } = 1;
my $key = $row->[-4];
$row->[-4] = $mapping{$key} if defined $key and exists $mapping{$key};
$csv->print($out_fh, $row);
}
close $in_fh;
close $out_fh;
say join ',', keys %seen;
map.csv
1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9
input.csv
a1|a2|bob|a3|a4|a5|a6|a7|a8|1|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|20|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|6|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
output.csv
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
STDOUT
kate,bob,ray
This awk should work.
$ awk '
BEGIN { FS = OFS = "|" }
NR==FNR { a[$1] = $2; next }
{ !unique[$3]++ }
{ $(NF-3) = (a[$(NF-3)]) ? a[$(NF-3)] : $(NF-3) }1
END {
for(n in unique) print n > "unique.txt"
}' file1 file2 > output.txt
Explanation:
We set the input and output field separators to |.
We iterate through first file creating an array storing column one as key and assigning column two as the value
Once the first file is loaded in memory, we create another array by reading the second file. This array stores the unique values from column three of second file.
While reading the file, we look at the forth value from last to be present in our array from first file. If it is we replace it with the value from array. If not then we leave the existing value as is.
In the END block we iterate through our unique array and print it to a file called unique.txt. This holds all the unique entries seen on column three of second file.
The entire output of the second file is redirected to output.txt which now has the modified forth column from last.
$ cat output.txt
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
$ cat unique.txt
kate
bob
ray

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.