Parsing CSV files and matching

Parsing CSV files and matching - perl

I have 10 folders and in each folder, i have two files (CSV,comma delimeted) in the following formats.
File 1:
Ensembl Gene ID,Ensembl Transcript ID,Exon Chr Start (bp),Exon Chr End (bp),Exon Rank in Transcript, Transcript count,Gene End (bp) ,Gene Start (bp),Strand
ENSG00000271782,ENST00000607815,50902700,50902978,1,1,50902978,50902700,-1
ENSG00000232753,ENST00000424955,103817769,103817825,1,1,103828355,103817769,1
ENSG00000232753,ENST00000424955,103827995,103828355,2,1,103828355,103817769,1
ENSG00000225767,ENST00000424664,50927141 50927168,1,1,50936822,50927141,1
File 2:
number,Start pos,End Pos
1,41035,41048
3,36738,36751
3,38169,38182
3,40264,40277
I am trying to match the second file to firstfile
The number in colum1 of second file is the key record number in first file.
Extract the last 3 colums from first file
where the output needed is :
1,ENSG00000271782,41035,41048,50902978,50902700,-1
3,ENSG00000225767,36738,36751,50936822,50927141,1
3,ENSG00000225767,38169,38182,50936822,50927141,1
3,ENSG00000225767,40264,40277,50936822,50927141,1
I have started reading from second using TexT::CSV, but need help.
use strict;
use warnings;
use lib 'C:/Perl/lib';
use Text::CSV;
my $file1 = "infile1";
open my $fh, "<", $file1 or die "$file1: $!";
my $file2 = "infile2"
open my $fh2, "<", $file2 or die "$file2: $!";
my $csv = Text::CSV->new ({
binary => 1,
auto_diag => 1,
});
while (my $row = $csv->getline ($fh2)) {
print "#$row\n"; # I am stuck in extraction ? do I need to put another while loop for fh1
}
close $fh1;
close $fh2;

Since there are no commas enclosed within double-quotes, you can just split on the commas instead of using Text::CSV (which is an excellent module). Given this, the following produces the output you want:
use strict;
use warnings;
use autodie;
my ( $num, %hash ) = 0;
my ( $file1, $file2 ) = qw/inFile1 inFile2/;
open my $fh1, '<', $file1;
while (<$fh1>) {
next if $. == 1;
chomp;
my #fields = split /,/;
$num++ if !$hash{ $fields[0] }++;
push #{ $hash{$num} }, [ #fields[ 0, 6 .. 8 ] ];
}
close $fh1;
open my $fh2, '<', $file2;
while (<$fh2>) {
next if $. == 1;
chomp;
my #fields = split /,/;
if ( my #arr = #{ $hash{ $fields[0] }->[0] } ) {
splice #arr, 1, 0, #fields[ 1, 2 ];
print join( ',', $fields[0], #arr ), "\n";
}
}
close $fh2;
This uses a hash to: 1) keep track of seen Gene IDs, and 2) build a hash of array of arrays (HoAoA). The count--your "key record"--is incremented on unique Gene IDs, so #1 keeps track of these IDs to insure that $num is incremented only if the Gene ID hasn't yet appeared. Number 2 (HoAoA) is used because there are multiple instances of the same Gene ID, but only the values from the first instance are used in the printing. (I did note, however, that the second skips #2, which is the multiple-instance Gene ID.) Perhaps you only needed a hash of arrays (HoA), but it works well the way it is--or you can just modify it, as needed. That is, if you aren't going to use the multiple Gene ID info, the code could be simplified.
Output on your datasets:
1,ENSG00000271782,41035,41048,50902978,50902700,-1
3,ENSG00000225767,36738,36751,50936822,50927141,1
3,ENSG00000225767,38169,38182,50936822,50927141,1
3,ENSG00000225767,40264,40277,50936822,50927141,1
Hope this helps!

The interesting part of this problem is that you need logic that read file 1 until it's ahead of file 2 and logic that reads file 2 until it's ahead of file 1 and logic to know how to act when one is behind the other, and when they are in balance.
You'll need to track unique gene ensemble ids, and their ordinal position in the list. So that when you read the second line of file2, you'll know how to skip the second and third line of file1, but also know not to skip and more in file1 when you've read the third and fourth line in file 1.
Or you can read file1 into memory, and create an array of arrays of lines, so that e.g.
file1arr[1] = [ $line1 ]
file1arr[2] = [ $line2, $line3 ]
file1arr[3] = [ $line4 ]
so when you loop over file 2, all the lines from file1 are in a neat little arrayref at the array index corresponding to the number column of file2.
Then it's just an exercise in iterating of the array of file1 lines, splitting them and building your output lines.

Related

Perl: Read columns and convert to array

I am new to perl, trying to read a file with columns and creating an array.
I am having a file with following columns.
file.txt
A 15
A 20
A 33
B 20
B 45
C 32
C 78
I wanted to create an array for each unique item present in A with its values assigned from second column.
eg:
#A = (15,20,33)
#B = (20,45)
#C = (32,78)
Tried following code, only for printing 2 columns
use strict;
use warnings;
my $filename = $ARGV[0];
open(FILE, $filename) or die "Could not open file '$filename' $!";
my %seen;
while (<FILE>)
{
chomp;
my $line = $_;
my #elements = split (" ", $line);
my $row_name = join "\t", #elements[0,1];
print $row_name . "\n" if ! $seen{$row_name}++;
}
close FILE;
Thanks

Firstly some general Perl advice. These days, we like to use lexical variables as filehandles and pass three arguments to open().
open(my $fh, '<', $filename) or die "Could not open file '$filename' $!";
And then...
while (<$fh>) { ... }
But, given that you have your filename in $ARGV[0], another tip is to use an empty file input operator (<>) which will return data from the files named in #ARGV without you having to open them. So you can remove your open() line completely and replace the while with:
while (<>) { ... }
Second piece of advice - don't store this data in individual arrays. Far better to store it in a more complex data structure. I'd suggest a hash where the key is the letter and the value is an array containing all of the numbers matching that letter. This is surprisingly easy to build:
use strict;
use warnings;
use feature 'say';
my %data; # I'd give this a better name if I knew what your data was
while (<>) {
chomp;
my ($letter, $number) = split; # splits $_ on whitespace by default
push #{ $data{$letter} }, $number;
}
# Walk the hash to see what we've got
for (sort keys %data) {
say "$_ : #{ $data{$_ } }";
}

Change the loop to be something like:
while (my $line = <FILE>)
{
chomp($line);
my #elements = split (" ", $line);
push(#{$seen{$elements[0]}}, $elements[1]);
}
This will create/append a list of each item as it is found, and result in a hash where the keys are the left items, and the values are lists of the right items. You can then process or reassign the values as you wish.

perl read file from specified line thru the end

I'm new with perl. I'm trying to read a large comma separate file, split and grab only some columns. I could create it with some internet help, but I'm struggling to change to code to start reading from a specific line thru the end of the file.
my need is open file start reading on line 12, split ',' grab column 0,2,10,11 and concatenate those needed columns with '\t'.
here is my code
#!/usr/bin/perl
my $filename = 'file_to_read.csv';
open(FILER, $filename) or die "Could not read $filename.";
open(FILEW, ">$filename.txt") || die "couldn't create the file\n";
while(<FILER>) {
chomp;
my #fields = split(',', $_);
print FILEW "$fields[0]\t$fields[3]\t$fields[10]\t$fields[11]\n";
}
close FILER;
close FILEW;
here is the file example:
[Header]
GSGT Version: X
Processing Date:12/01/2010 7:20 PM
Content:
Num SNPs:
Total SNPs:
Num Samples:
Total Samples:
Sample:
[Data]
SNP Name,Chromosome,Pos,GC Score,Theta,R,X,Y,X Raw,Y Raw,B Allele Freq,Log R Ratio,Allele1 - TOP,Allele2 - TOP
1:10001102-G-T,1,10001102,0.4159,0.007,0.477,0.472,0.005,6281,126,0.0000,-0.2581,A,A
1:100011159-T-G,1,100011159,0.4259,0.972,0.859,0.036,0.822,807,3648,0.9942,-0.0304,C,C
1:10002775-GA,1,10002775,0.4234,0.977,1.271,0.043,1.228,809,5140,0.9892,0.0111,G,G

Rather than skipping until a specific line number, which may vary from file to file, it is best to keep track of the current section of the file marked by [Header], [Data] etc.
This solution keeps a state variable $section which is updated to the current section name every time a [Section] label is encountered in the file. Everything from the Data section is summarised and printed
A similar thing could be done with the column headers, using names instead of numbers to select the fields to be output, but I have chosen to keep the complexity down instead
use strict;
use warnings 'all';
use feature 'say';
my $filename = 'file_to_read.csv';
open my $fh, '<', $filename or die qq{Unable to open "$filename" for input: $!};
my $section = "";
while ( <$fh> ) {
next unless /\S/; # Skip empty lines
if ( $section eq 'Data' ) { # Skip unless we're in the [Data] section
chomp;
my #fields = split /,/;
say join ',', #fields[0,3,10,11];
}
elsif ( /\[(\w+)\]/ ) {
$section = $1;
}
}
output
SNP Name,GC Score,B Allele Freq,Log R Ratio
1:10001102-G-T,0.4159,0.0000,-0.2581
1:100011159-T-G,0.4259,0.9942,-0.0304
1:10002775-GA,0.4234,0.9892,0.0111

please assign a variable to count the lines processed like my $line_count = 0;
and inside the beginning of while loop increment the varialbe $line_count++;
and skip if the line count is below 12 ie , next if $line_count > 12;

Count the number of items derived from split without putting into an array

I am looking to spare the use of an array for memory's sake, but still get the number of items derived from the split function for each pass of a while loop.
The ultimate goal is to filter the output files according to the number of their sequences, which could either be deduced by the number of rows the file has, or the number of carrots that appear, or the number of line breaks, etc.
Below is my code:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
"TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
"TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
"TTTATCG" => "AGTCATGCTTTATCGCGATCGAT",
"TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
"TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
"TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
"TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
"TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
"CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
);
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
foreach my $sequence (split /\t/, $line){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
}
}
}
The input file, "Clustered_Barcodes.txt" when opened, looks like the following:
TTTATGC TTTATGG TTTATCC TTTATCG
TTTATAA TTTATAA TTTATAT TTTATAT TTTATTA
CTTGTAA
There will be three output files from the code, "Clustered_Barcode_1.txt", "Clustered_Barcode_2.txt", and "Clustered_Barcode_3.txt". An example of what the output files would look like could be the 3rd and final file, which would look like the following:
>CTTGTAA
ATCGATCGCTTGTAACGATTAGC
I need some way to modify my code to identify the number of rows, carrots, or sequences that appear in the file and work that into the title of the file. The new title for the above sequence could be something like "Clustered_Barcode_Number_3_1_Sequence.txt"
PS- I made the hash in the above code manually in attempt to make things simpler. If you want to see the original code, here it is. The input file format is something like:
>TAGCTAGC
GCTAAGCGATGCTACGGCTATTAGCTAGCCGGTA
Here is the code for setting up the hash:
my $dir = ("~/Documents/Sequences");
open(INFILE, "<", "~/Documents/Clustered_Barcodes.txt") or die $!;
my %hash = ();
my #ArrayofFiles = glob "$dir/*"; #put all files from the specified directory into an array
#print join("\n", #ArrayofFiles), "\n"; #this is a diagnostic test print statement
foreach my $file (#ArrayofFiles){ #make hash of barcodes and sequences
open (my $sequence, $file) or die "can't open file: $!";
while (my $line = <$sequence>) {
if ($line !~/^>/){
my $seq = $line;
$seq =~ s/\R//g;
#print $seq;
$seq =~ m/(CATCAT|TACTAC)([TAGC]{16})([TAGC]+)([TAGC]{16})(CATCAT|TACTAC)/;
$hash{$2} = $3;
}
}
}
while(<INFILE>){
etc

You can use regex to get the count:
my $delimiter = "\t";
my $line = "zyz pqr abc xyz";
my $count = () = $line =~ /$delimiter/g; # $count is now 3
print $count;

Your hash structure is not right for your problem as you have multiple entries for same ids. for example TTTATAA hash id has 2 entries in your %hash.
To solve this, use hash of array to create the hash.
Change your hash creation code in
$hash{$2} = $3;
to
push(#{$hash{$2}}, $3);
Now change your code in the while loop
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
my %id_list;
foreach my $sequence (split /\t/, $line){
$id_list{$sequence}=1;
}
foreach my $sequence(keys %id_list)
{
foreach my $val (#{$hash{$sequence}})
{
print $out ">$sequence\n$val\n";
}
}
}

I have assummed that;
The first digit in the output file name is the input file line number
The second digit in the output file name is the input file column number
That the input hash is a hash of arrays to cover the case of several sequences "matching" the one barcode as mentioned in the comments
When a barcode has a match in the hash, that the output file will lists all the sequences in the array, one per line.
The simplest way to do this that I can see is to build the output file using a temporary filename and the rename it when you have all the data. According to the perl cookbook, the easiest way to create temporary files is with the module File::Temp.
The key to this solution is to move through the list of barcodes that appear on a line by column index rather than the usual perl way of simply iterating over the list itself. To get the actual barcodes, the column number $col is used to index back into #barcodes which is created by splitting the line on whitespace. (Note that splitting on a single space is special cased by perl to emulate the behaviour of one of its predecessors, awk (leading whitespace is removed and the split is on whitespace, not a single space)).
This way we have the column number (indexed from 1) and the line number we can get from the perl special variable, $. We can then use these to rename the file using the builtin, rename().
use warnings;
use strict;
use diagnostics;
use File::Temp qw(tempfile);
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => [ "TATAGCGCTTTATGCTAGCTAGC" ],
"TTTATGG" => [ "TAGCTAGCTTTATGGGCTAGCTA" ],
"TTTATCC" => [ "GCTAGCTATTTATCCGCTAGCTA" ],
"TTTATCG" => [ "AGTCATGCTTTATCGCGATCGAT" ],
"TTTATAA" => [ "TAGCTAGCTTTATAATAGCTAGC", "ATCGATCGTTTATAACGATCGAT" ],
"TTTATAT" => [ "TCGATCGATTTATATTAGCTAGC", "TAGCTAGCTTTATATGCTAGCTA" ],
"TTTATTA" => [ "GCTAGCTATTTATTATAGCTAGC" ],
"CTTGTAA" => [ "ATCGATCGCTTGTAACGATTAGC" ]
);
my $cbn = "Clustered_Barcode_Number";
my $trailer = "Sequence.txt";
while (my $line = <INFILE>) {
chomp $line ;
my $line_num = $. ;
my #barcodes = split " ", $line ;
for my $col ( 1 .. #barcodes ) {
my $barcode = $barcodes[ $col - 1 ]; # arrays indexed from 0
# skip this one if its not in the hash
next unless exists $hash{$barcode} ;
my #sequences = #{ $hash{$barcode} } ;
# Have a hit - create temp file and output sequences
my ($out, $temp_filename) = tempfile();
say $out ">$barcode" ;
say $out $_ for (#sequences) ;
close $out ;
# Rename based on input line and column
my $new_name = join "_", $cbn, $line_num, $col, $trailer ;
rename ($temp_filename, $new_name) or
warn "Couldn't rename $temp_filename to $new_name: $!\n" ;
}
}
close INFILE
All of the barcodes in your sample input data have a match in the hash, so when I run this, I get 4 files for line 1, 5 for line 2 and 1 for line 3.
Clustered_Barcode_Number_1_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt
Clustered_Barcode_Number_1_3_Sequence.txt
Clustered_Barcode_Number_1_4_Sequence.txt
Clustered_Barcode_Number_2_1_Sequence.txt
Clustered_Barcode_Number_2_2_Sequence.txt
Clustered_Barcode_Number_2_3_Sequence.txt
Clustered_Barcode_Number_2_4_Sequence.txt
Clustered_Barcode_Number_2_5_Sequence.txt
Clustered_Barcode_Number_3_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt for example has:
>TTTATGG
TAGCTAGCTTTATGGGCTAGCTA
and Clustered_Barcode_Number_2_5_Sequence.txt has:
>TTTATTA
GCTAGCTATTTATTATAGCTAGC
Clustered_Barcode_Number_2_3_Sequence.txt - which matched a hash key with two sequences - had the following;
>TTTATAT
TCGATCGATTTATATTAGCTAGC
TAGCTAGCTTTATATGCTAGCTA
I was speculating here about what you wanted when a supplied barcode had two matches. Hope that helps.

replace 4th column from the last and also pick unique value from 3rd column at the same time

I have two files both of them are delimited by pipe.
First file:
has may be around 10 columns but i am interested in first two columns which would useful in updating the column value of the second file.
first file detail:
1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9
Second file detail:
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**1**|a10|a11|a12
a1|a2|**ray**|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|**kate**|a3|a4|a5|a6|a7|a8|**20**|a10|a11|a12
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**6**|a10|a11|a12
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**45**|a10|a11|a12
My requirement here is to find unique values from 3rd column and also replace the 4th column from the last . The 4th column from the last may/may not have numeric number . This number would be appearing in the first field of first file as well. I need replace (second file )this number with the corresponding value that appears in the second column of the first file.
expected output:
unique string : ray kate bob
a1|a2|bob|a3|a4|a5|a6|a7|a8|**alpha**|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|**charlie**|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|**romeo**|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
I am able to pick the unique string using below command
awk -F'|' '{a[$3]++}END{for(i in a){print i}}' filename
I would dont want to read the second file twice , first to pick the unique string and second time to replace 4th column from the last as the file size is huge. It would be around 500mb and there are many such files.
Currently i am using perl (Text::CSV) module to read the first file ( this file is of small size ) and load the first two columns into a hash , considering first column as key and second as value. then read the second file and replace the n-4 column with hash value. But this seems to be time consuming as Text::CSV parsing seems to be slow.
Any awk/perl solution keeping speed in mind would be really helpful :)
Note: Ignore the ** asterix around the text , they are just to highlight they are not part of the data.
UPDATE : Code
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Utils;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ sep_char => '|' });
my $file = $ARGV[0] or die "Need to get CSV file on the command line\n";
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
chomp $line;
if ($csv->parse($line)) {
my #fields = $csv->fields();
$hash{$field[0]}=$field[1];
} else {
warn "Line could not be parsed: $line\n";
}
}
close($data);
my $csv = Text::CSV->new({ sep_char => '|' , blank_is_undef => 1 , eol => "\n"});
my $file2 = $ARGV[1] or die "Need to get CSV file on the command line\n";
open ( my $fh,'>','/tmp/outputfile') or die "Could not open file $!\n";
open(my $data2, '<', $file2) or die "Could not open '$file' $!\n";
while (my $line = <$data2>) {
chomp $line;
if ($csv->parse($line)) {
my #fields = $csv->fields();
if (defined ($field[-4]) && looks_like_number($field[-4]))
{
$field[-4]=$hash{$field[-4]};
}
$csv->print($fh,\#fields);
} else {
warn "Line could not be parsed: $line\n";
}
}
close($data2);
close($fh);

Here's an option that doesn't use Text::CSV:
use strict;
use warnings;
#ARGV == 3 or die 'Usage: perl firstFile secondFile outFile';
my ( %hash, %seen );
local $" = '|';
while (<>) {
my ( $key, $val ) = split /\|/, $_, 3;
$hash{$key} = $val;
last if eof;
}
open my $outFH, '>', pop or die $!;
while (<>) {
my #F = split /\|/;
$seen{ $F[2] } = undef;
$F[-4] = $hash{ $F[-4] } if exists $hash{ $F[-4] };
print $outFH "#F";
}
close $outFH;
print 'unique string : ', join( ' ', reverse sort keys %seen ), "\n";
Command-line usage: perl firstFile secondFile outFile
Contents of outFile from your datasets (asterisks removed):
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
STDOUT:
unique string : ray kate bob
Hope this helps!

Use getline instead of parse, it is much faster. The following is a more idiomatic way of performing this task. Note that you can reuse the same Text::CSV object for multiple files.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Text::CSV;
my $csv = Text::CSV->new({
auto_diag => 1,
binary => 1,
blank_is_undef => 1,
eol => $/,
sep_char => '|'
}) or die "Can't use CSV: " . Text::CSV->error_diag;
open my $map_fh, '<', 'map.csv' or die "map.csv: $!";
my %mapping;
while (my $row = $csv->getline($map_fh)) {
$mapping{ $row->[0] } = $row->[1];
}
close $map_fh;
open my $in_fh, '<', 'input.csv' or die "input.csv: $!";
open my $out_fh, '>', 'output.csv' or die "output.csv: $!";
my %seen;
while (my $row = $csv->getline($in_fh)) {
$seen{ $row->[2] } = 1;
my $key = $row->[-4];
$row->[-4] = $mapping{$key} if defined $key and exists $mapping{$key};
$csv->print($out_fh, $row);
}
close $in_fh;
close $out_fh;
say join ',', keys %seen;
map.csv
1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9
input.csv
a1|a2|bob|a3|a4|a5|a6|a7|a8|1|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|20|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|6|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
output.csv
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
STDOUT
kate,bob,ray

This awk should work.
$ awk '
BEGIN { FS = OFS = "|" }
NR==FNR { a[$1] = $2; next }
{ !unique[$3]++ }
{ $(NF-3) = (a[$(NF-3)]) ? a[$(NF-3)] : $(NF-3) }1
END {
for(n in unique) print n > "unique.txt"
}' file1 file2 > output.txt
Explanation:
We set the input and output field separators to |.
We iterate through first file creating an array storing column one as key and assigning column two as the value
Once the first file is loaded in memory, we create another array by reading the second file. This array stores the unique values from column three of second file.
While reading the file, we look at the forth value from last to be present in our array from first file. If it is we replace it with the value from array. If not then we leave the existing value as is.
In the END block we iterate through our unique array and print it to a file called unique.txt. This holds all the unique entries seen on column three of second file.
The entire output of the second file is redirected to output.txt which now has the modified forth column from last.
$ cat output.txt
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
$ cat unique.txt
kate
bob
ray

parse a huge text file in perl

I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487
This is how my model file looks, I want to have them as :
File A :
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853
File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948
File C:
ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.
Is there any easy way of doing this than going thorough an array?
How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.
Thanks
Sipra

#!/bin/env perl
use strict;
use warnings;
# open three output filehandles
my %fh;
for (qw[A B C]) {
open $fh{$_}, '>', "file$_" or die $!;
}
# open input
open my $in, '<', 'somefile' or die $!;
# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;
while (<$in>) {
chomp;
my #data = split /,/;
print $fh{A} join(',', #data[0 .. 9]), "\n";
print $fh{B} join(',', #data[0, 1, 10 .. 17]), "\n";
print $fh{C} join(',', #data[0, 1, 18 .. $#data]), "\n";
}
Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.
#!/usr/bin/env perl
use strict;
use warnings;
# open input
open my $in, '<', 'somefile' or die $!;
chomp(my $head = <$in>);
my #cols = split/,/, $head;
die 'Invalid number of records - ' . #cols . "\n"
if (#cols -2) % 8;
my #files;
my $name = 'A';
foreach (1 .. (#cols - 2) / 8) {
my %desc;
$desc{start_col} = (($_ - 1) * 8) + 2;
$desc{end_col} = $desc{start_col} + 7;
open $desc{fh}, '>', 'file' . $name++ or die $!;
print {$desc{fh}} join(',', #cols[0,1],
#cols[$desc{start_col} .. $desc{end_col}]),
"\n";
push #files, \%desc;
}
while (<$in>) {
chomp;
my #data = split /,/;
foreach my $f (#files) {
print {$f->{fh}} join(',', #data[0,1],
#data[$f->{start_col} .. $f->{end_col}]),
"\n";
}
}

This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );
#I'm lazy
my #source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;
#read each row
while ( my $raw_line = shift #source_file ) {
$csv->parse($raw_line);
my #fields = $csv->fields();
#get the first 2 ids
my #ids = splice #fields, 0, 2;
my $group = 0;
while (#fields) {
#get the first 8 columns
my #columns = splice #fields, 0, 8;
#if you want to change the separator of the output replace ',' with "\t"
push #{ $output_file->[$group] }, (join ',', #ids, #columns), $/;
$group++;
}
}
#for filename purposes
my $letter = 65;
foreach my $data (#$output_file) {
my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
write_file( $output_filename, #$data );
$letter++;
}
#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (#$output_file) {
# my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
# write_file( $output_filename, #$data );
# $sample_number++;
#}

Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files
perl -F, -lane 'print "#F[0..1] #F[2..9]"' <INPUT_FILE_NAME>

You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?
I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.
I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.
It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.
If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.
The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.
I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.
First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.
I see some sort of processing loop like this:
while (my $line = <$input_fh>) { #Line from the input file.
chomp $line;
my #input_line_array = split /\t/, $line;
my $fileHandle;
foreach my $output_file (#outputFileList) { #List of output files.
$fileHandle = $output_file->{FILE_HANDLE};
my #fieldsToWrite;
foreach my $fieldNumber (#{$output_file->{FIELD_LIST}}) {
push $fieldsToWrite, $input_line_array[$field];
}
say $file_handle join "\t", #fieldsToWrite;
}
}
I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the #input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.
I have a list called #outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in #input_line_array. So if
$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];
Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.
I hope this is making some sense.
The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.
Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse