Using Perl hashes to handle tab-delimited files - perl

I have two files:
file_1 has three columns (Marker(SNP), Chromosome, and position)
file_2 has three columns (Chromosome, peak_start, and peak_end).
All columns are numeric except for the SNP column.
The files are arranged as shown in the screenshots. file_1 has several hundred SNPs as rows while file_2 has 61 peaks. Each peak is marked by a peak_start and peak_end. There can be any of the 23 chromosomes in either file and file_2 has several peaks per chromosome.
I want to find if the position of the SNP in file_1 falls within the peak_start and peak_end in file_2 for each matching chromosome. If it does, I want to show which SNP falls in which peak (preferably write output to a tab-delimited file).
I would prefer to split the file, and use hashes where the chromosome is the key. I have found only a few questions remotely similar to this, but I could not understand well the suggested solutions.
Here is the example of my code. It is only meant to illustrate my question and so far doesn't do anything so think of it as "pseudocode".
#!usr/bin/perl
use strict;
use warnings;
my (%peaks, %X81_05);
my #array;
# Open file or die
unless (open (FIRST_SAMPLE, "X81_05.txt")) {
die "Could not open X81_05.txt";
}
# Split the tab-delimited file into respective fields
while (<FIRST_SAMPLE>) {
chomp $_;
next if (m/Chromosome/); # Skip the header
#array = split("\t", $_);
($chr1, $pos, $sample) = #array;
$X81_05{'$array[0]'} = (
'position' =>'$array[1]'
)
}
close (FIRST_SAMPLE);
# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
my ($chr, $peak_start, $peak_end);
while (<PEAKS>) {
chomp $_;
next if (m/Chromosome/); # Skip header
($chr, $peak_start, $peak_end) = split(/\t/);
$peaks{$chr}{'peak_start'} = $peak_start;
$peaks{$chr}{'peak_end'} = $peak_end;
}
close (PEAKS);
for my $chr1 (keys %X81_05) {
my $val = $X81_05{$chr1}{'position'};
for my $chr (keys %peaks) {
my $min = $peaks{$chr}{'peak_start'};
my $max = $peaks{$chr}{'peak_end'};
if (($val > $min) and ($val < $max)) {
#print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
}
else {
#print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
}
}
}
More awesome code:
http://i.stack.imgur.com/fzwRQ.png
http://i.stack.imgur.com/2ryyI.png

A couple of program hints in Perl:
You can do this:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
Instead of this:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
It's more standard Perl, and it's a bit easier to read.
Talking about Standard Perl, you should use the 3 argument open form, and use scalars for file handles:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
This way, if your file's name just happens to start with a | or >, it will still work. Using scalars variables (variables that start with a $) makes it easier to pass file handles between functions.
Anyway, just to make sure I understand you correctly: You said "I would prefer ... use hashes where the chromosome is the key."
Now, I have 23 pairs of chromosomes, but each of those chromosomes might have thousands of SNPs on it. If you key by chromosome this way, you can only store a single SNP per chromosome. Is this what you want? I notice your data is showing all the same chromosome. That means you can't key by chromosome. I'm ignoring that for now, and using my own data.
I've also noticed a difference in what you said the files contained, and how your program uses them:
You said: "file 1 has 3 columns (SNP, Chromosome, and position)" , yet your code is:
($chr1, $pos, $sample) = #array;
Which I assume is Chromosome, Position, and SNP. Which way is the file arranged?
You've got to clarify exactly what you're asking for.
Anyway, here's the tested version that prints out in tab delimited format. This is in a bit more modern Perl format. Notice that I only have a single hash by chromosome (as you specified). I read the peaks.txt in first. If I find in my position file a chromosome that doesn't exist in my peaks.txt file, I simply ignore it. Otherwise, I'll add in the additional hashes for POSITION and SNP:
I do a final loop that prints everything out (tab delimitated) as you specified, but you didn't specify a format. Change it if you have to.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #No need to check for file open failure
use constant {
PEAKS_FILE => "peak.txt",
POSITION_FILE => "X81_05.txt",
};
open ( my $peak_fh, "<", PEAKS_FILE );
my %chromosome_hash;
while ( my $line = <$peak_fh> ) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ( $chromosome, $peak_start, $peak_end ) = split ( "\t", $line );
$chromosome_hash{$chromosome}->{PEAK_START} = $peak_start;
$chromosome_hash{$chromosome}->{PEAK_END} = $peak_end;
}
close $peak_fh;
open ( my $position_fh, "<", POSITION_FILE );
while ( my $line = <$position_fh> ) {
chomp $line;
my ( $chromosome, $position, $snp ) = split ( "\t", $line );
next unless exists $chromosome_hash{$chromosome};
if ( $position >= $chromosome_hash{$chromosome}->{PEAK_START}
and $position <= $chromosome_hash{$chromosome}->{PEAK_END} ) {
$chromosome_hash{$chromosome}->{SNP} = $snp;
$chromosome_hash{$chromosome}->{POSITION} = $position;
}
}
close $position_fh;
#
# Now Print
#
say join ("\t", qw(Chromosome, SNP, POSITION, PEAK-START, PEAK-END) );
foreach my $chromosome ( sort keys %chromosome_hash ) {
next unless exists $chromosome_hash{$chromosome}->{SNP};
say join ("\t",
$chromosome,
$chromosome_hash{$chromosome}->{SNP},
$chromosome_hash{$chromosome}->{POSITION},
$chromosome_hash{$chromosome}->{PEAK_START},
$chromosome_hash{$chromosome}->{PEAK_END},
);
}
A few things:
Leave spaces around parentheses on both sides. It makes it easier to read.
I use parentheses when others don't. The current style is not to use them unless you have to. I tend to use them for all functions that take more than a single argument. For example, I could have said open my $peak_fh, "<", PEAKS_FILE;, but I think parameters start to get lost when you have three parameters on a function.
Notice I use use autodie;. This causes the program to quit if it can't open a file. That's why I don't even have to test whether or not the file opened.
I would have preferred to use object oriented Perl to hide the structure of the hash of hashes. This prevents errors such as thinking that the start peek is stored in START_PEEK rather than PEAK_START. Perl won't detect these type of miskeyed errors. Therefore, I prefer to use objects whenever I am doing arrays of arrays or hashes of hashes.

You only need one for loop because you are expecting to find some of the SNPs in the second lot. Hence, loop through your %X81_05 hash and check if any matches one in %peak. Something like:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
Note: I Have not tested the code.
Looking at the screenshots that you have added, this is not going to work.

The points raised by #David are good; try to incorporate those in your programs. (I have borrowed most of the code from #David's post.)
One thing I didn't understand is that why load both peak values and position in hash, as loading one would suffice. As each chromosome has more than one record, use HoA. My solution is based on that. You might need to change the cols and their positions.
use strict;
use warnings;
our $Sep = "\t";
open (my $peak_fh, "<", "data/file2");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromo
}
close $peak_fh;
open (my $position_fh, "<", "data/file1");
while (my $line = <$position_fh>) {
chomp $line;
my ($chromosome, $snp, $position) = split ($Sep, $line);
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end) = (split($Sep, $line))[1,2];
if ($position >= $start and $position <= $end) {
print "MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
else {
print "NO MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
}
}
close $position_fh;

I used #tuxuday and #David's code to solve this problem. Here is the final code that did what I wanted. I have not only learned a lot, but I have been able to solve my problem successfully! Kudos guys!
use strict;
use warnings;
use feature qw(say);
# Read in peaks and sample files from command line
my $usage = "Usage: $0 <peaks_file> <sample_file>";
my $peaks = shift #ARGV or die "$usage \n";
my $sample = shift #ARGV or die "$usage \n";
our $Sep = "\t";
open (my $peak_fh, "<", "$peaks");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromosome
}
close $peak_fh;
open (my $position_fh, "<", "$sample");
while (my $line = <$position_fh>) {
chomp $line;
next if $line =~ /Marker/; #Skip Header
my ($snp, $chromosome, $position) = split ($Sep, $line);
# Check if chromosome in peaks_file matches chromosome in sample_file
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end,$peak_no) = (split( $Sep, $peak_line ))[1,2,3];
if ( $position >= $start and $position <= $end) {
# Print output
say join ("\t",
$snp,
$chromosome,
$position,
$start,
$end,
$peak_no,
);
}
else {
next; # Go to next chromosome
}
}
}
close $position_fh;

Related

Parsing file based on column ID: perl

I have a tab delineated file with repeated values in the first column. The single, but repeated values in the first column correspond to multiple values in the second column. It looks something like this:
AAAAAAAAAA1 m081216|101|123
AAAAAAAAAA1 m081216|100|1987
AAAAAAAAAA1 m081216|927|463729
BBBBBBBBBB2 m081216|254|260489
BBBBBBBBBB2 m081216|475|1234
BBBBBBBBBB2 m081216|987|240
CCCCCCCCCC3 m081216|433|1000
CCCCCCCCCC3 m081216|902|366
CCCCCCCCCC3 m081216|724|193
For every type of sequence in the first column, I am trying to print to a file with just the sequences that correspond to it. The name of the file should include the repeated sequence in the first column and the number of sequences that correspond to it in the second column. In the above example I would therefore have 3 files of 3 sequences each. The first file would be named something like "AAAAAAAAAA1.3.txt" and look like the following when opened:
m081216|101|123
m081216|100|1987
m081216|927|463729
I have seen other similar questions, but they have been answered with using a hash. I don't think I can't use a hash because I need to keep the number of relationships between columns. Maybe there is a way to use a hash of hashes? I am not sure.
Here is my code so far.
use warnings;
use strict;
use List::MoreUtils 'true';
open(IN, "<", "/path/to/in_file") or die $!;
my #array;
my $queryID;
while(<IN>){
chomp;
my $OutputLine = $_;
processOutputLine($OutputLine);
}
sub processOutputLine {
my ($OutputLine) = #_;
my #Columns = split("\t", $OutputLine);
my ($queryID, $target) = #Columns;
push(#array, $target, "\n") unless grep{$queryID eq $_} #array;
my $delineator = "\n";
my $count = true { /$delineator/g } #array;
open(OUT, ">", "/path/to/out_$..$queryID.$count.txt") or die $!;
foreach(#array){
print OUT #array;
}
}
I would still recommend a hash. However, you store all sequences related to the same id in an anonymous array which is the value for that ID key. It's really two lines of code.
use warnings;
use strict;
use feature qw(say);
my $filename = 'rep_seqs.txt'; # input file name
open my $in_fh, '<', $filename or die "Can't open $filename: $!";
my %seqs;
foreach my $line (<$in_fh>) {
chomp $line;
my ($id, $seq) = split /\t/, $line;
push #{$seqs{$id}}, $seq;
}
close $in_fh;
my $out_fh;
for (sort keys %seqs) {
my $outfile = $_ . '_' . scalar #{$seqs{$_}} . '.txt';
open $out_fh, '>', $outfile or do {
warn "Can't open $outfile: $!";
next;
};
say $out_fh $_ for #{$seqs{$_}};
}
close $out_fh;
With your input I get the desired files, named AA..._count.txt, with their corresponding three lines each. If items separated by | should be split you can do that while writing it out, for example.
Comments
The anonymous array for a key $seqs{$id} is created once we push, if not there already
If there are issues with tabs (converted to spaces?), use ' '. See the comment.
A filehandle is closed and re-opened on every open, so no need to close every time
The default pattern for split is ' ', also triggering specific behavior -- it matches "any contiguous whitespace", and also omits leading whitespace. (The pattern / / matches a single space, turning off this special behavior of ' '.) See a more precise description on the split page. Thus it is advisable to use ' ' when splitting on unspecified number of spaces, since in the case of split this is a bit idiomatic, is perhaps the most common use, and is its default. Thanks to Borodin for prompting this comment and update (the original post had the equivalent /\s+/).
Note that in this case, since ' ' is the default along with $_, we can shorten it a little
for (<$in_fh>) {
chomp;
my ($id, $seq) = split;
push #{$seqs{$id}}, $seq;
}

Modifying CSV file and Preserving Order

The question that follows is a made up simplified example of a more complex problem that I'm trying to solve. I would like to preserve the structure of the code, especially the use of the %hash to store the outcomes for each patient but I do not need to read the data file into memory (but I cannot find a way of reading my csv data file line by line from the end.)
My sample data is made up of events that occur to patients. A patient can be added to the study (Event=B) or he can die (Event=D) or exit the study(Event=F.) Death and Exit are the only two possible outcomes for each patient.
For each event I have the date of occurrence (in hours from given point in time), the unique ID number of each patient, the event and the Outcome (a field set to 0 for every patient.)
I'm trying to write a code that will change the input file by putting next to each addition of a new patient, what is his eventual outcome (death or exit.)
In order to do so, I read the file from the end, and whenever I encounter a death or exit of a patient, I populate a hash that matches patient ID with outcome. When I encounter an event telling me that a new patient has been added to the study, I then match his ID with those in the hash and change the value of "Outcome" from 0 to either D or F.
I have been able to write a code that reads the file from bottom and then creates a new modified file with the updated value for Outcome. The problem is that since I read the input file from bottom to top and print each line after reading it, the output file is in reversed order and I do not know how to change this. Also, ideally I don't want to create a new file bu I would like to simply modify the input one. However, I have failed with every attempt to do so.
Sample data:
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,0
25201028,562962838335408,B,0
25201100,562962838335407,D,0
25201128,562962838335408,F,0
My code:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $fh_input, "<", "mini_test2.csv")
or die "cannot open > mini_test2.csv: $!";
my #lines = <$fh_input>;
close $fh_input;
open (my $fh_output, ">>", "Revised_mini_test2.csv")
or die "cannot open > Revised_mini_test2.csv: $!";
my $length = scalar(#lines);
my %outcome;
my #input_variables;
for (my $i = 1; $i < #lines; $i++){
chomp($lines[$length-$i]);
#input_variables=split(/,/, $lines[$length - $i]);
if ($input_variables[2] eq "D" || $input_variables[2] eq "F"){
$outcome{$input_variables[1]} = $input_variables[2];
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
elsif($input_variables[2] eq "B") {
$input_variables[3]=$outcome{$input_variables[1]};
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
else{
# necessary since the actual data has many more possible "Events"
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
}
close $fh_output;
EDIT: desired output should be
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0
Also, an additional complication is that the unique patient ID after the exit of a patient gets re-used. This means that I cannot do a 1st pass and store the outcome for each patient and a 2nd one to update the values of Outcome.
EDIT 2: let me clarify that when I say that each patient has a "unique ID" I mean that there cannot be in the study, at the same time, two patients with the same ID. However, if a patient exits the study, his ID gets re-used.
Update
I have just read your additional information that patient numbers are re-used once they exit the study. Why you would design a system like that I don't know, but there it is
It becomes far harder to write something straightforward without reading the file into an array, so that's what I have done here
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'mini_test2.csv';
my #data;
while ( <$fh> ) {
chomp;
push #data, [ split /,/ ];
}
my %outcome;
for ( my $i = $#data; $i > 0; --$i ) {
my ($patient_number, $event) = #{$data[$i]}[1,2];
if ( $event =~ /[DF]/ ) {
$outcome{$patient_number} = $event;
}
elsif ( $event =~ /[B]/ ) {
$data[$i][3] = delete $outcome{$patient_number} // 0;
}
}
print join(',', #$_), "\n" for #data;
output
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0
There are a few ways to approach this. I have chosen to take two passes through the file, first accumulating the outcome for each patient in a hash, and then replacing all the outcome fields in the B records
use strict;
use warnings;
use 5.010;
use autodie;
use Fcntl ':seek';
my %outcome;
open my $fh, '<', 'mini_test2.csv';
<$fh>; # Drop header
while ( <$fh> ) {
chomp;
my #fields = split /,/;
my ($patient_number, $event) = #fields[1,2];
if ( $event =~ /[DF]/ ) {
$outcome{$patient_number} = $event;
}
}
seek $fh, 0, SEEK_SET; # Rewind
print scalar <$fh>; # Copy header
while ( <$fh> ) {
chomp;
my #fields = split /,/;
my ($patient_number, $event) = #fields[1,2];
if ( $event !~ /[DF]/ ) {
$fields[3] = $outcome{$patient_number} // 0;
}
print join(',', #fields), "\n";
}
output
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0
What we can do is instead of printing out the line at each stage, we'll write it back to the array of lines. Then we can just print them out at the end.
for (my $i=$#lines; i>=0; i--)
{
chomp $lines[$i];
#input_variables = split /,/, $lines[$i];
if ($input_variables[2] eq "D" || $input_variables[2] eq "F")
{
$outcome{$input_variables[1]}=$input_variables[2];
}else
{
$input_variables[3]=$outcome{$input_variables[1]};
}
$line[$i] = join ",", #input_variables;
}
$, = "\n"; #Make list seperator for printing a newline.
print $fh_output #lines;
As for the second question of modifying the original file. It is possible to open a file for both reading and writing using modes "+<", "+>", or "+>>". Don't do this! It is error prone as you must replace data character by character.
The standard way to "modify" an existing file is to rename it, read from the renamed file, write to a new file with the original name, and delete the temp file.
my $file_name = "mini_test2.csv";
my $tmp_file_name = $file_name . ".tmp";
rename $file_name, $tmp_file_name;
open (my $fh_input, "<", $tmp_file_name)
or die "cannot open > $tmp_file_name: $!";
open (my $fh_output, ">>", $file_name)
or die "cannot open > $file_name: $!";
#Your code to process the data.
close $fh_input;
close $fh_output;
#delete the temp file
unlink $tmp_file_name;
But, in your case, you slurp all of the data into memory right away. Just open for writing that clobbers existing files
open (my $fh_output, ">", "mini_test2.csv")
or die "cannot open > mini_test2.csv: $!";

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.

Merging two files based on first column and returns multiple values for each key

I am fairly new to Perl so hopefully this has a quick solution.
I have been trying to combine two files based on a key. The problem is there are multiple values instead of the one it is returning. Is there a way to loop through the hash to get the 1-10 more values it could be getting?
Example:
File Input 1:
12345|AA|BB|CC
23456|DD|EE|FF
File Input2:
12345|A|B|C
12345|D|E|F
12345|G|H|I
23456|J|K|L
23456|M|N|O
32342|P|Q|R
The reason I put those last one in is because the second file has a lot of values I don’t want but file 1 I want all values. The result I want is something like this:
WANTED OUTPUT:
12345|AA|BB|CC|A|B|C
12345|AA|BB|CC|D|E|F
12345|AA|BB|CC|G|H|I
23456|DD|EE|FF|J|K|L
23456|DD|EE|FF|M|N|O
Attached is the code I am currently using. It gives an output like so:
OUTPUT I AM GETTING:
12345|AA|BB|CC|A|B|C
23456|DD|EE|FF|J|K|L
My code so far:
#use strict;
#use warnings;
open file1, "<FILE1.txt";
open file2, "<FILE2.txt";
while(<file2>){
my($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~ /(.+)\|(.+)\|(.+)\|(.+)/;
$value4 = "$value1|$value2|$value3";
$file2Hash{$key} = $value4;
}
while(<file1>){
my ($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~/(.+)\|(.+)\|(.+)\|(.+)/;
if (exists $file2Hash{$key}) {
print $line."|".$file2Hash{$key}."\n";
}
else {
print $line."\n";
}
}
Thank you for any help you may provide,
Your overall idea is sound. However in file2, if you encounter a key you have already defined, you overwrite it with a new value. To work around that, we store an array(-ref) inside our hash.
So in your first loop, we do:
push #{$file2Hash{$key}}, $value4;
The #{...} is just array dereferencing syntax.
In your second loop, we do:
if (exists $file2Hash{$key}){
foreach my $second_value (#{$file2Hash{$key}}) {
print "$line|$second_value\n";
}
} else {
print $line."\n";
}
Beyond that, you might want to declare %file2Hash with my so you can reactivate strict.
Keys in a hash must be unique. If keys in file1 are unique, use file1 to create the hash. If keys are not unique in either file, you have to use a more complicated data structure: hash of arrays, i.e. store several values at each unique key.
I assume that each key in FILE1.txt is unique and that each unique key has at least one corresponding line in FILE2.txt.
Your approach is then quite close to what you need, you should just use FILE1.txt to create the hash from (as already mentioned here).
The following should work:
#!/usr/bin/perl
use strict;
use warnings;
my %file1hash;
open file1, "<", "FILE1.txt" or die "$!\n";
while (<file1>) {
my ($key, $rest) = split /\|/, $_, 2;
chomp $rest;
$file1hash{$key} = $rest;
}
close file1;
open file2, "<", "FILE2.txt" or die "$!\n";
while (<file2>) {
my ($key, $rest) = split /\|/, $_, 2;
if (exists $file1hash{$key}) {
chomp $rest;
printf "%s|%s|%s\n", $key, $file1hash{$key}, $rest;
}
}
close file2;
exit 0;

Perl merging 2 csv files line by line with a primary key

Edit: solution added.
Hi, I currently have some working albeit slow code.
It merges 2 CSV files line by line using a primary key.
For example, if file 1 has the line:
"one,two,,four,42"
and file 2 has this line;
"one,,three,,42"
where in 0 indexed $position = 4 has the primary key = 42;
then the sub: merge_file($file1,$file2,$outputfile,$position);
will output a file with the line:
"one,two,three,four,42";
Every primary key is unique in each file, and a key might exist in one file but not in the other (and vice versa)
There are about 1 million lines in each file.
Going through every line in the first file, I am using a hash to store the primary key, and storing the line number as the value. The line number corresponds to an array[line num] which stores every line in the first file.
Then I go through every line in the second file, and check if the primary key is in the hash, and if it is, get the line from the file1array and then add the columns I need from the first array to the second array, and then concat. to the end. Then delete the hash, and then at the very end, dump the entire thing to file. (I am using a SSD so I want to minimise file writes.)
It is probably best explained with a code:
sub merge_file2{
my ($file1,$file2,$out,$position) = ($_[0],$_[1],$_[2],$_[3]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my %line_for;
my #file1array;
open FILE1, "<$file1";
print "$file1 opened\n";
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$.; #reads csv line at current position (of key)
$file1array[$.] = $_; #store line in file1array.
}
close FILE1;
print "$file2 opened - merging..\n";
open FILE2, "<", $file2;
my #from1to2 = qw( 2 4 8 17 18 19); #which columns from file 1 to be added into cols. of file 2.
while (<FILE2>){
print "$.\n" if ($.%1000) == 0;
chomp;
my #array1 = ();
my #array2 = ();
my #array2 = split /,/, $_; #split 2nd csv line by commas
my #array1 = split /,/, $file1array[$line_for{$array2[$position]}];
# ^ ^ ^
# prev line lookup line in 1st file,lookup hash, pos of key
#my #output = &merge_string(\#array1,\#array2); #merge 2 csv strings (old fn.)
foreach(#from1to2){
$array2[$_] = $array1[$_];
}
my $outstring = join ",", #array2;
$OUTSTRING.=$outstring."\n";
delete $line_for{$array2[$position]};
}
close FILE2;
print "adding rest of lines\n";
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
print "writing file $out\n\n\n";
write_line($out,$OUTSTRING);
}
The first while is fine, takes less than 1 minute, however the second while loop takes about 1 hour to run, and I am wondering if I have taken the right approach. I think it is possible for a lot of speedup? :) Thanks in advance.
Solution:
sub merge_file3{
my ($file1,$file2,$out,$position,$hsize) = ($_[0],$_[1],$_[2],$_[3],$_[4]);
print "merging: \n$file1 and \n$file2, to: \n$out\n";
my $OUTSTRING = undef;
my $header;
my (#file1,#file2);
open FILE1, "<$file1" or die;
while (<FILE1>){
if ($.==1){
$header = $_;
next;
}
print "$.\n" if ($.%100000) == 0;
chomp;
push #file1, [split ',', $_];
}
close FILE1;
open FILE2, "<$file2" or die;
while (<FILE2>){
next if $.==1;
print "$.\n" if ($.%100000) == 0;
chomp;
push #file2, [split ',', $_];
}
close FILE2;
print "sorting files\n";
my #sortedf1 = sort {$a->[$position] <=> $b->[$position]} #file1;
my #sortedf2 = sort {$a->[$position] <=> $b->[$position]} #file2;
print "sorted\n";
#file1 = undef;
#file2 = undef;
#foreach my $line (#file1){print "\t [ #$line ],\n"; }
my ($i,$j) = (0,0);
while ($i < $#sortedf1 and $j < $#sortedf2){
my $key1 = $sortedf1[$i][$position];
my $key2 = $sortedf2[$j][$position];
if ($key1 eq $key2){
foreach(0..$hsize){ #header size.
$sortedf2[$j][$_] = $sortedf1[$i][$_] if $sortedf1[$i][$_] ne undef;
}
$i++;
$j++;
}
elsif ( $key1 < $key2){
push(#sortedf2,[#{$sortedf1[$i]}]);
$i++;
}
elsif ( $key1 > $key2){
$j++;
}
}
#foreach my $line (#sortedf2){print "\t [ #$line ],\n"; }
print "outputting to file\n";
open OUT, ">$out";
print OUT $header;
foreach(#sortedf2){
print OUT (join ",", #{$_})."\n";
}
close OUT;
}
Thanks everyone, the solution is posted above. It now takes about 1 minute to merge the whole thing! :)
Two techniques come to mind.
Read the data from the CSV files into two tables in a DBMS (SQLite would work just fine), and then use the DB to do a join and write the data back out to CSV. The database will use indexes to optimize the join.
First, sort each file by primary key (using perl or unix sort), then do a linear scan over each file in parallel (read a record from each file; if the keys are equal then output a joined row and advance both files; if the keys are unequal then advance the file with the lesser key and try again). This step is O(n + m) time instead of O(n * m), and O(1) memory.
What's killing the performance is this code, which is concatenating millions of times.
$OUTSTRING.=$outstring."\n";
....
foreach my $key (sort { $a <=> $b } keys %line_for){
$OUTSTRING.= $file1array[$line_for{$key}]."\n";
}
If you want to write to the output file only once, accumulate your results in an array, and then print them at the very end, using join. Or, even better perhaps, include the newlines in the results and write the array directly.
To see how concatenation does not scale when crunching big data, experiment with this demo script. When you run it in concat mode, things start slowing down considerably after a couple hundred thousand concatenations -- I gave up and killed the script. By contrast, simply printing an array of a million lines took less than a than a minute on my machine.
# Usage: perl demo.pl 50 999999 concat|join|direct
use strict;
use warnings;
my ($line_len, $n_lines, $method) = #ARGV;
my #data = map { '_' x $line_len . "\n" } 1 .. $n_lines;
open my $fh, '>', 'output.txt' or die $!;
if ($method eq 'concat'){ # Dog slow. Gets slower as #data gets big.
my $outstring;
for my $i (0 .. $#data){
print STDERR $i, "\n" if $i % 1000 == 0;
$outstring .= $data[$i];
}
print $fh $outstring;
}
elsif ($method eq 'join'){ # Fast
print $fh join('', #data);
}
else { # Fast
print $fh #data;
}
If you want merge you should really merge. First of all you have to sort your data by key and than merge! You will beat even MySQL in performance. I have a lot of experience with it.
You can write something along those lines:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV_XS;
use autodie;
use constant KEYPOS => 4;
die "Insufficient number of parameters" if #ARGV < 2;
my $csv = Text::CSV_XS->new( { eol => $/ } );
my $sortpos = KEYPOS + 1;
open my $file1, "sort -n -k$sortpos -t, $ARGV[0] |";
open my $file2, "sort -n -k$sortpos -t, $ARGV[1] |";
my $row1 = $csv->getline($file1);
my $row2 = $csv->getline($file2);
while ( $row1 and $row2 ) {
my $row;
if ( $row1->[KEYPOS] == $row2->[KEYPOS] ) { # merge rows
$row = [ map { $row1->[$_] || $row2->[$_] } 0 .. $#$row1 ];
$row1 = $csv->getline($file1);
$row2 = $csv->getline($file2);
}
elsif ( $row1->[KEYPOS] < $row2->[KEYPOS] ) {
$row = $row1;
$row1 = $csv->getline($file1);
}
else {
$row = $row2;
$row2 = $csv->getline($file2);
}
$csv->print( *STDOUT, $row );
}
# flush possible tail
while ( $row1 ) {
$csv->print( *STDOUT, $row1 );
$row1 = $csv->getline($file1);
}
while ( $row2 ) {
$csv->print( *STDOUT, $row2 );
$row2 = $csv->getline($file1);
}
close $file1;
close $file2;
Redirect output to file and measure.
If you like more sanity around sort arguments you can replace file opening part with
(open my $file1, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[0]);
(open my $file2, '-|') || exec('sort', '-n', "-k$sortpos", '-t,', $ARGV[1]);
I can't see anything that strikes me as obviously slow, but I would make these changes:
First, I'd eliminate the #file1array variable. You don't need it; just store the line itself in the hash:
while (<FILE1>){
chomp;
$line_for{read_csv_string($_,$position)}=$_;
}
Secondly, although this shouldn't really make much of a difference with perl, I wouldn't add to $OUTSTRING all the time. Instead, keep an array of output lines and push onto it each time. If for some reason you still need to call write_line with a massive string you can always use join('', #OUTLINES) at the end.
If write_line doesn't use syswrite or something low-level like that, but rather uses print or other stdio-based calls, then you aren't saving any disk writes by building up the output file in memory. Therefore, you might as well not build your output up in memory at all, and instead just write it out as you create it. Of course if you are using syswrite, forget this.
Since nothing is obviously slow, try throwing Devel::SmallProf at your code. I've found that to be the best perl profiler for producing those "Oh! That's the slow line!" insights.
Assuming around 20 bytes lines each of your file would amount to about 20 MB, which isn't too big.
Since you are using hash your time complexity doesn't seem to be a problem.
In your second loop, you are printing to the console for each line, this bit is slow. Try removing that should help a lot.
You can also avoid the delete in the second loop.
Reading multiple lines at a time should also help. But not too much I think, there is always going to be a read ahead behind the scenes.
I'd store each record in a hash whose keys are the primary keys. A given primary key's value is a reference to an array of CSV values, where undef represents an unknown value.
use 5.10.0; # for // ("defined-or")
use Carp;
use Text::CSV;
sub merge_csv {
my($path,$record) = #_;
open my $fh, "<", $path or croak "$0: open $path: $!";
my $csv = Text::CSV->new;
local $_;
while (<$fh>) {
if ($csv->parse($_)) {
my #f = map length($_) ? $_ : undef, $csv->fields;
next unless #f >= 1;
my $primary = pop #f;
if ($record->{$primary}) {
$record->{$primary}[$_] //= $f[$_]
for 0 .. $#{ $record->{$primary} };
}
else {
$record->{$primary} = \#f;
}
}
else {
warn "$0: $path:$.: parse failed; skipping...\n";
next;
}
}
}
Your main program will resemble
my %rec;
merge_csv $_, \%rec for qw/ file1 file2 /;
The Data::Dumper module shows that the resulting hash given the simple inputs from your question is
$VAR1 = {
'42' => [
'one',
'two',
'three',
'four'
]
};