Record separator within a record separator

Record separator within a record separator - perl

How can I make use of a record separator, and then simultaneously use a sub-record separator? Perhaps that isn't the best way to think about what I am trying to do. Here is my goal:
I want to perform a while loop on a single tab delimitated item at a time, in a specified row of items. For every line (row) of tab separated items, I need to print the outcomes of all the while loops into a unique file. Allow the following examples to help clarify.
My input file will be something like the following. It will be called "Clustered_Barcodes.txt"
TTTATGC TTTATGG TTTATCC TTTATCG
TTTATAA TTTATAA TTTATAT TTTATAT TTTATTA
CTTGTAA
My perl code looks like the following:
#!/usr/bin/perl
use warnings;
use strict;
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
"TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
"TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
"TTTATCG" => "TAGCTAGCTTTATCGCGTACGTA",
"TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
"TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
"TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
"TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
"TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
"CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
);
while(<INFILE>) {
$/ = "\n";
my #lines = <INFILE>;
open my $out, '>', "Clustered_Barcode_$..fasta" or die $!;
foreach my $sequence (#lines){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
}
}
}
My desired output would be three different files.
The first file will be called "Clustered_Barcode_1.fasta" and will look like:
>TTTATGC
TATAGCGCTTTATGCTAGCTAGC
>TTTATGG
TAGCTAGCTTTATGGGCTAGCTA
>TTTATCC
GCTAGCTATTTATCCGCTAGCTA
>TTTATCG
TAGCTAGCTTTATCGCGTACGTA
Note that this is formatted so that the keys are preceded by a carrot, and then on the next line is the longer associated sequence (value). This file includes all the sequences in the first line of Clustered_Barcodes.txt
My third file should be named "Clustered_Barcode_3.fasta" and look like the following:
>CTTGTAA
ATCGATCGCTTGTAACGATTAGC
When I run my code, it only takes the second and third lines of sequences in the input file. How can I start with the first line (by getting rid of the \n requirement for a record separator)? How can I then process each item at a time and then print the line's worth of results into one file? Also, if there is a way to incorporate the number of sequences into the file name, that would be great. It would help me to later organize the files by size. For example, the name could be something like "Clusterd_Barcodes_1_File_3_Sequences.fasta".
Thank you all.

OK, so here's one way to do it:
#!/usr/bin/perl
use strict;
use warnings;
Standard preamble.
my %hash = (
"TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
"TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
"TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
"TTTATCG" => "TAGCTAGCTTTATCGCGTACGTA",
"TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
"TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
"TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
"TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
"TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
"CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
);
Set up the hash of sequences.
my $infile = 'Clustered_Barcodes.txt';
open my $infh, '<', $infile or die "$0: $infile: $!\n";
Open file for reading.
chomp(my #rows = readline $infh);
my $row_count = #rows;
Slurp all lines into memory in order to get the number of sequences. If you have too many sequences, this approach is not going to work (because you'll run out of memory (but that depends on how much RAM you have)).
my $i = 1;
for my $row (#rows) {
Loop over the lines.
my #fields = split /\t/, $row;
Split each line into fields separated by tabs.
my $outfile = "Clustered_Barcodes_${i}_File_${row_count}_Sequences.fasta";
$i++;
open my $outfh, '>', $outfile or die "$0: $outfile: $!\n";
Open current output file and increment counter.
for my $field (#fields) {
print $outfh ">$field\n$hash{$field}\n" if exists $hash{$field};
}
Write each field (and its mapping) to outfile.
}
And we're done. The main difference to your original code is using split /\t/ and foreach to loop over fields within a line.
We can do it without slurping, too:
while (my $row = readline $infh) {
chomp $row;
Loop over the lines, one by one. This replaces the 4 lines from chomp(my #rows = readline $infh); to for my $row (#rows) {.
But now we've lost the $i and $row_count variables, so we have to change the initialization of $outfile:
my $outfile = "Clustered_Barcodes_$..fasta";
That should be all the changes you need. (You can get $row_count back in this scenario by reading $infh twice (the first time just for counting, then seeking back to the start); this is left as an exercise for the reader.)

There's no need to read in the whole file that I see here. You just need to loop over the contents of each line:
while(my $line = <INFILE>) {
chomp $line;
open my $out, '>', "Clustered_Barcode_$..fasta" or die $!;
foreach my $sequence ( split /\t/, $line ){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
}
}
}

Related

Parsing file based on column ID: perl

I have a tab delineated file with repeated values in the first column. The single, but repeated values in the first column correspond to multiple values in the second column. It looks something like this:
AAAAAAAAAA1 m081216|101|123
AAAAAAAAAA1 m081216|100|1987
AAAAAAAAAA1 m081216|927|463729
BBBBBBBBBB2 m081216|254|260489
BBBBBBBBBB2 m081216|475|1234
BBBBBBBBBB2 m081216|987|240
CCCCCCCCCC3 m081216|433|1000
CCCCCCCCCC3 m081216|902|366
CCCCCCCCCC3 m081216|724|193
For every type of sequence in the first column, I am trying to print to a file with just the sequences that correspond to it. The name of the file should include the repeated sequence in the first column and the number of sequences that correspond to it in the second column. In the above example I would therefore have 3 files of 3 sequences each. The first file would be named something like "AAAAAAAAAA1.3.txt" and look like the following when opened:
m081216|101|123
m081216|100|1987
m081216|927|463729
I have seen other similar questions, but they have been answered with using a hash. I don't think I can't use a hash because I need to keep the number of relationships between columns. Maybe there is a way to use a hash of hashes? I am not sure.
Here is my code so far.
use warnings;
use strict;
use List::MoreUtils 'true';
open(IN, "<", "/path/to/in_file") or die $!;
my #array;
my $queryID;
while(<IN>){
chomp;
my $OutputLine = $_;
processOutputLine($OutputLine);
}
sub processOutputLine {
my ($OutputLine) = #_;
my #Columns = split("\t", $OutputLine);
my ($queryID, $target) = #Columns;
push(#array, $target, "\n") unless grep{$queryID eq $_} #array;
my $delineator = "\n";
my $count = true { /$delineator/g } #array;
open(OUT, ">", "/path/to/out_$..$queryID.$count.txt") or die $!;
foreach(#array){
print OUT #array;
}
}

I would still recommend a hash. However, you store all sequences related to the same id in an anonymous array which is the value for that ID key. It's really two lines of code.
use warnings;
use strict;
use feature qw(say);
my $filename = 'rep_seqs.txt'; # input file name
open my $in_fh, '<', $filename or die "Can't open $filename: $!";
my %seqs;
foreach my $line (<$in_fh>) {
chomp $line;
my ($id, $seq) = split /\t/, $line;
push #{$seqs{$id}}, $seq;
}
close $in_fh;
my $out_fh;
for (sort keys %seqs) {
my $outfile = $_ . '_' . scalar #{$seqs{$_}} . '.txt';
open $out_fh, '>', $outfile or do {
warn "Can't open $outfile: $!";
next;
};
say $out_fh $_ for #{$seqs{$_}};
}
close $out_fh;
With your input I get the desired files, named AA..._count.txt, with their corresponding three lines each. If items separated by | should be split you can do that while writing it out, for example.
Comments
The anonymous array for a key $seqs{$id} is created once we push, if not there already
If there are issues with tabs (converted to spaces?), use ' '. See the comment.
A filehandle is closed and re-opened on every open, so no need to close every time
The default pattern for split is ' ', also triggering specific behavior -- it matches "any contiguous whitespace", and also omits leading whitespace. (The pattern / / matches a single space, turning off this special behavior of ' '.) See a more precise description on the split page. Thus it is advisable to use ' ' when splitting on unspecified number of spaces, since in the case of split this is a bit idiomatic, is perhaps the most common use, and is its default. Thanks to Borodin for prompting this comment and update (the original post had the equivalent /\s+/).
Note that in this case, since ' ' is the default along with $_, we can shorten it a little
for (<$in_fh>) {
chomp;
my ($id, $seq) = split;
push #{$seqs{$id}}, $seq;
}

Count the number of items derived from split without putting into an array

I am looking to spare the use of an array for memory's sake, but still get the number of items derived from the split function for each pass of a while loop.
The ultimate goal is to filter the output files according to the number of their sequences, which could either be deduced by the number of rows the file has, or the number of carrots that appear, or the number of line breaks, etc.
Below is my code:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => "TATAGCGCTTTATGCTAGCTAGC",
"TTTATGG" => "TAGCTAGCTTTATGGGCTAGCTA",
"TTTATCC" => "GCTAGCTATTTATCCGCTAGCTA",
"TTTATCG" => "AGTCATGCTTTATCGCGATCGAT",
"TTTATAA" => "TAGCTAGCTTTATAATAGCTAGC",
"TTTATAA" => "ATCGATCGTTTATAACGATCGAT",
"TTTATAT" => "TCGATCGATTTATATTAGCTAGC",
"TTTATAT" => "TAGCTAGCTTTATATGCTAGCTA",
"TTTATTA" => "GCTAGCTATTTATTATAGCTAGC",
"CTTGTAA" => "ATCGATCGCTTGTAACGATTAGC",
);
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
foreach my $sequence (split /\t/, $line){
if (exists $hash{$sequence}){
print $out ">$sequence\n$hash{$sequence}\n";
}
}
}
The input file, "Clustered_Barcodes.txt" when opened, looks like the following:
TTTATGC TTTATGG TTTATCC TTTATCG
TTTATAA TTTATAA TTTATAT TTTATAT TTTATTA
CTTGTAA
There will be three output files from the code, "Clustered_Barcode_1.txt", "Clustered_Barcode_2.txt", and "Clustered_Barcode_3.txt". An example of what the output files would look like could be the 3rd and final file, which would look like the following:
>CTTGTAA
ATCGATCGCTTGTAACGATTAGC
I need some way to modify my code to identify the number of rows, carrots, or sequences that appear in the file and work that into the title of the file. The new title for the above sequence could be something like "Clustered_Barcode_Number_3_1_Sequence.txt"
PS- I made the hash in the above code manually in attempt to make things simpler. If you want to see the original code, here it is. The input file format is something like:
>TAGCTAGC
GCTAAGCGATGCTACGGCTATTAGCTAGCCGGTA
Here is the code for setting up the hash:
my $dir = ("~/Documents/Sequences");
open(INFILE, "<", "~/Documents/Clustered_Barcodes.txt") or die $!;
my %hash = ();
my #ArrayofFiles = glob "$dir/*"; #put all files from the specified directory into an array
#print join("\n", #ArrayofFiles), "\n"; #this is a diagnostic test print statement
foreach my $file (#ArrayofFiles){ #make hash of barcodes and sequences
open (my $sequence, $file) or die "can't open file: $!";
while (my $line = <$sequence>) {
if ($line !~/^>/){
my $seq = $line;
$seq =~ s/\R//g;
#print $seq;
$seq =~ m/(CATCAT|TACTAC)([TAGC]{16})([TAGC]+)([TAGC]{16})(CATCAT|TACTAC)/;
$hash{$2} = $3;
}
}
}
while(<INFILE>){
etc

You can use regex to get the count:
my $delimiter = "\t";
my $line = "zyz pqr abc xyz";
my $count = () = $line =~ /$delimiter/g; # $count is now 3
print $count;

Your hash structure is not right for your problem as you have multiple entries for same ids. for example TTTATAA hash id has 2 entries in your %hash.
To solve this, use hash of array to create the hash.
Change your hash creation code in
$hash{$2} = $3;
to
push(#{$hash{$2}}, $3);
Now change your code in the while loop
while(my $line = <INFILE>){
chomp $line;
open my $out, '>', "Clustered_Barcode_$..txt" or die $!;
my %id_list;
foreach my $sequence (split /\t/, $line){
$id_list{$sequence}=1;
}
foreach my $sequence(keys %id_list)
{
foreach my $val (#{$hash{$sequence}})
{
print $out ">$sequence\n$val\n";
}
}
}

I have assummed that;
The first digit in the output file name is the input file line number
The second digit in the output file name is the input file column number
That the input hash is a hash of arrays to cover the case of several sequences "matching" the one barcode as mentioned in the comments
When a barcode has a match in the hash, that the output file will lists all the sequences in the array, one per line.
The simplest way to do this that I can see is to build the output file using a temporary filename and the rename it when you have all the data. According to the perl cookbook, the easiest way to create temporary files is with the module File::Temp.
The key to this solution is to move through the list of barcodes that appear on a line by column index rather than the usual perl way of simply iterating over the list itself. To get the actual barcodes, the column number $col is used to index back into #barcodes which is created by splitting the line on whitespace. (Note that splitting on a single space is special cased by perl to emulate the behaviour of one of its predecessors, awk (leading whitespace is removed and the split is on whitespace, not a single space)).
This way we have the column number (indexed from 1) and the line number we can get from the perl special variable, $. We can then use these to rename the file using the builtin, rename().
use warnings;
use strict;
use diagnostics;
use File::Temp qw(tempfile);
open(INFILE, "<", "Clustered_Barcodes.txt") or die $!;
my %hash = (
"TTTATGC" => [ "TATAGCGCTTTATGCTAGCTAGC" ],
"TTTATGG" => [ "TAGCTAGCTTTATGGGCTAGCTA" ],
"TTTATCC" => [ "GCTAGCTATTTATCCGCTAGCTA" ],
"TTTATCG" => [ "AGTCATGCTTTATCGCGATCGAT" ],
"TTTATAA" => [ "TAGCTAGCTTTATAATAGCTAGC", "ATCGATCGTTTATAACGATCGAT" ],
"TTTATAT" => [ "TCGATCGATTTATATTAGCTAGC", "TAGCTAGCTTTATATGCTAGCTA" ],
"TTTATTA" => [ "GCTAGCTATTTATTATAGCTAGC" ],
"CTTGTAA" => [ "ATCGATCGCTTGTAACGATTAGC" ]
);
my $cbn = "Clustered_Barcode_Number";
my $trailer = "Sequence.txt";
while (my $line = <INFILE>) {
chomp $line ;
my $line_num = $. ;
my #barcodes = split " ", $line ;
for my $col ( 1 .. #barcodes ) {
my $barcode = $barcodes[ $col - 1 ]; # arrays indexed from 0
# skip this one if its not in the hash
next unless exists $hash{$barcode} ;
my #sequences = #{ $hash{$barcode} } ;
# Have a hit - create temp file and output sequences
my ($out, $temp_filename) = tempfile();
say $out ">$barcode" ;
say $out $_ for (#sequences) ;
close $out ;
# Rename based on input line and column
my $new_name = join "_", $cbn, $line_num, $col, $trailer ;
rename ($temp_filename, $new_name) or
warn "Couldn't rename $temp_filename to $new_name: $!\n" ;
}
}
close INFILE
All of the barcodes in your sample input data have a match in the hash, so when I run this, I get 4 files for line 1, 5 for line 2 and 1 for line 3.
Clustered_Barcode_Number_1_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt
Clustered_Barcode_Number_1_3_Sequence.txt
Clustered_Barcode_Number_1_4_Sequence.txt
Clustered_Barcode_Number_2_1_Sequence.txt
Clustered_Barcode_Number_2_2_Sequence.txt
Clustered_Barcode_Number_2_3_Sequence.txt
Clustered_Barcode_Number_2_4_Sequence.txt
Clustered_Barcode_Number_2_5_Sequence.txt
Clustered_Barcode_Number_3_1_Sequence.txt
Clustered_Barcode_Number_1_2_Sequence.txt for example has:
>TTTATGG
TAGCTAGCTTTATGGGCTAGCTA
and Clustered_Barcode_Number_2_5_Sequence.txt has:
>TTTATTA
GCTAGCTATTTATTATAGCTAGC
Clustered_Barcode_Number_2_3_Sequence.txt - which matched a hash key with two sequences - had the following;
>TTTATAT
TCGATCGATTTATATTAGCTAGC
TAGCTAGCTTTATATGCTAGCTA
I was speculating here about what you wanted when a supplied barcode had two matches. Hope that helps.

Modifying CSV file and Preserving Order

The question that follows is a made up simplified example of a more complex problem that I'm trying to solve. I would like to preserve the structure of the code, especially the use of the %hash to store the outcomes for each patient but I do not need to read the data file into memory (but I cannot find a way of reading my csv data file line by line from the end.)
My sample data is made up of events that occur to patients. A patient can be added to the study (Event=B) or he can die (Event=D) or exit the study(Event=F.) Death and Exit are the only two possible outcomes for each patient.
For each event I have the date of occurrence (in hours from given point in time), the unique ID number of each patient, the event and the Outcome (a field set to 0 for every patient.)
I'm trying to write a code that will change the input file by putting next to each addition of a new patient, what is his eventual outcome (death or exit.)
In order to do so, I read the file from the end, and whenever I encounter a death or exit of a patient, I populate a hash that matches patient ID with outcome. When I encounter an event telling me that a new patient has been added to the study, I then match his ID with those in the hash and change the value of "Outcome" from 0 to either D or F.
I have been able to write a code that reads the file from bottom and then creates a new modified file with the updated value for Outcome. The problem is that since I read the input file from bottom to top and print each line after reading it, the output file is in reversed order and I do not know how to change this. Also, ideally I don't want to create a new file bu I would like to simply modify the input one. However, I have failed with every attempt to do so.
Sample data:
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,0
25201028,562962838335408,B,0
25201100,562962838335407,D,0
25201128,562962838335408,F,0
My code:
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open (my $fh_input, "<", "mini_test2.csv")
or die "cannot open > mini_test2.csv: $!";
my #lines = <$fh_input>;
close $fh_input;
open (my $fh_output, ">>", "Revised_mini_test2.csv")
or die "cannot open > Revised_mini_test2.csv: $!";
my $length = scalar(#lines);
my %outcome;
my #input_variables;
for (my $i = 1; $i < #lines; $i++){
chomp($lines[$length-$i]);
#input_variables=split(/,/, $lines[$length - $i]);
if ($input_variables[2] eq "D" || $input_variables[2] eq "F"){
$outcome{$input_variables[1]} = $input_variables[2];
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
elsif($input_variables[2] eq "B") {
$input_variables[3]=$outcome{$input_variables[1]};
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
else{
# necessary since the actual data has many more possible "Events"
my $line = join(",", #input_variables);
print $fh_output $line . "\n";
}
}
close $fh_output;
EDIT: desired output should be
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0
Also, an additional complication is that the unique patient ID after the exit of a patient gets re-used. This means that I cannot do a 1st pass and store the outcome for each patient and a 2nd one to update the values of Outcome.
EDIT 2: let me clarify that when I say that each patient has a "unique ID" I mean that there cannot be in the study, at the same time, two patients with the same ID. However, if a patient exits the study, his ID gets re-used.

Update
I have just read your additional information that patient numbers are re-used once they exit the study. Why you would design a system like that I don't know, but there it is
It becomes far harder to write something straightforward without reading the file into an array, so that's what I have done here
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'mini_test2.csv';
my #data;
while ( <$fh> ) {
chomp;
push #data, [ split /,/ ];
}
my %outcome;
for ( my $i = $#data; $i > 0; --$i ) {
my ($patient_number, $event) = #{$data[$i]}[1,2];
if ( $event =~ /[DF]/ ) {
$outcome{$patient_number} = $event;
}
elsif ( $event =~ /[B]/ ) {
$data[$i][3] = delete $outcome{$patient_number} // 0;
}
}
print join(',', #$_), "\n" for #data;
output
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0
There are a few ways to approach this. I have chosen to take two passes through the file, first accumulating the outcome for each patient in a hash, and then replacing all the outcome fields in the B records
use strict;
use warnings;
use 5.010;
use autodie;
use Fcntl ':seek';
my %outcome;
open my $fh, '<', 'mini_test2.csv';
<$fh>; # Drop header
while ( <$fh> ) {
chomp;
my #fields = split /,/;
my ($patient_number, $event) = #fields[1,2];
if ( $event =~ /[DF]/ ) {
$outcome{$patient_number} = $event;
}
}
seek $fh, 0, SEEK_SET; # Rewind
print scalar <$fh>; # Copy header
while ( <$fh> ) {
chomp;
my #fields = split /,/;
my ($patient_number, $event) = #fields[1,2];
if ( $event !~ /[DF]/ ) {
$fields[3] = $outcome{$patient_number} // 0;
}
print join(',', #fields), "\n";
}
output
Data,PatientNumber,Event,Outcome
25201027,562962838335407,B,D
25201028,562962838335408,B,F
25201100,562962838335407,D,0
25201128,562962838335408,F,0

What we can do is instead of printing out the line at each stage, we'll write it back to the array of lines. Then we can just print them out at the end.
for (my $i=$#lines; i>=0; i--)
{
chomp $lines[$i];
#input_variables = split /,/, $lines[$i];
if ($input_variables[2] eq "D" || $input_variables[2] eq "F")
{
$outcome{$input_variables[1]}=$input_variables[2];
}else
{
$input_variables[3]=$outcome{$input_variables[1]};
}
$line[$i] = join ",", #input_variables;
}
$, = "\n"; #Make list seperator for printing a newline.
print $fh_output #lines;
As for the second question of modifying the original file. It is possible to open a file for both reading and writing using modes "+<", "+>", or "+>>". Don't do this! It is error prone as you must replace data character by character.
The standard way to "modify" an existing file is to rename it, read from the renamed file, write to a new file with the original name, and delete the temp file.
my $file_name = "mini_test2.csv";
my $tmp_file_name = $file_name . ".tmp";
rename $file_name, $tmp_file_name;
open (my $fh_input, "<", $tmp_file_name)
or die "cannot open > $tmp_file_name: $!";
open (my $fh_output, ">>", $file_name)
or die "cannot open > $file_name: $!";
#Your code to process the data.
close $fh_input;
close $fh_output;
#delete the temp file
unlink $tmp_file_name;
But, in your case, you slurp all of the data into memory right away. Just open for writing that clobbers existing files
open (my $fh_output, ">", "mini_test2.csv")
or die "cannot open > mini_test2.csv: $!";

Perl merging columns in two text files

I am a beginner with Perl and I want to merge the content of two text files.
I have read some similar questions and answers on this forum, but I still cannot resolve my issues
The first file has the original ID and the recoded ID of each individual (in the first and fourth columns)
The second file has the recoded ID and some information on some of the individuals (in the first and second columns).
I want to create an output file with the original, recoded and information of these individuals.
This is the perl script I have created so far, which is not working.
If anyone could help it would be very much appreciated.
use warnings;
use strict;
use diagnostics;
use vars qw( #fields1 $recoded $original $IDF #fields2);
my %columns1;
open (FILE1, "<file1.txt") || die "$!\n Couldn't open file1.txt\n";
while ($_ = <FILE1>)
{
chomp;
#fields1=split /\s+/, $_;
my $recoded = $fields1[0];
my $original = $fields1[3];
my %columns1 = (
$recoded => $original
);
};
open (FILE2, "<file2.txt") || die "$!\n Couldnt open file2.txt \n";
for ($_ = <FILE2>)
{
chomp;
#fields2=split /\s+/, $_;
my $IDF= $fields2[0];
my $F=$fields2[1];
my %columns2 = (
$F => $IDF
);
};
close FILE1;
close FILE2;
open (FILE3, ">output.txt") ||die "output problem\n";
for (keys %columns1) {
if (exists ($columns2{$_}){
print FILE3 "$_ $columns1{$_}\n"
};
}
close FILE3;

One problem is with scoping. In your first loop, you have a my in front of $column1 which makes it local to the loop and will not be in scope when you next the loop. So the %columns1 (which is outside of the loop) does not have any values set (which is what I suspect you want to set). For the assignment, it would seem to be easier to have $columns1{$recorded} = $original; which assigns the value to the key for the hash.
In the second loop you need to declare %columns2 outside of the loop and possibly use the above assignment.
For the third loop, in the print you just need add $columns2{$_} in front part of the string to be printed to get the original ID to be printed before the recorded ID.

Scope:
The problem is with scope of the hash variables you have defined. The scope of the variable is limited to the loop inside which the variable has been defined.
In your code, since %columns1 and %columns2 are used outside the while loops. Hence, they should be defined outside the loops.
Compilation error : braces not closed properly
Also, in the "if exists" part, the open-and-closed braces symmetry is affected.
Here is your code with the required corrections made:
use warnings;
use strict;
use diagnostics;
use vars qw( #fields1 $recoded $original $IDF #fields2);
my (%columns1, %columns2);
open (FILE1, "<file1.txt") || die "$!\n Couldn't open CFC_recoded.txt\n";
while ($_ = <FILE1>)
{
chomp;
#fields1=split /\s+/, $_;
my $recoded = $fields1[0];
my $original = $fields1[3];
%columns1 = (
$recoded => $original
);
}
open (FILE2, "<file2.txt") || die "$!\n Couldnt open CFC_F.xlsx \n";
for ($_ = <FILE2>)
{
chomp;
#fields2=split /\s+/, $_;
my $IDF= $fields2[0];
my $F=$fields2[1];
%columns2 = (
$F => $IDF
);
}
close FILE1;
close FILE2;
open (FILE3, ">output.txt") ||die "output problem\n";
for (keys %columns1) {
print FILE3 "$_ $columns1{$_} \n" if exists $columns2{$_};
}
close FILE3;

replace 4th column from the last and also pick unique value from 3rd column at the same time

I have two files both of them are delimited by pipe.
First file:
has may be around 10 columns but i am interested in first two columns which would useful in updating the column value of the second file.
first file detail:
1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9
Second file detail:
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**1**|a10|a11|a12
a1|a2|**ray**|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|**kate**|a3|a4|a5|a6|a7|a8|**20**|a10|a11|a12
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**6**|a10|a11|a12
a1|a2|**bob**|a3|a4|a5|a6|a7|a8|**45**|a10|a11|a12
My requirement here is to find unique values from 3rd column and also replace the 4th column from the last . The 4th column from the last may/may not have numeric number . This number would be appearing in the first field of first file as well. I need replace (second file )this number with the corresponding value that appears in the second column of the first file.
expected output:
unique string : ray kate bob
a1|a2|bob|a3|a4|a5|a6|a7|a8|**alpha**|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|**charlie**|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|**romeo**|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
I am able to pick the unique string using below command
awk -F'|' '{a[$3]++}END{for(i in a){print i}}' filename
I would dont want to read the second file twice , first to pick the unique string and second time to replace 4th column from the last as the file size is huge. It would be around 500mb and there are many such files.
Currently i am using perl (Text::CSV) module to read the first file ( this file is of small size ) and load the first two columns into a hash , considering first column as key and second as value. then read the second file and replace the n-4 column with hash value. But this seems to be time consuming as Text::CSV parsing seems to be slow.
Any awk/perl solution keeping speed in mind would be really helpful :)
Note: Ignore the ** asterix around the text , they are just to highlight they are not part of the data.
UPDATE : Code
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Utils;
use Text::CSV;
my %hash;
my $csv = Text::CSV->new({ sep_char => '|' });
my $file = $ARGV[0] or die "Need to get CSV file on the command line\n";
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$data>) {
chomp $line;
if ($csv->parse($line)) {
my #fields = $csv->fields();
$hash{$field[0]}=$field[1];
} else {
warn "Line could not be parsed: $line\n";
}
}
close($data);
my $csv = Text::CSV->new({ sep_char => '|' , blank_is_undef => 1 , eol => "\n"});
my $file2 = $ARGV[1] or die "Need to get CSV file on the command line\n";
open ( my $fh,'>','/tmp/outputfile') or die "Could not open file $!\n";
open(my $data2, '<', $file2) or die "Could not open '$file' $!\n";
while (my $line = <$data2>) {
chomp $line;
if ($csv->parse($line)) {
my #fields = $csv->fields();
if (defined ($field[-4]) && looks_like_number($field[-4]))
{
$field[-4]=$hash{$field[-4]};
}
$csv->print($fh,\#fields);
} else {
warn "Line could not be parsed: $line\n";
}
}
close($data2);
close($fh);

Here's an option that doesn't use Text::CSV:
use strict;
use warnings;
#ARGV == 3 or die 'Usage: perl firstFile secondFile outFile';
my ( %hash, %seen );
local $" = '|';
while (<>) {
my ( $key, $val ) = split /\|/, $_, 3;
$hash{$key} = $val;
last if eof;
}
open my $outFH, '>', pop or die $!;
while (<>) {
my #F = split /\|/;
$seen{ $F[2] } = undef;
$F[-4] = $hash{ $F[-4] } if exists $hash{ $F[-4] };
print $outFH "#F";
}
close $outFH;
print 'unique string : ', join( ' ', reverse sort keys %seen ), "\n";
Command-line usage: perl firstFile secondFile outFile
Contents of outFile from your datasets (asterisks removed):
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
STDOUT:
unique string : ray kate bob
Hope this helps!

Use getline instead of parse, it is much faster. The following is a more idiomatic way of performing this task. Note that you can reuse the same Text::CSV object for multiple files.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use Text::CSV;
my $csv = Text::CSV->new({
auto_diag => 1,
binary => 1,
blank_is_undef => 1,
eol => $/,
sep_char => '|'
}) or die "Can't use CSV: " . Text::CSV->error_diag;
open my $map_fh, '<', 'map.csv' or die "map.csv: $!";
my %mapping;
while (my $row = $csv->getline($map_fh)) {
$mapping{ $row->[0] } = $row->[1];
}
close $map_fh;
open my $in_fh, '<', 'input.csv' or die "input.csv: $!";
open my $out_fh, '>', 'output.csv' or die "output.csv: $!";
my %seen;
while (my $row = $csv->getline($in_fh)) {
$seen{ $row->[2] } = 1;
my $key = $row->[-4];
$row->[-4] = $mapping{$key} if defined $key and exists $mapping{$key};
$csv->print($out_fh, $row);
}
close $in_fh;
close $out_fh;
say join ',', keys %seen;
map.csv
1|alpha|s3.3|4|6|7|8|9
2|beta|s3.3|4|6|7|8|9
20|charlie|s3.3|4|6|7|8|9
6|romeo|s3.3|4|6|7|8|9
input.csv
a1|a2|bob|a3|a4|a5|a6|a7|a8|1|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|20|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|6|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
output.csv
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
STDOUT
kate,bob,ray

This awk should work.
$ awk '
BEGIN { FS = OFS = "|" }
NR==FNR { a[$1] = $2; next }
{ !unique[$3]++ }
{ $(NF-3) = (a[$(NF-3)]) ? a[$(NF-3)] : $(NF-3) }1
END {
for(n in unique) print n > "unique.txt"
}' file1 file2 > output.txt
Explanation:
We set the input and output field separators to |.
We iterate through first file creating an array storing column one as key and assigning column two as the value
Once the first file is loaded in memory, we create another array by reading the second file. This array stores the unique values from column three of second file.
While reading the file, we look at the forth value from last to be present in our array from first file. If it is we replace it with the value from array. If not then we leave the existing value as is.
In the END block we iterate through our unique array and print it to a file called unique.txt. This holds all the unique entries seen on column three of second file.
The entire output of the second file is redirected to output.txt which now has the modified forth column from last.
$ cat output.txt
a1|a2|bob|a3|a4|a5|a6|a7|a8|alpha|a10|a11|a12
a1|a2|ray|a3|a4|a5|a6|a7|a8||a10|a11|a12
a1|a2|kate|a3|a4|a5|a6|a7|a8|charlie|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|romeo|a10|a11|a12
a1|a2|bob|a3|a4|a5|a6|a7|a8|45|a10|a11|a12
$ cat unique.txt
kate
bob
ray

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Record separator within a record separator - perl

Related

Parsing file based on column ID: perl

Count the number of items derived from split without putting into an array

Modifying CSV file and Preserving Order

Perl merging columns in two text files

replace 4th column from the last and also pick unique value from 3rd column at the same time

Categories

Resources