I've unpacked it...how can I count the columns? - perl

I am using unpack to parse some text files w/ some columns. Each text file is different and has a different number of columns. How can I count the columns so I don't get errors? Right now I am using 0..5 but if the text file has 3 columns then I get an error: "Use of uninitialized value in substitution...". Thx!
open (PARSE,"<$temp") or die $!;
my #template = map {'A'.length} <PARSE> =~ /(\S+\s*)/g;
next unless #template;
$template[-1] = 'A*';# set the last segment to be slurpy
my $template = "#template";
my #data;
while (<PARSE>) {
push #data, [unpack $template, $_]
}
for my $dat (#data){ # for each row
for(0..5){ # for each column in that row
$dat->[$_]=~s/^\s+//g;
$dat->[$_]=~s/\s+$//g;
print $dat->[$_].',';
}
print "\n";
}

With languages like Perl, Python, Ruby, etc., you rarely need to stoop to the level of subscripts when iterating over an array:
for my $cell (#$dat){
# Work with $cell rather than $dat->[$_].
...
}

Probably easier and cleaner to use Tie::File so that you don't have to read everything into memory, but here's one way that uses the #data list you set up:
my $dataFirstLine = $data[0];
chomp($dataFirstLine);
my #dataColumns = split("\t", $dataFirstLine); # assumes delimiter is tab, replace with escaped delimiter of choice
my $dataColumnCount = scalar #dataColumns;
print "number of columns: $dataColumnCount\n";

Related

Data value of array not printing properly

I have written a script which collects marks of students and print the one who scored above 50.
Script is below:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
print Dumper(\#array);
my $class = "3";
foreach my $each_value (#array) {
print "EACH: $each_value\n";
my ($name, $score ) = split (/,/, $each_value);
if ($score lt 50) {
next;
} else {
print "$name, \"GOOD SCORE\", $score, $class";
}
}
Here I wanted to print data of STUDENT1, since his score is greater than 50.
So output should be:
STUDENT1, "GOOD SCORE", 90, 3
But its printing output like this:
STUDENT1, "GOOD SCORE", 90
STUDENT2, 3
Here some manipulation happens between 90 STUDENT2 which it discards to separate it.
I know I was not splitting data with new line character since we have single element in the array #array.
How can I split the element which is in array to new line, so that inside for loop I can split again with comma(,) to have the values in $name and $score.
Actually the #array is coming as an argument to this script. So I have to modify this script in order to parse right values.
As you already know your "array" only has one "element" with a string with the actual records in it, so it essentially is more a scalar than an array.
And as you suspect, you can split this scalar just as you already did with the newline as a separator instead of a comma. You can then put a foreach around the result of split() to iterate over the records.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $records = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
my $class = "3";
foreach my $record (split("\n", $records)) {
my ($name, $score) = split(',', $record);
if ($score >= 50) {
print("$name, \"GOOD SCORE\", $score, $class\n");
}
}
As a small note, lt is a string comparison operator. The numeric comparisons use symbols, such as <.
Although you have an array, you only have a single string value in it:
my #array = (
'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
');
That's not a big deal. Dave Cross has already shown you have you can break that up into multiple values, but there's another way I like to handle multi-line strings. You can open a filehandle on a reference to the string, then read lines from the string as you would a file:
my $string = 'STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
';
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
One of the things to consider while programming is how many times you are duplicating the data. If you have it in a big string then split it into an array, you've now stored the data twice. That might be fine and its usually expedient. You can't always avoid it, but you should have some tools in your toolbox that let you avoid it.
And, here's a chance to use indented here docs:
use v5.26;
my $string = <<~"HERE";
STUDENT1,90
STUDENT2,40
STUDENT3,30
STUDENT4,30
HERE
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
...
}
For your particular problem, I think you have a single string where the lines are separated by the '|' character. You don't show how you call this program or get the data, though.
You can choose any line ending you like by setting the value for the input record separator, $/. Set it to a pipe and this works:
use v5.10;
my $string = 'STUDENT1,90|STUDENT2,40|STUDENT3,30|STUDENT4,30';
{
local $/ = '|'; # input record separator
open my $string_fh, '<', \$string;
while( <$string_fh> ) {
chomp;
say "Got $_";
}
}
Now the structure of your program isn't too far away from taking the data from standard input or a file. That gives you a lot of flexibility.
The #array contains one element, Actually the for loop will working correct, you can fix it without any change in the for block just by replacing this array:
my #array = (
'STUDENT1,90',
'STUDENT2,40',
'STUDENT3,30',
'STUDENT4,30');
Otherwise you can iterate on them by splitting lines using new line \n .

Create a table by merging many files

This seemed like such an easy task, yet I am boggled.
I have text files, each named after a type of tissue (e.g. cortex.txt, heart.txt)
Each file contains two columns, and the column headers are gene_name and expression_value
Each file contains around 30K to 40K rows
I need to merge the files into one file with 29 columns, with headers
genename, tissue1, tissue2, tissue3, etc. to tissue28
So that each row contains one gene and its expression value in the 28 tissues
The following code creates an array containing a list of every gene name in every file:
my #list_of_genes;
foreach my $input_file ( #input_files ) {
print $input_file, "\n";
open ( IN, "outfiles/$input_file");
while ( <IN> ) {
if ( $_ =~ m/^(\w+\|ENSMUSG\w+)\t/) {
# check if the gene is already in the gene list
my $count = grep { $_ eq $1 } #list_of_genes;
# if not in list, add to the list
if ( $count == 0 ) {
push (#list_of_genes, $1);
}
}
}
close IN;
}
The next bit of code I was hoping would work, but the regex only recognises the first gene name.
Note: I am only testing it on one test file called "tissue1.txt".
The idea is to create an array of all the file names, and then take each gene name in turn and search through each file to extract each value and write it to the outfile in order along the row.
foreach my $gene (#list_of_genes) {
# print the gene name in the first column
print OUT $gene, "\t";
# use the gene name to search the first element of the #input_file array and dprint to the second column
open (IN, "outfiles/tissue1.txt");
while ( <IN> ) {
if ($_ =~ m/^$gene\t(.+)\n/i ) {
print OUT $1;
}
}
print OUT "\n";
}
EDIT 1:
Thank you Borodin. The output of your code is indeed a list of every gene name with a all expression values in each tissue.
e.g. Bcl20|ENSMUSG00000000317,0.815796340254127,0.815796340245643
This is great much better than I managed thank you. Two additional things are needed.
1) If a gene name is not found in the a .txt file then a value of 0 should be recorded
e.g. Ht4|ENSMUSG00000000031,4.75878049632381, 0
2) I need a comma separated header row so that the tissue from which each value comes remains associated with the value (basically a table) - the tissue is the name of the text file
e.g. From 2 files heart.txt and liver.txt the first row should be:
genename|id,heart,liver
where genename|id is always the first header
That's a lot of code to implement the simple idiom of using a hash to enforce uniqueness!
It's looking like you want an array of expression values for each different ENSMUSG string in all *.txt files in your outfiles directory.
If the files you need are the only ones in the outfles directory, then the solution looks like this. I've used autodie to check the return status of all Perl IO operations (chdir, open, print etc.) and checked only that the $gene value contains |ENSMUSG. You may not need even this check if your input data is well-behaved.
Please forgive me if this is bugged, as I have no access to a Perl compiler at present. I have checked it by sight and it looks fine.
use strict;
use warnings 'all';
use autodie;
chdir '/path/to/outfiles';
my %data;
while ( my $file = glob '*.txt' ) {
open my $fh, '<', $file;
while ( <$fh> ) {
my ($gene, $value) = split;
next unless $gene =~ /\|ENSMUSG/;
push #{ $data{$gene} }, $value;
}
}
print join(',', $_, #{ $data{$_} }), "\n" for keys %data;

identify and insert the missing rows

An array is populated from a tab delimited text (5 column) file that sometimes is missing rows. I need to identify and insert the missing rows. Inserting a string "blank row found" is sufficient.
Here is an example of data from file:
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
I’ve created an array of elements that identifies the second column of each row that should be present in the file, in the order each row should be present. However, I'm not sure how to continue from here, since I'm unable to install any Perl modules on the server (e.g. Arrays::Utils).
Is comparing arrays the correct way of approaching this problem? Perhaps there is a straightforward solution, that doesn’t require installation of any CPAN modules? Thanks for your help.
#!perl
use strict;
use warnings;
use File::Basename;
#use Arrays::Utils;
opendir my $dir, "/data/test_all_runs" or die "Cannot open directory: $!";
my #run_folder = readdir $dir;
closedir $dir;
my $run_folder = pop #run_folder; print "The folder is".$run_folder."\n";
my $home="/data/";
my $CNV_file = $home."test_all_runs/".$run_folder."/CNV.txt";
my #CNVarray;
open(TXT2, "$CNV_file");
while (<TXT2>){
push (#CNVarray, $_);
}
close(TXT2);
foreach (#CNVarray){
chop($_);
}
my #array1 = map { $_->[1] } #CNVarray;
my #array2 = qw(MTOR JAK1 NRAS DDR2 MYCN ALK IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1 FGFR3 PDGFRA KIT APC FGFR4 ROS1 ESR1 EGFR CDK6 MET SMO BRAF FGFR1 MYC JAK2 GNAQ RET FGFR2 HRAS CCND1 BIRC2 KRAS ERBB3 CDK4 AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1 GNA11 MAP2K2 JAK3 AR MED12);
my %array1_hash;
my %array2_hash;
# Create a hash entry for each element in #array1
for my $element ( #array1 ) {
$array1_hash{$element} = #array1;
}
# Same for #array2: This time, use map instead of a loop
map { $array_2{$_} = 1 } #array2;
for my $entry ( #array2 ) {
if ( not $array1_hash{$entry} ) {
return 1; #Entry in #array2 but not #array1: Differ
}else {
return 0; #Arrays contain the same elements
}
#if ( keys %array_hash1 != keys %array_hash2 ) {
#return 1; #Arrays differ
}
Note The best version is reached at the end. It is a few lines of code.
If I get it right, you have a separate reference list of key-words that need to be in the second field in a row, with rows in that order. One way to find skipped rows is to iterate through both lists.
That approach can be picky and error prone but here it can be made easier by removing the front element from the reference list each time. Then you always need to compare the current line against the first element in the reference list. Here is the basic logic, with the better version further below.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
# chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
foreach my $line (#CNVarray)
{
if ( (split /\t/, $line)[1] eq $ref_list[0] ) { # good row
shift #ref_list;
print $line, "\n";
}
else {
shift #ref_list;
print "blank row found\n";
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
# multiple missing rows? keep going through the reference list
shift #ref_list;
print "blank row found\n";
}
}
# We are done with the array, but are there more reference items?
print "blank row found\n" for #ref_list;
The while loop is needed since multiple rows can be missing (in a row), so we need to get to the place in the reference list that does match the current row. A few notes on the code.
The filehandle read <...> in the list context returns a list with all lines from the resource.
The chop in the original code removes the last character, probably not what you want. It is the chomp that removes the new line (or really $/).
Tested against the reference list qw(AA BB CC DD EE) with the input file (note spaces not tabs)
1 AA first
2 BB more
5 EE last
To test with this, change /\t/ to /\s/ (what will then work for tabs as well). It prints
1 AA first
2 BB more
blank row found
blank row found
5 EE last
With further elements added to the #ref_list (FF etc) further blank ... lines are printed.
The code above can be simplified. Lines are also collected in an array, then printed to a new file.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
my #new_lines;
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
shift #ref_list;
push #new_lines, 'blank row found';
print "blank row found\n";
}
shift #ref_list;
push #new_lines, $line;
print $line, "\n";
}
# There may be more items remaining on the reference list
for (#ref_list) {
push #new_lines, 'blank row found';
print "blank row found\n"
}
my $filled_file = 'skipped_rows_added.txt';
open my $out_fh, '>', $filled_file or die "Can't open $filled_file: $!";
print $out_fh "$_\n" for #new_lines;
close $out_fh;
This behaves the same way with the test input above. It can be simplified further yet
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne shift #ref_list ) {
print "blank row found\n";
}
print $line, "\n";
}
The shift returns the removed element, which is what need be tested against.
A note on split syntax, following the code update ("\t" changed to /\t/).
When invoked as split /$patt/, $str, the $patt is used as a regular expression, with a few very minor differences. So with /\s/ the string is split on white space as understood in regex, thus including the tab, for example.
With double quotes "..." used instead of /.../, what is inside is interpolated first which may result in surprises, in particular with escapes. (Unless it is used as m"..." in which case it is merely a regex with " being the delimiter.)
In the above code for the tab one can use /\t/, or "\t", or '\t' (or /\s/ which includes yet other types of space). The "\t" was changed to /\t/, which is better in my opinion, being clearer (it is a regex, no questions asked). Thanks to Borodin for the early edit and for the comment.
I would write this
The input file is read into a hash, keyed by the value of the second column. Then the hash is read back and printed in the specified sequence of keys
Most of the code is finding the input file and setting up the sequence of keys. The core of the program is only three lines of code
use strict;
use warnings 'all';
use File::Spec::Functions 'catfile';
my $home = '/data';
my #run_folder = grep -f, glob catfile($home, 'test_all_runs', '*', 'CNV.txt');
die "No CNV file found" unless #run_folder;
my $cnv_file = $run_folder[-1];
print "The file is $cnv_file\n\n";
my #sequence = qw/
MTOR JAK1 NRAS DDR2 MYCN ALK
IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1
FGFR3 PDGFRA KIT APC FGFR4 ROS1
ESR1 EGFR CDK6 MET SMO BRAF
FGFR1 MYC JAK2 GNAQ RET FGFR2
HRAS CCND1 BIRC2 KRAS ERBB3 CDK4
AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1
GNA11 MAP2K2 JAK3 AR MED12
/;
open my $fh, '<', $cnv_file or die qq{Unable to open "$cnv_file" for input: $!};
my %data;
$data{ (split)[1] } = $_ while <$fh>;
print $data{$_} // "no data for $_\n" for #sequence;
output
The file is /data/test_all_runs/XXX/CNV.txt
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
no data for NRAS
no data for DDR2
no data for MYCN
no data for ALK
no data for IDH1
no data for ERBB4
no data for RAF1
no data for CTNNB1
no data for PIK3CA
no data for DCUN1D1
no data for FGFR3
no data for PDGFRA
no data for KIT
no data for APC
no data for FGFR4
no data for ROS1
no data for ESR1
no data for EGFR
no data for CDK6
no data for MET
no data for SMO
no data for BRAF
no data for FGFR1
no data for MYC
no data for JAK2
no data for GNAQ
no data for RET
no data for FGFR2
no data for HRAS
no data for CCND1
no data for BIRC2
no data for KRAS
no data for ERBB3
no data for CDK4
no data for AKT1
no data for MAP2K1
no data for IDH2
no data for NF1
no data for ERBB2
no data for BRCA1
no data for GNA11
no data for MAP2K2
no data for JAK3
no data for AR
no data for MED12

Read a CSV file with uneven commas but fixed number of columns

I want to able to read this CSV file into an array of arrays or hashes for manipulation. How can I go about it?
For example my file contains the following (the first line is the header):
Name,Age,Items,Available
John,29,laptop,mouse,Yes
Jane,28,desktop,keyboard,mouse,yes
Doe,56,tablet,keyboard,trackpad,touchpen,Yes
First column is name, second is Age, third is Items, But items can contain more than one thing separated by commas, and last column is Person availability.
How can I accurately read this?
Well-formed CSV quotes fields that contain a comma as part of the value. If your CSV is well-formed use the Text::CSV module:
use Text::CSV;
my $csv = Text::CSV->new();
while (my $row = $csv->getline(\*DATA)) {
my $name = $row->[0];
my $age = $row->[1];
my #items = split /,/, $row->[2];
my $available = $row->[3];
print "$name/$age/#items/$available\n";
}
__DATA__
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad",touchpen,Yes
Output:
Name/Age/Items/Available
John/29/laptop mouse/Yes
Jane/28/desktop keyboard mouse/yes
Doe/56/tablet keyboard trackpad touchpen/Yes
If your CSV is not well-formed you'll need to implement a custom parse based on knowledge of your data. Assuming that the Items column is the only multi-valued field you can split on a comma and then remove the fields with a known position. Whatever is left is the items.
while (my $line = <DATA>) {
chomp $line;
my #record = split /,/, $line;
my $name = shift #record;
my $age = shift #record;
my $available = pop #record;
my #items = #record;
print "$name/$age/#items/$available\n";
}
__DATA__
Name,Age,Items,Available
John,29,laptop,mouse,Yes
Jane,28,desktop,keyboard,mouse,yes
Doe,56,tablet,keyboard,trackpad,touchpen,Yes
Alternately, you could use array slicing to get the same result:
my ($name, $age, $available, #items) = #record[0, 1, -1, 2 .. #record - 2];
Since your data is, in reality, a properly-formatted CSV file, you can use the standard tools to read and store it
Here's the data I'm now assuming that you have
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad,touch pen",Yes
Solution
Like my original answer, this code uses Text::CSV to parse each line of input. But instead of having to reformat it, each row may be pushed directly onto array #data
Also as before, it conforms to the standard of reading from STDIN. But this time I have used Data::Dump to reveal the in-memory data structure that has been built. If you run it on the command line you should use
$ perl unpack_csv.pl text.csv
use strict;
use warnings 'all';
use Text::CSV;
my $csv = Text::CSV->new;
my #data;
while ( <> ) {
$csv->parse($_);
my #row = $csv->fields;
push #data, \#row;
}
use Data::Dump;
dd \#data;
Update
I now realise that the OP's file may well contain properly-formatted CSV data, which makes this answer superfluous
However the question has not been changed to show the real data, so I am leaving this answer here in case the question's subject line and content entices people with a problem that this will solve
I recommend that you use an intermediate program to format your CSV file properly. Once you have a standard-format file, the resulting output can then be processed using Perl with Text::CSV, Excel, or anything similar
This program uses Text::CSV to read your input data and write the Items column enclosed in quotes if necessary
It works by using Text::CSV->parse to split each line into fields, and then reserving the first two and final fields for new fields 1, 2 and 4. Whatever is left is joined with a comma , and used for field 3. The four resulting values are passed back to Text::CSV->combine and printed
It conforms to the standard of reading from STDIN and writing to STDOUT, so if you run it on the command line you should use
$ perl reformat_csv.pl text.csv > new_text.csv
use strict;
use warnings 'all';
use Text::CSV;
my $csv = Text::CSV->new;
while ( <> ) {
$csv->parse($_);
my #row = $csv->fields;
my $f1 = shift #row;
my $f2 = shift #row;
my $f4 = pop #row;
my $f3 = join ',', #row;
$csv->combine($f1, $f2, $f3, $f4);
print $csv->string, "\n";
}
output
Name,Age,Items,Available
John,29,"laptop,mouse",Yes
Jane,28,"desktop,keyboard,mouse",yes
Doe,56,"tablet,keyboard,trackpad,touchpen",Yes

In a file/array, search for hash key, and replace it with the hash value, do this for all hash keys/values

I've searched around the site and surprisingly I can't seem to find something that will work for my particular problem. So I figured I'd post it and see how some of you more experienced programmers can address with problem.
I have a spreadsheet like text file (many lines with tab delimited columns), that I would like to search through for certain labels (ex scaffold1253.1_size81005.6.32799_7496) and replace them with more simplified labels (ex scaffold1253.1a). These labels are only in the first column of the text file. I've already written the script such that I have a hash with the old labels as keys corresponding to the new labels as their respective values. This hash has about 26000 lines. So essentially I'd like to take the hash keys 1 by 1, search for them in the text file, and replace them with their respective hash values.
I have a pretty good server availible so if its too complicated to make it first column specific to speed up the process then thats ok.
THis is what I have so far:
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
#gtfarray = <FASTAFILE2>;
#print #gtfarray;
my %hash;
while (<>)
{
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
#print %hash;
while (my ($find, $replace) = each %hash) {
foreach (#gtfarray){
$_ =~ s/$find/$replace/g;
push #newgtf, $_;
}
}
print #newgtf;
This code doesn't seem to work as it doesn't complete. I'm pretty sure it's a problem with the foreach loop structure. Sorry I don't know of any other way to do this. Does anyone have a better way to run through this file and conduct the replacement?
Any input would be greatly appreciated!
Thanks,
Andrew
#DVK
Here is the full script with your mods that runs into syntax errors with your while loop, any idea why it's not accepting it? Thanks again!
use warnings;
$gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open(FASTAFILE2, $gtf);
my %hash;
while (<>){
chomp;
my ($key, $val) = split /\t/;
$hash{$key} .= exists $hash{$key} ? ",$val" : $val;
}
while $line (<FASTAFILE2>){
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n"
}
__END__
Here is the syntax error:
perl gtf_mod2.pl <./Hc_genome/header_file.txt
syntax error at gtf_mod2.pl line 14, near "while $line "
syntax error at gtf_mod2.pl line 23, near "}"
Execution of gtf_mod2.pl aborted due to compilation errors.
You exhaust your file the first time through your loop using the initial $find and $replace key/value pair.
There are two potential solutions:
Open the file for reading during each iteration of your while loop (expensive)
Move the foreach loop to the outside of the while and iterate the hash each time (less expensive)
example:
REPLACE:
for my $line (#gtfarray) {
while(my ($find, $replace) = each %hash) {
if($line =~ s/$find/$replace/g) {
push #newgtf, $line;
next REPLACE; # skip to next iteration
}
}
# if there was no replacement, push the old line
push #newgtf, $line
}
How big is the file that you are replacing the first column in?
If it's >50,000 lines, you are better off doing the reverse:
Iterate through hash file once, and store that hash in memory
Iterate through main file once, and for every line, for every column, find that value in the memorized hash, replace with hash value if found, and write.
In other words, remove the first #gtfarray = <FASTAFILE2>; and replace your last while loop with:
while my $line (<FASTAFILE2>) {
my #fields = split(/\t/, $line);
# If you only care about first column, don't need the foreach loop below;
# just do the loop insides on $fields[0]
foreach my $field (#fields) {
$field = $hash{$field} if exists $hash{$field};
print $outfile "$field\t"; # Small bug - will print training \t
}
print $outfile "\n";
}
NOTE: I'm making an assumption that the fields contain FULL contents of your hash keys (e.g. your data file would contain a field with "scaffold1253.1_size81005.6.32799_7496" but NOT a field with "XYZscaffold1253.1_size81005.6.32799_7496___IOU").
If that assumption is wrong and you really DO need to run a regex because your scaffold strings may be contained in longer strings, there may still be a better solution aside from running O(N*M) regexes: if your scaffold strings are all of a certain well defined format (e.g. "scaffoldNNNNN.NNN_sizeNNNNN.NNN.NNNN_NNNN"), what you need to do then is:
For each line of data file, run a single regex finding that pattern, with the entire pattern inside a capture group parenthesis:
#matches = ($line =~ m/(scaffold\d+\.\d+_size\d+\.\d+\.\d+_\d+/g );
Then, look up every value of #matches array in the hash. If found, run ONLY the matches as a s/// regex.
Looking at your previous post, wouldn't it be more simple to create the shortened 'id' while reading the file. Then you would have no need of the other file where you get your hash?
Here is the (untested) code below. (would need to direct the print statements to an output file on the command line or open a file for writing in your script).
#!/usr/bin/perl
use strict;
use warnings;
my $gtf = './Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf';
open my $FASTAFILE2, "<", $gtf or die "Unable to open '$gtf' for reading. $!";
my %seen;
while (<$FASTAFILE2>) {
chomp;
my ($id, $val) = split /\t/, $_, 2;
# copy $id to $prefix and
# remove everything after '.1' in $prefix
(my $prefix = $id) =~ s/\.1\K.*//;
if ($seen{$id}) {
++$seen{$id};
}
else {
$seen{$id} = 'a';
}
print "$prefix$seen{$id}\t$val\n";
}
close $FASTAFILE2 or die "Unable to close '$gtf' from reading. $!";
Could it be a job for Tie::File? Assuming, that is, the data file could be operated on as an array.
use Tie::File;
my $file = "./Hc_genome/Hc_rztk_1+2+8+9.augustus.gtf";
tie #lines, 'Tie::File', $file or die ;
for (#lines) {
s/Oldlabel/NewLable/g; # Change this to fit
}
untie #lines ;
Tie::File does a bunch of tricks to keep the "in place " changes to the file memory efficient.