parse a huge text file in perl

parse a huge text file in perl - perl

I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487
This is how my model file looks, I want to have them as :
File A :
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853
File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948
File C:
ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.
Is there any easy way of doing this than going thorough an array?
How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.
Thanks
Sipra

#!/bin/env perl
use strict;
use warnings;
# open three output filehandles
my %fh;
for (qw[A B C]) {
open $fh{$_}, '>', "file$_" or die $!;
}
# open input
open my $in, '<', 'somefile' or die $!;
# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;
while (<$in>) {
chomp;
my #data = split /,/;
print $fh{A} join(',', #data[0 .. 9]), "\n";
print $fh{B} join(',', #data[0, 1, 10 .. 17]), "\n";
print $fh{C} join(',', #data[0, 1, 18 .. $#data]), "\n";
}
Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.
#!/usr/bin/env perl
use strict;
use warnings;
# open input
open my $in, '<', 'somefile' or die $!;
chomp(my $head = <$in>);
my #cols = split/,/, $head;
die 'Invalid number of records - ' . #cols . "\n"
if (#cols -2) % 8;
my #files;
my $name = 'A';
foreach (1 .. (#cols - 2) / 8) {
my %desc;
$desc{start_col} = (($_ - 1) * 8) + 2;
$desc{end_col} = $desc{start_col} + 7;
open $desc{fh}, '>', 'file' . $name++ or die $!;
print {$desc{fh}} join(',', #cols[0,1],
#cols[$desc{start_col} .. $desc{end_col}]),
"\n";
push #files, \%desc;
}
while (<$in>) {
chomp;
my #data = split /,/;
foreach my $f (#files) {
print {$f->{fh}} join(',', #data[0,1],
#data[$f->{start_col} .. $f->{end_col}]),
"\n";
}
}

This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );
#I'm lazy
my #source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;
#read each row
while ( my $raw_line = shift #source_file ) {
$csv->parse($raw_line);
my #fields = $csv->fields();
#get the first 2 ids
my #ids = splice #fields, 0, 2;
my $group = 0;
while (#fields) {
#get the first 8 columns
my #columns = splice #fields, 0, 8;
#if you want to change the separator of the output replace ',' with "\t"
push #{ $output_file->[$group] }, (join ',', #ids, #columns), $/;
$group++;
}
}
#for filename purposes
my $letter = 65;
foreach my $data (#$output_file) {
my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
write_file( $output_filename, #$data );
$letter++;
}
#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (#$output_file) {
# my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
# write_file( $output_filename, #$data );
# $sample_number++;
#}

Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files
perl -F, -lane 'print "#F[0..1] #F[2..9]"' <INPUT_FILE_NAME>

You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?
I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.
I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.
It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.
If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.
The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.
I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.
First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.
I see some sort of processing loop like this:
while (my $line = <$input_fh>) { #Line from the input file.
chomp $line;
my #input_line_array = split /\t/, $line;
my $fileHandle;
foreach my $output_file (#outputFileList) { #List of output files.
$fileHandle = $output_file->{FILE_HANDLE};
my #fieldsToWrite;
foreach my $fieldNumber (#{$output_file->{FIELD_LIST}}) {
push $fieldsToWrite, $input_line_array[$field];
}
say $file_handle join "\t", #fieldsToWrite;
}
}
I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the #input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.
I have a list called #outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in #input_line_array. So if
$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];
Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.
I hope this is making some sense.
The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.
Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.

Related

identify and insert the missing rows

An array is populated from a tab delimited text (5 column) file that sometimes is missing rows. I need to identify and insert the missing rows. Inserting a string "blank row found" is sufficient.
Here is an example of data from file:
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
I’ve created an array of elements that identifies the second column of each row that should be present in the file, in the order each row should be present. However, I'm not sure how to continue from here, since I'm unable to install any Perl modules on the server (e.g. Arrays::Utils).
Is comparing arrays the correct way of approaching this problem? Perhaps there is a straightforward solution, that doesn’t require installation of any CPAN modules? Thanks for your help.
#!perl
use strict;
use warnings;
use File::Basename;
#use Arrays::Utils;
opendir my $dir, "/data/test_all_runs" or die "Cannot open directory: $!";
my #run_folder = readdir $dir;
closedir $dir;
my $run_folder = pop #run_folder; print "The folder is".$run_folder."\n";
my $home="/data/";
my $CNV_file = $home."test_all_runs/".$run_folder."/CNV.txt";
my #CNVarray;
open(TXT2, "$CNV_file");
while (<TXT2>){
push (#CNVarray, $_);
}
close(TXT2);
foreach (#CNVarray){
chop($_);
}
my #array1 = map { $_->[1] } #CNVarray;
my #array2 = qw(MTOR JAK1 NRAS DDR2 MYCN ALK IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1 FGFR3 PDGFRA KIT APC FGFR4 ROS1 ESR1 EGFR CDK6 MET SMO BRAF FGFR1 MYC JAK2 GNAQ RET FGFR2 HRAS CCND1 BIRC2 KRAS ERBB3 CDK4 AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1 GNA11 MAP2K2 JAK3 AR MED12);
my %array1_hash;
my %array2_hash;
# Create a hash entry for each element in #array1
for my $element ( #array1 ) {
$array1_hash{$element} = #array1;
}
# Same for #array2: This time, use map instead of a loop
map { $array_2{$_} = 1 } #array2;
for my $entry ( #array2 ) {
if ( not $array1_hash{$entry} ) {
return 1; #Entry in #array2 but not #array1: Differ
}else {
return 0; #Arrays contain the same elements
}
#if ( keys %array_hash1 != keys %array_hash2 ) {
#return 1; #Arrays differ
}

Note The best version is reached at the end. It is a few lines of code.
If I get it right, you have a separate reference list of key-words that need to be in the second field in a row, with rows in that order. One way to find skipped rows is to iterate through both lists.
That approach can be picky and error prone but here it can be made easier by removing the front element from the reference list each time. Then you always need to compare the current line against the first element in the reference list. Here is the basic logic, with the better version further below.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
# chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
foreach my $line (#CNVarray)
{
if ( (split /\t/, $line)[1] eq $ref_list[0] ) { # good row
shift #ref_list;
print $line, "\n";
}
else {
shift #ref_list;
print "blank row found\n";
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
# multiple missing rows? keep going through the reference list
shift #ref_list;
print "blank row found\n";
}
}
# We are done with the array, but are there more reference items?
print "blank row found\n" for #ref_list;
The while loop is needed since multiple rows can be missing (in a row), so we need to get to the place in the reference list that does match the current row. A few notes on the code.
The filehandle read <...> in the list context returns a list with all lines from the resource.
The chop in the original code removes the last character, probably not what you want. It is the chomp that removes the new line (or really $/).
Tested against the reference list qw(AA BB CC DD EE) with the input file (note spaces not tabs)
1 AA first
2 BB more
5 EE last
To test with this, change /\t/ to /\s/ (what will then work for tabs as well). It prints
1 AA first
2 BB more
blank row found
blank row found
5 EE last
With further elements added to the #ref_list (FF etc) further blank ... lines are printed.
The code above can be simplified. Lines are also collected in an array, then printed to a new file.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
my #new_lines;
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
shift #ref_list;
push #new_lines, 'blank row found';
print "blank row found\n";
}
shift #ref_list;
push #new_lines, $line;
print $line, "\n";
}
# There may be more items remaining on the reference list
for (#ref_list) {
push #new_lines, 'blank row found';
print "blank row found\n"
}
my $filled_file = 'skipped_rows_added.txt';
open my $out_fh, '>', $filled_file or die "Can't open $filled_file: $!";
print $out_fh "$_\n" for #new_lines;
close $out_fh;
This behaves the same way with the test input above. It can be simplified further yet
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne shift #ref_list ) {
print "blank row found\n";
}
print $line, "\n";
}
The shift returns the removed element, which is what need be tested against.
A note on split syntax, following the code update ("\t" changed to /\t/).
When invoked as split /$patt/, $str, the $patt is used as a regular expression, with a few very minor differences. So with /\s/ the string is split on white space as understood in regex, thus including the tab, for example.
With double quotes "..." used instead of /.../, what is inside is interpolated first which may result in surprises, in particular with escapes. (Unless it is used as m"..." in which case it is merely a regex with " being the delimiter.)
In the above code for the tab one can use /\t/, or "\t", or '\t' (or /\s/ which includes yet other types of space). The "\t" was changed to /\t/, which is better in my opinion, being clearer (it is a regex, no questions asked). Thanks to Borodin for the early edit and for the comment.

I would write this
The input file is read into a hash, keyed by the value of the second column. Then the hash is read back and printed in the specified sequence of keys
Most of the code is finding the input file and setting up the sequence of keys. The core of the program is only three lines of code
use strict;
use warnings 'all';
use File::Spec::Functions 'catfile';
my $home = '/data';
my #run_folder = grep -f, glob catfile($home, 'test_all_runs', '*', 'CNV.txt');
die "No CNV file found" unless #run_folder;
my $cnv_file = $run_folder[-1];
print "The file is $cnv_file\n\n";
my #sequence = qw/
MTOR JAK1 NRAS DDR2 MYCN ALK
IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1
FGFR3 PDGFRA KIT APC FGFR4 ROS1
ESR1 EGFR CDK6 MET SMO BRAF
FGFR1 MYC JAK2 GNAQ RET FGFR2
HRAS CCND1 BIRC2 KRAS ERBB3 CDK4
AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1
GNA11 MAP2K2 JAK3 AR MED12
/;
open my $fh, '<', $cnv_file or die qq{Unable to open "$cnv_file" for input: $!};
my %data;
$data{ (split)[1] } = $_ while <$fh>;
print $data{$_} // "no data for $_\n" for #sequence;
output
The file is /data/test_all_runs/XXX/CNV.txt
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
no data for NRAS
no data for DDR2
no data for MYCN
no data for ALK
no data for IDH1
no data for ERBB4
no data for RAF1
no data for CTNNB1
no data for PIK3CA
no data for DCUN1D1
no data for FGFR3
no data for PDGFRA
no data for KIT
no data for APC
no data for FGFR4
no data for ROS1
no data for ESR1
no data for EGFR
no data for CDK6
no data for MET
no data for SMO
no data for BRAF
no data for FGFR1
no data for MYC
no data for JAK2
no data for GNAQ
no data for RET
no data for FGFR2
no data for HRAS
no data for CCND1
no data for BIRC2
no data for KRAS
no data for ERBB3
no data for CDK4
no data for AKT1
no data for MAP2K1
no data for IDH2
no data for NF1
no data for ERBB2
no data for BRCA1
no data for GNA11
no data for MAP2K2
no data for JAK3
no data for AR
no data for MED12

Parsing CSV files and matching

I have 10 folders and in each folder, i have two files (CSV,comma delimeted) in the following formats.
File 1:
Ensembl Gene ID,Ensembl Transcript ID,Exon Chr Start (bp),Exon Chr End (bp),Exon Rank in Transcript, Transcript count,Gene End (bp) ,Gene Start (bp),Strand
ENSG00000271782,ENST00000607815,50902700,50902978,1,1,50902978,50902700,-1
ENSG00000232753,ENST00000424955,103817769,103817825,1,1,103828355,103817769,1
ENSG00000232753,ENST00000424955,103827995,103828355,2,1,103828355,103817769,1
ENSG00000225767,ENST00000424664,50927141 50927168,1,1,50936822,50927141,1
File 2:
number,Start pos,End Pos
1,41035,41048
3,36738,36751
3,38169,38182
3,40264,40277
I am trying to match the second file to firstfile
The number in colum1 of second file is the key record number in first file.
Extract the last 3 colums from first file
where the output needed is :
1,ENSG00000271782,41035,41048,50902978,50902700,-1
3,ENSG00000225767,36738,36751,50936822,50927141,1
3,ENSG00000225767,38169,38182,50936822,50927141,1
3,ENSG00000225767,40264,40277,50936822,50927141,1
I have started reading from second using TexT::CSV, but need help.
use strict;
use warnings;
use lib 'C:/Perl/lib';
use Text::CSV;
my $file1 = "infile1";
open my $fh, "<", $file1 or die "$file1: $!";
my $file2 = "infile2"
open my $fh2, "<", $file2 or die "$file2: $!";
my $csv = Text::CSV->new ({
binary => 1,
auto_diag => 1,
});
while (my $row = $csv->getline ($fh2)) {
print "#$row\n"; # I am stuck in extraction ? do I need to put another while loop for fh1
}
close $fh1;
close $fh2;

Since there are no commas enclosed within double-quotes, you can just split on the commas instead of using Text::CSV (which is an excellent module). Given this, the following produces the output you want:
use strict;
use warnings;
use autodie;
my ( $num, %hash ) = 0;
my ( $file1, $file2 ) = qw/inFile1 inFile2/;
open my $fh1, '<', $file1;
while (<$fh1>) {
next if $. == 1;
chomp;
my #fields = split /,/;
$num++ if !$hash{ $fields[0] }++;
push #{ $hash{$num} }, [ #fields[ 0, 6 .. 8 ] ];
}
close $fh1;
open my $fh2, '<', $file2;
while (<$fh2>) {
next if $. == 1;
chomp;
my #fields = split /,/;
if ( my #arr = #{ $hash{ $fields[0] }->[0] } ) {
splice #arr, 1, 0, #fields[ 1, 2 ];
print join( ',', $fields[0], #arr ), "\n";
}
}
close $fh2;
This uses a hash to: 1) keep track of seen Gene IDs, and 2) build a hash of array of arrays (HoAoA). The count--your "key record"--is incremented on unique Gene IDs, so #1 keeps track of these IDs to insure that $num is incremented only if the Gene ID hasn't yet appeared. Number 2 (HoAoA) is used because there are multiple instances of the same Gene ID, but only the values from the first instance are used in the printing. (I did note, however, that the second skips #2, which is the multiple-instance Gene ID.) Perhaps you only needed a hash of arrays (HoA), but it works well the way it is--or you can just modify it, as needed. That is, if you aren't going to use the multiple Gene ID info, the code could be simplified.
Output on your datasets:
1,ENSG00000271782,41035,41048,50902978,50902700,-1
3,ENSG00000225767,36738,36751,50936822,50927141,1
3,ENSG00000225767,38169,38182,50936822,50927141,1
3,ENSG00000225767,40264,40277,50936822,50927141,1
Hope this helps!

The interesting part of this problem is that you need logic that read file 1 until it's ahead of file 2 and logic that reads file 2 until it's ahead of file 1 and logic to know how to act when one is behind the other, and when they are in balance.
You'll need to track unique gene ensemble ids, and their ordinal position in the list. So that when you read the second line of file2, you'll know how to skip the second and third line of file1, but also know not to skip and more in file1 when you've read the third and fourth line in file 1.
Or you can read file1 into memory, and create an array of arrays of lines, so that e.g.
file1arr[1] = [ $line1 ]
file1arr[2] = [ $line2, $line3 ]
file1arr[3] = [ $line4 ]
so when you loop over file 2, all the lines from file1 are in a neat little arrayref at the array index corresponding to the number column of file2.
Then it's just an exercise in iterating of the array of file1 lines, splitting them and building your output lines.

nested while loop comparing two files, outside loop stops after one iteration, need to compare every line of files

Firstly I apologise if my formatting here is incorrect, I am very new to writing scripts (3 days) and this is my first post on this site.
I have two files which are tab separated, File a contains 14 columns, and File b contains 8 columns.
One column in File b has a numeric value which correlates to a range of numbers generated by two numeric fields from File a.
For every line in File a, I need to, search through the File b and print a combination of data from fields on both files. There will be multiple matches for each line of File a due to a numeric range being accepted.
The code that I have created does exactly what I want it to do but only for the first line of File a, and doesn't continue the loop. I have looked all over the internet and I believe it may be something to do with the fact that both files read from standard input. I have tried to correct this problem but I can't seem to get anything to work
My current understanding is that by changing one file to read from a different file descriptor my loop may work... with something such as >$3 but I don't really understand this very well despite my research. Or possibly using the grep function which I am also struggling with.
Here is the outline of the code I am using now:
use strict;
use warnings;
print "which file read from?\n";
my $filea = <STDIN>;
chomp $filea;
{
unless (open ( FILEA, $filea) {
print "cannot open, do you want to try again? y/n?\n?";
my $attempt = <STDIN>;
chomp $again;
if ($again =~ 'n') {
exit;
} else {
print "\n";
$filea = <STDIN>;
chomp $filea;
redo;
}
}
}
#I also open fileb the same way, but wont write it all out to save space and your time.
my output = 'output.txt';
open (OUTPUT, ">>$output");
while (my $loop1 = <FILEA>) {
chomp $loop1;
( my $var1, my $var2, my $var3, my $var4, my $var5, my $var6,
my $var7, my $var8, my $var9, my $var10, my $var11, my $var12,
my $var13, my $var14 ) = split ( "\t", $loop1);
#create the range of number which needs to be matched from file b.
my $length = length ($var4);
my $range = ($var2 + $length);
#perform the search loop through fileb
while (my $loop2 = <FILEB>) {
chomp $loop2;
( my $vala, my $valb, my $valc, my $vald, my $vale, my $valf,
my $valg) = split ( "\t", $loop2 );
#there are then several functions and additions of the data, which all work basicly so I'll just use a quick example.
if ($vald >= $val3 $$ $vald <= $range) {
print OUTPUT "$val1, $vald, $val11, $valf, $vala, $val5 \n";
}
}
}
I hope this all makes sense, I tried to make everything as clear as possible, if anyone could help me edit the code so that the loop continues through all of filea that would be great.
If possible please explain what you've done. Ideally I'd like it if its possible to obtain this result without changing the code too much.
Thanks guys!!!

Avoid naked handles when possible; use $fh (filehandle) instead of FH
You can use until instead of unless, and skip the redo:
print "Enter the file name\n";
my $file_a = <STDIN>;
chomp $file_a;
my $fh_a;
until(open $fh_a, '<', $file_a) {
print "Re-enter the file name or 'n' to cancel\n";
$file_a = <STDIN>;
chomp $file_a;
if($file_a eq 'n') {
exit;
}
}
You can (should) use an array instead of all those individual column variables: my #cols_a = split /\t/, $line;
You should read file B into an array, once, and then search that array each time you need to: my #file_b = <$fh_b>;
The result will look something like this:
#Assume we have opened both files already . . .
my #file_b = <$fh_b>;
chomp #file_b;
while(my $line = <$fh_a>) {
chomp $line;
my #cols_a = split /\t/, $line;
#Remember, most arrays (perl included) are zero-indexed,
#so $cols_a[1] is actually the SECOND column.
my $range = ($cols_a[1] + length $cols_a[3]);
foreach my $line_b (#file_b) {
#This loop will run once for every single line of file A.
#Not efficient, but it will work.
#There are, of course, lots of optimisations you can make
#(starting with, for example, storing file B as an array of array
#references so you don't have to split each line every time)
my #cols_b = split /\t/, $line_b;
if($cols_b[3] > $cols_a[2] && $cols_b[3] < ($cols_a[2] + $range)) {
#Do whatever here
}
}
}

I want to replace a sequence name in fasta file with another name

I have one fasta file and one text file fasta file contains sequences in fasta format and text file contains name of genes now I want to replace name of the sequences in fasta file after '>' sign with the gene names in text file
I am new to perl though I have written a script but I don't know why its not working can anyone help me on that please
following is my script:
print"Enter annotated file...";
$f1=<STDIN>;
print"Enter sequence file...";
$f2=<STDIN>;
open(FILE1,$f1) || die"Can't open $f1";
#annotfile=<FILE1>;
open(FILE2,$f2) || die"Can't open $f2";
#seqfile=<FILE2>;
#d=split('\t',#annotfile[0]);
for($i=0;$i<scalar(#annotfile);$i++)
{
#curr_all=split('\t',#annotfile[$i]);
#curr_id[$i]=#curr_all[0];
#gene_nm[$i]=#curr_all[1];
}
for($j=0;$j<scalar(#seqfile);$j++)
{
$id=#curr_id[$j];
$gene=#gene_nm[$j];
#seqfile[$j]=~s/$id[$j]/$gene[$j]/g;
print #seqfile[$j];
}
my files looks like following:
annot.txt
pool75_contig_389 ubiquitin ligase e3a
pool75_contig_704 tumor susceptibility
pool75_contig_1977 serine threonine-protein phosphatase 4 catalytic subunit
pool75_contig_3064 bardet-biedl syndrome 2 protein P
pool75_contig_2499 succinyl- ligase
goat300.fasta
goat300.fasta
>pool75_contig_704
CCCTTTCTCCCTTCCCAACATTCAGAGATACTGAATCGAAACTCTTACTGTCTGTTAGAT
GACAAAGAGTTATCCATCCTACATACTCCAATTTCCTTCCGCAACTTGTGATTTCGCCGC
TTGAATCTTGACGCCGTGCGTCCACAGTTTGTTGTGTTTTATCAATCAAGGTCATTATCA
ACCGAAGACGCTATCTATTTTCTTGGCGAAGCTCTCGGAAAGGAGCCATCGAAATGGAAG
TATTTCTCAAGAAAGTCCGCGAGTTATCCCGGAAGCAGTTC
>pool75_contig_389
GACCTATACCGGACCGTCACTGAAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ACGATCCAGGCATGGAGTTGTGGTGACGAGTAGGAGGGTCACCGTGGTGAGCGGGAAGCC
TCGGGCGTGAGCCTGGGTGGAGCCGCCACGGGTGCAGATCTTGGTGGTAGTAGCAAATAT
TCAAGTGAGAACCTTGAAGGCCGAGGTGGAGAAGGNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTCATTTGTAT
CGCCCGGAAAACGTCACAAGAACGGGAGTTGCGTACAGAA
>pool75_contig_1977
AAGGGACACCGTTGGGTGAGGCGAGCTGCGTTCCTCGAACCATGGCTTCAAAAAGCGACT
TAGACCGTCAGATTGAACAGCTCAGGGCCTGCAAGCTCATTACAGAGGATGAGGTTAAGG
CACTCTGCGCTAAGGCGCGTGAGATTTTAATTGAAGAGAGTAATGTCCAGTGCGTGGACT
CACCTGTCACGGTTTGTGGCGATATCCACGGCCAGTTTTACGACTTGATTGAACTGTTTA
AAGTGGGCGGAGATGTTC
>pool75_contig_3064
TTACTATTTCTGGGCCTTAAGACTGGCTTAGTCGCTTACGACCCTTATAACAATGTAGAT
GTATATTATAAGGATCTTCCTGATGGTGCTAACGCTATGTTAATTTATTCAAACTCACCG
ACAAAGGAACAGAATATGCTTTGGCAGGTGGAAACTGTTCGATAATTGGATTGAACGACG
GCGGATGCGAGGTATTTTGGACAGTCACTGGCGACTCCGTTTGCTCTCTTTGCTCGATTA
AATCCGACAGCGATAAGTCAAGAGATTTTGTGGTTGGCTCTGAAGATTTTGACATCCGAA
TCTTCCATGGGGATGCCATAATATATGAAATCACGGAGTCTGATG
>pool75_contig_2499
AAGAGAAGAGGTGAGTTTGAGTATTGTTTGTGTGTGTGTGGTTGGGTGAGTGTGTGGTAT
GTGGTGTATGTGTGTGATGAATGTATGTGAAAGAGAGTGATGAATCTCATGGATATGTTC
GAGTTCGTGGTTTCCATTGATCGGTTATAGCCGAGATGATGGATGTGTTCCATGTGTCTG
ATTTCAGTTTAGGATTGTGTTGATGATGTTGATGATGAAAATTGTTGATGGTGATGACGA
TAGTGATGATGATGACGATGTTTCGGATAATGGTGATGATGATGATGGTTCCGACGATGA
TGTTTCGCTTGATGATGGTGATAATGATGACTCCGAAAATAACGTTGACTCGGATGAG

Consider using Bio::SeqIO to parse your Fasta dataset, instead of doing it yourself. Bio::SeqIO lives for this task, and is well developed for it. Additionally, if you're in bioinformatics, it would serve you well to get to know Bio::SeqIO. Given this, consider the following:
use strict;
use warnings;
use Bio::SeqIO;
open my $fh, '<', 'annot.txt' or die $!;
my %annot = map { /(\S+)\s+(.+)/; $1 => $2 } <$fh>;
close $fh;
my $in = Bio::SeqIO->new( -file => 'goat300.fasta', -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
my $seqID = $annot{ $seq->id } // $seq->id;
print "$seqID\n" . $seq->seq . "\n";
}
Output on your datasets:
tumor susceptibility
CCCTTTCTCCCTTCCCAACATTCAGAGATACTGAATCGAAACTCTTACTGTCTGTTAGATGACAAAGAGTTATCCATCCTACATACTCCAATTTCCTTCCGCAACTTGTGATTTCGCCGCTTGAATCTTGACGCCGTGCGTCCACAGTTTGTTGTGTTTTATCAATCAAGGTCATTATCAACCGAAGACGCTATCTATTTTCTTGGCGAAGCTCTCGGAAAGGAGCCATCGAAATGGAAGTATTTCTCAAGAAAGTCCGCGAGTTATCCCGGAAGCAGTTC
ubiquitin ligase e3a
GACCTATACCGGACCGTCACTGAAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGATCCAGGCATGGAGTTGTGGTGACGAGTAGGAGGGTCACCGTGGTGAGCGGGAAGCCTCGGGCGTGAGCCTGGGTGGAGCCGCCACGGGTGCAGATCTTGGTGGTAGTAGCAAATATTCAAGTGAGAACCTTGAAGGCCGAGGTGGAGAAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTCATTTGTATCGCCCGGAAAACGTCACAAGAACGGGAGTTGCGTACAGAA
serine threonine-protein phosphatase 4 catalytic subunit
AAGGGACACCGTTGGGTGAGGCGAGCTGCGTTCCTCGAACCATGGCTTCAAAAAGCGACTTAGACCGTCAGATTGAACAGCTCAGGGCCTGCAAGCTCATTACAGAGGATGAGGTTAAGGCACTCTGCGCTAAGGCGCGTGAGATTTTAATTGAAGAGAGTAATGTCCAGTGCGTGGACTCACCTGTCACGGTTTGTGGCGATATCCACGGCCAGTTTTACGACTTGATTGAACTGTTTAAAGTGGGCGGAGATGTTC
bardet-biedl syndrome 2 protein P
TTACTATTTCTGGGCCTTAAGACTGGCTTAGTCGCTTACGACCCTTATAACAATGTAGATGTATATTATAAGGATCTTCCTGATGGTGCTAACGCTATGTTAATTTATTCAAACTCACCGACAAAGGAACAGAATATGCTTTGGCAGGTGGAAACTGTTCGATAATTGGATTGAACGACGGCGGATGCGAGGTATTTTGGACAGTCACTGGCGACTCCGTTTGCTCTCTTTGCTCGATTAAATCCGACAGCGATAAGTCAAGAGATTTTGTGGTTGGCTCTGAAGATTTTGACATCCGAATCTTCCATGGGGATGCCATAATATATGAAATCACGGAGTCTGATG
succinyl- ligase
AAGAGAAGAGGTGAGTTTGAGTATTGTTTGTGTGTGTGTGGTTGGGTGAGTGTGTGGTATGTGGTGTATGTGTGTGATGAATGTATGTGAAAGAGAGTGATGAATCTCATGGATATGTTCGAGTTCGTGGTTTCCATTGATCGGTTATAGCCGAGATGATGGATGTGTTCCATGTGTCTGATTTCAGTTTAGGATTGTGTTGATGATGTTGATGATGAAAATTGTTGATGGTGATGACGATAGTGATGATGATGACGATGTTTCGGATAATGGTGATGATGATGATGGTTCCGACGATGATGTTTCGCTTGATGATGGTGATAATGATGACTCCGAAAATAACGTTGACTCGGATGAG
The hash %annot is initialized by reading and capturing the contents of your annot.txt data. A Bio::SeqIO object is created using your goat300.fasta file data. The while loop iterates through your fasta sequences. The variable $seqID either takes the associated value of the key in the %annot hash or it keeps the current sequence ID (the // notation means defined or, so that insures $seqID will be defined). Finally, the Fasta record is printed.
Hope this helps!

There were a lot of warnings in your code, and your approach was inefficient. Let me first show you a working Perl program. I'll explain afterwards.
#!/usr/bin/perl
use strict;
use warnings;
# Read the annotations file
print"Enter annotated file...\n";
# my $f1 = <STDIN>;
my $f1 = 'annot.txt';
open(my $fh_annotations, '<', $f1) or die "Can't open $f1";
my #annotfile = <$fh_annotations>;
close $fh_annotations;
# Read the sequence file
print"Enter sequence file...\n";
# my $f2 = <STDIN>;
my $f2 = 'goat300.fasta';
open(my $fh_genes, '<', $f2) or die "Can't open $f2";
my #seqfile = <$fh_genes>;
close $fh_genes;
# Process the annotations data
my %names; # this hash is going to hold the names
foreach my $line (#annotfile) {
chomp $line; # remove newline
my #fields = split /\t/, $line; # split into array
$names{$fields[0]} = $fields[1]; # save in the hash as key->value pair
}
# Process the sequence data
foreach my $line (#seqfile) {
# Look at each line
if ($line =~ m/>(.+)$/) {
# If there is a heading there, remember it...
if (exists $names{$1}) {
# ... check if we know a name for it and replace it in the line
$line =~ s/($1)/$names{$1}/;
}
}
# output the line (this would be done to another filehandle)
print $line;
}
This reads both files and saves them in memory, just like yours did. But instead of trying to build two arrays for the names, I went with a hash, which is a key/value pair. Think of it like an array with names instead of numbers and no particular sorting.
Once these names are set up, I can process the sequence file. I simply look at each line and check if there is a heading there, by looking for the > sign. If it's there (it goes into $1 because of the parenthesis), I look if we have a hash entry (with exists) in our %names hash. If we do, we can replace the heading with the proper name.
After that, we could write it out to a new file. I'm just printing it.
I've used a few other techniques. Unfortunately the literature people get in a BioPerl context is quite outdated. Please take this advice, it will make your live easier.
Always use strict and warnings. They will tell you about problems with your code.
Always declare your variables with my. This is not like other languages, where you need to set up a variable at the top of your problem. You can declare it where you need it. The vars only live in a certain scope, which means between the nearest enclosing { and } brackets, or block.
Use three-argument open and lexical file handles for security. Read more here.
Perl offers foreach as an alternative to the C for loop. In this case, it made things a lot easier.
One more thing about this program: While this example data was rather short, I believe your actual data might be a lot larger. Consider processing the sequence file while you read it so you do not run out of memory. There's no need to save all the lines, unless you want to do something else with them.
open my $fh_out, '>', $filename_out or die $!;
open my $fh_in, '<', $filename_in or die $!;
while (my $line = <$fh_in>) {
# do stuff with the line, like your regex
print $fh_out $line;
}
close $fh_in;
close $fh_out;

Using Perl to parse a CSV file from a particular row to the end of the file

am very new to Perl and need your help
I have a CSV file xyz.csv with contents:
here level1 and er values are strings names...not numbers...
level1,er
level2,er2
level3,er3
level4,er4
I parse this CSV file using the script below and pass the fields to an array in the first run
open(my $d, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$d>) {
chomp $line;
my #data = split "," , $line;
#XYX = ( [ "$data[0]", "$data[1]" ], );
}
For the second run I take an input from a command prompt and store in variable $val. My program should parse the CSV file from the value stored in variable until it reaches the end of the file
For example
I input level2 so I need a script to parse from the second line to the end of the CSV file, ignoring the values before level2 in the file, and pass these values (level2 to level4) to the #XYX = (["$data[1]","$data[1]"],);}
level2,er2
level3,er3
level4,er4
I input level3 so I need a script to parse from the third line to the end of the CSV file, ignoring the values before level3 in the file, and pass these values (level3 and level4) to the #XYX = (["$data[0]","$data[1]"],);}
level3,er3
level4,er4
How do I achieve that? Please do give your valuable suggestions. I appreciate your help

As long as you are certain that there are never any commas in the data you should be OK using split. But even so it would be wise to limit the split to two fields, so that you get everything up to the first comma and everything after it
There are a few issues with your code. First of all I hope you are putting use strict and use warnings at the top of all your Perl programs. That simple measure will catch many trivial problems that you could otherwise overlook, and so it is especially important before you ask for help with your code
It isn't commonly known, but putting a newline "\n" at the end of your die string prevent Perl from giving file and line number details in the output of where the error occurred. While this may be what you want, it is usually more helpful to be given the extra information
Your variable names are verly unhelpful, and by convention Perl variables consist of lower-case alphanumerics and underscores. Names like #XYX and $W don't help me understand your code at all!
Rather than splitting to an array, it looks like you would be better off putting the two fields into two scalar variables to avoid all that indexing. And I am not sure what you intend by #XYX = (["$data[1]","$data[1]"],). First of all do you really mean to use $data[1] twice? Secondly, your should never put scalar variables inside double quotes, as it does something very specific, and unless you know what that is you should avoid it. Finally, did you mean to push an anonymous array onto #XYX each time around the loop? Otherwise the contents of the array will be overwritten each time a line is read from the file, and the earlier data will be lost
This program uses a regular expression to extract $level_num from the first field. All it does it find the first sequence of digits in the string, which can then be compared to the minimum required level $min_level to decide whether a line from the log is relevant
use strict;
use warnings;
my $file = 'xyz.csv';
my $min_level = 3;
my #list;
open my $fh, '<', $file or die "Could not open '$file' $!";
while (my $line = <$fh>) {
chomp $line;
my ($level, $error) = split ',', $line, 2;
my ($level_num) = $level =~ /(\d+)/;
next unless $level_num >= $min_level;
push #list, [ $level, $error ];
}

For deciding which records to process you can use the "flip-flop" operator (..) along these lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $level = shift || 'level1';
while (<DATA>) {
if (/^\Q$level,/ .. 0) {
print;
}
}
__DATA__
level1,er
level2,er2
level3,er3
level4,er4
The flip-flop operator returns false until its first operand is true. At that point it returns false until its second operand is true; at which point it returns false again.
I'm assuming that your file is ordered so that once you start to process it, you never want to stop. That means that the first operand to the flip-flop can be /^\Q$level,/ (match the string $level at the start of the line) and the second operand can just be zero (as we never want it to stop processing).
I'd also strongly recommend not parsing CSV records using split /,/. That may work on your current data but, in general, the fields in a CSV file are allowed to contain embedded commas which will break this approach. Instead, have a look at Text::CSV or Text::ParseWords (which is included with the standard Perl distribution).
Update: I seem to have got a couple of downvotes on this. It would be great if people would take the time to explain why.

#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my #XYZ;
my $file = 'xyz.csv';
open my $fh, '<', $file or die "$file: $!\n";
my $level = shift; # get level from commandline
my $getall = not defined $level; # true if level not given on commandline
my $parser = Text::CSV->new({ binary => 1 }); # object for parsing lines of CSV
while (my $row = $parser->getline($fh)) # $row is an array reference containing cells from a line of CSV
{
if ($getall # if level was not given on commandline, then put all rows into #XYZ
or # if level *was* given on commandline, then...
$row->[0] eq $level .. 0 # ...wait until the first cell in a row equals $level, then put that row and all subsequent rows into #XYZ
)
{
push #XYZ, $row;
}
}
close $fh;

#!/usr/bin/perl
use strict;
use warnings;
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
my $level = shift ||"level1";
while (my $line = <$data>) {
chomp $line;
my #fields = split "," , $line;
if($fields[0] eq $level .. 0){
print "\n$fields[0]\n";
print "$fields[1]\n";
}}
This worked....thanks ALL for your help...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse