An array is populated from a tab delimited text (5 column) file that sometimes is missing rows. I need to identify and insert the missing rows. Inserting a string "blank row found" is sufficient.
Here is an example of data from file:
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
I’ve created an array of elements that identifies the second column of each row that should be present in the file, in the order each row should be present. However, I'm not sure how to continue from here, since I'm unable to install any Perl modules on the server (e.g. Arrays::Utils).
Is comparing arrays the correct way of approaching this problem? Perhaps there is a straightforward solution, that doesn’t require installation of any CPAN modules? Thanks for your help.
#!perl
use strict;
use warnings;
use File::Basename;
#use Arrays::Utils;
opendir my $dir, "/data/test_all_runs" or die "Cannot open directory: $!";
my #run_folder = readdir $dir;
closedir $dir;
my $run_folder = pop #run_folder; print "The folder is".$run_folder."\n";
my $home="/data/";
my $CNV_file = $home."test_all_runs/".$run_folder."/CNV.txt";
my #CNVarray;
open(TXT2, "$CNV_file");
while (<TXT2>){
push (#CNVarray, $_);
}
close(TXT2);
foreach (#CNVarray){
chop($_);
}
my #array1 = map { $_->[1] } #CNVarray;
my #array2 = qw(MTOR JAK1 NRAS DDR2 MYCN ALK IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1 FGFR3 PDGFRA KIT APC FGFR4 ROS1 ESR1 EGFR CDK6 MET SMO BRAF FGFR1 MYC JAK2 GNAQ RET FGFR2 HRAS CCND1 BIRC2 KRAS ERBB3 CDK4 AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1 GNA11 MAP2K2 JAK3 AR MED12);
my %array1_hash;
my %array2_hash;
# Create a hash entry for each element in #array1
for my $element ( #array1 ) {
$array1_hash{$element} = #array1;
}
# Same for #array2: This time, use map instead of a loop
map { $array_2{$_} = 1 } #array2;
for my $entry ( #array2 ) {
if ( not $array1_hash{$entry} ) {
return 1; #Entry in #array2 but not #array1: Differ
}else {
return 0; #Arrays contain the same elements
}
#if ( keys %array_hash1 != keys %array_hash2 ) {
#return 1; #Arrays differ
}
Note The best version is reached at the end. It is a few lines of code.
If I get it right, you have a separate reference list of key-words that need to be in the second field in a row, with rows in that order. One way to find skipped rows is to iterate through both lists.
That approach can be picky and error prone but here it can be made easier by removing the front element from the reference list each time. Then you always need to compare the current line against the first element in the reference list. Here is the basic logic, with the better version further below.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
# chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
foreach my $line (#CNVarray)
{
if ( (split /\t/, $line)[1] eq $ref_list[0] ) { # good row
shift #ref_list;
print $line, "\n";
}
else {
shift #ref_list;
print "blank row found\n";
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
# multiple missing rows? keep going through the reference list
shift #ref_list;
print "blank row found\n";
}
}
# We are done with the array, but are there more reference items?
print "blank row found\n" for #ref_list;
The while loop is needed since multiple rows can be missing (in a row), so we need to get to the place in the reference list that does match the current row. A few notes on the code.
The filehandle read <...> in the list context returns a list with all lines from the resource.
The chop in the original code removes the last character, probably not what you want. It is the chomp that removes the new line (or really $/).
Tested against the reference list qw(AA BB CC DD EE) with the input file (note spaces not tabs)
1 AA first
2 BB more
5 EE last
To test with this, change /\t/ to /\s/ (what will then work for tabs as well). It prints
1 AA first
2 BB more
blank row found
blank row found
5 EE last
With further elements added to the #ref_list (FF etc) further blank ... lines are printed.
The code above can be simplified. Lines are also collected in an array, then printed to a new file.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
my #new_lines;
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
shift #ref_list;
push #new_lines, 'blank row found';
print "blank row found\n";
}
shift #ref_list;
push #new_lines, $line;
print $line, "\n";
}
# There may be more items remaining on the reference list
for (#ref_list) {
push #new_lines, 'blank row found';
print "blank row found\n"
}
my $filled_file = 'skipped_rows_added.txt';
open my $out_fh, '>', $filled_file or die "Can't open $filled_file: $!";
print $out_fh "$_\n" for #new_lines;
close $out_fh;
This behaves the same way with the test input above. It can be simplified further yet
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne shift #ref_list ) {
print "blank row found\n";
}
print $line, "\n";
}
The shift returns the removed element, which is what need be tested against.
A note on split syntax, following the code update ("\t" changed to /\t/).
When invoked as split /$patt/, $str, the $patt is used as a regular expression, with a few very minor differences. So with /\s/ the string is split on white space as understood in regex, thus including the tab, for example.
With double quotes "..." used instead of /.../, what is inside is interpolated first which may result in surprises, in particular with escapes. (Unless it is used as m"..." in which case it is merely a regex with " being the delimiter.)
In the above code for the tab one can use /\t/, or "\t", or '\t' (or /\s/ which includes yet other types of space). The "\t" was changed to /\t/, which is better in my opinion, being clearer (it is a regex, no questions asked). Thanks to Borodin for the early edit and for the comment.
I would write this
The input file is read into a hash, keyed by the value of the second column. Then the hash is read back and printed in the specified sequence of keys
Most of the code is finding the input file and setting up the sequence of keys. The core of the program is only three lines of code
use strict;
use warnings 'all';
use File::Spec::Functions 'catfile';
my $home = '/data';
my #run_folder = grep -f, glob catfile($home, 'test_all_runs', '*', 'CNV.txt');
die "No CNV file found" unless #run_folder;
my $cnv_file = $run_folder[-1];
print "The file is $cnv_file\n\n";
my #sequence = qw/
MTOR JAK1 NRAS DDR2 MYCN ALK
IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1
FGFR3 PDGFRA KIT APC FGFR4 ROS1
ESR1 EGFR CDK6 MET SMO BRAF
FGFR1 MYC JAK2 GNAQ RET FGFR2
HRAS CCND1 BIRC2 KRAS ERBB3 CDK4
AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1
GNA11 MAP2K2 JAK3 AR MED12
/;
open my $fh, '<', $cnv_file or die qq{Unable to open "$cnv_file" for input: $!};
my %data;
$data{ (split)[1] } = $_ while <$fh>;
print $data{$_} // "no data for $_\n" for #sequence;
output
The file is /data/test_all_runs/XXX/CNV.txt
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
no data for NRAS
no data for DDR2
no data for MYCN
no data for ALK
no data for IDH1
no data for ERBB4
no data for RAF1
no data for CTNNB1
no data for PIK3CA
no data for DCUN1D1
no data for FGFR3
no data for PDGFRA
no data for KIT
no data for APC
no data for FGFR4
no data for ROS1
no data for ESR1
no data for EGFR
no data for CDK6
no data for MET
no data for SMO
no data for BRAF
no data for FGFR1
no data for MYC
no data for JAK2
no data for GNAQ
no data for RET
no data for FGFR2
no data for HRAS
no data for CCND1
no data for BIRC2
no data for KRAS
no data for ERBB3
no data for CDK4
no data for AKT1
no data for MAP2K1
no data for IDH2
no data for NF1
no data for ERBB2
no data for BRCA1
no data for GNA11
no data for MAP2K2
no data for JAK3
no data for AR
no data for MED12
Firstly I apologise if my formatting here is incorrect, I am very new to writing scripts (3 days) and this is my first post on this site.
I have two files which are tab separated, File a contains 14 columns, and File b contains 8 columns.
One column in File b has a numeric value which correlates to a range of numbers generated by two numeric fields from File a.
For every line in File a, I need to, search through the File b and print a combination of data from fields on both files. There will be multiple matches for each line of File a due to a numeric range being accepted.
The code that I have created does exactly what I want it to do but only for the first line of File a, and doesn't continue the loop. I have looked all over the internet and I believe it may be something to do with the fact that both files read from standard input. I have tried to correct this problem but I can't seem to get anything to work
My current understanding is that by changing one file to read from a different file descriptor my loop may work... with something such as >$3 but I don't really understand this very well despite my research. Or possibly using the grep function which I am also struggling with.
Here is the outline of the code I am using now:
use strict;
use warnings;
print "which file read from?\n";
my $filea = <STDIN>;
chomp $filea;
{
unless (open ( FILEA, $filea) {
print "cannot open, do you want to try again? y/n?\n?";
my $attempt = <STDIN>;
chomp $again;
if ($again =~ 'n') {
exit;
} else {
print "\n";
$filea = <STDIN>;
chomp $filea;
redo;
}
}
}
#I also open fileb the same way, but wont write it all out to save space and your time.
my output = 'output.txt';
open (OUTPUT, ">>$output");
while (my $loop1 = <FILEA>) {
chomp $loop1;
( my $var1, my $var2, my $var3, my $var4, my $var5, my $var6,
my $var7, my $var8, my $var9, my $var10, my $var11, my $var12,
my $var13, my $var14 ) = split ( "\t", $loop1);
#create the range of number which needs to be matched from file b.
my $length = length ($var4);
my $range = ($var2 + $length);
#perform the search loop through fileb
while (my $loop2 = <FILEB>) {
chomp $loop2;
( my $vala, my $valb, my $valc, my $vald, my $vale, my $valf,
my $valg) = split ( "\t", $loop2 );
#there are then several functions and additions of the data, which all work basicly so I'll just use a quick example.
if ($vald >= $val3 $$ $vald <= $range) {
print OUTPUT "$val1, $vald, $val11, $valf, $vala, $val5 \n";
}
}
}
I hope this all makes sense, I tried to make everything as clear as possible, if anyone could help me edit the code so that the loop continues through all of filea that would be great.
If possible please explain what you've done. Ideally I'd like it if its possible to obtain this result without changing the code too much.
Thanks guys!!!
Avoid naked handles when possible; use $fh (filehandle) instead of FH
You can use until instead of unless, and skip the redo:
print "Enter the file name\n";
my $file_a = <STDIN>;
chomp $file_a;
my $fh_a;
until(open $fh_a, '<', $file_a) {
print "Re-enter the file name or 'n' to cancel\n";
$file_a = <STDIN>;
chomp $file_a;
if($file_a eq 'n') {
exit;
}
}
You can (should) use an array instead of all those individual column variables: my #cols_a = split /\t/, $line;
You should read file B into an array, once, and then search that array each time you need to: my #file_b = <$fh_b>;
The result will look something like this:
#Assume we have opened both files already . . .
my #file_b = <$fh_b>;
chomp #file_b;
while(my $line = <$fh_a>) {
chomp $line;
my #cols_a = split /\t/, $line;
#Remember, most arrays (perl included) are zero-indexed,
#so $cols_a[1] is actually the SECOND column.
my $range = ($cols_a[1] + length $cols_a[3]);
foreach my $line_b (#file_b) {
#This loop will run once for every single line of file A.
#Not efficient, but it will work.
#There are, of course, lots of optimisations you can make
#(starting with, for example, storing file B as an array of array
#references so you don't have to split each line every time)
my #cols_b = split /\t/, $line_b;
if($cols_b[3] > $cols_a[2] && $cols_b[3] < ($cols_a[2] + $range)) {
#Do whatever here
}
}
}
I have a text file which is tab separated. They can be quite big upto 1 GB. I will have variable number of columns depending on the number of sample in them. Each sample have eight columns.For example, sampleA : ID1, id2, MIN_A, AVG_A, MAX_A,AR1_A,AR2_A,AR_A,AR_5. Of which the ID1, and id2 are the common to all the samples. What I want to achieve is split the whole file in to chunks of files depending on the number of samples.
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,3535,4545,5656,5656,7675,67567,57758,875,8678,578,57856785,85587,574,56745,567356,675489,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853,457328,3457385,567438,5678934,56845,567348,58567,548948,58649,5839,546847,458274,758345,4572384,4758475,47487
This is how my model file looks, I want to have them as :
File A :
ID1,ID2,MIN_A,AVG_A,MAX_A,AR1_A,AR2_A,AR3_A,AR4_A,AR5_A
12,134,3535,4545,5656,5656,7675,67567,57758,875
454385,3457,485784,5673489,5658,567845,575867,45785,7568,43853
File B:
ID1, ID2,MIN_B, AVG_B, MAX_B,AR1_B,AR2_B,AR3_B,AR4_B,AR5_B
12,134,8678,578,57856785,85587,574,56745,567356,675489
454385,3457,457328,3457385,567438,5678934,56845,567348,58567,548948
File C:
ID1, ID2,MIN_C,AVG_C,MAX_C,AR1_C,AR2_C,AR3_C,AR4_C,AR5_C
12,134,573586,5867,576384,75486,587345,34573,45485,5447
454385,3457,58649,5839,546847,458274,758345,4572384,4758475,47487.
Is there any easy way of doing this than going thorough an array?
How I have worked out my logic is counting the (number of headers - 2) and dividing them by 8 will give me the number of Samples in the file. And then going through each element in an array and to parse them . Seems to be a tedious way of doing this. I would be happy to know any simpler way of handling this.
Thanks
Sipra
#!/bin/env perl
use strict;
use warnings;
# open three output filehandles
my %fh;
for (qw[A B C]) {
open $fh{$_}, '>', "file$_" or die $!;
}
# open input
open my $in, '<', 'somefile' or die $!;
# read the header line. there are no doubt ways to parse this to
# work out what the rest of the program should do.
<$in>;
while (<$in>) {
chomp;
my #data = split /,/;
print $fh{A} join(',', #data[0 .. 9]), "\n";
print $fh{B} join(',', #data[0, 1, 10 .. 17]), "\n";
print $fh{C} join(',', #data[0, 1, 18 .. $#data]), "\n";
}
Update: I got bored and made it cleverer, so it automatically handles any number of 8-column records in a file. Unfortunately, I don't have time to explain it or add comments.
#!/usr/bin/env perl
use strict;
use warnings;
# open input
open my $in, '<', 'somefile' or die $!;
chomp(my $head = <$in>);
my #cols = split/,/, $head;
die 'Invalid number of records - ' . #cols . "\n"
if (#cols -2) % 8;
my #files;
my $name = 'A';
foreach (1 .. (#cols - 2) / 8) {
my %desc;
$desc{start_col} = (($_ - 1) * 8) + 2;
$desc{end_col} = $desc{start_col} + 7;
open $desc{fh}, '>', 'file' . $name++ or die $!;
print {$desc{fh}} join(',', #cols[0,1],
#cols[$desc{start_col} .. $desc{end_col}]),
"\n";
push #files, \%desc;
}
while (<$in>) {
chomp;
my #data = split /,/;
foreach my $f (#files) {
print {$f->{fh}} join(',', #data[0,1],
#data[$f->{start_col} .. $f->{end_col}]),
"\n";
}
}
This is independent to the number of samples. I'm not confident on the output file name though because you might reach more than 26 samples. Just replace how the output file name works if that's the case. :)
use strict;
use warnings;
use File::Slurp;
use Text::CSV_XS;
use Carp qw( croak );
#I'm lazy
my #source_file = read_file('source_file.csv');
# you metion yours is tab separated
# just add the {sep_char => "\t"} inside new
my $csv = Text::CSV_XS->new()
or croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
my $output_file;
#read each row
while ( my $raw_line = shift #source_file ) {
$csv->parse($raw_line);
my #fields = $csv->fields();
#get the first 2 ids
my #ids = splice #fields, 0, 2;
my $group = 0;
while (#fields) {
#get the first 8 columns
my #columns = splice #fields, 0, 8;
#if you want to change the separator of the output replace ',' with "\t"
push #{ $output_file->[$group] }, (join ',', #ids, #columns), $/;
$group++;
}
}
#for filename purposes
my $letter = 65;
foreach my $data (#$output_file) {
my $output_filename = sprintf( 'SAMPLE_%c.csv', $letter );
write_file( $output_filename, #$data );
$letter++;
}
#if you reach more than 26 samples then you might want to use numbers instead
#my $sample_number = 1;
#foreach my $data (#$output_file) {
# my $output_filename = sprintf( 'sample_%s.csv', $sample_number );
# write_file( $output_filename, #$data );
# $sample_number++;
#}
Here is a one liner to print the first sample, you can write a shell script to write the data for different samples into different files
perl -F, -lane 'print "#F[0..1] #F[2..9]"' <INPUT_FILE_NAME>
You said tab separated, but your example shows it being comma separated. I take it that's a limitation in putting your sample data in Markdown?
I guess you're a bit concerned about memory, so you want to open the multiple files and write them as you parse your big file.
I would say to try Text::CSV::Simple. However, I believe it reads the entire file into memory which might be a problem for a file this size.
It's pretty easy to read a line, and put that line into a list. The issue is mapping the fields in that list to the names of the fields themselves.
If you read in a file with a while loop, you're not reading the whole file into memory at once. If you read in each line, parse that line, then write that line to the various output files, you're not taking up a lot of memory. There's a cache, but I believe it's emptied after a \n is written to the file.
The trick is to open the input file, then read in the first line. You want to create some sort of field mapping structure, so you can figure out which fields to write to each of the output files.
I would have a list of all the files you need to write to. This way, you can go through the list for each file. Each item in the list should contain the information you need for writing to that file.
First, you need a filehandle, so you know which file you're writing to. Second, you need a list of the field numbers you've got to write to that particular output file.
I see some sort of processing loop like this:
while (my $line = <$input_fh>) { #Line from the input file.
chomp $line;
my #input_line_array = split /\t/, $line;
my $fileHandle;
foreach my $output_file (#outputFileList) { #List of output files.
$fileHandle = $output_file->{FILE_HANDLE};
my #fieldsToWrite;
foreach my $fieldNumber (#{$output_file->{FIELD_LIST}}) {
push $fieldsToWrite, $input_line_array[$field];
}
say $file_handle join "\t", #fieldsToWrite;
}
}
I'm reading in one line of the input file into $line and dividing that up into fields which I am putting in the #input_line_array. Now that I have the line, I have to figure out which fields get written to each of the output files.
I have a list called #outputFileList that is a list of all the output files I want to write to. $outputFileList[$fileNumber]->{FILE_HANDLE} contains the file handle for my output file $fileNumber. $ouputFileList[$fileNumber]->{FIELD_LIST} is a list of fields I want to write to output file $fileNumber. This is indexed to the fields in #input_line_array. So if
$outputFileList[$fileNumber]->{FIELD_LIST} = [0, 1, 2, 4, 6, 8];
Means that I want to write the following fields to my output file: $input_line_array[0], $input_line_array[1], $input_line_array[2], $input_line_array[4], $input_line_array[6], and $input_line_array[8] to my output file $outputFileList->[$fileNumber]->{FILE_HANDLE} in that order as a tab separated list.
I hope this is making some sense.
The initial problem is reading in the first line of <$input_fh> and parsing it into the needed complex structure. However, now that you have an idea on how this structure needs to be stored, parsing that first line shouldn't be too much of an issue.
Although I didn't use object oriented code in this example (I'm pulling this stuff out of my a... I mean... brain as I write this post). I would definitely use an object oriented code approach with this. It will actually make things much faster by removing errors.