How to find text in data file and calculate average using perl - perl

I would like to replace a grep | awk | perl command with a pure perl solution to make it quicker and simpler to run.
I want to match each line in input.txt with a data.txt file and calculate the average of the values with the matched ID names and numbers.
The input.txt contains 1 column of ID numbers:
FBgn0260798
FBgn0040007
FBgn0046692
I would like to match each ID number with it's corresponding ID names and associated value. Here's an example of data.txt where column 1 is the ID number, columns 2 and 3 are ID name1 and ID name2 and column 3 contains the values I want to calculate the average.
FBgn0260798 CG17665 CG17665 21.4497
FBgn0040007 Gprk1 CG40129 22.4236
FBgn0046692 RpL38 CG18001 1182.88
So far I used grep and awk to produce an output file containing the corresponding values for matched ID numbers and values and then used that output file to calculate the counts and averages using the following commands:
# First part using grep | awk
exec < input.txt
while read line
do
grep -w $line data.txt | cut -f1,2,3,4 | awk '{print $1,$2,$3,$4} ' >> output.txt
done
# Second part with perl
open my $input, '<', "output_1.txt" or die; ## the output file is from the first part and has the same layout as the data.txt file
my $total = 0;
my $count = 0;
while (<$input>) {
my ($name, $id1, $id2, $value) = split;
$total += $value;
$count += 1;
}
print "The total is $total\n";
print "The count is $count\n";
print "The average is ", $total / $count, "\n";
Both parts work OK but I would like to make it simplify it by running just one script. I've been trying to find a quicker way of running the whole lot together in perl but after several hours of reading, I am totally stuck on how to do it. I've been playing around with hashes, arrays, if and elsif statements without zero success. If anyone has suggestions etc, that would be great.
Thanks,
Harriet

If I understand you, you have a data file that contains the name of each line and the value for that line. The other two IDs are not important.
You will use a new file called an input file that will contain matching names as found in the data file. These are the values you want to average.
The fastest way is to create a hash that is keyed by the names and the values will be the value for that name in the data file. Because this is a hash, you can quickly locate the corresponding value. This is much faster than grep`ing the same array over and over again.
This first part will read in the data.txt file and store the name and value in a hash keyed by the name.
use strict;
use warnings;
use autodie; # This way, you don't have to check if you can't open the file
use feature qw(say);
use constant {
INPUT_NAME => "input.txt",
DATA_FILE => "data.txt",
};
#
# Read in data.txt and get the values and keys
#
open my $data_fh, "<", DATA_FILE;
my %ids;
while ( my $line = <$data_fh> ) {
chomp $line;
my ($name, $id1, $id2, $value) = split /\s+/, $line;
$ids{$name} = $value;
}
close $data_fh;
Now, that you have this hash, it's easy to read through the input.txt file and locate the matching name in the data.txt file:
open $input_fh, "<", INPUT_FILE;
my $count = 0;
my $total = 0;
while ( my $name = <$input_fh> ) {
chomp $name;
if ( not defined $ids{$name} ) {
die qq(Cannot find matching id "$name" in data file\n);
}
$total += $ids{$name};
$count += 1;
}
close $input_fh;
say "Average = " $total / $count;
You read through each file once. I am assuming that you only have a single instance of each name in each file.

Related

Split file Perl

I want to split parts of a file. Here is what the start of the file looks like (it continues in same way):
Location Strand Length PID Gene
1..822 + 273 292571599 CDS001
906..1298 + 130 292571600 trxA
I want to split in Location column and subtract 822-1 and do the same for every row and add them all together. So that for these two results the value would be: (822-1)+1298-906) = 1213
How?
My code right now, (I don't get any output at all in the terminal, it just continue to process forever):
use warnings;
use strict;
my $infile = $ARGV[0]; # Reading infile argument
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $line2 = <$IN>;
my $coding = 0; # Initialize coding variable
while(my $line = $line2){ # reading the file line by line
# TODO Use split and do the calculations
my #row = split(/\.\./, $line);
my #row2 = split(/\D/, $row[1]);
$coding += $row2[0]- $row[0];
}
print "total amount of protein coding DNA: $coding\n";
So what I get from my code if I put:
print "$coding \n";
at the end of the while loop just to test is:
821
1642
And so the first number is correct (822-1) but the next number doesn't make any sense to me, it should be (1298-906). What I want in the end outside the loop:
print "total amount of protein coding DNA: $coding\n";
is the sum of all the subtractions of every line i.e. 1213. But I don't get anything, just a terminal that works on forever.
As a one-liner:
perl -nE '$c += $2 - $1 if /^(\d+)\.\.(\d+)/; END { say $c }' input.txt
(Extracting the important part of that and putting it into your actual script should be easy to figure out).
Explicitly opening the file makes your code more complicated than it needs to be. Perl will automatically open any files passed on the command line and allow you to read from them using the empty file input operator, <>. So your code becomes as simple as this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $total;
while (<>) {
my ($min, $max) = /(\d+)\.\.(\d+)/;
next unless $min and $max;
$total += $max - $min;
}
say $total;
If this code is in a file called adder and your input data is in add.dat, then you run it like this:
$ adder add.dat
1213
Update: And, to explain where you were going wrong...
You only ever read a single line from your file:
my $line2 = <$IN>;
And then you continually assign that same value to another variable:
while(my $line = $line2){ # reading the file line by line
The comment in this line is wrong. I'm not sure where you got that line from.
To fix your code, just remove the my $line2 = <$IN> line and replace your loop with:
while (my $line = <$IN>) {
# your code here
}

Create a table by merging many files

This seemed like such an easy task, yet I am boggled.
I have text files, each named after a type of tissue (e.g. cortex.txt, heart.txt)
Each file contains two columns, and the column headers are gene_name and expression_value
Each file contains around 30K to 40K rows
I need to merge the files into one file with 29 columns, with headers
genename, tissue1, tissue2, tissue3, etc. to tissue28
So that each row contains one gene and its expression value in the 28 tissues
The following code creates an array containing a list of every gene name in every file:
my #list_of_genes;
foreach my $input_file ( #input_files ) {
print $input_file, "\n";
open ( IN, "outfiles/$input_file");
while ( <IN> ) {
if ( $_ =~ m/^(\w+\|ENSMUSG\w+)\t/) {
# check if the gene is already in the gene list
my $count = grep { $_ eq $1 } #list_of_genes;
# if not in list, add to the list
if ( $count == 0 ) {
push (#list_of_genes, $1);
}
}
}
close IN;
}
The next bit of code I was hoping would work, but the regex only recognises the first gene name.
Note: I am only testing it on one test file called "tissue1.txt".
The idea is to create an array of all the file names, and then take each gene name in turn and search through each file to extract each value and write it to the outfile in order along the row.
foreach my $gene (#list_of_genes) {
# print the gene name in the first column
print OUT $gene, "\t";
# use the gene name to search the first element of the #input_file array and dprint to the second column
open (IN, "outfiles/tissue1.txt");
while ( <IN> ) {
if ($_ =~ m/^$gene\t(.+)\n/i ) {
print OUT $1;
}
}
print OUT "\n";
}
EDIT 1:
Thank you Borodin. The output of your code is indeed a list of every gene name with a all expression values in each tissue.
e.g. Bcl20|ENSMUSG00000000317,0.815796340254127,0.815796340245643
This is great much better than I managed thank you. Two additional things are needed.
1) If a gene name is not found in the a .txt file then a value of 0 should be recorded
e.g. Ht4|ENSMUSG00000000031,4.75878049632381, 0
2) I need a comma separated header row so that the tissue from which each value comes remains associated with the value (basically a table) - the tissue is the name of the text file
e.g. From 2 files heart.txt and liver.txt the first row should be:
genename|id,heart,liver
where genename|id is always the first header
That's a lot of code to implement the simple idiom of using a hash to enforce uniqueness!
It's looking like you want an array of expression values for each different ENSMUSG string in all *.txt files in your outfiles directory.
If the files you need are the only ones in the outfles directory, then the solution looks like this. I've used autodie to check the return status of all Perl IO operations (chdir, open, print etc.) and checked only that the $gene value contains |ENSMUSG. You may not need even this check if your input data is well-behaved.
Please forgive me if this is bugged, as I have no access to a Perl compiler at present. I have checked it by sight and it looks fine.
use strict;
use warnings 'all';
use autodie;
chdir '/path/to/outfiles';
my %data;
while ( my $file = glob '*.txt' ) {
open my $fh, '<', $file;
while ( <$fh> ) {
my ($gene, $value) = split;
next unless $gene =~ /\|ENSMUSG/;
push #{ $data{$gene} }, $value;
}
}
print join(',', $_, #{ $data{$_} }), "\n" for keys %data;

In perl is it possible to use the same elements created in an array in file.csv and do a foreach on file1csv as well

In file.csv i have created an array from all the unique values found in column B. now i want to do a foreach on the same values in column C of file1.csv is this possible? I can't hard code the values of the array as they change to frequently and every time the user runs the script there would be errors. so that is why i created the array values like this.
#!/usr/bin/perl
use strict;
use warnings;
use Tk;
use Tk::BrowseEntry;
use POSIX 'mktime';
use POSIX 'strftime';
open(STDERR, ">&STDOUT");
######## entry widget to get $yyyy $mmm $dd #######################################
print "\n Select Year = $yyyy\n";
print "\n Select Month = $mmm\n";
print "\n Number of Backup Days = $dd\n";
######## create input and output files #######################################
my $filerror = "\n\n! Cannot open File below, please check it exists or is not open already?\n";
my $OUTFILE = "C:\\Temp\\$yyyy\$mmmAudit.txt";
my $INFILE1 = "c:\\file1.csv";
my $INFILE = "c:\\file.csv";
#Open input file for reading and Output file for writting
open (INPUT,"$INFILE") or die "\n$filerror\$INFILE",,1;
#open (OUTPUT,">$OUTFILE") or die "\n$filerror\n$OUTFILE",,1;
my $total_names = 0;
$total_names++ while (<INPUT>);
my $Month_total = $total_names * $dd;
######### get total number of rows in files ##################################
print "\n Total number of names is $total_names\n";
print "\n Total number of names is $Month_total\n";
close INPUT;
open (INPUT,"$INFILE") or die "\n$filerror\$INFILE",,1;
######### keep only unique names to do a foreach in file1.csv#########
my %seen;
while (<INPUT>)
{
chomp;
my $line = $_;
my #elements = split (",", $line);
my $col_name = $elements[1];
print " $col_name \n" if ! $seen{$col_name}++;
}
## now in file1.csv i want to do a for each on all $col_mames's
close INPUT;
... you don't need to loop over the whole file twice - it looks like you're doing it the first time just to count line numbers - could you not do that inside your second loop? Saves the open/closing too... which for the same file is odd.
Your second file you can loopover just after grabbing each column name unless I'm missing something. Not efficient code, but it'll do it.

New to perl - Dynamically entering a value for an array (a column) from a csv using perl

I am new to perl and I need to write a perl script with the following requirements:
Need to read a csv file
Store the columns in a array
Suppose there are 7 fields (columns) in the csv: field 1, field 2, field 3,field 4,field 5,field 6,field 7
I need to be able to give any fields as input parameter dynamically. Suppose if I give input parameters as field 3, field 7 and the data in the csv is:
No of orders'|Date'|Year'|Exp_date'|Month'|time'|Committed
12'|12122002'|2013'|02022012'|12'|1230'|Yes
Then I want the output to be:
Year~Committed
2013~Yes
and the remaining columns also to be in the same format:
No of orders~Date~Exp_date~Month~time
12~12122002~02022012~12~1230
Currently I got a perl script from net which provides me the left sided result only in hardcoded format. But I want to give the input in run time and want the result to be generated.Help is appreciated.
$filename = 'xyz.csv';
# use the perl open function to open the file
open(FILE, $filename) or die "Could not read from $filename, program halting.";
# loop through each line in the file
# with the typical "perl file while" loop
while(<FILE>)
{
chomp;
# read the fields in the current line into an array
#fields = split('\`\|', $_);
# print the concat
print "$fields[0]~$fields[1]\n";
}
close FILE;
I am not going to argue whether to use Text::CSV or not. It is not relevant to the question and split is perfectly acceptable.
Here is how I would solve the problem, assuming that the poster wants to have input files of more than one line of data.
#!/usr/bin/perl -w
# take any file name from the command line, instead of hard coding it
my ($filename) = #ARGV;
my #title;
my #table;
open FILE, "<", $filename or die "Could not read from $filename, program halting.";
while (<FILE>) {
chomp;
if (!#title) {
#title = split '\'\|';
next;
}
my #row = split '\'\|';
# build an array of array references, a two dimensional table
push #table, \#row;
}
close FILE;
print "Access which row? ";
my $row = <STDIN> - 1;
die "Row value is out of range for $filename\n" if (($row < 0) || ($row >= scalar(#table)));
print "Access which column? ";
my $col = <STDIN> - 1;
die "Column value is out of range for $filename\n" if (($col < 0) || ($col >= scalar(#title)));
print "$title[$col]\~$table[$row][$col]\n";
The first pass of the while loop will store the split values in our Title array. The remaining passes will append our data, row by row, as array references to Table. Then with some user prompting and basic error checking, we print our results.

Find all the occurrence of string in a file and print its line number in Perl

I have a large file which contains 400000 lines, each line contains many number of keywords separated by tab.
And also I have a file that contains list of keywords to be matched. Say this file act as a look up.
So for each keyword in the look up table I need to search all its occurrence in the given file. And should print the line number of the occurrence.
I have tried this
#!usr/bin/perl
use strict;
use warnings;
my $linenum = 0;
print "Enter the file path of lookup table:";
my $filepath1 = <>;
print "Enter the file path that contains keywords :";
my $filepath2 = <>;
open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;
open OUT, ">", "SampleLineNum.txt";
while( $line = <FILE1> )
{
while( <FILE2> )
{
$linenum = $., last if(/$line/);
}
print OUT "$linenum ";
}
close FILE1;
This gives the first occurrence of the keyword. But I need all the occurrence and also the keyword should be exactly match.
The problem am facing in exact match is, for instance I have the keywords "hello" and "hello world"
if I need to match "hello", it returns the line number which contains "hello world" also
my script should match only "hello" and give its line number.
Here is a solution that matches every occurrence of all keywords:
#!usr/bin/perl
use strict;
use warnings;
#Lexical variable for filehandle is preferred, and always error check opens.
open my $keywords, '<', 'keywords.txt' or die "Can't open keywords: $!";
open my $search_file, '<', 'search.txt' or die "Can't open search file: $!";
my $keyword_or = join '|', map {chomp;qr/\Q$_\E/} <$keywords>;
my $regex = qr|\b($keyword_or)\b|;
while (<$search_file>)
{
while (/$regex/g)
{
print "$.: $1\n";
}
}
keywords.txt:
hello
foo
bar
search.txt:
plonk
food is good
this line doesn't match anything
bar bar bar
hello world
lalalala
hello everyone
Output:
4: bar
4: bar
4: bar
5: hello
7: hello
Explanation:
This creates a single regex that matches all of the keywords in the keywords file.
<$keywords> - when this is used in list context, it returns a list of all lines of the file.
map {chomp;qr/\Q$_\E/} - this removes the newline from each line and applies the \Q...\E quote-literal regex operator to each line (This ensures that if you have a keyword like "foo.bar" it will treat the dot as a literal character, not a regex metacharacter).
join '|', - join the resulting list into a single string, separated by pipe characters.
my $regex = qr|\b($keyword_or)\b|; - create a regex that looks like this:
/\b(\Qhello\E|\Qfoo\E|\Qbar\E)\b/
This regex will match any of your keywords. \b is the word boundary marker, ensuring that only whole words match: food no longer matches foo. The parentheses capture the specific keyword that matched in $1. This is how the output prints the keyword that matched.
I updated the solution to match each keyword on a given line and to only match complete words.
Is this part of something bigger? Because this is a one liner with grep
grep -n hello filewithlotsalines.txt
grep -n "hello world" filewithlotsalines.txt
-n gets grep to show the line numbers first before the matching lines. You can do man grep for more options.
I am assuming here that you are on a linux or *nix system.
I have a different interpretation of your request. It seems that you may want to maintain a list of line numbers where certain entries from a lookup table are found on lines of a 'keyword' file. Here's a sample lookup table:
hello world
hello
perl
hash
Test
script
And a tab-delimited 'keyword' file, where multiple keywords may be found on a single line:
programming tests
hello everyone
hello hello world perl
scripting scalar
test perl script
hello world perl script hash
Given the above, consider the following solution:
use strict;
use warnings;
my %lookupTable;
print "Enter the file path of lookup table: \n";
chomp( my $lookupTableFile = <> );
print "Enter the file path that contains keywords: \n";
chomp( my $keywordsFile = <> );
open my $ltFH, '<', $lookupTableFile or die $!;
while (<$ltFH>) {
chomp;
undef #{ $lookupTable{$_} };
}
close $ltFH;
open my $kfFH, '<', $keywordsFile or die $!;
while (<$kfFH>) {
chomp;
for my $keyword ( split /\t+/ ) {
push #{ $lookupTable{$keyword} }, $. if defined $lookupTable{$keyword};
}
}
close $kfFH;
open my $slFH, '>', 'SampleLineNum.txt' or die $!;
print $slFH "$_: #{ $lookupTable{$_} }\n"
for sort { lc $a cmp lc $b } keys %lookupTable;
close $slFH;
print "Done!\n";
Output to SampleLineNum.txt:
hash: 6
hello: 2 3
hello world: 3 6
perl: 3 5 6
script: 5 6
Test:
The script uses a hash of arrays (HoA), where the key is an entry from the lookup table and the associated value is a reference to a list of line numbers where that entry was found on lines of a 'keyword' file. The hash %lookupTable is initialized with a reference to an empty list.
The each line of the 'keywords' file is split on the delimiting tab, and if a corresponding entry is defined in %lookupTable, the line number is pushed onto the corresponding list. When done, the %lookupTable keys are case-insensitively sorted and written out to SampleLineNum.txt, along with their corresponding list of line numbers where the entry was found, if any.
There's no sanity checks on the file names entered, so consider adding those.
Hope this helps!
To find all of the occurrences, you need to read in the keywords and then loop through the keywords to find matches for each line. Here is what I modified to find keywords in the line using an array. In addition, I added a counter to count the line number and then
if there is a match to print to print out the line number. Your code will print out a item for each line even if there is not a match.
#!usr/bin/perl
use strict;
use warnings;
my $linenum = 0;
print "Enter the file path of lookup table:";
my $filepath1 = <>;
print "Enter the file path that contains keywords :";
my $filepath2 = <>;
open( FILE1, "< $filepath1" );
open FILE2, "< $filepath2" ;
# Read in all of the keywords
my #keywords = <FILE2>;
# Close the file2
close(FILE2);
# Remove the line returns from the keywords
chomp #keywords;
# Sort and reverse the items to compare the maximum length items
# first (hello there before hello)
#keywords = reverse sort #keywords;
foreach my $k ( #keywords)
{
print "$k\n";
}
open OUT, ">", "SampleLineNum.txt";
my $line;
# Counter for the lines in the file
my $count = 0;
while( $line = <FILE1> )
{
# Increment the counter for the number of lines
$count++;
# loop through the keywords to find matches
foreach my $k ( #keywords )
{
# If there is a match, print out the line number
# and use last to exit the loop and go to the
# next line
if ( $line =~ m/$k/ )
{
print "$count\n";
last;
}
}
}
close FILE1;
I think there are some questions similar to this one. You can check out:
Perl: Search text file for keywords from array
How can I search multiple files for a string in Perl?
The File::Grep module is interesting.
as others had already given some perl solution,i will suggest you that may be you could use awk here.
> cat temp
abc
bac
xyz
> cat temp2
abc jbfwerf kfnm
jfjkwebfkjwe bac xyz
ndwjkfn abc kenmfkwe bac xyz
> awk 'FNR==NR{a[$1];next}{for(i=1;i<=NF;i++)if($i in a)print $i,FNR}' temp temp2
abc 1
bac 2
xyz 2
abc 3
bac 3
xyz 3
>