I have data.txt with data as shown below. I have around hundred names with 3 constant ids 100, 200, 300 and each have 32 data values.
NAME: xyz
ID: 100
DATA: 10 15 99 13 ...
ID: 200
DATA: 23 45 78 90..
ID: 300
DATA: 45 67 89 56
NAME: abc
ID: 100
DATA: 2 4 787 8..
ID: 200
DATA: 12 14 17..
ID: 300
DATA: 45 34 22..
I need to write this data to another file which looks like
xyz_100, xyz_200,xyz_300,abc_100,...
10 , 23 , 45 ,2
15 , 45 ,67 ,4
I build a hash to store the values, but right now my code overwrites the first two entries and store the last entry. How can I save the first two entries too, please let me know if I can simplify the code.
#!/usr/local/bin/perl
use diagnostics;
use strict;
use warnings;
my #avar_names;
my %record;
local $/ = '';
open my $fh, '<', 'datalog.dat' or die "failed: $!";
while (<$fh>) {
chomp;
my %avar;
while (/^([A-Z:]+):\s*(.*)/mg) {
$avar{$1} = $2;
}
my $avar_name = "$avar{NAME}_$avar{ID}";
push #avar_names, $avar_name;
$record{$avar_name} = $avar{DATA};
use Data::Dumper;
print Dumper \%record;
}
There is a consistency problem: the local line makes the first while process the whole NAME/ID data as a whole (then, thanks to the while(//mg) loop), but the code as it is expects a line by line processing.
I suggest simplicity: process the file lines one by one, for each one, identify a "NAME/ID" couple (ID could be DATA) thanks to a regex on the current line.
In order for that to work, %avar needs not to be local to the while on <$fh>, since it has to keep track of the last NAME/ID reference it got from the last lines of the file.
The #avar_names array is not used.
Finally, you want to print the resulting records at the end (I supposed), but that is your choice.
This is your program from which I tried to change only the things mentioned above:
#!/usr/local/bin/perl
use diagnostics;
use strict;
use warnings;
my #avar_names;
my %record;
#local $/ = ''; ## commented out
open my $fh, '<', 'datalog.dat' or die "failed: $!";
my %avar; ## moved here
while (<$fh>) {
chomp;
if (m/^([A-Z]+):\s*(.*)/) { ## if, not a while
$avar{$1} = $2;
## ensure we have at least our 3 keys, process only if we met DATA
if (exists($avar{NAME}) && exists($avar{ID}) && exists($avar{DATA}) && $1 eq 'DATA') {
my $avar_name = "$avar{NAME}_$avar{ID}";
push #avar_names, $avar_name; ## not used
$record{$avar_name} = $avar{DATA};
}
}
}
use Data::Dumper;
print Dumper \%record;
Output:
$VAR1 = {
'xyz_100' => '10 15 99 13 ...',
'xyz_300' => '45 67 89 56',
'abc_200' => '12 14 17..',
'abc_300' => '45 34 22..',
'abc_100' => '2 4 787 8..',
'xyz_200' => '23 45 78 90..'
};
You need something like this:
my #avars;
# ... some code
while (<$fh>) {
# more code
my %curr;
while (/^([A-Z:]+):\s*(.*)/mg) {
$curr->{$1} = $2;
}
if (%curr) {
push #avars, \%curr;
my $avar_name = "$curr{NAME}_$curr{ID}";
push #avar_names, $avar_name;
$record{$avar_name} = $avar{DATA};
}
# ....
Related
I have the below table (txt file with /t separator) and I want to write a script to subtract each time the values of the next line to the previous line and then to obtain the absolute value of each value.
43 402 51 360 66
61 63 67 66 65
63 60 69 63 58
65 53 89 55 57
103 138 135 135 85
For example:
abs(61-43) abs(63-402) abs(67-51) abs(66-360) abs(65-66)
abs(63-61) abs(60-63) abs(69-67) abs(63-66) abs(58-65) etc..
This is what I wrote.
#!/usr/bin/perl -w
use strict;
use warnings;
my $filename = '/home/kgee/Desktop/gene_gangs/table_gangs/table_ony_numbers';
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
$newtable=0;
$i=0;
$j=1;
while (my $row = <$fh>) {
chomp $row;
print "$row\n";
my #numbertable = split//,$filename;
for (#numbertable){
$newtable[$j]= ($i+2) -($i+1);
$temp[$i]= abs($newtable[$j]);
$newtable=$tepm[$i];
my #newtable= split//,$newtable;
print("#newtable","\n");
$i=$i+1;
}
}
I got many errors all of them are "global symbol XXX requires explicit package name (did you forget to declare "my XXX"?) at line XXX?"
I read online that to step over this issue you have remove the use warnings;(from the begging) which is not recommanded or to declare the variable outside the block (and not inside!). I tried both but still I have some warnings.
OK first off: Do not "turn off warnings" ever. Would you turn off the warnings on your car, and expect that to end well? Warnings are there because there's something you need to fix, not ignore.
my is used to declare a variable the first time you use it.
This is as simple as
my $newtable = 0;
You've also added some routes of confusion into your code, and I'd suggest you sort:
Indent it properly
Don't use $newtable and #newtable - they're different variables, and it's way too easy to mix up $newtable; and $newtable[0];
You've got $temp and $tepm - this is exactly the sort of thing that use warnings helps you identify.
Splitting $filename to get #numbertable - I'm pretty sure that doesn't do what you you want, because it's splitting the string '/home/kgee/Desktop/gene_gangs/table_gangs/table_ony_numbers' into characters. Did you perhaps mean split /\t/, $row;?
Likewise my #newtable= split//,$newtable; ... I don't think that does what you think it does either, because $newtable is 'just' zero as instantiated earlier in your program, and you never modify it.
for (#numbertable) iterates each element in that table (the split row?) but you don't ever use the iterator. $_ is set to the current element each iteration. But it doesn't have anything to do with $i and $j, and you don't actually seem to modify $j at all - so it stays zero.
perl -w and use warnings; is redundant. You should probably stick with one or other. (I favour use warnings; along with use strict; personally).
Actually, the more I look at the code, I'm afraid it becomes clear it doesn't actually work, and your problems run a bit deeper than your initial warnings.
How about:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #previous_row;
my #new_table;
##iterate line by line - use your file handle here, <DATA> is a special case.
while ( <DATA> ) {
#strip trailing linefeed.
chomp;
#split this row on any whitespace (which includes tabs)
my #row = split;
#Handle first iteration - can't subtract 'previous row' if there isn't one.
if ( #previous_row ) {
my #new_row;
#iterate the current row
foreach my $element ( #row ) {
#grab the elements off the previous row - note "shift" modifies it, and this will
#break if you've got uneven length rows. (But you don't, and I'll leave it to you to handle that if you do.
my $previous_row_element = shift #previous_row;
#calculate new value
my $value = abs ( $element - $previous_row_element );
#stash new value into new row.
push #new_row, $value;
}
#new row is complete, so push it into the new table.
push #new_table, \#new_row;
}
#Update 'previous row' with the contents of the current row.
#previous_row = #row;
}
#lazy mode output. Iterating the array and printing values in the format you want
#is up to you.
print Dumper \#new_table;
__DATA__
43 402 51 360 66
61 63 67 66 65
63 60 69 63 58
65 53 89 55 57
103 138 135 135 85
See #Sobrique's answer for excellent explanation of some problems with your script. I just present an alternative approach here:
use feature qw(say);
use strict;
use warnings;
my $filename = 'table_only_numbers';
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename': $!";
my #row1;
while (my $row = <$fh>) {
chomp $row;
my #row2 = split " ", $row;
if (#row1) {
die "Bad format!" if #row1 != #row2;
my #new = map { abs ($row2[$_] - $row1[$_]) } 0..$#row2;
say join " ", #new;
}
#row1 = #row2;
}
close $fh;
I am trying to merge a fasta file and a qual file in a new fastq file having in mind the case that the two files might be provided with different order in their sequence IDs. To do that, I tried the first step of my script to be the sorting of the sequences which works perfectly when I test it as a separate script. The same with the rest, when I run separately the part where it combines the files in a fastq, it runs perfectly. But now that I am trying to combine the two methods in one script it doesn't work and I don't know what else to do! I would appreciate it if you can help me.
Here is my script as far. It creates the new fastq file but the content is messed up and not what I want. I run it from terminal like this:
$ perl script.pl reads.fasta reads.qual > reads.fq
Script :
#!/usr/bin/env perl
use strict;
use warnings;
die ("Usage: script.pl reads.fasta reads.qual > reads.fq") unless (scalar #ARGV) == 2;
open FASTA, $ARGV[0] or die "cannot open fasta: $!\n";
open QUAL, $ARGV[1] or die "cannot open qual: $!\n";
my $offset = 33;
my $count = 0;
local($/) = "\n>";
my %id2seq = ();
my $id = '';
my %idq2seq = ();
my $idq = '';
my (#sort_q, #sort_f);
while(<FASTA>){
chomp;
if($_ =~ /^>(.+)/){
$id = $1;
}else{
$id2seq{$id} .= $_;
}
}
for $id (sort keys %id2seq)
{
#sort_f = "$id\n$id2seq{$id}\n\n";
print #sort_f;
}
while(<QUAL>){
chomp;
if($_ =~ /^>(.+)/){
$idq = $1;
}else{
$idq2seq{$idq} .= $_;
}
}
for $idq (sort keys %idq2seq)
{
#sort_q = "$idq\n$idq2seq{$idq}\n\n";
print "#sort_q";
}
while (my #sort_f) {
chomp #sort_f;
my ($fid, #seq) = split "\n", #sort_f;
my $seq = join "", #seq; $seq =~ s/\s//g;
my $sortq = #sort_q;
chomp my #sortq;
my ($qid, #qual) = split "\n", #sortq;
#qual = split /\s+/, (join( " ", #qual));
# convert score to character code:
my #qual2 = map {chr($_+$offset)} #qual;
my $quals = join "", #qual2; `enter code here`
die "missmatch of fasta and qual: '$fid' ne '$qid'" if $fid ne $qid;
$fid =~ s/^\>//;
print STDOUT (join( "\n", "#".$fid, $seq, "+$fid", $quals), "\n");
$count++;
}
close FASTA;
close QUAL;
print STDERR "wrote $count entries\n";
Thank you in advance
It's been a while since I have used perl, but I would approach this using a hash of key/value pairs for both the fasta and quality input. Then write out all the pairs by looping over the fasta hash and pulling out the corresponding quality string.
I have written something in python that will do what you need, you can see it in action here:
It assumes that your input looks like this:
reads.fasta
>fa_0
GCAGCCTGGGACCCCTGTTGT
>fa_1
CCCACAAATCGCAGACACTGGTCGG
reads.qual
>fa_0
59 37 38 51 56 55 60 44 43 42 56 65 60 68 52 67 43 72 59 65 69
>fa_1
36 37 47 72 34 53 67 41 70 67 66 51 47 41 73 58 75 36 61 48 70 55 46 42 42
output
#fa_0
GCAGCCTGGGACCCCTGTTGT
+
;%&387<,+*8A<D4C+H;AE
#fa_1
CCCACAAATCGCAGACACTGGTCGG
+
$%/H"5C)FCB3/)I:K$=0F7.**
#fa_2
TCGTACAGCAGCCATTTTCATAACCGAACATGACTC
+
C?&93A#:?#F,2:'KF*20CC:I7F9J.,:E8&?F
import sys
# Check there are enough arguments
if len(sys.argv) < 3:
print('Usage: {s} reads.fasta reads.qual > reads.fq'.format(s=sys.argv[0]), file=sys.stderr)
sys.exit(1)
# Initalise dictionaries for reads and qualities
read_dict = dict()
qual_dict = dict()
fa_input = sys.argv[1]
qual_input = sys.argv[2]
# Read in fasta input
with open(fa_input, 'r') as fa:
for line in fa:
line = line.strip()
if line[0] == '>':
read_dict[line[1:]] = next(fa).strip()
else:
next(fa)
# Read in quality input
with open(qual_input, 'r') as qual:
for line in qual:
line = line.strip()
if line[0] == '>':
qual_dict[line[1:]] = next(qual).strip()
else:
next(qual)
count = 0
# Iterate over all fasta reads
for key, seq in read_dict.items():
# Check if read header is in the qualities data
if key in qual_dict.keys():
# There's both sequence and quality data so write stdout
read_str = '#{header}\n{seq}\n+\n{qual}'.format(
header=key,
seq=seq,
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')]))
print(read_str, file=sys.stdout)
count += 1
else: # not found
# Write error to stderr
print('Error: {k} not found in qual file'.format(k=key), file=sys.stderr)
# Print count to stderr
print('{c} reads written'.format(c=count), file=sys.stderr)
If you need to use an offset for the quality score edit
qual=''.join([chr(int(x)) for x in qual_dict[key].split(' ')])) to
qual=''.join([chr(int(x) + offset) for x in qual_dict[key].split(' ')])) and define an offset variable before this.
I am creating a script that should take in a data file and a column index, read and store that column and then perform some statistics on the data. I am unsure how to specify that I only want to store a specific column in Perl. Here is my code so far:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw(say);
use Scalar::Util qw(looks_like_number);
my ($FILE, $COLUMN_TO_PARSE) = #ARGV;
#check if file arg is present
if(not defined $FILE){
die "Please specify file input $!";
}
#check if column arg is present
if(not defined $COLUMN_TO_PARSE){
die "Please specify column number $!";
}
unless(open(INPUT_FILE, "<", $FILE)){
die "Couldn't open ", $FILE ," for reading!", $!;
}
my #data;
while(<INPUT_FILE>){
# Only store $COLUMN_TO_PARSE, save to #data
}
close(INPUT_FILE);
For reference, the data coming in looks something like this(sorry for format):
01 8 0 35 0.64 22
02 8 0 37 0.68 9
03 8 0 49 0.68 49
So for example, if I ran
perl descriptiveStatistics.pl dataFile.txt 3
I would expect to have [35,37,49] in the #data array.
I stumbled upon this question, but it has to do with headers which I don't have, and not very helpful imo. Any suggestions?
I've used split() to split the input into a list of records. By default, split() works on $_ and splits on white space - which is exactly what we want here.
I've then used a list slice to get the column that you want, and pushed that onto your array.
#! /usr/bin/perl
use warnings;
use strict;
# Check parameters
#ARGV == 2 or die "Please specify input file and column number\n";
my ($file, $column_to_parse) = #ARGV;
open my $in_fh, '<', $file
or die "Couldn't open $file for reading: $!";
my #data;
while (<$in_fh>){
push #data, (split)[$column_to_parse];
}
If I was writing it for myself, I think I would replace the while loop with a map.
my #data = map { (split)[$column_to_parse] } <$in_fh>;
Update: To ensure that you have been given a valid column number (and I think that's a good idea) you might write something like this:
while (<$in_fh>){
my #fields = split;
die "Not enough columns in row $.\n" if $#fields < $column_to_parse;
push #data, $fields[$column_to_parse];
}
split is a good choice:
while (my $line = <INPUT_FILE>) {
my #items = split(/\t/, $line);
push #data,$items[$COLUMN_TO_PARSE];
}
You could design a regexp pattern for matching a column, which you repeat $COLUMN_TO_PARSE times, and which then captures the content of the column and pushes it onto your array #data.
Like this:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $COLUMN_TO_PARSE = 3;
while (<DATA>) {
if (/([^\s]+\s+){$COLUMN_TO_PARSE}([^\s]+)/) {
push #data, $2;
} else {
print("error wrong line format: $_\n");
}
}
print Dumper(#data);
__DATA__
01 8 0 35 0.64 22
02 8 0 37 0.68 9
03 8 0 49 0.68 49
which gives the following dump for #data:
$VAR1 = '35';
$VAR2 = '37';
$VAR3 = '49';
$COLUMN_TO_PARSE is zero based, as in your example, and, as a side-effect, the regexp will fail if the requested column does not exist, thus giving you error-handling.
You can use split to get data column wise. Each column in stored in consecutive indices of the array.
while(<INPUT_FILE>){
my #columns = split(/\t/, $_); #Assuming delimiter to tab
print "First column====$columns[0]\n";
print "Second column====$columns[1]\n";
}
Do process whichever column you want and store into an array.
I have 2 tab-delimited files formatted similar to this:
file 1
A 100 90 PASS
B 89 80 PASS
C 79 70 PASS
D 69 60 FAIL
F 59 0 FAIL
file 2
Randy 80
Denis 44
Earl 97
I want to take the values from column 2 in file 2 and compare them with the ranges given between columns 2 and 3 of file 1. Then I want to create a new file that combines this data, printing columns 1 and 2 from file 2 and columns 1 and 4 from file 1:
file 3
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS
I want to implement this using awk or perl.
You can use this awk:
awk 'BEGIN{FS=OFS="\t"}
FNR==NR {
a[$0] = $2
next
}
{
for (i in a)
if ($2>=a[i] && $3<=a[i])
print i, $1, $4
}' file2 file1
Earl 97 A PASS
Randy 80 B PASS
Denis 44 F FAIL
In perl, I'd probably do something like this:
#!/usr/bin/env perl
use strict;
use warnings 'all';
use Data::Dumper;
open ( my $grades_in, '<', "file1.txt" ) or die $!;
my #grade_lookup = map { [split] } <$grades_in>;
print Dumper \#grade_lookup;
close ( $grades_in );
open ( my $people, '<', "file2.txt" ) or die $!;
while (<$people>) {
chomp;
my ( $person, $score ) = split;
my ( $grade ) = grep { $_ -> [1] >= $score
and $_ -> [2] <= $score } #grade_lookup;
print join " ", $person, $score, $grade -> [0], $grade -> [3], "\n";
}
close ( $people );
output:
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS
In Perl
use strict;
use warnings 'all';
use autodie;
use List::Util 'first';
my #grades = do {
open my $fh, '<', 'file1.txt';
map [ split ], <$fh>;
};
open my $fh, '<', 'file2.txt';
while ( <$fh>) {
my ($name, $score) = split;
my $grade = first { $_->[2] <= $score } #grades;
print "$name $score #$grade[0,3]\n";
}
output
Randy 80 B PASS
Denis 44 F FAIL
Earl 97 A PASS
I would like to filter out particular columns with a regex and discard others. For example, if I had the following column names:
date
mem_total
cpu.usagemhz.average_0
cpu.usagemhz.average_1
cpu.usagemhz.average_2
I would like to capture only columns that begin with "cpu.usage.mhz.average"
Is their a particular function of text::csv that will help me do a quick check of the column names?
Thanks!
JD
* Update **
I tried jimtut answer and it is extremely close to what I am looking for. Thanks Again Everyone!
Here is the code from jimtut with one small edit on the print statement at the bottom. I added the print $colCount just to see what was going on with the data;
use Text::CSV;
my $file = "foo.csv";
my $pattern = ".*In";
open(F, $file) or warn "Warning! Unable to open $file\n";
my $lineCount = 0;
my %desiredColumns;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
print "$colCount\t$field\n";
}
}
}
}
close(F);
Here is the results
colCount | $field
12 565
13 73
14 36
15 32
16 127
17 40
18 32
19 42
20 171
12 464
13 62
14 32
15 24
16 109
17 21
18 19
19 39
20 150
12 515
13 76
14 28
15 30
16 119
17 15
18 25
19 46
20 169
12 500
13 71
14 30
15 28
16 111
17 20
18 18
19 40
20 167
I would like to add this data to individual arrays or hashes. what do you think? something like...
foreach column {
check to see if a hash already exists with that column number. If not then create hash.
}
Then go through each field and add the field data to the appropriate hash.
Do you think this is the right way to go about solving this?
No, not a specific function in Text::CSV. I would do something like this:
use Text::CSV;
my $file = "foo.csv";
my $pattern = "cpu.usage.mhz.average.*";
open(F, $file) or die "Unable to open $file: $!\n";
my $lineCount = 0;
my %desiredColumns;
my %columnContents;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
push(#{$columnContents{$colCount}}, $field);
}
}
}
}
close(F);
foreach my $key (sort keys %columnContents) {
print "Column $key: " . join(",", #{$columnContents{$key}}) . "\n\n";
}
Hope that helps! I'm sure someone can write that in a Perl one-liner, but that's easier (for me) to read...
Since your fields of interest are at index 2-4, we'll just pluck those out of the field array returned by getline(). This sample code prints them but you can do whatever you like to them.
use Text::CSV; # load the module
my $csv = Text::CSV->new (); # instantiate
open $fh, "<somefile"; # open the input
while ( my $fields = $csv->getline($fh) ) { # read a line, and parse it into fields
print "I got #{$fields}[2..4]\n"; # print the fields of interest
}
close ($fh) # close when done
WHY are you trying to do this? Is it to minimize storage? Eliminate processing costs for parsing many un-needed columns?
If the latter, you can't avoid that processing cost. Any solution you come up with would STILL read and parse 100% of the file.
If the former, there are many methods, some are more efficient than the others.
Also, what exactly do you mean "help me do a quick check of the column names?"? If you want to get the column names, there's column_names() method provided you previously set the column names using column_names(getline($fh)).
If you want to only return specific column names in a hash to avid wasting memory on un-needed columns, there's no clear-cut API for that. You can roll your own, or abuse a "bug/feature" of getline_hr() method:
For the former (roll your own), you can do something like:
my $headers = $csv->getline( $fh ); # First line is headers.
my #headers_keep = map { /^cpu.usage.mhz.average/ ? 1 : 0 } #$headers;
while ( my $row = $csv->getline( $fh ) ) {
my $i = 0;
my #row_new = grep { $headers_keep[$i++] } $#row;
push #rows, \#row_new;
}
BUT you can either roll your own OR .
You can also use a "feature" of "getline_hr()" which doesn't assign values into a hash if the column name is a duplicate (only the LAST version gets assigned) \
In your case, for column names: date,mem_total,cpu.usagemhz.average_0,cpu.usagemhz.average_1,cpu.usagemhz.average_2, merely set the column_names array to contain "cpu.usagemhz.average_0" value in the first 2 eements of the array - they will NOT be then saved by getline_hr().
You can go over the list of columns, find the consecutive range of "not needed" columns, and replace their names with the name of the first needed column follwing that range. The only stiking point is if the "un-needed" range is at the very end of the columns - replace with "JUNK" or something.