If I have one file FOO_1.txt that contains:
FOOA
FOOB
FOOC
FOOD
...
and a lots of other files FOO_files.txt. Each of them contains:
1110000000...
one line that contain 0 or 1 as the number of FOO1 values (fooa,foob, ...)
Now I want to combine them to one file FOO_RES.csv that will have the following format:
FOOA,1,0,0,0,0,0,0...
FOOB,1,0,0,0,0,0,0...
FOOC,1,0,0,0,1,0,0...
FOOD,0,0,0,0,0,0,0...
...
What is the simple & elegant way to conduct that
(with hash & arrays -> $hash{$key} = \#data ) ?
Thanks a lot for any help !
Yohad
If you can't describe a your data and your desired result clearly, there is no way that you will be able to code it--taking on a simple project is a good way to get started using a new language.
Allow me to present a simple method you can use to churn out code in any language, whether you know it or not. This method only works for smallish projects. You'll need to actually plan ahead for larger projects.
How to write a program:
Open up your text editor and write down what data you have. Make each line a comment
Describe your desired results.
Start describing the steps needed to change your data into the desired form.
Numbers 1 & 2 completed:
#!/usr/bin perl
use strict;
use warnings;
# Read data from multiple files and combine it into one file.
# Source files:
# Field definitions: has a list of field names, one per line.
# Data files:
# * Each data file has a string of digits.
# * There is a one-to-one relationship between the digits in the data file and the fields in the field defs file.
#
# Results File:
# * The results file is a CSV file.
# * Each field will have one row in the CSV file.
# * The first column will contain the name of the field represented by the row.
# * Subsequent values in the row will be derived from the data files.
# * The order of subsequent fields will be based on the order files are read.
# * However, each column (2-X) must represent the data from one data file.
Now that you know what you have, and where you need to go, you can flesh out what the program needs to do to get you there - this is step 3:
You know you need to have the list of fields, so get that first:
# Get a list of fields.
# Read the field definitions file into an array.
Since it is easiest to write CSV in a row oriented fashion, you will need to process all your files before generating each row. So you'll need someplace to store the data.
# Create a variable to store the data structure.
Now we read the data files:
# Get a list of data files to parse
# Iterate over list
# For each data file:
# Read the string of digits.
# Assign each digit to its field.
# Store data for later use.
We've got all the data in memory, now write the output:
# Write the CSV file.
# Open a file handle.
# Iterate over list of fields
# For each field
# Get field name and list of values.
# Create a string - comma separated string with field name and values
# Write string to file handle
# close file handle.
Now you can start converting comments into code. You could have anywhere from 1 to 100 lines of code for each comment. You may find that something you need to do is very complex and you don't want to take it on at the moment. Make a dummy subroutine to handle the complex task, and ignore it until you have everything else done. Now you can solve that complex, thorny sub-problem on it's own.
Since you are just learning Perl, you'll need to hit the docs to find out how to do each of the subtasks represented by the comments you've written. The best resource for this kind of work is the list of functions by category in perlfunc. The Perl syntax guide will come in handy too. Since you'll need to work with a complex data structure, you'll also want to read from the Data Structures Cookbook.
You may be wondering how the heck you should know which perldoc pages you should be reading for a given problem. An article on Perlmonks titled How to RTFM provides a nice introduction to the documentation and how to use it.
The great thing, is if you get stuck, you have some code to share when you ask for help.
If I understand correctly your first file is your key order file, and the remaining files each contain a byte per key in the same order. You want a composite file of those keys with each of their data bytes listed together.
In this case you should open all the files simultaneously. Read one key from the key order file, read one byte from each of the data files. Output everything as you read it to you final file. Repeat for each key.
It looks like you have many foo_files that have 1 line in them, something like:
1110000000
Which stands for
fooa=1
foob=1
fooc=1
food=0
fooe=0
foof=0
foog=0
fooh=0
fooi=0
fooj=0
And it looks like your foo_res is just a summation of those values? In that case, you don't need a hash of arrays, but just a hash.
my #foo_files = (); #NOT SURE HOW YOU POPULATE THIS ONE
my #foo_keys = qw(a b c d e f g h i j);
my %foo_hash = map{ ( $_, 0 ) } #foo_keys; # initialize hash
foreach my $foo_file ( #foo_files ) {
open( my $FOO, "<", $foo_file) || die "Cannot open $foo_file\n";
my $line = <$FOO>;
close( $FOO );
chomp($line);
my #foo_values = split(//, $line);
foreach my $indx ( 0 .. $#foo_keys ) {
last if ( ! $foo_values[ $indx ] ); # or some kind of error checking if the input file doesn't have all the values
$foo_hash{ $foo_keys[$indx] } += $foo_values[ $indx ];
}
}
It's pretty hard to understand what you are asking for, but maybe this helps?
Your specifications aren't clear. You couldn't have a "lots of other files" named FOO_files.txt, because it's only one name. So I'm going to take this as the files-with-data + filelist pattern. In this case, there are files named FOO*.txt, each containing "[01]+\n".
Thus the idea is to process all the files in the filelist file and to insert them all into a result file FOO_RES.csv, comma-delimited.
use strict;
use warnings;
use English qw<$OS_ERROR>;
use IO::Handle;
open my $foos, '<', 'FOO_1.txt'
or die "I'm dead: $OS_ERROR";
#ARGV = sort map { chomp; "$_.txt" } <$foos>;
$foos->close;
open my $foo_csv, '>', 'FOO_RES.csv'
or die "I'm dead: $OS_ERROR";
while ( my $line = <> ) {
my ( $foo_name ) = ( $ARGV =~ /(.*)\.txt$/ );
$foo_csv->print( join( ',', $foo_name, split //, $line ), "\n" );
}
$foo_csv->close;
You don't really need to use a hash. My Perl is a little rusty, so syntax may be off a bit, but basically do this:
open KEYFILE , "foo_1.txt" or die "cannot open foo_1 for writing";
open VALFILE , "foo_files.txt" or die "cannot open foo_files for writing";
open OUTFILE , ">foo_out.txt"or die "cannot open foo_out for writing";
my %output;
while (<KEYFILE>) {
my $key = $_;
my $val = <VALFILE>;
my $arrVal = split(//,$val);
$output{$key} = $arrVal;
print OUTFILE $key."," . join(",", $arrVal)
}
Edit: Syntax check OK
Comment by Sinan: #Byron, it really bothers me that your first sentence says the OP does not need a hash yet your code has %output which seems to serve no purpose. For reference, the following is a less verbose way of doing the same thing.
#!/usr/bin/perl
use strict;
use warnings;
use autodie qw(:file :io);
open my $KEYFILE, '<', "foo_1.txt";
open my $VALFILE, '<', "foo_files.txt";
open my $OUTFILE, '>', "foo_out.txt";
while (my $key = <$KEYFILE>) {
chomp $key;
print $OUTFILE join(q{,}, $key, split //, <$VALFILE> ), "\n";
}
__END__
Related
I have a text file that looks like this http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=id:P01375%20OR%20id:P04626%20OR%20id:P08238%20OR%20id:P06213&format=txt.
This file is contained of different entries that are divided with //. I think I have almost found the way how to divide txt file into multiple txt files whenever this specific pattern appears, but I still don't know how to name them after dividing and how to print them in specific directory. I would like that each file that is divided carries specific ID which is a first line nut second column in each entry.
This is the code that I have wrote so far:
mkdir "spliced_files"; #directory where I would like to put all my splitted files
$/="//\n"; # divide them whenever //appears and new line after this
open (IDS, 'example.txt') or die "Cannot open"; #example.txt is an input file
my #ids = <IDS>;
close IDS;
my $entry = 25444; #number of entries or //\n characters
my $i=0;
while ($i eq $entry) {
print $ids[$i];
};
$i++;
I am still having problem with finding how to split all entries from 'example.txt' file whenever "//\n" and to print all this seperated files into directory spliced_files. In addition I would have to name all of these seperated files with the ID that is specific for each of these files or entries (which appears in the first row, but only a second column).
So I expect output to be number of files in spliced_files directory, and each of them are named with their ID (first row, but only second column). For example name of the first file wiould be TNFA_HUMAN, od the second would be ERBB2_HUMAN and so on..)
You still look like you're programming by guesswork. And you haven't made use of any of the advice you have been given in answers to your previous questions. I strongly recommend that you spend a week working through a good beginners book like Learning Perl and come back when you understand more about how Perl works.
But here are some comments on your new code:
open (IDS, 'example.txt') or die "Cannot open";
You have been told that using lexical variables and the three-arg version of open() is a better approach here. You should also include $! in your error message, so you know what has gone wrong.
open my $ids_fh, '<', 'example.txt'
or die "Cannot open: $!";
Then later on (I added the indentation in the while loop to make things clearer)...
my $i=0;
while ($i eq $entry) {
print $ids[$i];
};
$i++;
The first time you enter this loop, $i is 1 and $entry is 25444. You compare them (as strings! You probably want ==, not eq) to see if they are equal. Clearly they are different, so your while loop exits. Once the loop exits, you increment $i.
This code bears no relation at all to the description of your problem. I'm not going to give you the answer, but here is the structure of what you need to do:
mkdir "spliced_files";
local $/ = "//\n"; # Always localise changes to special vars
open my $ids_fh, '<', 'example.txt'
or die "Cannot open: $!";
# No need to read the whole file in one go.
# Process it a line at a time.
while (<$ids_fh>) {
# Your record (the whole thing, not just the first line) is in $_.
# You need to extract the ID value from that string. Let's assume
# you've stored in it $id
# Open a file with the right name
open my $out_fh, '>', "spliced_files/$id" or die $!;
# Print the record to the new file.
print $out_fh $_;
}
But really, you need to take the time to learn about programming before you attack this task. Or, if you don't have the time for that, pay a programmer to do it for you.
Sample file:
### Current GPS Coordinates ###
Just In ... : unknown
### Power Source ###
2
### TPM Status ###
0
### Boot Version ###
GD35 1.1.0.12 - built 14:22:56, Jul 10 232323
I want split above file in to arrays like below:
#Current0_PS_Coordinates should be like below
### Current GPS Coordinates ###
Just In ... : unknown
I like to do it in Perl any help? (current program added from comment)
#!/usr/local/lib/perl/5.14.2 -w
my #lines;
my $file;
my $lines;
my $i;
#chomp $i;
$file = 'test.txt'; # Name the file
open(INFO, $file); # Open the file
#lines = <INFO>; # Read it into an array
close(INFO); # Close the file
foreach my $line (#lines) { print "$line"; }
Read the file line by line. Use chomp to remove trailing newlines
If the input line matches this regexp /^### (.*) ###/ then you have the name of an "array" in $1
It is possible to make named variables like #Current0_PS_Coordinates from these matches.
But it's better to use them as hash keys and then store the data in a hash that has arrays as it's values
So put the $1 from the match in "$lastmatch" and start an empty array referred to by a hash like this $items{$lastmatch}=[] for now and read some more
If the input line does not match the "name of an array" regexp given above and if it is not empty then we assume that it is a value for the last match found. So it can be stuffed in the current array like this push #$items{$lastmatch}, $line
Once you've done this all the data will be available in the %items hash
See the perldata, perlre, perldsc and perllol documentation pages for more details
A good place to start would be buying the book Learning Perl (O'Reilly). Seriously, it's a great book with interesting exercises at the end of each chapter. Very easy to learn.
1). Why do you have "my #lines" then "my $lines" lower down? I don't even think you're allowed to do that because scalars and arrays are the same variable but different context. For example, #list can be ('a', 'b', 'c') but calling $list would return 3, the number of items in that list.
2). What is "my $i"? Even if you're just writing down thoughts, try to use descriptive names. It'll make the code a lot easier to piece together.
3). Why is there a commented out "chomp $i"? Where were you going with that thought?
4). Try to use the 3 argument form of open. This will ensure you don't accidentally destroy files you're reading from:
open INFO, "<", $file;
If you're not sure where to start this problem, Vorsprung's answer probably won't mean anything. Regex and variables like $1 are things you'll need to read a book to understand.
I have no background in programming whatsoever, so I would appreciate it if you would explain how and why any code you recommend should be written the way it is.
I have a data matrix 2,000+ samples, and need to do the following manipulate the format in one column.
I would also like to manipulate the format of one of the columns so that it is easier to merge with my other matrix. For example, one column is known as sample number (column #16). The format is currently similar to ABCD-A1-A0SD-01A-11D-A10Y-09, yet I would like to change it to be formatted to the following ABCD-A1-A0SD-01A. This will allow me to have it in the right format so that I can merge it with another matrix. I seem to not be able to find any information on how to proceed with this step.
The sample input should look like this:
ABCD-A1-A0SD-01A-11D-A10Y-09
ABCD-A1-A0SD-01A-11D-A10Y-09
ABCD-A1-A0SE-01A-11D-A10Y-09
ABCD-A1-A0SE-01A-11D-A10Y-09
ABCD-A1-A0SF-01A-11D-A10Y-09
ABCD-A1-A0SH-01A-11D-A10Y-09
ABCD-A1-A0SI-01A-11D-A10Y-09
I want the last three extensions removed. The output sample should look like this:
ABCD-A1-A0SD-01A
ABCD-A1-A0SD-01A
ABCD-A1-A0SE-01A
ABCD-A1-A0SE-01A
ABCD-A1-A0SF-01A
ABCD-A1-A0SH-01A
ABCD-A1-A0SI-01A
Finally, the matrix that I want to merge with has a different layout, in other words the number of columns and rows are different. This is a issue when I tackle the next step which is merging the two matrices together. The original matrix has about 52 columns and 2,000+ rows, whereas the merging matrix only has 15 column and 467 rows.
Each row of the original matrix has mutational information for a patient. This means that the same patient with the same ID might appear many times. The second matrix contains the patient information, so no patients are repeated in that matrix. When merging the matrix, I want to make sure that every patient mutation (each row) is matched with its corresponding information from the merging matrix.
My sample code:
#!/usr/bin/perl
use strict;
use warnings;
my $file = 'sorted_samples_2.txt';
open(INFILE, $file) or die "Can't open file: $!\n";
open(my $outfile, '>', 'sorted_samples_changed.txt');
foreach my $line (<INFILE>) {
print "The input line is $line\n";
my #columns = split('\t', $line);
($columns[15]) = $columns[15]=~/:((\w\w\w\w-\w\d-\w|\w\w-\d\d\w)+)$/;
printf $outfile "#columns/n";
}
Issues: The code deletes the header and deleted the string in column 16.
A few issues about your code:
Good job on include use strict; and use warnings;. Keep doing that
Anytime you're doing file or directory processing, include use autodie; as well.
Always use lexical file handles $infh instead of globs INFILE.
Use the 3 parameter form of open.
Always process a file line by line using a while loop. Using a for loop loads the entire file into memory
Don't forget to chomp your input from a file.
Use the line number variable $. if you want special logic for your header
The first parameter of split is a pattern. Use /\t/. The only exception to this is ' ' which has special meaning. Currently your introducing a bug by using a single quoted string.
When altering a value with a regex, try to focus on what you DO want instead of what you DON'T. In this case it looks like you want 4 groups separated by dashes, and then truncate the rest. Focus on matching those groups.
Don't use printf when you mean print.
The following applies these fixes to your script:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $infile = 'sorted_samples_2.txt';
my $outfile = 'sorted_samples_changed.txt';
open my $infh, '<', $infile;
open my $outfh, '>', $outfile;
while (my $line = <$infh>) {
chomp $line;
my #columns = split /\t/, $line;
if ($. > 1) {
$columns[15] =~ s/^(\w{4}-\w\d-\w{4}-\w{3}).*/$1/
or warn "Unable to fix column at line $.";
}
print $outfh join("\t", #columns), "\n";
}
You need to define scope for your variables with 'my' in declaration itself when you use 'use strict'.
In your case, you should use my #sort = sort {....} in first line and
you should have an array reference $t defined somewhere to de-reference it in second line. You don't have #array declared anywhere in this code, that is the reason you got all those errors. Make sure you understand what you are doing before you do it.
Sorry for the vague question, I'm struggling to think how to better word it!
I have a CSV file that looks a little like this, only a lot bigger:
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
The values in the first column are a ID numbers and the second column could be described as a property (for want of a better word...). The ID number 550672 has properties 1,2,3,4. Can anyone point me towards how I can begin solving how to produce strings such as that for all the ID numbers? My ideal output would be a new csv file which looks something like:
550672,1;2;3;4
656372,1;2
766153,1;4
etc.
I am very much a Perl baby (only 3 days old!) so would really appreciate direction rather than an outright solution, I'm determined to learn this stuff even if it takes me the rest of my days! I have tried to investigate it myself as best as I can, although I think I've been encumbered by not really knowing what to really search for. I am able to read in and parse CSV files (I even got so far as removing duplicate values!) but that is really where it drops off for me. Any help would be greatly appreciated!
I think it is best if I offer you a working program rather than a few hints. Hints can only take you so far, and if you take the time to understand this code it will give you a good learning experience
It is best to use Text::CSV whenever you are processing CSV data as all the debugging has already been done for you
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new;
open my $fh, '<', 'data.txt' or die $!;
my %data;
while (my $line = <$fh>) {
$csv->parse($line) or die "Invalid data line";
my ($key, $val) = $csv->fields;
push #{ $data{$key} }, $val
}
for my $id (sort keys %data) {
printf "%s,%s\n", $id, join ';', #{ $data{$id} };
}
output
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
Firstly props for seeking an approach not a solution.
As you've probably already found with perl, There Is More Than One Way To Do It.
The approach I would take would be;
use strict; # will save you big time in the long run
my %ids # Use a hash table with the id as the key to accumulate the properties
open a file handle on csv or die
while (read another line from the file handle){
split line into ID and property variable # google the split function
append new property to existing properties for this id in the hash table # If it doesn't exist already, it will be created
}
foreach my $key (keys %ids) {
deduplicate properties
print/display/do whatever you need to do with the result
}
This approach means you will need to iterate over the whole set twice (once in memory), so depending on the size of the dataset that may be a problem.
A more sophisticated approach would be to use a hashtable of hashtables to do the de duplication in the intial step, but depending on how quickly you want/need to get it working, that may not be worthwhile in the first instance.
Check out
this question
for a discussion on how to do the deduplication.
Well, open the file as stdin in perl, assume each row is of two columns, then iterate over all lines using left column as hash identifier, and gathering right column into an array pointed by a hash key. At the end of input file you'll get a hash of arrays, so iterate over it, printing a hash key and assigned array elements separated by ";" or any other sign you wish.
and here you go
dtpwmbp:~ pwadas$ cat input.txt
550672,1
656372,1
766153,1
550672,2
656372,2
868194,2
766151,2
550672,3
868179,3
868194,3
550672,4
766153,4
dtpwmbp:~ pwadas$ cat bb2.pl
#!/opt/local/bin/perl
my %hash;
while (<>)
{
chomp;
my($key, $value) = split /,/;
push #{$hash{$key}} , $value ;
}
foreach my $key (sort keys %hash)
{
print $key . "," . join(";", #{$hash{$key}} ) . "\n" ;
}
dtpwmbp:~ pwadas$ cat input.txt | perl -f bb2.pl
550672,1;2;3;4
656372,1;2
766151,2
766153,1;4
868179,3
868194,2;3
dtpwmbp:~ pwadas$
perl -F"," -ane 'chomp($F[1]);$X{$F[0]}=$X{$F[0]}.";".$F[1];if(eof){for(keys %X){$X{$_}=~s/;//;print $_.",".$X{$_}."\n"}}'
Another (not perl) way which incidentally is shorter and more elegant:
#!/opt/local/bin/gawk -f
BEGIN {FS=OFS=",";}
NF > 0 { IDs[$1]=IDs[$1] ";" $2; }
END { for (i in IDs) print i, substr(IDs[i], 2); }
The first line (after specifying the interpreter) sets the input FIELD SEPARATOR and the OUTPUT FIELD SEPARATOR to the comma. The second line checks of we have more than zero fields and if you do it makes the ID ($1) number the key and $2 the value. You do this for all lines.
The END statement will print these pairs out in an unspecified order. If you want to sort them you have to option of asorti gnu awk function or connecting the output of this snippet with a pipe to sort -t, -k1n,1n.
I have a couple of text files (A.txt and B.txt) which look like this (might have ~10000 rows each)
processa,id1=123,id2=5321
processa,id1=432,id2=3721
processa,id1=3,id2=521
processb,id1=9822,id2=521
processa,id1=213,id2=1
processc,id1=822,id2=521
I need to check if every row in file A.txt is present in B.txt as well (B.txt might have more too, that is okay).
The thing is that rows can be in any order in the two files, so I am thinking I will sort them in some particular order in both the files in O(nlogn) and then match each line in A.txt to the next lines in B.txt in O(n). I could implement a hash, but the files are big and this comparison happens only once after which these files are regenerated, so I don't think that is a good idea.
What is the best way to sort the files in Perl? Any ordering would do, it just needs to be some ordering.
For example, in dictionary ordering, this would be
processa,id1=123,id2=5321
processa,id1=213,id2=1
processa,id1=3,id2=521
processa,id1=432,id2=3721
processb,id1=9822,id2=521
processc,id1=822,id2=521
As I mentioned before, any ordering would be just as fine, as long as Perl is fast in doing it.
I want to do it from within Perl code, after opening the file like so
open (FH, "<A.txt");
Any comments, ideas etc would be helpful.
To sort the file in your script, you will still have to load the entire thing into memory. If you're doing that, I'm not sure what's the advantage of sorting it vs just loading it into a hash?
Something like this would work:
my %seen;
open(A, "<A.txt") or die "Can't read A: $!";
while (<A>) {
$seen{$_}=1;
}
close A;
open(B, "<B.txt") or die "Can't read B: $!";
while(<B>) {
delete $seen{$_};
}
close B;
print "Lines found in A, missing in B:\n";
join "\n", keys %seen;
Here's another way to do it. The idea is to create a flexible data structure that allows you to answer many kinds of questions easily with grep.
use strict;
use warnings;
my ($fileA, $fileB) = #ARGV;
# Load all lines: $h{LINE}{FILE_NAME} = TALLY
my %h;
$h{$_}{$ARGV} ++ while <>;
# Do whatever you need.
my #all_lines = keys %h;
my #in_both = grep { keys %{$h{$_}} == 2 } keys %h;
my #in_A = grep { exists $h{$_}{$fileA} } keys %h;
my #only_in_A = grep { not exists $h{$_}{$fileB} } #in_A;
my #in_A_mult = grep { $h{$_}{$fileA} > 1 } #in_A;
well, i routinely parse very large (600MB) daily Apache log files with Perl, and to store the information i use a hash. I also go through about 30 of these files, in one script instance, using the same hash. its not a big deal assuming you have enough RAM.
May I ask why you must do this in native Perl? If the cost of calling a system call or 3 is not an issue (e.g. you do this infrequently and not in a tight loop), why not simply do:
my $cmd = "sort $file1 > $file1.sorted";
$cmd .= "; sort $file2 > $file2.sorted";
$cmd .= "; comm -23 $file1.sorted $file2.sorted |wc -l";
my $count = `$cmd`;
$count =~ s/\s+//g;
if ($count != 0) {
print "Stuff in A exists that aren't in B\n";
}
Please note that comm parameter might be different, depending on what exactly you want.
As usual, CPAN has an answer for this. Either Sort::External or File::Sort looks like it would work. I've never had occasion to try either, so I don't know which would be better for you.
Another possibility would be to use AnyDBM_File to create a disk-based hash that can exceed available memory. Without trying it, I couldn't say whether using a DBM file would be faster or slower than the sort, but the code would probably be simpler.
Test if A.txt is a subset of B.txt
open FILE.B, "B.txt";
open FILE.A, "A.txt";
my %bFile;
while(<FILE.B>) {
($process, $id1, $id2) = split /,/;
$bFile{$process}{$id1}{$id2}++;
}
$missingRows = 0;
while(<FILE.A>) {
$missingRows++ unless $bFile{$process}{$id1}{$id2};
# If we've seen a given entry already don't add it
next if $missingRows; # One miss means they aren't all verified
}
$is_Atxt_Subset_Btxt = $missingRows?FALSE:TRUE;
That will give you a test for all rows in A being in B with only reading in all of B and then testing each member of the array while reading A.