Splitting one txt file into multiple txt files based on delimiter and naming them with a specific character - perl

I have a text file that looks like this http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=id:P01375%20OR%20id:P04626%20OR%20id:P08238%20OR%20id:P06213&format=txt.
This file is contained of different entries that are divided with //. I think I have almost found the way how to divide txt file into multiple txt files whenever this specific pattern appears, but I still don't know how to name them after dividing and how to print them in specific directory. I would like that each file that is divided carries specific ID which is a first line nut second column in each entry.
This is the code that I have wrote so far:
mkdir "spliced_files"; #directory where I would like to put all my splitted files
$/="//\n"; # divide them whenever //appears and new line after this
open (IDS, 'example.txt') or die "Cannot open"; #example.txt is an input file
my #ids = <IDS>;
close IDS;
my $entry = 25444; #number of entries or //\n characters
my $i=0;
while ($i eq $entry) {
print $ids[$i];
};
$i++;
I am still having problem with finding how to split all entries from 'example.txt' file whenever "//\n" and to print all this seperated files into directory spliced_files. In addition I would have to name all of these seperated files with the ID that is specific for each of these files or entries (which appears in the first row, but only a second column).
So I expect output to be number of files in spliced_files directory, and each of them are named with their ID (first row, but only second column). For example name of the first file wiould be TNFA_HUMAN, od the second would be ERBB2_HUMAN and so on..)

You still look like you're programming by guesswork. And you haven't made use of any of the advice you have been given in answers to your previous questions. I strongly recommend that you spend a week working through a good beginners book like Learning Perl and come back when you understand more about how Perl works.
But here are some comments on your new code:
open (IDS, 'example.txt') or die "Cannot open";
You have been told that using lexical variables and the three-arg version of open() is a better approach here. You should also include $! in your error message, so you know what has gone wrong.
open my $ids_fh, '<', 'example.txt'
or die "Cannot open: $!";
Then later on (I added the indentation in the while loop to make things clearer)...
my $i=0;
while ($i eq $entry) {
print $ids[$i];
};
$i++;
The first time you enter this loop, $i is 1 and $entry is 25444. You compare them (as strings! You probably want ==, not eq) to see if they are equal. Clearly they are different, so your while loop exits. Once the loop exits, you increment $i.
This code bears no relation at all to the description of your problem. I'm not going to give you the answer, but here is the structure of what you need to do:
mkdir "spliced_files";
local $/ = "//\n"; # Always localise changes to special vars
open my $ids_fh, '<', 'example.txt'
or die "Cannot open: $!";
# No need to read the whole file in one go.
# Process it a line at a time.
while (<$ids_fh>) {
# Your record (the whole thing, not just the first line) is in $_.
# You need to extract the ID value from that string. Let's assume
# you've stored in it $id
# Open a file with the right name
open my $out_fh, '>', "spliced_files/$id" or die $!;
# Print the record to the new file.
print $out_fh $_;
}
But really, you need to take the time to learn about programming before you attack this task. Or, if you don't have the time for that, pay a programmer to do it for you.

Related

Perl replacement operator doesn't work under Windows when patterns contain slashes

I want to replace a string with a path:
my $somedir = "D:/somedir/someotherdir";
system("perl -pi.bak -e \"s{STRING_TO_BE_REPLACED}{$somedir}\" $file");
but under Windows it replaces string with random symbols instead of slashes.
What's the problem?
I think it's got something to do with a syntax detail needed on Windows, but can't test now.
However, as you are in a Perl script, why go out with system and run another Perl interpreter? It is far more complex and inefficient since it involves a syscall or a shell, and starts another program. Also, it is far harder to get it right -- you need to deal with syntax details, quoting and escaping, for system, your system's command interpreter, the other instance of Perl, and the regex.
The code below reads the whole file into an array first, which is fine if files aren't huge. In general it is better to process a file line by line, and how to do what you need in that way is discussed in detail in a perlfaq5 page. See the comment at the end, with the link.
use warnings 'all';
use strict;
# your code ...
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
# Change #lines in-place. See the comment
s/STRING_TO_BE_REPLACED/$somedir/ for #lines;
open $fh, '>', $file or die "Can't open $file for write: $!";
print $fh #lines;
close $fh;
When we open the $fh the second time it is closed and re-opened, so there is no need for an explicit close. When an existing file is opened for writing ('>') it is clobbered, so this replaces it.
It's more to write but it is better.
Comment on the in-place change to #lines This uses the fact that when iterating over an array if we change the index variable, here $_, the change is made in the original element. The index variable is like an alias for the array element. It says in perlsyn
If any element of LIST is an lvalue, you can modify it by modifying VAR inside the loop. Conversely, if any element of LIST is NOT an lvalue, any attempt to modify that element will fail. In other words, the foreach loop index variable is an implicit alias for each item in the list that you're looping over.
This has the benefit of not copying data and not touching elements that don't change so it is more efficient, potentially a lot more. However, it relies on a subtle property and thus it may be tricky and error prone, so I do not recommend it as a general practice.
To copy the array, with modifications, to a new one
my #lines_new;
foreach my $line (#lines) {
$line =~ s{STRING_TO_BE_REPLACED}{$somedir};
push #lines_new, $line;
}
This also changes #lines. If it need be kept intact do (my $new_line = $line) =~ s/.../. Then write #lines_new to $file. Somewhere in between these two is
#lines = map { s{STRING_TO_BE_REPLACED}{$somedir}; $_ } #lines;
what was posted originally. However, since the map changes elements of #lines and copies data to build the output list, while the whole statement also overwrites the array, on reflection I think it makes more sense to do either the in-place change or an explicit copy to a new array.
In principle it is better to not read the whole file at once but rather to process line by line, unless the file is small enough. In that case open the file for reading and new one for writing, and after you copy (with changes) the file over, move the new one to rewrite $file. See the topic in perlfaq5
The copied file is temporary, to be used to overwrite $file, so it can be named using the core module File::Temp to avoid accidents. To move a file use move from the core module File::Copy.

Which is best method to read or write a particular line in file in perl?

I am trying to read a last line of the file.
And There are many method available to read particular line of file.
first:
#array=<FILE_HANDLE>;
$line=(reverse #array)[0];
second: ReadBackwards package
$bw = File::ReadBackwards->new( 'log_file' ) or
die "can't read 'log_file' $!" ;
$log_line = $bw->readline;
I want to know that In perl which is preferable method, whether using package
or stored whole file content into the variable .
Your title asks a different question than your body.
To read the last (or nth from last) line from an arbitrarily large file, absolutely do use File::ReadBackwards.
To read or write an arbitrary single line in a file, use Tie::File:
use Tie::File;
tie my #line, 'Tie::File', 'filename' or die "unable to open filename: $!";
print "line 123 is $line[123].";
$line[42] = 'abc';
print "line 42 is now abc.";
For large files, this will be significantly more expensive than File::ReadBackwards, because it will need to read through the entire file up to the line you want to modify (or through the entire file, if using a negative index, so if you are doing that, you are better off using File::ReadBackwards and then manually updating the file).
The thing is - there isn't a way that doesn't rely on reading the whole file. Files start at the beginning, and your process will have to seek all the way through.
This is the most expensive task your code will be doing, so it makes very little difference which approach you take.
So I'd go for a very simple:
my $last_line;
while ( my $line = <$filehandle> ) { $last_line = $line };
print $last_line,"\n";
This may look inefficient, but it's doing no more than the other things are - reading the whole file, and discarding everything but the final line.
If - as your question suggests - you're after a -specific- line - you can either use testing $. (current line number) or a regular expression and 'bail out' of the loop when a condition is met. Thus you don't read the whole file, which is marginally more efficient.
E.g.:
while ( my $line = <$filehandle> ) {
$last_line = $line;
last if $. >= 100; #bails out after we hit line 100.
last if m/END FILE/; #looks for 'END FILE' on the line, and bails out.
};
print $last_line,"\n";

Perl: Substitute text string with value from list (text file or scalar context)

I am a perl novice, but have read the "Learning Perl" by Schwartz, foy and Phoenix and have a weak understanding of the language. I am still struggling, even after using the book and the web.
My goal is to be able to do the following:
Search a specific folder (current folder) and grab filenames with full path. Save filenames with complete path and current foldername.
Open a template file and insert the filenames with full path at a specific location (e.g. using substitution) as well as current foldername (in another location in the same text file, I have not gotten this far yet).
Save the new modified file to a new file in a specific location (current folder).
I have many files/folders that I want to process and plan to copy the perl program to each of these folders so the perl program can make new .
I have gotten so far ...:
use strict;
use warnings;
use Cwd;
use File::Spec;
use File::Basename;
my $current_dir = getcwd;
open SECONTROL_TEMPLATE, '<secontrol_template.txt' or die "Can't open SECONTROL_TEMPLATE: $!\n";
my #secontrol_template = <SECONTROL_TEMPLATE>;
close SECONTROL_TEMPLATE;
opendir(DIR, $current_dir) or die $!;
my #seq_files = grep {
/gz/
} readdir (DIR);
open FASTQFILENAMES, '> fastqfilenames.txt' or die "Can't open fastqfilenames.txt: $!\n";
my #fastqfiles;
foreach (#seq_files) {
$_ = File::Spec->catfile($current_dir, $_);
push(#fastqfiles,$_);
}
print FASTQFILENAMES #fastqfiles;
open (my ($fastqfilenames), "<", "fastqfilenames.txt") or die "Can't open fastqfilenames.txt: $!\n";
my #secontrol;
foreach (#secontrol_template) {
$_ =~ s/#/$fastqfilenames/eg;
push(#secontrol,$_);
}
open SECONTROL, '> secontrol.txt' or die "Can't open SECONTROL: $!\n";
print SECONTROL #secontrol;
close SECONTROL;
close FASTQFILENAMES;
My problem is that I cannot figure out how to use my list of files to replace the "#" in my template text file:
my #secontrol;
foreach (#secontrol_template) {
$_ =~ s/#/$fastqfilenames/eg;
push(#secontrol,$_);
}
The substitute function will not replace the "#" with the list of files listed in $fastqfilenames. I get the "#" replaced with GLOB(0x8ab1dc).
Am I doing this the wrong way? Should I not use substitute as this can not be done, and then rather insert the list of files ($fastqfilenames) in the template.txt file? Instead of the $fastqfilenames, can I substitute with content of file (e.g. s/A/{r file.txt ...). Any suggestions?
Cheers,
JamesT
EDIT:
This made it all better.
foreach (#secontrol_template) {
s/#/$fastqfilenames/g;
push #secontrol, $_;
}
And as both suggestions, the $fastqfiles is a filehandle.
replaced this: open (my ($fastqfilenames), "<", "fastqfilenames.txt") or die "Can't open fastqfilenames.txt: $!\n";
with this:
my $fastqfilenames = join "\n", #fastqfiles;
made it all good. Thanks both of you.
$fastqfilenames is a filehandle. You have to read the information out of the filehandle before you can use it.
However, you have other problems.
You are printing all of the filenames to a file, then reading them back out of the file. This is not only a questionable design (why read from the file again, since you already have what you need in an array?), it also won't even work:
Perl buffers file I/O for performance reasons. The lines you have written to the file may not actually be there yet, because Perl is waiting until it has a large chunk of data saved up, to write it all at once.
You can override this buffering behavior in a few different ways (closing the file handle being the simplest if you are done writing to it), but as I said, there is no reason to reopen the file again and read from it anyway.
Also note, the /e option in a regex replacement evaluates the replacement as Perl code. This is not necessary in your case, so you should remove it.
Solution: Instead of reopening the file and reading it, just use the #fastqfiles variable you previously created when replacing in the template. It is not clear exactly what you mean by replacing # with the filenames.
Do you want to to replace each # with a list of all filenames together? If so, you should probably need to join the filenames together in some way before doing the replacement.
Do you want to create a separate version of the template file for each filename? If so, you need an inner for loop that goes over each filename for each template. And you will need something other than a simple replacement, because the replacement will change the original string on the first time through. If you are on Perl 5.16, you could use the /r option to replace non-destructively: push(#secontrol,s/#/$file_name/gr); Otherwise, you should copy to another variable before doing the replacement.
$_ =~ s/#/$fastqfilenames/eg;
$fastqfilenames is a file handle, not the file contents.
In any case, I recommend the use of Text::Template module in order to do this kind of work (file text substitution).

How to open/join more than one file (depending on user input) and then use 2 files simultaneously

EDIT: Sorry for the misunderstanding, I have edited a few things, to hopefully actually request what I want.
I was wondering if there was a way to open/join two or more files to run the rest of the program on.
For example, my directory has these files:
taggedchpt1_1.txt, parsedchpt1_1.txt, taggedchpt1_2.txt, parsedchpt1_2.txt etc...
The program must call a tagged and parsed simultaneously. I want to run the program on both of chpt1_1 and chpt1_2, preferably joined together in one .txt file, unless it would be very slow to do so. For instance run what would be accomplished having two files:
taggedchpt1_1_and_chpt1_2 and parsedchpt1_1_and_chpt1_2
Can this be done through Perl? Or should I just combine the text files myself(or automate that process, making chpt1.txt which would include chpt1_1, chpt1_2, chpt1_3 etc...)
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
print "Please type in the chapter and section NUMBERS in the form chp#_sec#:\n"; ##So the user inputs 31_3, for example
chomp (my $chapter_and_section = "chpt".<>);
print "Please type in the search word:\n";
chomp (my $search_key = <>);
open(my $tag_corpus, '<', "tagged${chapter_and_section}.txt") or die $!;
open(my $parse_corpus, '<', "parsed${chapter_and_section}.txt") or die $!;
For the rest of the program to work, I need to be able to have:
my #sentences = <$tag_corpus>; ##right now this is one file, I want to make it more
my #typeddependencies = <$parse_corpus>; ##same as above
EDIT2: Really sorry about the misunderstanding. In the program, after the steps shown, I do 2 for loops. Reading through the lines of the tagged and parsed.
What I want is to accomplish this with more files from the same directory, without having to re-input the next files. (ie. I can run taggedchpt31_1.txt and parsedchpt31_1.txt...... I want to run taggedchpt31 and parsedchpt31 - which includes ~chpt31_1, ~chpt31_2, etc...)
Ultimately, it would be best if I joined all the tagged files and all the parsed files that have a common chapter (in the end still requiring only two files I want to run) but not have to save the joined file to the directory... Now that I put it into words, I think I should just save files that include all the sections.
Sorry and Thanks for all your time! Look at FMc's breakdown of my question for more help.
You could iterate over the file names, opening and reading each one in turn. Or you could produce an iterator that knows how to read lines from sequence of files.
sub files_reader {
# Takes a list of file names and returns a closure that
# will yield lines from those files.
my #handles = map { open(my $h, '<', $_) or die $!; $h } #_;
return sub {
shift #handles while #handles and eof $handles[0];
return unless #handles;
return readline $handles[0];
}
}
my $reader = files_reader('foo.txt', 'bar.txt', 'quux.txt');
while (my $line = $reader->()) {
print $line;
}
Or you could use Perl's built-in iterator that can do the same thing:
local #ARGV = ('foo.txt', 'bar.txt', 'quux.txt');
while (my $line = <>) {
print $line;
}
Edit in response to follow-up questions:
Perhaps it would help to break your problem down into smaller sub-tasks. As I understand it, you have three steps.
Step 1 is to get some input from the user -- perhaps a directory name, or maybe a couple of file name patterns (taggedchpt and parsedchpt).
Step 2 is for the program to find all of the relevant file names. For this task, glob() or readdir()might be useful. There are many questions on StackOverflow related to such issues. You'll end up with two lists of file names, one for the tagged files and one for the parsed files.
Step 3 is to process the lines across all of the files in each of the two sets. Most of the answers you have received, including mine, will help you with this step.
No one has mentioned the #ARGV hack yet? Ok, here it is.
{
local #ARGV = ('taggedchpt1_1.txt', 'parsedchpt1_1.txt', 'taggedchpt1_2.txt',
'parsedchpt1_2.txt');
while (<ARGV>) {
s/THIS/THAT/;
print FH $_;
}
}
ARGV is a special filehandle that iterates through all the filenames in #ARGV, closing a file and opening the next one as necessary. Normally #ARGV contains the command-line arguments that you passed to perl, but you can set it to anything you want.
You're almost there... this is a bit more efficient than discrete opens on each file...
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
print "Please type in the chapter and section NUMBERS in the for chp#_sec#:\n";
chomp (my $chapter_and_section = "chpt".<>);
print "Please type in the search word:\n";
chomp (my $search_key = <>);
open(FH, '>output.txt') or die $!; # Open an output file for writing
foreach ("tagged${chapter_and_section}.txt", "parsed${chapter_and_section}.txt") {
open FILE, "<$_" or die $!; # Read a filename (from the array)
foreach (<FILE>) {
$_ =~ s/THIS/THAT/g; # Regex replace each line in the open file (use
# whatever you like instead of "THIS" &
# "THAT"
print FH $_; # Write to the output file
}
}

How can I combine files into one CSV file?

If I have one file FOO_1.txt that contains:
FOOA
FOOB
FOOC
FOOD
...
and a lots of other files FOO_files.txt. Each of them contains:
1110000000...
one line that contain 0 or 1 as the number of FOO1 values (fooa,foob, ...)
Now I want to combine them to one file FOO_RES.csv that will have the following format:
FOOA,1,0,0,0,0,0,0...
FOOB,1,0,0,0,0,0,0...
FOOC,1,0,0,0,1,0,0...
FOOD,0,0,0,0,0,0,0...
...
What is the simple & elegant way to conduct that
(with hash & arrays -> $hash{$key} = \#data ) ?
Thanks a lot for any help !
Yohad
If you can't describe a your data and your desired result clearly, there is no way that you will be able to code it--taking on a simple project is a good way to get started using a new language.
Allow me to present a simple method you can use to churn out code in any language, whether you know it or not. This method only works for smallish projects. You'll need to actually plan ahead for larger projects.
How to write a program:
Open up your text editor and write down what data you have. Make each line a comment
Describe your desired results.
Start describing the steps needed to change your data into the desired form.
Numbers 1 & 2 completed:
#!/usr/bin perl
use strict;
use warnings;
# Read data from multiple files and combine it into one file.
# Source files:
# Field definitions: has a list of field names, one per line.
# Data files:
# * Each data file has a string of digits.
# * There is a one-to-one relationship between the digits in the data file and the fields in the field defs file.
#
# Results File:
# * The results file is a CSV file.
# * Each field will have one row in the CSV file.
# * The first column will contain the name of the field represented by the row.
# * Subsequent values in the row will be derived from the data files.
# * The order of subsequent fields will be based on the order files are read.
# * However, each column (2-X) must represent the data from one data file.
Now that you know what you have, and where you need to go, you can flesh out what the program needs to do to get you there - this is step 3:
You know you need to have the list of fields, so get that first:
# Get a list of fields.
# Read the field definitions file into an array.
Since it is easiest to write CSV in a row oriented fashion, you will need to process all your files before generating each row. So you'll need someplace to store the data.
# Create a variable to store the data structure.
Now we read the data files:
# Get a list of data files to parse
# Iterate over list
# For each data file:
# Read the string of digits.
# Assign each digit to its field.
# Store data for later use.
We've got all the data in memory, now write the output:
# Write the CSV file.
# Open a file handle.
# Iterate over list of fields
# For each field
# Get field name and list of values.
# Create a string - comma separated string with field name and values
# Write string to file handle
# close file handle.
Now you can start converting comments into code. You could have anywhere from 1 to 100 lines of code for each comment. You may find that something you need to do is very complex and you don't want to take it on at the moment. Make a dummy subroutine to handle the complex task, and ignore it until you have everything else done. Now you can solve that complex, thorny sub-problem on it's own.
Since you are just learning Perl, you'll need to hit the docs to find out how to do each of the subtasks represented by the comments you've written. The best resource for this kind of work is the list of functions by category in perlfunc. The Perl syntax guide will come in handy too. Since you'll need to work with a complex data structure, you'll also want to read from the Data Structures Cookbook.
You may be wondering how the heck you should know which perldoc pages you should be reading for a given problem. An article on Perlmonks titled How to RTFM provides a nice introduction to the documentation and how to use it.
The great thing, is if you get stuck, you have some code to share when you ask for help.
If I understand correctly your first file is your key order file, and the remaining files each contain a byte per key in the same order. You want a composite file of those keys with each of their data bytes listed together.
In this case you should open all the files simultaneously. Read one key from the key order file, read one byte from each of the data files. Output everything as you read it to you final file. Repeat for each key.
It looks like you have many foo_files that have 1 line in them, something like:
1110000000
Which stands for
fooa=1
foob=1
fooc=1
food=0
fooe=0
foof=0
foog=0
fooh=0
fooi=0
fooj=0
And it looks like your foo_res is just a summation of those values? In that case, you don't need a hash of arrays, but just a hash.
my #foo_files = (); #NOT SURE HOW YOU POPULATE THIS ONE
my #foo_keys = qw(a b c d e f g h i j);
my %foo_hash = map{ ( $_, 0 ) } #foo_keys; # initialize hash
foreach my $foo_file ( #foo_files ) {
open( my $FOO, "<", $foo_file) || die "Cannot open $foo_file\n";
my $line = <$FOO>;
close( $FOO );
chomp($line);
my #foo_values = split(//, $line);
foreach my $indx ( 0 .. $#foo_keys ) {
last if ( ! $foo_values[ $indx ] ); # or some kind of error checking if the input file doesn't have all the values
$foo_hash{ $foo_keys[$indx] } += $foo_values[ $indx ];
}
}
It's pretty hard to understand what you are asking for, but maybe this helps?
Your specifications aren't clear. You couldn't have a "lots of other files" named FOO_files.txt, because it's only one name. So I'm going to take this as the files-with-data + filelist pattern. In this case, there are files named FOO*.txt, each containing "[01]+\n".
Thus the idea is to process all the files in the filelist file and to insert them all into a result file FOO_RES.csv, comma-delimited.
use strict;
use warnings;
use English qw<$OS_ERROR>;
use IO::Handle;
open my $foos, '<', 'FOO_1.txt'
or die "I'm dead: $OS_ERROR";
#ARGV = sort map { chomp; "$_.txt" } <$foos>;
$foos->close;
open my $foo_csv, '>', 'FOO_RES.csv'
or die "I'm dead: $OS_ERROR";
while ( my $line = <> ) {
my ( $foo_name ) = ( $ARGV =~ /(.*)\.txt$/ );
$foo_csv->print( join( ',', $foo_name, split //, $line ), "\n" );
}
$foo_csv->close;
You don't really need to use a hash. My Perl is a little rusty, so syntax may be off a bit, but basically do this:
open KEYFILE , "foo_1.txt" or die "cannot open foo_1 for writing";
open VALFILE , "foo_files.txt" or die "cannot open foo_files for writing";
open OUTFILE , ">foo_out.txt"or die "cannot open foo_out for writing";
my %output;
while (<KEYFILE>) {
my $key = $_;
my $val = <VALFILE>;
my $arrVal = split(//,$val);
$output{$key} = $arrVal;
print OUTFILE $key."," . join(",", $arrVal)
}
Edit: Syntax check OK
Comment by Sinan: #Byron, it really bothers me that your first sentence says the OP does not need a hash yet your code has %output which seems to serve no purpose. For reference, the following is a less verbose way of doing the same thing.
#!/usr/bin/perl
use strict;
use warnings;
use autodie qw(:file :io);
open my $KEYFILE, '<', "foo_1.txt";
open my $VALFILE, '<', "foo_files.txt";
open my $OUTFILE, '>', "foo_out.txt";
while (my $key = <$KEYFILE>) {
chomp $key;
print $OUTFILE join(q{,}, $key, split //, <$VALFILE> ), "\n";
}
__END__