How to split files using Perl? - text-processing

Each div should be separated as individual files.
Input.txt
[[div]]
line 1
line 2
...
[[/div]]
[[div]]
line 3
line 4
line 5
...
[[/div]]
[[div]]
line 6
line 7
...
[[/div]]
filename.txt
fm.html
chap01.html
bm.html
Output needed
fm.html
<html>
<body>
line 1
line 2
...
</body>
</html>
chap01.html
<html>
<body>
line 3
line 4
line 5
...
</body>
</html>
bm.html
<html>
<body>
line 6
line 7
...
</body>
</html>
Coding that i have tried now.. but it returns last div in all files... And need to add meta...Kindly need solution
#!/usr/bin/perl
open(REDA,"filename.txt");
#namef=<REDA>;
open(RED,"input.txt");
open(WRITX,">input1.txt");
while(<RED>)
{
chomp($_);
$_="$_"."<cr>";
print WRITX $_;
}
close(RED);
close(WRITX);
open(REDQ,"input1.txt");
open(WRITQ,">input2.txt");
while(<REDQ>)
{
$_=~s/\[\[div\]\]<cr>/\n\[\[div\]\]/gi;
print WRITQ $_;
}
close(REDQ);
close(WRITQ);
open(REDE,"input2.txt");
while(<REDE>)
{
foreach $namef (#namef)
{
chomp($namef);
$namef=~s/\.[a-z]+//gi;
open(WRIT1,">$namef.html");
if(/\[\[div\]\]/i)
{
chomp($_);
$_=~s/<cr>/\n/gi;
print WRIT1 $_;
}
}
}
close(REDA);
close(REDE);
close(REDX);
close(WRIT1);
system ("del input1.txt");
system ("del input2.txt");

If you're sure the [[div]] sections are separated by blank lines, you can make use of Perl's paragraph mode slurp which divides a file into chunks separated by one or more blank lines. The following code (tested) does what you need. Execute the following in a terminal where the current directory contains the relevant files:
perl -n00 -e '
BEGIN{ #Executed before input.txt is read
open $f,"<","filename.txt";
#names = split /\n+/,<$f> #Split is needed because we changed the input record separator
}
# The following is executed for each "paragraph" (div section)
s!\[\[div\]\]\n!<html>\n<body>\n!; # substitute <html>\n<body\n instead of [[div]]
s!\[\[/div\]\]\n!</body>\n</html>!; # substitute </body>\n</html> instead of [[/div]]
$content{shift #names}=$_; #Add the modified content to hash keyed by file name
END{ #This is executed after the whole of input.txt has been read
for(keys %content){ #For each file we want to create
open $of,">",$_;
print $of $content{$_}
}
}
' input.txt
Update
If you want to use the above code as a Perl script, you can do the following:
#!/usr/bin/env perl
use strict;
use warnings;
open my $f,'<','filename.txt' or die "Failed to open filename.txt: $!\n";
my #names;
chomp(#names=<$f>);
open my $if,'<','input.txt' or die "Failed to open input.txt: $!\n";
my %content;
while(my $paragraph=do{local $/="";<$if>}){
$paragraph=~ s!\[\[div\]\]\n!<html>\n<body>\n!;
$paragraph=~ s!\[\[/div\]\]\n!</body>\n</html>!;
$content{shift #names}=$paragraph;
}
for(keys %content){
open my $of,'>',$_ or die "Failed to open $_ : $!\n";
print $of $content{$_}
}
Save the above as (say) split_file.pl, make it executable via chmod +x split_file.pl then run it as ./split_file.pl.

You could do something like this:
#!/usr/bin/env perl
use strict;
use warnings;
my #file_names;
## Read the list of file names
open(my $fh,"$ARGV[0]");
while (<$fh>) {
chomp; #remove new line character from the end of the line
push #file_names,$_;
}
my $counter=0;
my ($file_name,$fn);
## Read the input file
open($fh,"$ARGV[1]");
while (<$fh>) {
## If this is an opening DIV, open the next output file,
## and set $counter to 1.
if (/\[\[div\]\]/) {
$counter=1;
$file_name=shift(#file_names);
open($fn, '>',"$file_name");
}
## If this is a closing DIV, print the line and set $counter back to 0
if (/\[\[\/div\]\]/) {
$counter=0;
print $fn $_;
close($fn);
}
## Print into the corresponding file handle if $counter is 1
print $fn $_ if $counter==1
}
Save the script as foo.pl and run it like this:
perl foo.pl filename.txt Input.txt

In Perl you can loop through the contents of file filename.txt like so:
#!/usr/bin/perl
# somescript.pl
open (my $fh, "<", "filename.txt");
my #files = <$fh>;
close ($fh);
foreach my $file (#files) {
print "$file";
}
Put the above in a file called somescript.pl, make it executable, chmod +x somescript.pl, and run it:
$ ./somescript.pl
fm.html
chap01.html
bm.html
You can see that it's now reading in the file filename.txt and printing each line out to the screen. I leave the rest to you to try. If you get stuck ask for help.
I would use the same approach that I did to read in the filename.txt file for reading in the input.txt file.

Writing it in rather more idiomatic Perl, you might get something like this:
#!/usr/bin/perl
use strict;
use warnings;
# First argument is the name of the file that contains
# the filenames.
open my $fn, shift or die $!;
chomp(my #files = <$fn>);
# Variable to contain the current open filehandle
my $curr_fh;
while (<>) {
# Skip blank lines
next unless /\S/;
# If it's the opening of a div...
if (/\[\[div]]/) {
# Open the next file...
open $curr_fh, '>', shift #files or die $!;
# Print the opening html...
print $curr_file "<html>\n<body>\n";
# ... and skip the rest of the loop
next;
}
# If it's the end of a div
if (/\[\[\/div]]/) {
# Print the closing html...
print $curr_fh "</body>\n</html>\n";
# Close the current file...
close $curr_fh;
# Unset the variable so we can reuse it...
undef $curr_fh;
# and skip the rest of the loop
next;
}
# Otherwise, just print the record to the currently open file
print $curr_fh $_;
}
Call it with two arguments, the name of the file containing the the filenames (filename.txt) followed by the name of the file containing the data (input.txt).

Related

Copying content from one file to another file using perl

Following code is for copying file content from readfile to writefile. Instead of copying upto last, i want to copy upto some keyword.
use strict;
use warnings;
use File::Slurp;
my #lines = read_file('readfile.txt');
while ( my $line = shift #lines) {
next unless ($line =~ m/END OF HEADER/);
last; # here suggest some other logic
}
append_file('writefile.txt', #lines);
next will continue to the next iteration of the loop, effectively skipping the rest of the statements in the loop for that iteration (in this case, the last).
last will immediately exit the loop, which sounds like what you want. So you should be able to simply put the conditional statement on the last.
Also, I'm not sure why you want to read the entire file into memory to iterate over its lines? Why not just use a regular while(<>)? And I would recommend avoiding File::Slurp, it has some long-standing issues.
You don't show any example input with expected output, and your description is unclear - you said "i want to copy upto some keyword" but in your code you use shift, which removes items from the beginning of the array.
Do you want to remove the lines before or after and including or not including "END OF HEADER"?
This code will copy over only the header:
use warnings;
use strict;
my $infile = 'readfile.txt';
my $outfile = 'writefile.txt';
open my $ifh, '<', $infile or die "$infile: $!";
open my $ofh, '>', $outfile or die "$outfile: $!";
while (<$ifh>) {
last if /END OF HEADER/;
print $ofh $_;
}
close $ifh;
close $ofh;
Whereas if you want to copy everything after the header, you could replace the while above with:
while (<$ifh>) {
last if /END OF HEADER/;
}
while (<$ifh>) {
print $ofh $_;
}
Which will loop and do nothing until it sees END OF HEADER, then breaking out of the first loop and moving to the second, which prints out the lines after the header.
data.txt:
fsffs
sfsfsf
sfSDFF
END OF HEADER
{ dsgs xdgfxdg zFZ }
dgdbg
vfraeer
Code:
use strict;
use warnings;
use 5.020;
use autodie;
use Data::Dumper;
my $infile = 'data.txt';
my $header_file = 'header.txt';
my $after_header_file = 'after_header.txt';
open my $DATA, '<', $infile;
open my $HEADER, '>', $header_file;
open my $AFTER_HEADER, '>', $after_header_file;
{
local $/ = "END OF HEADER";
my $header = <$DATA>;
say {$HEADER} $header;
my $rest = <$DATA>;
say {$AFTER_HEADER} $rest;
}
close $DATA;
close $HEADER;
close $AFTER_HEADER;
say "Created files: $header_file, $after_header_file";
Output:
$ perl 1.pl
Created files: header.txt, after_header.txt
$ cat header.txt
fsffs
sfsfsf
sfSDFF
END OF HEADER
$ cat after_header.txt
{ dsgs xdgfxdg zFZ }
dgdbg
vfraeer
$/ specifies the input record separator, which by default is a newline. Therefore, when you read from a file:
while (my $x = <$INFILE>) {
}
each value of $x is a sequence of characters up to and including the input recored separator, i.e. a newline, which is what we normally think of as a line of text in a file. Often, we chomp off the newline/input_record_separator at the end of the text:
while (my $x = <$INFILE>) {
chomp $x;
say "$x is a dog";
}
But, you can set the input record separator to anything you want, like your "END OF HEADER" text. That means a line will be all the text up to and including the input record separator, which in this case is "END OF HEADER". For example, a line will be: "abc\ndef\nghi\nEND OF HEADER". Furthermore, chomp() will now remove "END OF HEADER" from the end of its argument, so you could chomp your line if you don't want the "END OF HEADER" marker in the output file.
If perl cannot find the input record separator, then perl keeps reading the file until perl hits the end of the file, then perl returns all the text that was read.
You can use those operations to your advantage when you want to seek to some specific text in a file.
Declaring a variable as local makes the variable magical: when the closing brace of the surrounding block is encountered, perl sets the variable back to the value it had just before the opening brace of the surrounding block:
#Here, by default $/ = "\n", but some code out here could have
#also set $/ to something else
{
local $/ = "END OF HEADER";
} # $/ gets set back to whatever value it had before this block
When you change one of perl's predefined global variables, it's considered good practice to only change the variable for as long as you need to use the variable, then change the variable back to what it was.
If you want to target just the text between the braces, you can do:
data.txt:
fsffs
sfsfsf
sfSDFF
END OF HEADER { dsgs xdgfxdg zFZ }
dgdbg
vfraeer
Code snippet:
...
...
{
local $/ = 'END OF HEADER {';
my $pre_brace = <$DATA>;
$/ = '}';
my $target_text = <$DATA>;
chomp $target_text; #Removes closing brace
say "->$target_text<-";
}
--output:--
-> dsgs xdgfxdg zFZ <-

Need to replace value from one file to another file using perl

I am writing a program using perl which read a value from one file and replace this value in other file. Program runs successfully, but value didn't get replaced. Please suggest me where is the error.
use strict;
use warnings;
open(file1,"address0.txt") or die "Cannot open file.\n";
my $value;
$value=<file1>;
system("perl -p -i.bak -e 's/add/$value/ig' rough.sp");
Here the value which I want to replace exists in address0.txt file. It is a single value 1. I want to place this value in place of add in other file rough.sp.
My rough.sp looks like
Vdd 1 0 add
My address0.txt looks like
1
So output should be like
Vdd 1 0 1
Please help me out. Thanks in advance
Assuming that there is a 1:1 relationship between lines in adress0.txt and rough.sp, you can proceed like this:
use strict;
use warnings;
my ($curline_1,$curline_2);
open(file1, "address0.txt") or die "Cannot open file.\n";
open(file2, "rough.sp") or die "Cannot open file.\n";
open(file3, ">out.sp") or die "Cannot open file.\n";
while (<file1>) {
$curline_1 = $_;
chomp($curline_1);
$curline_2 = <file2>;
$curline_2 =~ s/ add/ $curline_1/;
print file3 $curline_2;
}
close(file1);
close(file2);
close(file3);
exit(0);
Explanation:
The code iterates through the lines of your input files in parallel. Note that the lines read include the line terminator. Line contents from the 'address' file are taken as replacement values fpr the add literal in your .sp file. Line terminators from the 'address' file are eliminated to avoid introducing additional newlines.
Addendum:
An extension for multi-replacements might look like this:
$curline_1 = $_;
chomp($curline_1);
my #parts = split(/ +/, $curline_1); # splits the line from address0.txt into an array of strings made up of contiguous non-whitespace chars
$curline_2 = <file2>;
$curline_2 =~ s/ add/ $parts[0]/;
$curline_2 =~ s/ sub/ $parts[1]/;
# ...

perl print only the last line of the array

I am trying to print the array but the out put contain only the last line of the array. the partial code is as follow.
open OUT, "> /myFile.txt"
or die "Couldn't open output file: $!";
foreach (#result) {
print OUT;
}
the out put is
List Z
which is the last line, but when I do print "#result" the out put is
List A
List B
List C so on...
I am little bit confuse why the results are different on the same array.
Working on a hunch, I tried adding \r to the end of your input lines, and sure enough, it creates the illusion that only the last line of your input is printed to the file. Here's the code to test it:
use strict;
use warnings;
my #result = map "$_\r", 'A' .. 'Z';
open (OUT, "> myFile.txt") or die("Couldn't open output file: $!");
foreach (#result) {
print OUT ;
}
What you have probably done is performed chomp on lines from a file from a different operating system (DOS, Windows), which does not strip the \r line endings. Hence, when the lines are printed, the lines overwrite each other.
If this is what is wrong, the solution is to use the dos2unix tool to fix your files, or to use:
s/\s+\z//;
to strip your newlines.
You may inspect your input by using the Data::Dumper module, using the option Useqq, e.g.:
use Data::Dumper;
$Data::Dumper::Useqq = 1;
print Dumper \#result;
If these whitespace characters are in your output, they will then be visible.
the problem is here
open OUT, "> /myFile.txt"
this should be
open OUT, ">>", "/myfile.txt"
What you wrote overwrites the entire file for each iteration of the foreach(#result) loop.
What you are intending to do is append to it (">>").
">>" appends, ">" overwrites.
Also take note of how i broke ">> /myfile.txt" into ">>", "/myfile.txt".
This is both more secure, and more robust for less specific applications of open.
Foreign line terminators from any platform can easily be fixed by clearing whitespace from the end of the line and adding it back when printing it
Like this
open my $out, '>', '/myFile.txt' or die "Couldn't open output file: $!";
foreach (#result) {
s/\s+$//;
print $out "$_\n";
}
or
foreach my $line (#result) {
$line =~ s/\s+$//;
print $out "$line\n";
}

Perl program skipping first line in csv file

I am trying to understand an error with a PERL program. I have a comma-separated file, and I want to extract the contents of each row to a separate text file, using the contents of the first field in each row as the file name.
The program below does exactly this EXCEPT it skips the first line of the csv file. I tried to nail down the source of the error by adding a couple of print commands. The print command on line 22 shows that the first line is read by the command in line 21. But, once the foreach loop starts, the first line is not printed.
I'm not quite sure of the problem. I appreciate any help!
#!/usr/bin/perl
# script that takes a .csv file (such as that exported from Excel) and
# extracts the contents of each row into a separate text file, using the first column as the filename
# original source: http://www.tek-tips.com/viewthread.cfm?qid=1516940
# modified 3/14/12
# usage = ./export_rows.pl <yourfilename>.csv
use warnings;
use strict;
use Text::CSV_XS;
use Tie::Handle::CSV;
unless(#ARGV) {
print "Please supply a .csv file at the command line! For example, export_rows.pl myfile.csv\n";
exit;
}
my $fh = Tie::Handle::CSV->new(file => $ARGV[0],
header => 0);
my #headers = #{scalar <$fh>};
print "$headers[0]\n\n";
foreach my $csv_line (<$fh>) {
print "$csv_line->[0]\n";
open OUT, "> $csv_line->[0].txt" or die "Could not open file $csv_line->[0].txt for output.\n$!";
for my $i (1..$#headers) {
print OUT "$csv_line->[$i]\n";
}
close OUT;
}
close $fh;
Try beginning at 0 in your for loop:
for my $i (1..$#headers)
Should be:
for my $i (0..$#headers)
EDIT:
To get the first line of the file you can use Tie::File
Here is sample code:
my #arr;
tie #arr, 'Tie::File', 'a.txt' or die $!;
my $first = $arr[0];
untie #arr;
print "$first\n";
This module is cool in that it allow you to access lines of a file via accessing indices in an array. If your file is big it is not incredibly efficient, but I think you can definitely use it here.

How do I use variables to do substitution in Perl?

I have several text files, that were once tables in a database, which is now disassembled. I'm trying to reassemble them, which will be easy, once I get them into a usable form. The first file, "keys.text" is just a list of labels, inconsistently formatted. Like:
Sa 1 #
Sa 2
U 328 #*
It's always letter(s), [space], number(s), [space], and sometime symbol(s). The text files that match these keys are the same, then followed by a line of text, also separated, or delimited, by a SPACE.
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
What I'm trying to do in the code below, is match the key from "keys.text", with the same key in the .txt files, and put a tab between the key, and the text. I'm sure I'm overlooking something very basic, but the result I'm getting, looks identical to the source .txt file.
Thanks in advance for any leads or assistance!
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
open(IN1, "keys.text");
my $key;
# Read each line one at a time
while ($key = <IN1>) {
# For each txt file in the current directory
foreach my $file (<*.txt>) {
open(IN, $file) or die("Cannot open TXT file for reading: $!");
open(OUT, ">temp.txt") or die("Cannot open output file: $!");
# Add temp modified file into directory
my $newFilename = "modified\/keyed_" . $file;
my $line;
# Read each line one at a time
while ($line = <IN>) {
$line =~ s/"\$key"/"\$key" . "\/t"/;
print(OUT "$line");
}
rename("temp.txt", "$newFilename");
}
}
EDIT: Just to clarify, the results should retain the symbols from the keys as well, if there are any. So they'd look like:
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
The regex seems quoted rather oddly to me. Wouldn't
$line =~ s/$key/$key\t/;
work better?
Also, IIRC, <IN1> will leave the newline on the end of your $key. chomp $key to get rid of that.
And don't put parentheses around your print args, esp when you're writing to a file handle. It looks wrong, whether it is or not, and distracts people from the real problems.
if Perl is not a must, you can use this awk one liner
$ cat keys.txt
Sa 1 #
Sa 2
U 328 #*
$ cat mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
$ awk 'FNR==NR{ k[$1 SEP $2];next }($1 SEP $2 in k) {$2=$2"\t"}1 ' keys.txt mytext.txt
Sa 1 # Random line of text follows.
Sa 2 This text is just as random.
U 328 #* Continuing text...
Using split rather than s/// makes the problem straightforward. In the code below, read_keys extracts the keys from keys.text and records them in a hash.
Then for all files named on the command line, available in the special Perl array #ARGV, we inspect each line to see whether it begins with a key. If not, we leave it alone, but otherwise insert a TAB between the key and the text.
Note that we edit the files in-place thanks to Perl's handy -i option:
-i[extension]
specifies that files processed by the <> construct are to be edited in-place. It does this by renaming the input file, opening the output file by the original name, and selecting that output file as the default for print statements. The extension, if supplied, is used to modify the name of the old file to make a backup copy …
The line split " ", $_, 3 separates the current line into exactly three fields. This is necessary to protect whitespace that's likely to be present in the text portion of the line.
#! /usr/bin/perl -i.bak
use warnings;
use strict;
sub usage { "Usage: $0 text-file\n" }
sub read_keys {
my $path = "keys.text";
open my $fh, "<", $path
or die "$0: open $path: $!";
my %key;
while (<$fh>) {
my($text,$num) = split;
++$key{$text}{$num} if defined $text && defined $num;
}
wantarray ? %key : \%key;
}
die usage unless #ARGV;
my %key = read_keys;
while (<>) {
my($text,$num,$line) = split " ", $_, 3;
$_ = "$text $num\t$line" if defined $text &&
defined $num &&
$key{$text}{$num};
print;
}
Sample run:
$ ./add-tab input
$ diff -u input.bak input
--- input.bak 2010-07-20 20:47:38.688916978 -0500
+++ input 2010-07-20 21:00:21.119531937 -0500
## -1,3 +1,3 ##
-Sa 1 # Random line of text follows.
-Sa 2 This text is just as random.
-U 328 #* Continuing text...
+Sa 1 # Random line of text follows.
+Sa 2 This text is just as random.
+U 328 #* Continuing text...
Fun answers:
$line =~ s/(?<=$key)/\t/;
Where (?<=XXXX) is a zero-width positive lookbehind for XXXX. That means it matches just after XXXX without being part of the match that gets substituted.
And:
$line =~ s/$key/$key . "\t"/e;
Where the /e flag at the end means to do one eval of what's in the second half of the s/// before filling it in.
Important note: I'm not recommending either of these, they obfuscate the program. But they're interesting. :-)
How about doing two separate slurps of each file. For the first file you open the keys and create a preliminary hash. For the second file then all you need to do is add the text to the hash.
use strict;
use warnings;
my $keys_file = "path to keys.txt";
my $content_file = "path to content.txt";
my $output_file = "path to output.txt";
my %hash = ();
my $keys_regex = '^([a-zA-Z]+)\s*\(d+)\s*([^\da-zA-Z\s]+)';
open my $fh, '<', $keys_file or die "could not open $key_file";
while(<$fh>){
my $line = $_;
if ($line =~ /$keys_regex/){
my $key = $1;
my $number = $2;
my $symbol = $3;
$hash{$key}{'number'} = $number;
$hash{$key}{'symbol'} = $symbol;
}
}
close $fh;
open my $fh, '<', $content_file or die "could not open $content_file";
while(<$fh>){
my $line = $_;
if ($line =~ /^([a-zA-Z]+)/){
my $key = $1;
// strip content_file line from keys/number/symbols to leave text
line =~ s/^$key//;
line =~ s/\s*$hash{$key}{'number'}//;
line =~ s/\s*$hash{$key}{'symbol'}//;
$line =~ s/^\s+//g;
$hash{$key}{'text'} = $line;
}
}
close $fh;
open my $fh, '>', $output_file or die "could not open $output_file";
for my $key (keys %hash){
print $fh $key . " " . $hash{$key}{'number'} . " " . $hash{$key}{'symbol'} . "\t" . $hash{$key}{'text'} . "\n";
}
close $fh;
I haven't had a chance to test it yet and the solution seems a little hacky with all the regex but might give you an idea of something else you can try.
This looks like the perfect place for the map function in Perl! Read in the entire text file into an array, then apply the map function across the entire array. The only other thing you might want to do is use the quotemeta function to escape out any possible regular expressions in your keys.
Using map is very efficient. I also read the keys into an array in order to not have to keep opening and closing the keys file in my loop. It's an O^2 algorithm, but if your keys aren't that big, it shouldn't be too bad.
#! /usr/bin/env perl
use strict;
use vars;
use warnings;
open (KEYS, "keys.text")
or die "Cannot open 'keys.text' for reading\n";
my #keys = <KEYS>;
close (KEYS);
foreach my $file (glob("*.txt")) {
open (TEXT, "$file")
or die "Cannot open '$file' for reading\n";
my #textArray = <TEXT>;
close (TEXT);
foreach my $line (#keys) {
chomp $line;
map($_ =~ s/^$line/$line\t/, #textArray);
}
open (NEW_TEXT, ">$file.new") or
die qq(Can't open file "$file" for writing\n);
print TEXT join("\n", #textArray) . "\n";
close (TEXT);
}