How to merge 2 lines in perl script - perl

I am trying to write a Perl script that will transform the input
( name
( type ....
)
)
into the output
( name ( type ... ) )
I.e. all these lines matching ( ) are merged into a single line and I want to update the original file itself.
Thanks in advance

use strict;
use warnings;
my $file="t.txt"; #or shift (ARGV); for command line input
my $new_format=undef;
open READ, $file;
local $/=undef; #says to read to end of file
$new_format=<READ>;
$new_format=~ s/\n//g; #replaces all newline characters with nothing, aka removes all \n
close(READ);
open WRITE, ">$file"; #open for writing (overwrites)
print WRITE $new_format;
close WRITE;
This works, assuming that the entire file is one big expression. For reference, to remove all white-space, use $new_format=~ s/\s//g; instead of $new_format=~ s/\n//g;. It can be easily modified to account for multiple expressions. All one would have to do redefine $/ to be whatever you're using to separate expressions (for example if simply a blank line: local $/ = /^\s+$/;) and throw everything into a while loop. For each iteration, push the string into an array and after the file is completely processed, write the contents of the array to the file in the format that you require.

Is the ((..)) syntax guaranteed? If so I'd suggest merging the whole thing into one line and then splitting based on )(s.
my $line = "";
while(<DATA>)
{
$_ =~ s= +$==g; # remove end spaces.
$line .= $_;
}
$line =~ s=\n==g;
my #lines = split /\)\(/,$line;
my $resulttext = join ")\n(", #lines;
print $resulttext;
__END__
( name
( type ....
)
)
( name2
( type2 ....
)
)
( name3
( type3 ....
)
)

Here's another option:
use strict;
use warnings;
while (<>) {
chomp unless /^\)/;
print;
}
Usage: perl script.pl inFile [>outFile]
Sample data:
( name
( type ....
)
)
( name_a
( type_a ....
)
)
( name_b
( type_b ....
)
)
Output:
( name ( type .... ) )
( name_a ( type_a .... ) )
( name_b ( type_b .... ) )
The script removes the input record separator unless the line read contains the last closing right paren (matched by being the first char on the line).
Hope this helps!

Related

Parsing data from delimited blocks

I have a log file content many blocks /begin CHECK ... /end CHECK like below:
/begin CHECK
Var_AAA
"Description AAA"
DATATYPE UBYTE
Max_Value 255.
ADDRESS 0xFF0011
/end CHECK
/begin CHECK
Var_BBB
"Description BBB"
DATATYPE UBYTE
Max_Value 255.
ADDRESS 0xFF0022
/end CHECK
...
I want to extract the variable name and its address, then write to a new file like this
Name Address
Var_AAA => 0xFF0011
Var_BBB => 0xFF0022
I am just thinking about the ($start, $keyword, $end) to check for each block and extract data after keyword only
#!/usr/bin/perl
use strict;
use warnings;
my $input = 'input.log';
my $output = 'output.out';
my ( $start, $keyword, $end ) = ( '^\/begin CHECK\n\n', 'ADDRESS ', '\/end CHECK' );
my #block;
# open input file for reading
open( my $in, '<', $input ) or die "Cannot open file '$input' for reading: $!";
# open destination file for writing
open( my $out, '>', $output ) or die "Cannot open file '$output' for writing: $!";
print( "copying variable name and it's address from $input to $output \n" );
while ( $in ) { #For each line of input
if ( /$start/i .. /$end/i ) { #Block matching
push #block, $_;
}
if ( /$end/i ) {
for ( #block ) {
if ( /\s+ $keyword/ ) {
print $out join( '', #block );
last;
}
}
#block = ();
}
close $in or die "Cannot close file '$input': $!";
}
close $out or die "Cannot close file '$output': $!";
But I got nothing after execution. Can anyone suggest me with sample idea?
Most everything looks good but it's your start regex that's causing the first problem:
'^\/begin CHECK\n\n'
You are reading lines from the file but then looking for two newlines in a row. That's not going to ever match because a line ends with exactly one newline (unless you change $/, but that's a different topic). If you want to match the send of a line, you can use the $ (or \z) anchor:
'^\/begin CHECK$'
Here's the program I pared down. You can adjust it to do all the rest of the stuff that you need to do:
use v5.10;
use strict;
use warnings;
use Data::Dumper;
my ($start, $keyword, $end) = (qr{^/begin CHECK$}, qr(^ADDRESS ), qr(^/end CHECK));
while (<DATA>) #For each line of input
{
state #block;
chomp;
if (/$start/i .. /$end/i) #Block matching
{
push #block, $_ unless /^\s*$/;
}
if( /$end/i )
{
print Dumper( \#block );
#block = ();
}
}
After that, you're not reading the data. You need to put the filehandle inside <> (the line input operator):
while ( <$in> )
The file handles will close themselves at the end of the program automatically. If you want to close them yourself that's fine but don't do that until you are done. Don't close $in until the while is finished.
using the command prompt in windows. In MacOS or Unix will follow the same logic you can do:
perl -wpe "$/='/end CHECK';s/^.*?(Var_\S+).*?(ADDRESS \S+).*$/$1 => $2\n/s" "your_file.txt">"new.txt
first we set the endLine character to $/ = "/end CHECK".
we then pick only the first Var_ and the first ADDRESS. while deleting everything else in single line mode ie Dot Matches line breaks \n. s/^.*?(Var_\S+).*?(ADDRESS \S+).*$/$1 => $2\n/s.
We then write the results into a new file. ie >newfile.
Ensure to use -w -p -e where -e is for executing the code, -p is for printing and -w is for warnings:
In this code, I did not write the values to a new file ie, did not include the >newfile.txt prt so that you may be able to see the result. If you do include the part, just open the newfile.txt and everything will be printed there
Here are some of the issues with your code
You have while ($in) instead of while ( <$in> ), so your program never reads from the input file
You close your input file handle inside the while read loop, so you can only ever read one record
Your $start regex pattern is '^\/begin CHECK\n\n'. The single quotes make your program search for backslash n backslash n instead of newline newline
Your test if (/\s+ $keyword/) looks for multiple space characters of any sort, followed by a space, followed by ADDRESS—the contents of $keyword. There are no occurrences of ADDRESS preceded by whitespace anywhere in your data
You have also written far too much without testing anything. You should start by writing your read loop on its own and make sure that the data is coming in correctly before proceeding by adding two or three lines of code at a time between tests. Writing 90% of the functionality before testing is a very bad approach.
In future, to help you address problems like this, I would point you to the excellent resources linked on the Stack Overflow Perl tag information page
The only slightly obscure thing here is that the range operator /$start/i .. /$end/i returns a useful value; I have copied it into $status. The first time the operator matches, the result will be 1; the second time 2 etc. The last time is different because it is a string that uses engineering notation like 9E0, so it still evaluates to the correct count but you can check for the last match using /E/. I've used == 1 and /E/ to avoid pushing the begin and end lines onto #block
I don't think there's anything else overly complex here that you can't find described in the Perl language reference
use strict;
use warnings;
use autodie; # Handle bad IO status automatically
use List::Util 'max';
my ($input, $output) = qw/ input.log output.txt /;
open my $in_fh, '<', $input;
my ( #block, #vars );
while ( <$in_fh> ) {
my $status = m{^/begin CHECK}i .. m{^/end CHECK}i;
if ( $status =~ /E/ ) { # End line
#block = grep /\S/, #block;
chomp #block;
my $var = $block[0];
my $addr;
for ( #block ) {
if ( /^ADDRESS\s+(0x\w+)/ ) {
$addr = $1;
last;
}
}
push #vars, [ $var, $addr ];
#block = ();
}
elsif ( $status ) {
push #block, $_ unless $status == 1;
}
}
# Format and generate the output
open my $out_fh, '>', $output;
my $w = max map { length $_->[0] } #vars;
printf $out_fh "%-*s => %s\n", $w, #$_ for [qw/ Name Address / ], #vars;
close $out_fh;
output
Name => Address
Var_AAA => 0xFF0011
Var_BBB => 0xFF0022
Update
For what it's worth, I would have written something like this. It produces the same output as above
use strict;
use warnings;
use autodie; # Handle bad IO status automatically
use List::Util 'max';
my ($input, $output) = qw/ input.log output.txt /;
my $data = do {
open my $in_fh, '<', $input;
local $/;
<$in_fh>;
};
my #vars;
while ( $data =~ m{^/begin CHECK$(.+?)^/end CHECK$}gms ) {
my $block = $1;
next unless $block =~ m{(\w+).+?ADDRESS\s+(0x\w+)}ms;
push #vars, [ $1, $2 ];
}
open my $out_fh, '>', $output;
my $w = max map { length $_->[0] } #vars;
printf $out_fh "%-*s => %s\n", $w, #$_ for [qw/ Name Address / ], #vars;
close $out_fh;

Nested Loop running very slowly

I'm trying to run a program to check each line of one file against each line of a second file to see if some of the elements match. Each file is around 200k lines.
What I've got so far looks like this;
#!/usr/bin/perl
#gffgenefind.pl
use strict;
use warnings;
die "SNP gff\n" unless #ARGV == 4;
open( my $snp, "<", $ARGV[0] ) or die "Can't open $:";
open( my $gff, "<", $ARGV[1] ) or die "can't open $:";
open( my $outg, ">", $ARGV[2] );
open( my $outs, ">", $ARGV[3] );
my $scaffold;
my $site;
my #snplines = <$snp>;
my #gfflines = <$gff>;
foreach my $snpline (#snplines) {
my #arr = split( /\t/, $snpline );
$scaffold = $arr[0];
$site = $arr[1];
foreach my $line (#gfflines) {
my #arr1 = split( /\t/, $line );
if ( $arr1[3] <= $site and $site <= $arr1[4] and $arr1[0] eq $scaffold ) {
print $outg "$line";
print $outs "$snpline";
}
}
}
File 1 (snp) looks like this scaffold_100 10689 A C A 0 0 0 0 0 0
File 2 (gff) looks like this scaffold_1 phytozomev10 gene 750912 765975 . - . ID=Carubv10008059m.g.v1.0;Name=Carubv10008059m.g
Essentially, I'm looking to see if the first values match and if the second value from snp is within the range defined on the second file (in this case 750912 to 765975)
I've seen that nested loops are to be avoided, and was wondering if there's an alternative way for me to look through this data.
Thanks!
Firstly - lose the foreach loop. That reads your whole file into memory, when you probably don't need to.
Try instead:
while ( my $snpline = <$snp> ) {
because it reads line by line.
Generally - mixing array indicies and named variables is also bad style.
The core problem will most likely be though because each line of your first file, you're cycling all of the second file.
Edit: Note - because 'scaffold' isn't unique, amended accordingly
This seems like a good place to use a hash. E.g.
my %sites;
while ( <$snp> ) {
my ( $scaffold, $site ) = split ( /\t/ );
$sites{$site}{$scaffold}++
}
while ( <$gff> ) {
my ( $name, $tmp1, $tmp2, $range_start, $range_end ) = split ( /\t/ );
if ( $sites{$name} ) {
foreach my $scaffold ( keys %{ $sites{$name} ) {
if ( $scaffold > $range_start
and $scaffold < $range_end ) {
#do stuff with it;
print;
}
}
}
}
Hopefully you get the gist, even if it isn't specifically what you're after?
Try this Python snippet:
#!/usr/bin/env python
import sys
import contextlib
if len(sys.argv) !=5:
raise Exception('SNP gff')
snp, gff, outg, outs = sys.argv[1:]
gff_dict = {}
with open(gff) as gff_handler:
for line in gff_handler:
fields=line.split()
try:
gff_dict[fields[0]].append(fields[1:])
except KeyError:
gff_dict[fields[0]] = [fields[1:]]
with contextlib.nested(open(snp),
open(outs, 'w'),
open(outg, 'w')) as (snp_handler,
outs_handler,
outg_handler):
for line_snp in snp_handler:
fields=line_snp.split()
key = fields[0]
if key in gff_dict:
for ele in gff_dict[key]:
if ele[2] <= fields[1] <= ele[3]:
outs_handler.write(line_snp)
outg_handler.write("{0}\t{1}\n".format(key,"\t".join(ele)))

replace a string of characters with the line number

I have a text file that has approximately 3,000 lines. 99% of the time I need all 3,000 lines. However, periodically I will grep out the lines I need and direct the output to another text file to use.
The only problem I have in doing so, is: Embedded in the text file is a 6 character string of numbers that indicate the line number. In order to use the file, this area needs to be correctly renumbered...(I don't need to re-sort the data, but I need to replace the current six characters with the new line number. and it must be padded with zeros! Unfortuantely the entire rows is one long row of data with no field separators!
For example, my first three rows might look something like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
The six characters at positions 17-22 (Immediately following the "ZZ"), need be renumbered based on the current row number...so the above needs to look like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
Any ideas would be greatly appreciated!
Thanks,
KSL.
Here's the solution I came up with Perl. It assumes that the numbering is always 6 digits after the ZZ sequence.
In convert.pl:
use strict;
use warnings;
my $i = 1; # or the value you want to start numbering
while (<STDIN>) {
my $replace = sprintf("%06d", $i++);
$_ =~ s/ZZ\d{6}/ZZ$replace/g;
print $_;
}
In data.dat:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
To run:
cat data.dat | perl convert.pl
Output
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
If I would solve this, I would create a simple python script to read those lines by filtering as grep does and using a internal counter from inside the python script.
As simple hints you can read each line in a string and access them using variablename[17:22] (17:22 is the position of the string you are trying to use).
Now, there is a method in the string in python which does the replace, just replace the values by the counter you create.
I hope this helps.
To do this in awk:
awk '{print substr($0,1,16) sprintf("%06d", NR) substr($0,23)}'
or
gawk 'match($0,/^(.*ZZ)[0-9]{6}(.*)/,a) {print a[1] sprintf("%06d",NR) a[2]}'
This is exactly the type of thing where unpack is useful.
#!/usr/bin/env perl
use v5.10.0;
use strict;
use warnings;
while( my $line = <> ){
chomp $line;
my #elem = unpack 'A16 A6 A*', $line;
$elem[1] = sprintf '%06d', $.;
# $. is the line number for the last used file handle
say #elem;
}
Actually looking at the lines, it looks like there is date information stored in the first 14 characters.
Assuming that at some point you might want to parse the lines for some reason you can use the following as an example of how you could use unpack to split up the lines.
#!/usr/bin/env perl
use v5.10.0; # say()
use strict;
use warnings;
use DateTime;
my #date_elem = qw'
year month day
hour minute second
';
my #elem_names = ( #date_elem, qw'
ZZ
line_number
random_data
');
while( my $line = <> ){
chomp $line;
my %data;
#data{ #elem_names } = unpack 'A4 (A2)6 A6 A*', $line;
# choose either this:
$data{line_number} = sprintf '%06d', $.;
say #data{#elem_names};
# or this:
$data{line_number} = $.;
printf '%04d' . ('%02d'x5) . "%2s%06d%s\n", #data{ #elem_names };
# the choice will affect the contents of %data
# this just shows the contents of %data
for( #elem_names ){
printf qq'%12s: "%s"\n', $_, $data{$_};
}
# you can create a DateTime object with the date elements
my $dt = DateTime->new(
(map{ $_, $data{$_} } #date_elem),
time_zone => 'floating',
);
say $dt;
print "\n";
}
Although it would be better to use a regular expression, so that you could throw out bogus data.
use v5.14; # /a modifier
...
my $rdate = join '', map{"(\\d{$_})"} 4, (2)x5;
my $rx = qr'$rdate (ZZ) (\d{6}) (.*)'xa;
while( my $line = <> ){
chomp $line;
my %data;
unless( #data{ #elem_names } = $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
...
It would be better still; to use named capture groups added in 5.10.
...
my $rx = qr'
(?<year> \d{4} ) (?<month> \d{2} ) (?<day> \d{2} )
(?<hour> \d{2} ) (?<minute> \d{2} ) (?<second> \d{2} )
ZZ
(?<line_number> \d{6} )
(?<random_data> .* )
'xa;
while( my $line = <> ){
chomp $line;
unless( $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
my %data = %+;
# for compatibility with previous examples
$data{ZZ} = 'ZZ';
...

Proper-case first field

My text files contains this one :
COcoNut,Other,900,21_7_2011,Coimbatore,TEINGKAAY
CotTon,Others,3500,21_7_2011,Coimbatore,PARUTTI
Maize,Others,1200,21_7_2011,Coimbatore,MAKKAACHOOLAM
Bajra,Other,1325,14_7_2011,Coimbatore,KAMBU
Jowar,Other,2750,14_7_2011,Coimbatore,CHOOLAM
Ragi,Other,910,14_7_2011,Coimbatore,KEIZHVARAKU
Coconut,Grade_I,650,12_7_2011,Coimbatore,TEINGKAAY GRADE ONNU
Copra,other,5300,7_7_2011,Coimbatore,KOPPARAI
Paddy,ADT_______36,950,15_7_2011,Madurai,NELLU ADT MUPPATTI AARU
Paddy,AST_16,950,15_7_2011,Madurai,NELLU AST PATINAARU
Here i had COcoNut, cotTon JOWar, etc. But i want to print like this Coconut, Cotton, Jowar i.e., the first letter should be uppercase rather than remaining using regular expression in perl and not in any packages ....
And also you have seen 'Others' , but i want only 'Other' in that text files. This also added with the above expression.
Then, this text files will read and write a same files i.e., to be overwrite in that files in perl scripts
Please any one suggest me
#!/usr/bin/perl
use strict;
use warnings;
use IO::InSitu;
my ( $in, $out ) = open_rw ( '/path/to/file' );
my $separator = ',';
while ( <$in> ) {
my #fields = split ( $separator => $_ );
$fields[ 0 ] = ucfirst lc $fields[ 0 ];
$fields[ 1 ] =~ s/(?<=other)s//gi;
print { $out } join ( $separator => #fields ) . "\n";
}
s/^([^,]*)/\u\L$1/;s/,Others,/,Other,/;

How can I extract numeric data from a text file?

I want the Perl script to extract a data from a text file and save it as another text file. Each line of the text file contains an URL to a jpg like "http://pics1.riyaj.com/thumbs/000/082/104//small.jpg". I want the script to extract the last 6 numbers of each jpg URL, (i.e 082104) to a variable. I want the variable to be added to a different location on each line of the new text.
Input text:
text http://pics1.riyaj.com/thumbs/000/082/104/small.jpg text
text http://pics1.riyaj.com/thumbs/000/569/315/small.jpg text
Output text:
text php?id=82104 text
text php?id=569315 text
Thanks
What have you tried so far?
Here's a short program that gives you the meat of the problem, and you can add the rest of it:
while( )
{
s|http://.*/\d+/(\d+)/(\d+).*?jpg|php?id=$1$2|;
print;
}
This is very close to the command-line program the handles the looping and printing for you with the -p switch (see the perlrun documentation for the details):
perl -pi.old -e 's|http://.*/\d+/(\d+)/(\d+).*?jpg|php?id=$1$2|' inputfile > outputfile
I didn't know whether to answer according to what you described ("last 6 digits") or just assume that it all fits the pattern you showed. So I decided to answer both ways.
Here is a method that can handle lines more diverse than your examples.
use FileHandle;
my $jpeg_RE = qr{
(.*?) # Anything, watching out for patterns ahead
\s+ # At least one space
(?> http:// ) # Once we match "http://" we're onto the next section
\S*? # Any non-space, watching out for what follows
( (?: \d+ / )* # At least one digit, followed by a slash, any number of times
\d+ # another group of digits
) # end group
\D*? # Any number of non-digits looking ahead
\.jpg # literal string '.jpg'
\s+ # At least one space
(.*) # The rest of the line
}x;
my $infile = FileHandle->new( "<$file_in" );
my $outfile = FileHandle->new( ">$file_out" );
while ( my $line = <$infile> ) {
my ( $pre_text, $digits, $post_text ) = ( $line =~ m/$jpeg_RE/ );
$digits =~ s/\D//g;
$outfile->printf( "$pre_text php?id=%s $post_text\n", substr( $digits, -6 ));
}
$infile->close();
However, if it's just as regular as you show, it gets a lot easier:
use FileHandle;
my $jpeg_RE = qr{
(?> \Qhttp://pics1.riyaj.com/thumbs/\E )
\d{3}
/
( \d{3} )
/
( \d{3} )
\S*?
\.jpg
}x;
my $infile = FileHandle->new( "<$file_in" );
my $outfile = FileHandle->new( ">$file_out" );
while ( my $line = <$infile> ) {
$line =~ s/$jpeg_RE/php?id=$1$2/g;
$outfile->print( $line );
}
$infile->close();