How can I extract numeric data from a text file? - perl

I want the Perl script to extract a data from a text file and save it as another text file. Each line of the text file contains an URL to a jpg like "http://pics1.riyaj.com/thumbs/000/082/104//small.jpg". I want the script to extract the last 6 numbers of each jpg URL, (i.e 082104) to a variable. I want the variable to be added to a different location on each line of the new text.
Input text:
text http://pics1.riyaj.com/thumbs/000/082/104/small.jpg text
text http://pics1.riyaj.com/thumbs/000/569/315/small.jpg text
Output text:
text php?id=82104 text
text php?id=569315 text
Thanks

What have you tried so far?
Here's a short program that gives you the meat of the problem, and you can add the rest of it:
while( )
{
s|http://.*/\d+/(\d+)/(\d+).*?jpg|php?id=$1$2|;
print;
}
This is very close to the command-line program the handles the looping and printing for you with the -p switch (see the perlrun documentation for the details):
perl -pi.old -e 's|http://.*/\d+/(\d+)/(\d+).*?jpg|php?id=$1$2|' inputfile > outputfile

I didn't know whether to answer according to what you described ("last 6 digits") or just assume that it all fits the pattern you showed. So I decided to answer both ways.
Here is a method that can handle lines more diverse than your examples.
use FileHandle;
my $jpeg_RE = qr{
(.*?) # Anything, watching out for patterns ahead
\s+ # At least one space
(?> http:// ) # Once we match "http://" we're onto the next section
\S*? # Any non-space, watching out for what follows
( (?: \d+ / )* # At least one digit, followed by a slash, any number of times
\d+ # another group of digits
) # end group
\D*? # Any number of non-digits looking ahead
\.jpg # literal string '.jpg'
\s+ # At least one space
(.*) # The rest of the line
}x;
my $infile = FileHandle->new( "<$file_in" );
my $outfile = FileHandle->new( ">$file_out" );
while ( my $line = <$infile> ) {
my ( $pre_text, $digits, $post_text ) = ( $line =~ m/$jpeg_RE/ );
$digits =~ s/\D//g;
$outfile->printf( "$pre_text php?id=%s $post_text\n", substr( $digits, -6 ));
}
$infile->close();
However, if it's just as regular as you show, it gets a lot easier:
use FileHandle;
my $jpeg_RE = qr{
(?> \Qhttp://pics1.riyaj.com/thumbs/\E )
\d{3}
/
( \d{3} )
/
( \d{3} )
\S*?
\.jpg
}x;
my $infile = FileHandle->new( "<$file_in" );
my $outfile = FileHandle->new( ">$file_out" );
while ( my $line = <$infile> ) {
$line =~ s/$jpeg_RE/php?id=$1$2/g;
$outfile->print( $line );
}
$infile->close();

Related

replace a string of characters with the line number

I have a text file that has approximately 3,000 lines. 99% of the time I need all 3,000 lines. However, periodically I will grep out the lines I need and direct the output to another text file to use.
The only problem I have in doing so, is: Embedded in the text file is a 6 character string of numbers that indicate the line number. In order to use the file, this area needs to be correctly renumbered...(I don't need to re-sort the data, but I need to replace the current six characters with the new line number. and it must be padded with zeros! Unfortuantely the entire rows is one long row of data with no field separators!
For example, my first three rows might look something like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
The six characters at positions 17-22 (Immediately following the "ZZ"), need be renumbered based on the current row number...so the above needs to look like:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
Any ideas would be greatly appreciated!
Thanks,
KSL.
Here's the solution I came up with Perl. It assumes that the numbering is always 6 digits after the ZZ sequence.
In convert.pl:
use strict;
use warnings;
my $i = 1; # or the value you want to start numbering
while (<STDIN>) {
my $replace = sprintf("%06d", $i++);
$_ =~ s/ZZ\d{6}/ZZ$replace/g;
print $_;
}
In data.dat:
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000999MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000027SILLMORERANDOMDATAFOLLOWSAFTERTHIS
To run:
cat data.dat | perl convert.pl
Output
20130918082020ZZ000001RANDOMDATAFOLLOWSAFTERTHISABCDEFGH
20130810112000ZZ000002MORERANDOMDATAFOLLOWSAFTERTHISABCD
20130810112000ZZ000003SILLMORERANDOMDATAFOLLOWSAFTERTHIS
If I would solve this, I would create a simple python script to read those lines by filtering as grep does and using a internal counter from inside the python script.
As simple hints you can read each line in a string and access them using variablename[17:22] (17:22 is the position of the string you are trying to use).
Now, there is a method in the string in python which does the replace, just replace the values by the counter you create.
I hope this helps.
To do this in awk:
awk '{print substr($0,1,16) sprintf("%06d", NR) substr($0,23)}'
or
gawk 'match($0,/^(.*ZZ)[0-9]{6}(.*)/,a) {print a[1] sprintf("%06d",NR) a[2]}'
This is exactly the type of thing where unpack is useful.
#!/usr/bin/env perl
use v5.10.0;
use strict;
use warnings;
while( my $line = <> ){
chomp $line;
my #elem = unpack 'A16 A6 A*', $line;
$elem[1] = sprintf '%06d', $.;
# $. is the line number for the last used file handle
say #elem;
}
Actually looking at the lines, it looks like there is date information stored in the first 14 characters.
Assuming that at some point you might want to parse the lines for some reason you can use the following as an example of how you could use unpack to split up the lines.
#!/usr/bin/env perl
use v5.10.0; # say()
use strict;
use warnings;
use DateTime;
my #date_elem = qw'
year month day
hour minute second
';
my #elem_names = ( #date_elem, qw'
ZZ
line_number
random_data
');
while( my $line = <> ){
chomp $line;
my %data;
#data{ #elem_names } = unpack 'A4 (A2)6 A6 A*', $line;
# choose either this:
$data{line_number} = sprintf '%06d', $.;
say #data{#elem_names};
# or this:
$data{line_number} = $.;
printf '%04d' . ('%02d'x5) . "%2s%06d%s\n", #data{ #elem_names };
# the choice will affect the contents of %data
# this just shows the contents of %data
for( #elem_names ){
printf qq'%12s: "%s"\n', $_, $data{$_};
}
# you can create a DateTime object with the date elements
my $dt = DateTime->new(
(map{ $_, $data{$_} } #date_elem),
time_zone => 'floating',
);
say $dt;
print "\n";
}
Although it would be better to use a regular expression, so that you could throw out bogus data.
use v5.14; # /a modifier
...
my $rdate = join '', map{"(\\d{$_})"} 4, (2)x5;
my $rx = qr'$rdate (ZZ) (\d{6}) (.*)'xa;
while( my $line = <> ){
chomp $line;
my %data;
unless( #data{ #elem_names } = $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
...
It would be better still; to use named capture groups added in 5.10.
...
my $rx = qr'
(?<year> \d{4} ) (?<month> \d{2} ) (?<day> \d{2} )
(?<hour> \d{2} ) (?<minute> \d{2} ) (?<second> \d{2} )
ZZ
(?<line_number> \d{6} )
(?<random_data> .* )
'xa;
while( my $line = <> ){
chomp $line;
unless( $line =~ $rx ){
die qq'unable to parse line "$line" ($.)';
}
my %data = %+;
# for compatibility with previous examples
$data{ZZ} = 'ZZ';
...

How to merge 2 lines in perl script

I am trying to write a Perl script that will transform the input
( name
( type ....
)
)
into the output
( name ( type ... ) )
I.e. all these lines matching ( ) are merged into a single line and I want to update the original file itself.
Thanks in advance
use strict;
use warnings;
my $file="t.txt"; #or shift (ARGV); for command line input
my $new_format=undef;
open READ, $file;
local $/=undef; #says to read to end of file
$new_format=<READ>;
$new_format=~ s/\n//g; #replaces all newline characters with nothing, aka removes all \n
close(READ);
open WRITE, ">$file"; #open for writing (overwrites)
print WRITE $new_format;
close WRITE;
This works, assuming that the entire file is one big expression. For reference, to remove all white-space, use $new_format=~ s/\s//g; instead of $new_format=~ s/\n//g;. It can be easily modified to account for multiple expressions. All one would have to do redefine $/ to be whatever you're using to separate expressions (for example if simply a blank line: local $/ = /^\s+$/;) and throw everything into a while loop. For each iteration, push the string into an array and after the file is completely processed, write the contents of the array to the file in the format that you require.
Is the ((..)) syntax guaranteed? If so I'd suggest merging the whole thing into one line and then splitting based on )(s.
my $line = "";
while(<DATA>)
{
$_ =~ s= +$==g; # remove end spaces.
$line .= $_;
}
$line =~ s=\n==g;
my #lines = split /\)\(/,$line;
my $resulttext = join ")\n(", #lines;
print $resulttext;
__END__
( name
( type ....
)
)
( name2
( type2 ....
)
)
( name3
( type3 ....
)
)
Here's another option:
use strict;
use warnings;
while (<>) {
chomp unless /^\)/;
print;
}
Usage: perl script.pl inFile [>outFile]
Sample data:
( name
( type ....
)
)
( name_a
( type_a ....
)
)
( name_b
( type_b ....
)
)
Output:
( name ( type .... ) )
( name_a ( type_a .... ) )
( name_b ( type_b .... ) )
The script removes the input record separator unless the line read contains the last closing right paren (matched by being the first char on the line).
Hope this helps!

How to split a this string 'gi|216ATGCTGATGCTGTG' in this format 'gi|216 ATGCTGTGCTGATGCTG' in Perl?

I am parsing the fasta alignment file which contains
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
I want to split this string into this:
gi|216 CCAACGAAATGATCGCCACACAA
gi|21- GCTGGTTCAGCGACCAAAAGTAGC
For first string, I use
$aar=split("\d",$string);
But that didn't work. What should I do?
So you're parsing some genetic data and each line has a gi| prefix followed by a sequence of numbers and hyphens followed by the nucleotide sequence? If so, you could do something like this:
my ($number, $nucleotides);
if($string =~ /^gi\|([\d-]+)([ACGT]+)$/) {
$number = $1;
$nucleotides = $2;
}
else {
# Broken data?
}
That assumes that you've already stripped off leading and trailing whitespace. If you do that, you should get $number = '216' and $nucleotides = 'CCAACGAAATGATCGCCACACAA' for the first one and $number = '216-' and $nucleotides = 'GCTGGTTCAGCGACCAAAAGTAGC' for the second one.
Looks like BioPerl has some stuff for dealing with fasta data so you might want to use BioPerl's tools rather than rolling your own.
Here's how I'd go about doing that.
#!/usr/bin/perl -Tw
use strict;
use warnings;
use Data::Dumper;
while ( my $line = <DATA> ) {
my #strings =
grep {m{\A \S+ \z}xms} # no whitespace tokens
split /\A ( \w+ \| [\d-]+ )( [ACTG]+ ) /xms, # capture left & right
$line;
print Dumper( \#strings );
}
__DATA__
gi|216CCAACGAAATGATCGCCACACAA
gi|21-GCTGGTTCAGCGACCAAAAGTAGC
If you just want to add a space (can't really tell from your question), use substitution. To put a space in front of any grouping of ACTG:
$string =~ s/([ACTG]+)/ \1/;
or to add a tab after any grouping of digits and dashes:
$string =~ s/([\d-]+)/\1\t/;
note that this will substitute on $string in place.

How to centrally justify in printf function in Perl

I know the printf function by default uses right-justification. - will make it left justify. But is it possible to make it centrally justify the format?
The printf function cannot center text.
However, there is a very old, and almost forgotten mechanism that can do this. You can create format statements in Perl that tells write statements how to print. By using format and write, you can center justify text.
This was something sort of done back in the days of Perl 3.x back in 1989, but sort of abandoned by the time Perl 4 came out. Perl 5, with its stronger variable scoping really put a crimp in the use of formats since using them would violate the way Perl 5 likes to scope variables (formats are global in nature).
You can learn more about it by looking at perldoc perlform. I haven't seen them used in years.
my #lines = (
"It is true that printf and sprintf",
"do not have a conversion to center-justify text.",
"However, you can achieve the same effect",
"by padding left-justified text",
"with an appropriate number of spaces."
);
my $max_length = 0;
foreach my $line (#lines) {
$max_length = (length $line > $max_length) ? length $line : $max_length;
}
foreach my $line (#lines) {
printf "%s%-${max_length}s\n", ' ' x int(($max_length - length $line)/2), $line;
}
You need to use two variables for each value you'd like to print and dynamically set the width of each around the value width. The problem becomes a little trickier if you want consistent total widths when your value has an odd/even string length. The following seems to do the trick!
use POSIX;
printf( "<%*s%*s>\n",
((10+length($_))/2), $_,
ceil((10-length($_))/2), "" )
for( qw( four five5 six666 7seven7 ) );
which prints
< four >
< five5 >
< six666 >
< 7seven7 >
You need to know the line width for this. For example, printing centered lines to the terminal:
perl -lne 'BEGIN {$cols=`tput cols`} print " " x (($cols-length)/2),$_;' /etc/passwd
Of course, this is not a printf formatting tag.
my #lines = (
"Some example lines",
"of differing length",
"to show a different approach",
"that actually",
"prints your content",
"centered in its max-width block",
"with minimal padding",
"(which the answer",
"by Sam Choukri does NOT do)."
);
# Get the required field width
my $max_length = 0;
foreach my $line ( #lines )
{
$max_length = length( $line ) if ( length( $line ) > $max_length );
}
foreach my $line ( #lines )
{
# First step, find out how much padding is required
my $padding = $max_length - length( $line );
# Get half that amount (rounded up) in spaces
$padding = ( ' ' x ( ( $padding + $padding % 2 ) / 2 ) );
# Print your output, with padding appended, right-justified to
# a max-width field.
# (Pipe character added to show that trailing padding is correct)
printf "%${max_length}s|\n", $line . $padding;
}

Neatest way to remove linebreaks in Perl

I'm maintaining a script that can get its input from various sources, and works on it per line. Depending on the actual source used, linebreaks might be Unix-style, Windows-style or even, for some aggregated input, mixed(!).
When reading from a file it goes something like this:
#lines = <IN>;
process(\#lines);
...
sub process {
#lines = shift;
foreach my $line (#{$lines}) {
chomp $line;
#Handle line by line
}
}
So, what I need to do is replace the chomp with something that removes either Unix-style or Windows-style linebreaks.
I'm coming up with way too many ways of solving this, one of the usual drawbacks of Perl :)
What's your opinion on the neatest way to chomp off generic linebreaks? What would be the most efficient?
Edit: A small clarification - the method 'process' gets a list of lines from somewhere, not nessecarily read from a file. Each line might have
No trailing linebreaks
Unix-style linebreaks
Windows-style linebreaks
Just Carriage-Return (when original data has Windows-style linebreaks and is read with $/ = '\n')
An aggregated set where lines have different styles
After digging a bit through the perlre docs a bit, I'll present my best suggestion so far that seems to work pretty good. Perl 5.10 added the \R character class as a generalized linebreak:
$line =~ s/\R//g;
It's the same as:
(?>\x0D\x0A?|[\x0A-\x0C\x85\x{2028}\x{2029}])
I'll keep this question open a while yet, just to see if there's more nifty ways waiting to be suggested.
Whenever I go through input and want to remove or replace characters I run it through little subroutines like this one.
sub clean {
my $text = shift;
$text =~ s/\n//g;
$text =~ s/\r//g;
return $text;
}
It may not be fancy but this method has been working flawless for me for years.
$line =~ s/[\r\n]+//g;
Reading perlport I'd suggest something like
$line =~ s/\015?\012?$//;
to be safe for whatever platform you're on and whatever linefeed style you may be processing because what's in \r and \n may differ through different Perl flavours.
Note from 2017: File::Slurp is not recommended due to design mistakes and unmaintained errors. Use File::Slurper or Path::Tiny instead.
extending on your answer
use File::Slurp ();
my $value = File::Slurp::slurp($filename);
$value =~ s/\R*//g;
File::Slurp abstracts away the File IO stuff and just returns a string for you.
NOTE
Important to note the addition of /g , without it, given a multi-line string, it will only replace the first offending character.
Also, the removal of $, which is redundant for this purpose, as we want to strip all line breaks, not just line-breaks before whatever is meant by $ on this OS.
In a multi-line string, $ matches the end of the string and that would be problematic ).
Point 3 means that point 2 is made with the assumption that you'd also want to use /m otherwise '$' would be basically meaningless for anything practical in a string with >1 lines, or, doing single line processing, an OS which actually understands $ and manages to find the \R* that proceed the $
Examples
while( my $line = <$foo> ){
$line =~ $regex;
}
Given the above notation, an OS which does not understand whatever your files '\n' or '\r' delimiters, in the default scenario with the OS's default delimiter set for $/ will result in reading your whole file as one contiguous string ( unless your string has the $OS's delimiters in it, where it will delimit by that )
So in this case all of these regex are useless:
/\R*$// : Will only erase the last sequence of \R in the file
/\R*// : Will only erase the first sequence of \R in the file
/\012?\015?// : When will only erase the first 012\015 , \012 , or \015 sequence, \015\012 will result in either \012 or \015 being emitted.
/\R*$// : If there happens to be no byte sequences of '\015$OSDELIMITER' in the file, then then NO linebreaks will be removed except for the OS's own ones.
It would appear nobody gets what I'm talking about, so here is example code, that is tested to NOT remove line feeds. Run it, you'll see that it leaves the linefeeds in.
#!/usr/bin/perl
use strict;
use warnings;
my $fn = 'TestFile.txt';
my $LF = "\012";
my $CR = "\015";
my $UnixNL = $LF;
my $DOSNL = $CR . $LF;
my $MacNL = $CR;
sub generate {
my $filename = shift;
my $lineDelimiter = shift;
open my $fh, '>', $filename;
for ( 0 .. 10 )
{
print $fh "{0}";
print $fh join "", map { chr( int( rand(26) + 60 ) ) } 0 .. 20;
print $fh "{1}";
print $fh $lineDelimiter->();
print $fh "{2}";
}
close $fh;
}
sub parse {
my $filename = shift;
my $osDelimiter = shift;
my $message = shift;
print "Parsing $message File $filename : \n";
local $/ = $osDelimiter;
open my $fh, '<', $filename;
while ( my $line = <$fh> )
{
$line =~ s/\R*$//;
print ">|" . $line . "|<";
}
print "Done.\n\n";
}
my #all = ( $DOSNL,$MacNL,$UnixNL);
generate 'Windows.txt' , sub { $DOSNL };
generate 'Mac.txt' , sub { $MacNL };
generate 'Unix.txt', sub { $UnixNL };
generate 'Mixed.txt', sub {
return #all[ int(rand(2)) ];
};
for my $os ( ["$MacNL", "On Mac"], ["$DOSNL", "On Windows"], ["$UnixNL", "On Unix"]){
for ( qw( Windows Mac Unix Mixed ) ){
parse $_ . ".txt", #{ $os };
}
}
For the CLEARLY Unprocessed output, see here: http://pastebin.com/f2c063d74
Note there are certain combinations that of course work, but they are likely the ones you yourself naĆ­vely tested.
Note that in this output, all results must be of the form >|$string|<>|$string|< with NO LINE FEEDS to be considered valid output.
and $string is of the general form {0}$data{1}$delimiter{2} where in all output sources, there should be either :
Nothing between {1} and {2}
only |<>| between {1} and {2}
In your example, you can just go:
chomp(#lines);
Or:
$_=join("", #lines);
s/[\r\n]+//g;
Or:
#lines = split /[\r\n]+/, join("", #lines);
Using these directly on a file:
perl -e '$_=join("",<>); s/[\r\n]+//g; print' <a.txt |less
perl -e 'chomp(#a=<>);print #a' <a.txt |less
To extend Ted Cambron's answer above and something that hasn't been addressed here: If you remove all line breaks indiscriminately from a chunk of entered text, you will end up with paragraphs running into each other without spaces when you output that text later. This is what I use:
sub cleanLines{
my $text = shift;
$text =~ s/\r/ /; #replace \r with space
$text =~ s/\n/ /; #replace \n with space
$text =~ s/ / /g; #replace double-spaces with single space
return $text;
}
The last substitution uses the g 'greedy' modifier so it continues to find double-spaces until it replaces them all. (Effectively substituting anything more that single space)