sort alphabetically whole word in first array position - perl

I have a file with values separated by pipe |
input:
car tree alfa young|salt brick|23223|
emilia jack albert| way to go|56566|
I'm trying to sort first column alphabetically and write sorted file out.
result:
alfa car tree young|salt brick|23223|
albert emilia jack| way to go|56566|
What I have tried:
can sort first array position but cant write out all file content.
$filename = 'test.dat';
open (INPUT,"$filename");
open (OUTPUT,">out.dat");
while (<INPUT>)
{
#array = split('\|');
#arr = split(' ',$array[0]);
$,=" ";
print OUTPUT sort #arr,"\n";
}
close (INPUT);
close (OUTPUT);
sorts everything in each line.
$filename = 'test.dat';
open (INPUT,"$filename");
open (OUTPUT,">out.dat");
while (<INPUT>)
{
#arr = split(' ');
$,=" ";
print OUTPUT sort #arr,"\n";
}
close (INPUT);
close (OUTPUT);

Your programs will be far more flexible if you move away from the idea of hard-coding file names in your code. It's a much better idea to read from STDIN and write to STDOUT whenever possible. That's what this code does.
#!/usr/bin/perl
use strict;
use warnings;
while (<>) {
# Split the record into the first field (the one we want to sort)
# and everything else.
my ($first, $rest) = split /\|/, $_, 2;
# Split the first record into words
my #words = split /\s+/, $first;
# Sort #words and then join the line back together and print it.
print join(' ', sort #words), "|$rest";
}
If we store this program in a file called reorder, we can then run it from a command-line prompt like this:
$ ./reorder < test.dat > out.dat
We avoid having to open any filehandles (the operating system does it for us) and we don't need to change the program if the filenames ever change.

perl -pe 's/^([^|]+)/#a = sort split ' ', $1; "#a";/e' myfile

After you've split the line into an array
my #ary = split '\|';
you need to split the first element, sort the list, and join it back
my $first = join ' ', sort split ' ', shift #ary;
where shift removes (and returns) the first element.
Then build the string back
my $new_string = join '|', $first, #ary;
This can be done without temporary variables. The last two steps can be done in one statement, and all of it right in a print statement.
Comments
Please always start your programs with use warnings; and use strict;
Use three-argument form of open
open my $fh, '<', $file or die "Can't open $file: $!";
and so use lexical filehandles ($fh), not bareword (typeglobs, FH).
Don't quote things that don't need to be quoted -- even errors can result! See What's wrong with always quoting (perlfaq4), for instance. Thanks to Sinan Ünür for the comment.
You can also do it with a regex
$line =~ s/(.*?)\|/join(" ", sort split " ", $1).q(|)/e;
This captures the first field (up to the pipe), and runs the above code on it, courtesy of /e modifier which makes replacement part be evaluated as code.
We needed to also match the pipe and then put it back. This can be avoided using lookahead
$line =~ s/(.*?)(?=\|)/join(' ', sort split ' ', $1)/e;
which only asserts that the pattern inside (?=...) is there, without consuming it.

Your first bit of code is so very close to being right. You're only outputting the first column though - you need to print out the contents of #array. If you replace the first element of #array with the sorted results, then you can just write it out like this.
#array = split('\|');
$array[0] = join(" ",sort split(' ',$array[0]));
print OUTPUT join("|",#array),"\n";

Related

Replace multiple hex values

I've written the following script because I need to do some cleanup in some files. I have a specific number of hex characters that needs to be changed into another set of hex characters (ie null to space, see below). I've written the following script, my problem is that it only replaces the first occurence and nothing else.
I've tried the /g just like a regular sed pattern but it doesnt work. Is there a way to do this and replace all matches?
(The reason i havent used a $line =~ s/... is because I think its neater and more maintainable that way, and this script will need to be accessed and run on occasion by others who may need to edit the hex values to be replaced). Another reason is because i need to change from 10+ hex values to an equivalent amount, so a huge one liner would be hard to read. Thank you in advance.
#!/usr/bin/perl
use strict;
use warnings;
my $filebase = shift || "testreplace.txt";
my $filefilter = shift || "testf";
open my $fh1, '>', 'testreplaceout';
# Iterate over file and read lines
open my $file1, '<', $filebase;
while (my $line = <$file1>)
{
chomp($line);
for ($line) {
s/\x00/\x20/g;
s/\x31/\x32/g;
}
print {$fh1} "$line \n";
}
/g will do what you want. If it doesn't seem to be working, add some debugging:
use Data::Dumper;
$Data::Dumper::Useqq = $Data::Dumper::Terse = 1;
And in your loop:
print Dumper($line);
for ($line) {
s/\x00/\x20/g;
s/\x31/\x32/g;
}
print Dumper($line);
Using tr with paired delimiters instead can be very readable/maintainable:
$line =~ tr[\x00\x31]
[\x20\x32];
Also, consider adding use autodie;
tr/// is probably your best bet here (since you are dealing with constant single character replacements). The following is a more generic solution.
my %replacements = (
'foo' => 'bar',
'bar' => 'baz',
);
my $pat = join '|', map quotemeta, keys(%replacement);
s/($pat)/$replacements{$1}/g;
Update: read comments for caveats of this answer.
Here's one way that'll allow you to keep your list of regex search/replaces at the top of your script nice and clean for ease of viewing and modification:
use warnings;
use strict;
my #re_list = (
['a', 'x'],
['b', 'y'],
);
while (my $line = <DATA>){
for my $re (#re_list){
$line =~ s/$re->[0]/$re->[1]/g;
}
print $line;
}
__DATA__
aaabbbccc
bbbcccddd
ababababa
Output:
xxxyyyccc
yyycccddd
xyxyxyxyx

Why is my Perl code not omitting newlines?

I'm reading this textfile to get ONLY the words in it and ignore all kind of whitespaces:
hello
now
do you see this.sadslkd.das,msdlsa but
i hoohoh
And this is my Perl code:
#!usr/bin/perl -w
require 5.004;
open F1, './text.txt';
while ($line = <F1>) {
#print $line;
#arr = split /\s+/, $line;
foreach $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
#print #arr;
}
close F1;
And this is the output:
hello
now
do
you
see
this.sadslkd.das,msdlsa
but
i
hoohoh
The output is showing two newlines but I am expecting the output to be just words. What should I do to just get words?
You should always use strict and use warnings (in preference to the -w command-line qualifier) at the top of every Perl program, and declare each variable at its first point of use using my. That way Perl will tell you about simple errors that you may otherwise overlook.
You should also use lexical file handles with the three-parameter form of open, and check the status to make sure it succeeded. There is little point in explicitly closing an input file unless you expect your program to run for an appreciable time, as Perl will close all files for you on exit.
Do you really need to require Perl v5.4? That version is fifteen years old, and if there is anything older than that installed then you have a museum!
Your program would be better like this:
use strict;
use warnings;
open my $fh, '<', './text.txt' or die $!;
while (my $line = <$fh>) {
my #arr = split /\s+/, $line;
foreach my $w (#arr) {
if ($w !~ /^\s+$/) {
print $w."\n";
}
}
}
Note: my apologies. The warnings pragma and lexical file handles were introduced only in v5.6 so that part of my answer is irrelevant. The latest version of Perl is v5.16 and you really should upgrade
As Birei has pointed out, the problem is that, when the line has leading whitespace, there is a empty field before the first separator. Imagine if your data was comma-separated, then you would want Perl to report a leading empty field if the line started with a comma.
To extract all the non-space characters you can use a regular expression that does exactly that
my #arr = $line =~ /\S+/g;
and this can be emulated by using the default parameter for split which is a single quoted space (not a regular expression)
my #arr = $line =~ split ' ', $line;
In this case split behaves like the awk utility and discards any leading empty fields as you expected.
This is even simpler if you let Perl use the $_ variable in the read loop, as all of the parameters for split can be defaulted:
while (<F1>) {
my #arr = split;
foreach my $w (#arr) {
print "$w\n" if $w !~ /^\s+$/;
}
}
This line is the problem:
#arr=split(/\s+/,$line);
\s+ does a match just before the leading spaces. Use ' ' instead.
#arr=split(' ',$line);
I believe that in this line:
if(!($w =~ /^\s+$/))
You wanted to ask if there's nothing in this row - don't print it.
But the "+" in the REGEX actually force it to have at least 1 space.
If you change the "\s+" to "\s*", you'll see that it's working. because * is 0 occurrences or more ...

Using Perl to parse a CSV file from a particular row to the end of the file

am very new to Perl and need your help
I have a CSV file xyz.csv with contents:
here level1 and er values are strings names...not numbers...
level1,er
level2,er2
level3,er3
level4,er4
I parse this CSV file using the script below and pass the fields to an array in the first run
open(my $d, '<', $file) or die "Could not open '$file' $!\n";
while (my $line = <$d>) {
chomp $line;
my #data = split "," , $line;
#XYX = ( [ "$data[0]", "$data[1]" ], );
}
For the second run I take an input from a command prompt and store in variable $val. My program should parse the CSV file from the value stored in variable until it reaches the end of the file
For example
I input level2 so I need a script to parse from the second line to the end of the CSV file, ignoring the values before level2 in the file, and pass these values (level2 to level4) to the #XYX = (["$data[1]","$data[1]"],);}
level2,er2
level3,er3
level4,er4
I input level3 so I need a script to parse from the third line to the end of the CSV file, ignoring the values before level3 in the file, and pass these values (level3 and level4) to the #XYX = (["$data[0]","$data[1]"],);}
level3,er3
level4,er4
How do I achieve that? Please do give your valuable suggestions. I appreciate your help
As long as you are certain that there are never any commas in the data you should be OK using split. But even so it would be wise to limit the split to two fields, so that you get everything up to the first comma and everything after it
There are a few issues with your code. First of all I hope you are putting use strict and use warnings at the top of all your Perl programs. That simple measure will catch many trivial problems that you could otherwise overlook, and so it is especially important before you ask for help with your code
It isn't commonly known, but putting a newline "\n" at the end of your die string prevent Perl from giving file and line number details in the output of where the error occurred. While this may be what you want, it is usually more helpful to be given the extra information
Your variable names are verly unhelpful, and by convention Perl variables consist of lower-case alphanumerics and underscores. Names like #XYX and $W don't help me understand your code at all!
Rather than splitting to an array, it looks like you would be better off putting the two fields into two scalar variables to avoid all that indexing. And I am not sure what you intend by #XYX = (["$data[1]","$data[1]"],). First of all do you really mean to use $data[1] twice? Secondly, your should never put scalar variables inside double quotes, as it does something very specific, and unless you know what that is you should avoid it. Finally, did you mean to push an anonymous array onto #XYX each time around the loop? Otherwise the contents of the array will be overwritten each time a line is read from the file, and the earlier data will be lost
This program uses a regular expression to extract $level_num from the first field. All it does it find the first sequence of digits in the string, which can then be compared to the minimum required level $min_level to decide whether a line from the log is relevant
use strict;
use warnings;
my $file = 'xyz.csv';
my $min_level = 3;
my #list;
open my $fh, '<', $file or die "Could not open '$file' $!";
while (my $line = <$fh>) {
chomp $line;
my ($level, $error) = split ',', $line, 2;
my ($level_num) = $level =~ /(\d+)/;
next unless $level_num >= $min_level;
push #list, [ $level, $error ];
}
For deciding which records to process you can use the "flip-flop" operator (..) along these lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my $level = shift || 'level1';
while (<DATA>) {
if (/^\Q$level,/ .. 0) {
print;
}
}
__DATA__
level1,er
level2,er2
level3,er3
level4,er4
The flip-flop operator returns false until its first operand is true. At that point it returns false until its second operand is true; at which point it returns false again.
I'm assuming that your file is ordered so that once you start to process it, you never want to stop. That means that the first operand to the flip-flop can be /^\Q$level,/ (match the string $level at the start of the line) and the second operand can just be zero (as we never want it to stop processing).
I'd also strongly recommend not parsing CSV records using split /,/. That may work on your current data but, in general, the fields in a CSV file are allowed to contain embedded commas which will break this approach. Instead, have a look at Text::CSV or Text::ParseWords (which is included with the standard Perl distribution).
Update: I seem to have got a couple of downvotes on this. It would be great if people would take the time to explain why.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
my #XYZ;
my $file = 'xyz.csv';
open my $fh, '<', $file or die "$file: $!\n";
my $level = shift; # get level from commandline
my $getall = not defined $level; # true if level not given on commandline
my $parser = Text::CSV->new({ binary => 1 }); # object for parsing lines of CSV
while (my $row = $parser->getline($fh)) # $row is an array reference containing cells from a line of CSV
{
if ($getall # if level was not given on commandline, then put all rows into #XYZ
or # if level *was* given on commandline, then...
$row->[0] eq $level .. 0 # ...wait until the first cell in a row equals $level, then put that row and all subsequent rows into #XYZ
)
{
push #XYZ, $row;
}
}
close $fh;
#!/usr/bin/perl
use strict;
use warnings;
open(my $data, '<', $file) or die "Could not open '$file' $!\n";
my $level = shift ||"level1";
while (my $line = <$data>) {
chomp $line;
my #fields = split "," , $line;
if($fields[0] eq $level .. 0){
print "\n$fields[0]\n";
print "$fields[1]\n";
}}
This worked....thanks ALL for your help...

Why is first value of captured expression getting stored in fourth element in Perl?

I am storing information captured by regex into an array. But for some reason the first value is getting stored at 4 element of array. Any suggestion on whats going wrong and how to store the first value in the first element of array.
The following is the script:
#!/usr/bin/perl
use strict;
my #value;
my $find= qr/^\s+([0-9]+)\s+([A-Z])/;
open (FILE, "</usr/test")|| die "cant open file";
my #body=<FILE>;
foreach my $line (#body){
chomp $line;
push #value, join('', $line =~ /$find/);
}
print "$value[0]\n"; #does not print anything
print "$value[4]\n"; #prints first value i.e 1389E
exit;
DATA
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes
Your second line has more than one space between the number group and the letter, so you probably want \s+ both times rather than \s the second time.
You won't necessarily know how many items you have in the #value array at the end, so you might want to put the printing into a for loop rather than assume you have a fifth item. (Maybe you know you want the first and fifth, however?) Follow-up: based on your edit, you have more than two entries after all. The version that I give below, using split and \s+ captures the number and letter for all the lines. I'll adjust the print part of the script to show you what I mean.
A few other things:
You should always enable warnings.
There's no reason to read the whole file into an array and then process through it line by line. Skip the #body array and just do what you need to in the while loop.
Use the more modern form of open with lexical filehandles and three arguments.
split seems more straightforward here to me, rather than a regular expression with captures. Since you want to capture two specific parts of the line, you can use split with an array slice to grab those two items and feed them to join.
#value is not an especially helpful variable name, but I think you should at least make it plural. It's a good habit to get into, I think, insofar as the array stores your plural records. (That's not a hard and fast rule, but it bugged me here. This point is pretty minor.)
Here's how all this might look:
#!/usr/bin/env perl
use warnings;
use strict;
my #values;
# open my $filehandle, '<', '/usr/test'
# or die "Can't open /usr/test: $!";
while (my $line = <DATA>) {
chomp $line;
push #values, join('', (split /\s+/, $line)[1..2]);
}
for my $record (#values) {
print $record, "\n";
}
__DATA__
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes
I think you should be writing
print "$value[0]\n";
print "$value[4]\n";
to access elements of an array.
You should use lexical file handles and the three argument form of open as well as avoiding slurping files unnecessarily.
In any case, the most likely reason for your problem is a single character missing from your pattern. Compare the one below with the one you have above.
#!/usr/bin/perl
use strict;
use warnings;
my #value;
my $find= qr/^\s+([0-9]+)\s+([A-Z])/;
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
push #value, join '', $line =~ $find;
}
use Data::Dumper;
print Dumper \#value;
__DATA__
1389 E not
188 S yes
24 D yes
456 K not
2 Q yes
Do you have leading whitespace lines, or other leading lines in your data that don't match your regexp? Since you're unconditionally push()-ing onto your output array, regardless of whether your regexp matches, you'll get blank array elements for every non-matching line in your input.
Observe:
#!/usr/bin/perl
use strict;
use warnings;
my #lines;
while (<DATA>) {
push #lines , ( join( '' , /^\s*(\d+)/ ));
}
foreach ( 0 .. $#lines ) {
print "$_ -> $lines[$_]\n";
}
__DATA__
FOO
Bar
Baz
1234
456
bargle
Output:
0 ->
1 ->
2 ->
3 -> 1234
4 -> 456
5 ->

With Perl, how do I read records from a file with two possible record separators?

Here is what I am trying to do:
I want to read a text file into an array of strings. I want the string to terminate when the file reads in a certain character (mainly ; or |).
For example, the following text
Would you; please
hand me| my coat?
would be put away like this:
$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';
Could I get some help on something like this?
This will do it. The trick to using split while preserving the token you're splitting on is to use a zero-width lookback match: split(/(?<=[;|])/, ...).
Note: mctylr's answer (currently the top rated) isn't actually correct -- it will split fields on newlines, b/c it only works on a single line of the file at a time.
gbacon's answer using the input record separator ($/) is quite clever--it's both space and time efficient--but I don't think I'd want to see it in production code. Putting one split token in the record separator and the other in the split strikes me as a little too unobvious (you have to fight that with Perl ...) which will make it hard to maintain. I'm also not sure why he's deleting multiple newlines (which I don't think you asked for?) and why he's doing that only for the end of '|'-terminated records.
# open file for reading, die with error message if it fails
open(my $fh, '<', 'data.txt') || die $!;
# set file reading to slurp (whole file) mode (note that this affects all
# file reads in this block)
local $/ = undef;
my $string = <$fh>;
# convert all newlines into spaces, not specified but as per example output
$string =~ s/\n/ /g;
# split string on ; or |, using a zero-width lookback match (?<=) to preserve char
my (#strings) = split(/(?<=[;|])/, $string);
One way is to inject another character, like \n, whenever your special character is found, then split on the \n:
use warnings;
use strict;
use Data::Dumper;
while (<DATA>) {
chomp;
s/([;|])/$1\n/g;
my #string = split /\n/;
print Dumper(\#string);
}
__DATA__
Would you; please hand me| my coat?
Prints out:
$VAR1 = [
'Would you;',
' please hand me|',
' my coat?'
];
UPDATE: The original question posed by James showed the input text on a single line, as shown in __DATA__ above. Because the question was poorly formatted, others edited the question, breaking the 1 line into 2. Only James knows whether 1 or 2 lines was intended.
I prefer #toolic's answer because it deals with multiple separators very easily.
However, if you wanted to overly complicate things, you could always try:
#!/usr/bin/perl
use strict; use warnings;
my #contents = ('');
while ( my $line = <DATA> ) {
last unless $line =~ /\S/;
$line =~ s{$/}{ };
if ( $line =~ /^([^|;]+[|;])(.+)$/ ) {
$contents[-1] .= $1;
push #contents, $2;
}
else {
$contents[-1] .= $1;
}
}
print "[$_]\n" for #contents;
__DATA__
Would you; please
hand me| my coat?
Something along the lines of
$text = <INPUTFILE>;
#string = split(/[;!]/, $text);
should do the trick more or less.
Edit: I've changed "/;!/" to "/[;!]/".
Let Perl do half the work for you by setting $/ (the input record separator) to vertical bar, and then extract semicolon-separated fields:
#!/usr/bin/perl
use warnings;
use strict;
my #string;
*ARGV = *DATA;
$/ = "|";
while (<>) {
s/\n+$//;
s/\n/ /g;
push #string => $1 while s/^(.*;)//;
push #string => $_;
}
for (my $i = 0; $i < #string; ++$i) {
print "\$string[$i] = '$string[$i]';\n";
}
__DATA__
Would you; please
hand me| my coat?
Output:
$string[0] = 'Would you;';
$string[1] = ' please hand me|';
$string[2] = ' my coat?';