PERL -- Regex incl all hash keys (sorted) + deleting empty fields from $_ in file read - perl

I'm working on a program and I have a couple of questions, hope you can help:
First I need to access a file and retrieve specific information according to an index that is obtained from a previous step, in which the indexes to retrieve are found and store in a hash.
I've been looking for a way to include all array elements in a regex that I can use in the file search, but I haven´t been able to make it work. Eventually i've found a way that works:
my #atoms = ();
my $natoms=0;
foreach my $atomi (keys %{$atome}){
push (#atoms,$atomi);
$natoms++;
}
#atoms = sort {$b cmp $a} #atoms;
and then I use it as a regex this way:
while (<IN_LIG>){
if (!$natoms) {last;}
......
if ($_ =~ m/^\s*$atoms[$natoms-1]\s+/){
$natoms--;
.....
}
Is there any way to create a regex expression that would include all hash keys? They are numeric and must be sorted. The keys refer to the line index in IN_LIG, whose content is something like this:
8 C5 9.9153 2.3814 -8.6988 C.ar 1 MLK -0.1500
The key is to be found in column 0 (8). I have added ^ and \s+ to make sure it refers only to the first column.
My second problem is that sometimes input files are not always identical and they make contain white spaces before the index, so when I create an array from $_ I get column0 = " " instead of column0=8
I don't understand why this "empty column" is not eliminated on the split command and I'm having some trouble to remove it. This is what I have done:
#info = split (/[\s]+/,$_);
if ($info[0] eq " ") {splice (#info, 0,1);} # also tried $info[0] =~ m/\s+/
and when I print the array #info I get this:
Array:
Array: 8
Array: C5
Array: 9.9153
Array: 2.3814
.....
How can I get rid of the empty column?
Many thanks for your help
Merche

There is a special form of split where it will remove both leading and trailing spaces. It looks like this, try it:
my $line = ' begins with spaces and ends with spaces ';
my #tokens = split ' ', $line;
# This prints |begins:with:spaces:and:ends:with:spaces|
print "|", join(':', #tokens), "|\n";
See the documentation for split at http://p3rl.org/split (or with perldoc split)
Also, the first part of your program might be simpler as:
my #atoms = sort {$b cmp $a} keys %$atome;
my $natoms = #atoms;
But, what is your ultimate goal with the atoms? If you simply want to verify that the atoms you're given are indeed in the file, then you don't need to sort them, nor to count them:
my #atoms = keys %$atome;
while (<IN_LIG>){
# The atom ID on this line
my ($atom_id) = split ' ';
# Is this atom ID in the array of atom IDs that we are looking for
if (grep { /$atom_id/ } #atoms) {
# This line of the file has an atom that was in the array: $atom_id
}
}

Lets warm up by refining and correcting some of your code:
# If these are all numbers, do a numerical sort: <=> not cmp
my #atoms = ( sort { $b <=> $a } keys %{$atome} );
my $natoms = scalar #atoms;
No need to loop through the keys, you can insert them into the array right away. You can also sort them right away, and if they are numbers, the sort must be numerical, otherwise you will get a sort like: 1, 11, 111, 2, 22, 222, ...
$natoms can be assigned directly by the count of values in #atoms.
while(<IN_LIG>) {
last unless $natoms;
my $key = (split)[0]; # split splits on whitespace and $_ by default
$natoms-- if ($key == $atoms[$natoms - 1]);
}
I'm not quite sure what you are doing here, and if it is the best way, but this code should work, whereas your regex would not. Inside a regex, [] are meta characters. Split by default splits $_ on whitespace, so you need not be explicit about that. This split will also definitely remove all whitespace. Your empty field is most likely an empty string, '', and not a space ' '.
The best way to compare two numbers is not by a regex, but with the equality operator ==.
Your empty field should be gone by splitting on whitespace. The default for split is split ' '.
Also, if you are not already doing it, you should use:
use strict;
use warnings;
It will save you a lot of headaches.

for your second question you could use this line:
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
in order to capture 9 items from each line (assuming they do not contain whitespace).
The first question I do not understand.
Update: I would read alle the lines of the file and use them in a hash with $info[0] as the key and [#info[1..8]] as the value. Then you can lookup the entries by your index.
my %details;
while (<IN_LIG>) {
#info = $_ =~ m{^\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)}xms;
$details{ $info[0] } = [ #info[1..$#info] ];
}
Later you can lookup details for the indices you are interested in and process as needed. This assumes the index is unique (has the property of keys).

thanks for all your replies. I tried the split form with ' ' and it saved me several lines of code. thanks!
As for the regex, I found something that could make all keys as part of the string expression with join and quotemeta, but I couldn't make it work. Nevertheless I found an alternative that works, but I liked the join/quotemeta solution better
The atom indexes are obtained from a text file according to some energy threshold. Later, in the IN_LIG loop, I need to access the molecule file to obtain more information about the atoms selected, thus I use the atom "index" in the molecule to identify which lines of the file I have to read and process. This is a subroutine to which I send a hash with the atom index and some other information.
I tried this for the regex:
my $strings = join "|" map quotemeta,
sort { $hash->{$b} <=> $hash->{$a}} keys %($hash);
but I did something wrong cos it wouldn't take all keys

Related

Sorting hashes on value length whilst preserving order

I'm currently writing a Perl script to sort lines from stdin and print the lines in order of line length whilst preserving order for the ones that are equal.
My sorting code consists of the following:
while (my $curr_line = <STDIN>) {
chomp($curr_line);
$lines{$curr_line} = length $curr_line;
}
for my $line (sort{ $lines{$a} <=> $lines{$b} } keys %lines){
print $line, "\n";
}
For example my stdin consists of the following:
tiny line
medium line
big line
huge line
rand line
megahugegigantic line
I'd get the following output:
big line
rand line
tiny line
huge line
medium line
megahugegigantic line
Is there any way I can preserve the order for lines of equal length such that tiny would come before huge which comes before rand? Also, the order seems to change everytime I run the script.
Thanks in advance
One possible solution
You can save the position of the line in the input file handle as well as the length. The $. magic variable (input line number) provides this. You can then sort on both values.
use strict;
use warnings;
my %lines;
while ( my $curr_line = <DATA> ) {
chomp($curr_line);
$lines{$curr_line} = [ length $curr_line, $. ];
}
for my $line (
sort {
$lines{$a}->[0] <=> $lines{$b}->[0]
|| $lines{$a}->[1] <=> $lines{$b}->[1]
} keys %lines
) {
print $line, "\n";
}
__DATA__
tiny lin1
medium line
big line
huge lin2
rand lin3
megahugegigantic line
This will always output
big line
tiny lin1
huge lin2
rand lin3
medium line
megahugegigantic line
You can of course use a hash to make the code more readable, too.
$lines{$curr_line} = {
length => length $curr_line,
position => $.,
};
Explanation of your implementation
Your results changed their order every time because of random hash ordering. The way keys returns the list of keys is random, because of the way Perl implements hashes. This is by design, and a security feature. Since there are several keys that have the same value, the sort will sometimes return different results, based on which of the equal value keys came out first.
You could mitigate this by sticking another sort in front of your keys call. That would sort the keys by name, at least making the order of the undesired result be consistent.
# vvvv
for my $line (sort{ $lines{$a} <=> $lines{$b} } sort keys %lines) { ... }
Note that you don't have to chomp the input if you put the \n back when you print. It's always of the same length anyway. If you do, you should print a $/, which is the input record separator that chomp removed, or you falsify your data.
Your problem is not with sort Perl uses the quick sort algorithm which is a stable sort, inputs that match the same sort key have the same order on output of the sort as input.
Your problem is that you are storing the lines in a hash. A hash is an unordered collection of key value pairs so adding the lines to the hash and then printing them out again with out the sort will give you the lines in a random order.
You need to read all the lines into an array and then sort them on length, the quickest way being to use a Schwartzian Transformation see below.
my #lines = <STDIN>;
chomp(#lines);
my #sorted = # This is the clever bit and needs to be red from the last map up
map { $_->[0] } # Get the lines
sort { $a->[1] <=> $b->[1] } # Sort on length
map { [$_, length $_] } # Create a list of array refs containing
# the line and the length of the line
#lines;
print join "\n", #sorted; # print out the sorted lines
Nowhere do you store the original order, so you can't possibly sort by it. The easiest fix is to store the lines in an array, and ensure that Perl is using a stable sort.
use sort 'stable';
my #lines = <>;
chomp(#lines);
for my $line ( sort { length($a) <=> length($b) } #lines) {
say $line;
}
[ ST is overkill for this. It' such overkill that it probably even slows things down! ]
As has been explained, the randomness comes from your use of hash keys to store the strings. There is no need for this, or anything more elaborate like a Schwartzian Transform, to make this work
All Perl versions since v5.8 have used a stable sort, which will keep values that sort equally in the same order. But you can insist that the sort operator you get is a stable one using the sort pragma with
use sort 'stable'
Here's how I would write your program. It stops reading input at end of file, or when it sees a blank line in case you want to enter the data from the keyboard
use strict;
use warnings 'all';
use feature 'say';
use sort 'stable';
my #list;
while ( <> ) {
last unless /\S/;
chomp;
push #list, $_;
}
say for sort { length $a <=> length $b } #list;
Using the same input as you use in the question, this produces
output
big line
tiny line
huge line
rand line
medium line
megahugegigantic line

Perl: Find a match, remove the same lines, and to get the last field

Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!
my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;
There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;
Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;

Finding equal lines in file with Perl

I have a CSV file which contains duplicated items in different rows.
x1,y1
x2,y2
y1,x1
x3,y3
The two rows containing x1,y1 and y1,x1 are a match as they contain the same data in a diffrent order.
I need your help to find an algorithm to search for such lines in a 12MB file.
If you can define some ordering and equality relations between fields, you could store a normalized form and test your lines for equality against that.
As an example, we will use string comparision for your fields, but after lowercasing them. We can then sort the parts according to this relation, and create a lookup table via a nested hash:
use strict; use warnings;
my $cache; # A hash of hashes. Will be autovivified later.
while (<DATA>) {
chomp;
my #fields = split;
# create the normalized representation by lowercasing and sorting the fields
my #normalized_fields = sort map lc, #fields;
# find or create the path in the lookup
my $pointer = \$cache;
$pointer = \${$pointer}->{$_} for #normalized_fields;
# if this is an unknow value, make it known, and output the line
unless (defined $$pointer) {
$$pointer = 1; # set some defined value
print "$_\n"; # emit the unique line
}
}
__DATA__
X1 y1
X2 y2
Y1 x1
X3 y3
In this example I used the scalar 1 as value of the lookup data structure, but in more complex scenarios the original fields or the line number could be stored here. For the sake of the example, I used space-seperated values here, but you could replace the split with a call to Text::CSV or something.
This hash-of-hashes approach has sublinear space complexity, and worst case linear space complexity. The lookup time only depends on the number (and size) of fields in a record, not on the total number of records.
Limitation: All records must have the same number of fields, or some shorter records could be falsely considered “seen”. To circumvent these problems, we can use more complex nodes:
my $pointer = \$cache;
$pointer = \$$pointer->[0]{$_} for #normalized_fields;
unless (defined $$pointer->[1]) {
$$pointer->[1] = 1; ...
}
or introduce a default value for nonexistant field (e.g. the seperator of the original file). Here an example with the NUL character:
my $fields = 3;
...;
die "record too long" if #fields > $fields;
...; # make normalized fields
push #normalized_fields, ("\x00") x ($fields - #normalized_fields);
...; # do the lookup
A lot depends on what you want to know about duplicate lines once they have been found. This program uses a simple hash to list the line numbers of those lines that are equivalent.
use strict;
use warnings;
my %data;
while (<DATA>) {
chomp;
my $key = join ',', sort map lc, split /,/;
push #{$data{$key}}, $.;
}
foreach my $list (values %data) {
next unless #$list > 1;
print "Lines ", join(', ', #$list), " are equivalent\n";
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3
output
Lines 1, 3 are equivalent
Make two hash tables A and B
Stream through your input one line at a time
For the first line pair x and y, use each as key and the other as value for both hash tables (e.g., $A->{x} = y; $B->{y} = x;)
For the second and subsequent line pairs, test if the second field's value exists as a key for either A or B — if it does, you have a reverse match — if not, then repeat the addition process from step 3 to add it to the hash tables
To do a version of amon's answer without a hash table, if your data are numerical, you could:
Stream through input line by line, sorting fields one and two by numerical ordering
Pipe result to UNIX sort on first and second fields
Stream through sorted output line by line, checking if current line matches the previous line (reporting a reverse match, if true)
This has the advantage of using less memory than hash tables, but may take more time to process.
amon already provided the answer I would've provided, so please enjoy this bad answer:
#! /usr/bin/perl
use common::sense;
my $re = qr/(?!)/; # always fails
while (<DATA>) {
warn "Found duplicate: $_" if $_ =~ $re;
next unless /^(.*),(.*)$/;
die "Unexpected input at line $.: $_" if "$1$2" =~ tr/,//;
$re = qr/^\Q$2,$1\E$|$re/
}
__DATA__
x1,y1
x2,y2
y1,x1
x3,y3

PERL: Sorting Letters from A to Z

I'm Trying to sort letters in a file from A to Z
for example: a A B d r g
sorted: A a B d g r
#ARGV == 2 or die "Usage: $0 infile outfile\n";
open $old, '<', $ARGV[0] or die $!;
open $new, '>', $ARGV[1] or die $!;
#mass=<$old>;
#array=qw(#mass);
#sort=sort #array;
#mass1=sort {uc $a cmp uc $b} #sort;
print $new #mass1;
Where am I going wrong?
I don't think you understand the the standard text ordering is ASCII-based. So because all uppercase proceed all lowercase, the same is true of your input. Therefore, you order for a straight sort would be ( 'A', 'B', 'a', 'd', 'g', 'r' ).
You want to double compare the two strings. In this case, you're going to need to pass a routine to sort.
#sort= sort { lc $a cmp lc $b or $a cmp $b } #array;
I'm not sure what you intended to do with qw, but
suffice it to say that the contents of #mass will be never be used.
#array = qw(hello world);
Will cause #array to be defined to contain 2 strings, hello and world. It is just shorthand for:
#array = ('hello', 'world');
Which is why
#array=qw(#mass);
Evaluates to ('#mass') - an array with the single literal string of 5 characters #mass.
Maybe that's what you're doing wrong. What if you try
#array = map { split /\s+/} #mass;
#mass is the list of lines. Each line has words or just letters, separated by space.
What that last line does is maps each line with split /\s+/ - which will split each
line like 'ba ab a G' into a list like ('ba', 'ab', 'a', 'G') and #array will
become a single list of words/letters.
Then it's a matter of how you want to sort them. See the other answer as well.
Oh, and remember to put back the spaces when you write out your file:
print $new (join " ", #mass1);
If you want each line to be sorted interdependently of the other, that's easy too:
$mass1 = join "\n", map { join " ", sort (split /\s+/) } #mass
That reads, 'for every line in #mass, split on space, sort and join back again with space', and with the resulting array, join with newline to produce the output of the file.
Note that you can drop in sort with a comparator like sort { $a cmp $b } etc.
If your file is too big, then looping is maybe prudent:
for my $mass (<$old>) {
my $sorted_line = join " ", sort (split /\s+/, $mass);
print $new "$sorted_line\n";
}
You need to find the correct LOCALE to use, so that the order used by all functions (sort, etc) are using the correct locale and sort accordingly to it.
See this page showing most of the variables defining locales, and look for LANG and LC_ALL. and LC_COLLATE (I have to admit I'm not exactly sure which is used when. LC_ALL is supposed to take precedence over the others, so it's the one you can change to have all LC_* values set... Please test, ymmv)
I believe you probably need to use one of the unicode locales. Ascii won't do what you want, as CAPS are before regular letters in ascii.
To find out which locales you can use: locale -a
To see which locales you are currently set to : locale (user and system-wide values are possible)
You probably need something containing "utf-8" to have the order you seek
Then : (if for example en_US.UTF-8 is available):
just before using it in the sort, define locales you want to sort with:
LC_ALL=en_US.UTF-8
(or whatever the value you need it to be set at, and is available as shown by "locale -a")
(save/restore their previous values around the invocation if you need to)
In shell, you probably better want to ass "export" to those variables you redefine, to ensure subshells use the new value too (like: something | sort : in bash, sort will be in a subshell, therefore using the default value of LC_*, or using the exported value if you exported it!)

How do I remove a a list of character sequences from the beginning of a string in Perl?

I have to read lines from a file and store them into a hash in Perl. Many of these lines have special character sequences at the beginning that I need to remove before storing. These character sequences are
| || ### ## ##||
For example, if it is ||https://ads, I need to get https://ads; if ###http, I need to get http.
I need to exclude these character sequences. I want to do this by having all the character sequences to exclude in a array and then check if the line starts with these character sequences and remove those. What is a good way to do this?
I've gone as far as:
our $ad_file = "C:/test/list.txt";
our %ads_list_hash = ();
my $lines = 0;
# List of lines to ignore
my #strip_characters = qw /| || ### ## ##||/;
# Create a list of substrings in the easylist.txt file
open my $ADS, '<', $ad_file or die "can't open $ad_file";
while(<$ADS>) {
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}
close $ADS;
I need to add the logic to remove the #strip_characters from the beginning of each line if any of them are present.
Probably a bit too complex and general for the task, but still..
my $strip = join "|", map {quotemeta} #strip_characters;
# avoid bare [] etc. in the RE
# ... later, in the while()
s/^(?:$strip)+//o;
# /o means "compile $strip into the regex once and for all"
Why don't you do it with a regex? Something like
$line =~ s/^[## |]+//;
should work.
If you want to remove a list of characters (according to your title), then a very simple regular expression will work.
Within the loop, add the following regular expression
while( <$ADS> ) {
chomp;
s/^[## \|]+//;
$ads_list_hash{$lines++} = $_;
}
Note the pipe charachter ('|') is escapted.
However, it appears that you want to remove a list of expressions. You can do the following
while( <$ADS> ) {
chomp;
s/^((\|)|(\|\|)|(###)|(##)|(##\|\|))+//;
$add_list_hash{$lines++} = $_;
}
You said that the list of expression is stored in an array or words. In your sample code, you create this array with 'qw'. If the list of expressions isn't known at compile time, you can build a regular expression in a variable, and use it.
my #strip_expression = ... // get an array of strip expressions
my $re = '^((' . join(')|(',#strip_expression) . '))+';
and then, use the following statement in the loop:
s/$re//;
Finaly, one thing not related to the question can be said about the code: It would be much more appropriate to use Array instead of Hash, to map an integer to a set of strings. Unless you have some other requirement, better have:
our #ads_list; // no need to initialize the array (or the hash) with empty list
...
while( <$ADS> ) {
chomp;
s/.../;
push #ads_list, $_;
}
$ads_list_hash{$lines} = $_;
$lines ++;
Don't do that. If you want an array, use an array:
push #ads_lines, $_;
Shawn's Rule of Programming #7: When creating data structures: if preserving the order is important, use an array; otherwise use a hash.
Because substitutions return whether or not they did anything you can use a
substitution to search the string for your pattern and remove it if it's there.
while( <$ADS> ) {
next unless s/^\s*(?:[#]{2,3}|(?:##)?[|]{1,2})\s*//;
chomp;
$ads_list_hash{$lines} = $_;
$lines ++;
}