Perl - Hyphen and Minus - perl

I have a method where i split terms bounded by white-spaces. I want to remove the minus sign when it is alone like these:
$word =~ s/^\-$//;
The problem is that i cannot visually identify the difference between a minus and a hyphen (used for separating two words for example). How can i be sure that i'm only removing the minus sign?

In the ASCII printable character set, the hyphen and minus are the same symbol (ASCII 45), so when you're just scanning printable ASCII text data, whether you remove it or not would really depend on the context. Also, hyphenated words shouldn't contain whitespace, and when used to set apart a phrase -- like this -- you'll usually find two consecutive dashes. So if you're finding the symbol on it's own there's something unusual going on in the file.
To match the En-dash character or Em-dash characters, you'd search for \226 or \227 respectively (the ASCII value in octal).

Try:
#!/usr/bin/env perl
use strict;
use warnings;
while( <DATA> ){
if( m/(?<=[[:alpha:]])\-(?=[[:alpha:]])/ ){
print "hyphen: $_";
}elsif( m/\-/ ){
print "minus: $_";
}else{
print "other: $_";
}
}
__DATA__
this has hypenated-words.
this is a negative number: -2
some confusing-2 things
-to test it
title -- one-line description

When coding, use a suitable editor. There are many of them, use Google or ask fellow developers. Here's a selection of notepads:
Notepad++
Programmer's Notepad
Notepad2
These editors won't sell you a hyphen for a minus when you clearly hit the minus key on the keyboard. So in about eleven years of programming, I've never faced this problem thanks to using appropriate editing software for coding.

Related

perl specify unicode character by name without putting name in all caps

So, this is sort of a cosmetic point, but is there an easy way to insert a unicode character by its name inside a Perl string and give the name "normal" casing?
Perl includes unicode literals that look up code points by name, as in the following:
"\N{GREEK SMALL LETTER ALPHA}"
I find something like the following easier to read:
"\N{Greek Small Letter Alpha}",
As far as I know there are no case minimal pairs when it comes to unicode character names. Is there a concise way to name the character that still triggers a compilation error very early in the process of executing a script if the character doesn't exist?
example compilation error with intentionally misspelled character name, this is the kind of check I don't want to give up.
$ echo '%[a]' | ./unicodify
Unknown charname 'GREK SMALL LETTER ALPHA' at ./unicodify line 10, within string
Execution of ./unicodify aborted due to compilation errors.
I'm trying to write a small utility to make it easier to enter unicode characters in text files by mnemonic names delimited by %[ and ].
Here's an extremely stripped down example that just replaces %[a] and %[b].
#! /usr/bin/env perl
use strict;
use warnings;
use utf8;
use open ':std' => ':utf8';
my %abbrevs = (
'a' => "\N{GREEK SMALL LETTER ALPHA}",
'b' => "\N{GREEK SMALL LETTER BETA}",
);
while (<>) {
chomp;
my $line = $_;
$line =~ s/(\%\[(.*?)\])/$abbrevs{$2}/g;
print "${line}\n";
}
Quote charnames,
Starting in Perl v5.16, any occurrence of \N{CHARNAME} sequences in a double-quotish string automatically loads this module with arguments :full and :short (described below) if it hasn't already been loaded with different arguments
On of those "different arguments" requests the use of loose matching.
$ perl -CSD -e'
use charnames ":loose";
CORE::say "\N{Greek Small Letter Alpha}";
'
α
LOOSE MATCHES
By specifying :loose, Unicode's loose character name matching rules are selected instead of the strict exact match used otherwise. That means that CHARNAME doesn't have to be so precisely specified. Upper/lower case doesn't matter (except with scripts as mentioned above), nor do any underscores, and the only hyphens that matter are those at the beginning or end of a word in the name (with one exception: the hyphen in U+1180 HANGUL JUNGSEONG O-E does matter). Also, blanks not adjacent to hyphens don't matter. The official Unicode names are quite variable as to where they use hyphens versus spaces to separate word-like units, and this option allows you to not have to care as much. The reason non-medial hyphens matter is because of cases like U+0F60 TIBETAN LETTER -A versus U+0F68 TIBETAN LETTER A. The hyphen here is significant, as is the space before it, and so both must be included.
:loose slows down look-ups by a factor of 2 to 3 versus :full, but the trade-off may be worth it to you. Each individual look-up takes very little time, and the results are cached, so the speed difference would become a factor only in programs that do look-ups of many different spellings, and probably only when those look-ups are through vianame() and string_vianame(), since \N{...} look-ups are done at compile time.
The module also provides the means for creating custom aliases.

How do I determine what a tab character is when parsing a file?

I am opening a file (in perl) and I was wondering how do I determine what a tab character looks like.
I know they are in my file, but I was wondering how I can tell what it is. I know that for output to a file you would use \t, but its not the same for reading a file.
I also know that it reads it as some sort of TAB character because I printed out a line char by char on every line and could easily see the TABed lines.
Tab character is always \t, there is nothing more to say about it.
However, some editors use conventions about how many spaces single tab character should represent. Common wisdom says 8, but often people mean 4, and I have seen it to mean 3 and even 2 spaces.
Some editors (like Komodo or Komodo Edit) try to be smart: they read source file and count typical distribution of leading spaces and tabs. For example, if only 4,8,12,... leading spaces can be seen, it may implicitly assume that your tab character should mean 4 spaces. Or, if 2,4,6,... leading spaces are observed, it may use 2 spaces per tab.
If I understood you correctly, you want similar behavior for leading spaces.
In this case, you can determine most likely tab to space value using code below. Note that this code is not optimal: it would ignore lines with actual tabs, it only considers first indentation level to get tab indent and so on. Consider this only as starting point to get good implementation:
my %dist;
while (my $line = <>) {
my ($spaces) = ($line =~ /(^ *)/);
my $len = length($spaces);
$dist{$len}++;
}
my #sp = sort {$a <=> $b} keys %dist;
print "Leading space distribution in file: "
. join(",", #sp) . "\n";
if (scalar #sp >= 2) {
print "Most likely tab setting is: ", $sp[1] - $sp[0];
}
It's common for some IDEs and editors to insert four spaces instead of a tab character if you hit the tab key. The actual tab character is \t in perl (the contents depend on the platform, but the \t should always represent the tab character for your platform)
To make sure you catch both the tab character, and any groups of 4 spaces, you could regex for /\t| {4}/

Best way to print fixed columns table in Perl (using underscores instead of spaces)

I need to format database records into a table that a web forum can display properly (using bbcode). The forum in question does not respect spaces no matter which type of formatting tag I use but does have a monospace font, so I need to replace all spaces by underscores like this to keep everything aligned:
Field____Field____Field
Value____Value____Value
Value____Value____Value
Value____Value____Value
Value____Value____Value
I've looked into Perl formats and printf, but I can't figure out how to make the spaces and tabs into underscore using these methods. The text also have variable length, so I need the columns to be variable as well (can't hardcode fixed values).
Any help would be appreciated. Thanks!
A bit of a hack but I would use sprintf but I would replace the space in my values with another character that can not be found in these values (like ~). This can be done with a simple regex.
After sprintf I would replace the spaces with underlines and my special character in the values back to space.
You don't need anything advanced, you just need to replace the spaces with underscore:
my $str = "Field Field Field";
$str =~ tr/ /_/;
print $str;
In case the values in your fields may contain tabs (or other space-like characters) you may want to do the following:
my $str = "Field Field\tContinued Field";
$str =~ s/\s/_/g;
print $str;

How to reformat a source file to go from 2 space indentations to 3?

This question is nearly identical to this question except that I have to go to three spaces (company coding guidelines) rather than four and the accepted solution will only double the matched pattern. Here was my first attempt:
:%s/^\(\s\s\)\+/\1 /gc
But this does not work because four spaces get replaced by three. So I think that what I need is some way to get the count of how many times the pattern matched "+" and use that number to create the other side of the substitution but I feel this functionality is probably not available in Vim's regex (Let me know if you think it might be possible).
I also tried doing the substitution manually by replacing the largest indents first and then the next smaller indent until I got it all converted but this was hard to keep track of the spaces:
:%s/^ \(\S\)/ \1/gc
I could send it through Perl as it seems like Perl might have the ability to do it with its Extended Patterns. But I could not get it to work with my version of Perl. Here was my attempt with trying to count a's:
:%!perl -pe 'm<(?{ $cnt = 0 })(a(?{ local $cnt = $cnt + 1; }))*aaaa(?{ $res = $cnt })>x; print $res'
My last resort will be to write a Perl script to do the conversion but I was hoping for a more general solution in Vim so that I could reuse the idea to solve other issues in the future.
Let vim do it for you?
:set sw=3<CR>
gg=G
The first command sets the shiftwidth option, which is how much you indent by. The second line says: go to the top of the file (gg), and reindent (=) until the end of the file (G).
Of course, this depends on vim having a good formatter for the language you're using. Something might get messed up if not.
Regexp way... Safer, but less understandable:
:%s#^\(\s\s\)\+#\=repeat(' ',strlen(submatch(0))*3/2)#g
(I had to do some experimentation.)
Two points:
If the replacement starts with \=, it is evaluated as an expression.
You can use many things instead of /, so / is available for division.
The perl version you asked for...
From the command line (edits in-place, no backup):
bash$ perl -pi -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
(in-place, original backed up to "YOUR_FILE.bak"):
bash$ perl -pi.bak -e 's{^((?: )+)}{" " x (length($1)/2)}e' YOUR_FILE
From vim while editing YOUR_FILE:
:%!perl -pe 's{^((?: )+)}{" " x (length($1)/2)}e'
The regex matches the beginning of the line, followed by (the captured set of) one or more "two space" groups. The substitution pattern is a perl expression (hence the 'e' modifier) which counts the number of "two space" groups that were captured and creates a string of that same number of "three space" groups. If an "extra" space was present in the original it is preserved after the substitution. So if you had three spaces before, you'll have four after, five before will turn into seven after, etc.

Why does my Perl tr/// remove newlines?

I'm trying to clean up form input using the following Perl transliteration:
sub ValidateInput {
my $input = shift;
$input =~ tr/a-zA-Z0-9_#.:;',#$%&()\/\\{}[]?! -//cd;
return $input;
}
The problem is that this transliteration is removing embedded newline characters that users may enter into a textarea field which I want to keep as part of the string. Any ideas on how I can update this to stop it from removing embedded newline characters? Thanks in advance for your help!
I'm not sure what you are doing, but I suspect you are trying to keep all the characters between the space and the tilde in the ASCII table, along with some of the whitespace characters. I think most of your list condenses to a single range \x20-\x7e:
$string =~ tr/\x0a\x0d\x20-\x7e//cd;
If you want to knock out a character like " (although I suspect you really want it since you allow the single quote), just adjust your range:
$string =~ tr/\x0a\x0d\x20-\xa7\xa9-\x7e//cd;
That's a bit of a byzantine way of doing it! If you add \012 it should keep the newlines.
$input =~ tr/a-zA-Z0-9_#.:;',#$%&()\/\{}[]?! \012-//cd;
See Form content types.
application/x-www-form-urlencoded: Line breaks are represented as "CR LF" pairs (i.e., %0D%0A).
...
multipart/form-data: As with all MIME transmissions, "CR LF" (i.e., %0D%0A) is used to separate lines of data.
I do not know what you have in the database. Now you know what your script it sees.
You are using CGI.pm, right?
Thanks for the help guys! Ultimately I decided to process all the data in our database to remove the character that was causing the issue so that any text that was submitted via our update form (and not changed by the user) would match what was in the database. Per your suggestions I also added a few additional allowed characters to the validation regex.