This question already has answers here:
split() but keep delimiter
(2 answers)
Closed 11 months ago.
I want to split a multi sentence paragraph into its constituent sentences whilst retaining the split characters ie the '. ? !'. The code I'm using is:
my #Sentence = split(/[\.\?\!]/,$Paragraph);
Is there any way that I can save those sentence terminators?
Yes, if you add parentheses around the delimiter, they will be included in the result list.
my #Sentence = split /([\.\?\!])/, $Paragraph;
E.g. if you have the string foo.bar.baz before you would get qw(foo bar baz), and with parentheses you would get qw(foo . bar . baz).
In case you want to keep the delimiters attached to the sentence, you could use a lookbehind assertion
my #Sentence = split /(?<=[\.\?\!])/, $Paragraph;
# result qw(foo. bar. baz)
If you want to strip unnecessary spaces after the match, you could use /(?<=[\.\?\!]) */.
Related
This question already has answers here:
How can I prevent Perl from interpreting double-backslash as single-backslash character?
(3 answers)
Closed 4 years ago.
I have this sample string, containing 2 backslashes. Please don't ask me for the source of the string, it is just a sample string.
my $string = "use Ppppp\\Ppppp;";
print $string;
Both, double quotes or quotes will print
use Ppppp\Ppppp;
Using
my $string = "\Quse Ppppp\\Ppppp;\E";
print $string;
will print
use\ Ppppp\\Ppppp\;
adding those extra backslashes to the output.
Is there a simple solution in perl to display the string "literally", without modifying the string like adding extra backslashes to escape?
I have this sample string, containing 2 backslashes. ...
my $string = "use Ppppp\\Ppppp;";
Sorry, but you're mistaken - that string only contains one backslash*, as \\ is a escape sequence in double-quoted (and single-quoted) strings that produces a single backslash. See also "Quote and Quote-like Operators" in perlop. If your string really does contain two backslashes, then you need to write "use Ppppp\\\\Ppppp;", or use a heredoc, as in:
chomp( my $string = <<'ENDSTR' );
use Ppppp\\Ppppp;
ENDSTR
If you want the string output as valid Perl source code (using its escaping), then you can use one of several options:
my $string = "use Ppppp\\Ppppp;";
# option 1
use Data::Dumper;
$Data::Dumper::Useqq=1;
$Data::Dumper::Terse=1;
print Dumper($string);
# option 2
use Data::Dump;
dd $string;
# option 3
use B;
print B::perlstring($string);
Each one of these will print "use Ppppp\\Ppppp;". (There are of course other modules available too. Personally I like Data::Dump. Data::Dumper is a core module.)
Using one of these modules is also the best way to verify what your $string variable really contains.
If that still doesn't fit your needs: A previous edit of your question said "How can I escape correctly all special characters including backslash?" - you'd have to specify a full list of which characters you consider special. You could do something like this, for example:
use 5.014; # for s///r
my $string = "use Ppppp\\Ppppp;";
print $string=~s/(?=[\\])/\\/gr;
That'll print $string with backslashes doubled, without modifying $string. You can also add more characters to the regex character class to add backslashes in front of those characters as well.
* Update: So I don't sound too pedantic here: of course the Perl source code contains two backslashes. But there is a difference between the literal source code and what the Perl string ends up containing, the same way that the string "Foo\nBar" contains a newline character instead of the two literal characters \ and n.
For the sake of completeness, as already discussed in the comments: \Q\E (aka quotemeta) is primarily meant for escaping any special characters that may be special to regular expressions (all ASCII characters not matching /[A-Za-z_0-9]/), which is why it is also escaping the spaces and semicolon.
Since you mention external files: If you are reading a line such as use Ppppp\\Ppppp; from an external file, then the Perl string will contain two backslashes, and if you print it, it will also show two backslashes. But if you wanted to represent that string as Perl source code, you have to write "use Ppppp\\\\Ppppp;" (or use one of the other methods from the question you linked to).
This question already has answers here:
How to Replace white space in perl
(3 answers)
Closed 5 years ago.
What does this line do in Perl?
s/\s//g;
I'm looking at a script that is used to search and count certain characters in an input file and I understand everything in the code except for this line. I was wondering what this line did for the script?
s/\s//g;
is short for
$_ =~ s/\s//g;
It is a substitution operator bound to $_. It replaces all sequences in $_ that match the regex pattern \s with nothing. (Without g, it would only replace the first.)
\s matches a character of whitespace.
This question already has an answer here:
Perl: find whether an particular element of array is a word or sentence
(1 answer)
Closed 9 years ago.
I have a line which can be a single word or sentence. What is the command line to check whether it is a single word or sentence ?
Your input is in $line.
Check like below
if(chomp($line) =~ /^\w+$/){
# only a word
} else {
# It contains multiple words
}
Coudln't you just check for spaces in the input line? If it contains a space it's safe to say it's a sentence? Then add some safety checks so it doesn't count when people write something like " word", "word ", etc. :)
do split(" ") and store in Array. If your array more than 1 element so it obviously not a word.
I am opening a file (in perl) and I was wondering how do I determine what a tab character looks like.
I know they are in my file, but I was wondering how I can tell what it is. I know that for output to a file you would use \t, but its not the same for reading a file.
I also know that it reads it as some sort of TAB character because I printed out a line char by char on every line and could easily see the TABed lines.
Tab character is always \t, there is nothing more to say about it.
However, some editors use conventions about how many spaces single tab character should represent. Common wisdom says 8, but often people mean 4, and I have seen it to mean 3 and even 2 spaces.
Some editors (like Komodo or Komodo Edit) try to be smart: they read source file and count typical distribution of leading spaces and tabs. For example, if only 4,8,12,... leading spaces can be seen, it may implicitly assume that your tab character should mean 4 spaces. Or, if 2,4,6,... leading spaces are observed, it may use 2 spaces per tab.
If I understood you correctly, you want similar behavior for leading spaces.
In this case, you can determine most likely tab to space value using code below. Note that this code is not optimal: it would ignore lines with actual tabs, it only considers first indentation level to get tab indent and so on. Consider this only as starting point to get good implementation:
my %dist;
while (my $line = <>) {
my ($spaces) = ($line =~ /(^ *)/);
my $len = length($spaces);
$dist{$len}++;
}
my #sp = sort {$a <=> $b} keys %dist;
print "Leading space distribution in file: "
. join(",", #sp) . "\n";
if (scalar #sp >= 2) {
print "Most likely tab setting is: ", $sp[1] - $sp[0];
}
It's common for some IDEs and editors to insert four spaces instead of a tab character if you hit the tab key. The actual tab character is \t in perl (the contents depend on the platform, but the \t should always represent the tab character for your platform)
To make sure you catch both the tab character, and any groups of 4 spaces, you could regex for /\t| {4}/
just encountered the code for doing tab expansion in perl, here is the code:
1 while $string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
I tested it to be working, but I am too much a rookie to understand this, anyone care to explain a bit about why it works? or any pointer for related material that could help me understand this would be appreciated, Thanks a lot.
Perl lets you embed arbitrary code as replacement expressions in regexes.
$& is the string matched by the last pattern match—in this case, some number of tab characters.
$` is the string preceding whatever was matched by the last pattern match—this lets you know how long the previous text was, so you can align things to columns properly.
For example, running this against the string "Something\t\t\tsomething else", $& is "\t\t\t", and $` is "Something". length($&) is 3, so there are at most 24 spaces needed, but length($`)%8 is 1, so to make it align to columns every eight it adds 23 spaces.
The e flag on the regex means to treat the replacement string (' ' x (...etc...) as perl code and interpret/execute it for each match. So, basically look for any place there's 1 or more (+) tab characters (\t), then execute the small perl snippet to convert those tabs into spaces.
The snippet calculates how many tabs were matched, multiplies that number by 8 to get the number of spaces required, but also accounts for anything which may have come before the matched tabs.