preg_replace whitespace also breaks my special characters - preg-replace

I have a string that contains a lot of whitespaces and line breaks that I would like to clean up, so I use:
$str = trim(preg_replace('/\s+/', ' ', $str));
However, when I echo out $str I notice that special characters like " à " turn into �.
When i remove the preg_replace the � becomes " à " again, but my string is full of whitespaces and linebreaks.
I tried Google (ofc) but not a whole lot of people seem to experience this problem :)
My knowledge of PHP is intermediate, so I (still) kinda lack the insight of where this problem might occur :)

I've had the same problem. preg_replace WILL break a UTF-8 string if it has, among MANY others, one of the following characters (just mentioning some of the more usual cases here):
(U+00E0) : à Latin small letter a with grave
(U+0160) : Š Latin capital letter s with caron
(U+03A0) : Π Greek capital letter pi
(U+0420) : Р Cyrillic capital letter er
The answer is to use the UTF-8 pattern modifier. There's one catch: UTF-8 can have whitespace characters not caught by \s. So you must add \p{Z} to your pattern, this matches all whitespace. So use:
$str = preg_replace( '/[\p{Z}\s]+/u', ' ', $str );

maybe something like this could help as there could be a problem with the charset
$text = utf8_decode($text);
$text = trim(preg_replace('/\s+/', ' ', $text));
$text = utf8_encode($text);
are you getting utf-8 input?

Related

"\x{2019}" does not map to iso-8859-1 perl

I have a string named $title
Gardens and Anti-Gardens in Marie de France’s <i>Lais</i>
and I am getting this error
"\x{2019}" does not map to iso-8859-1
I try removing the italic tags but it still gives me the error i.e.
$title =~ s/<i>|<\/i>//g;
Thank you
Why do you think the HTML tags have anything to do with characters in the string?
If you google the \x{2019} the first hit is this.
Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)
That's the ’. Typically Microsoft Word converts apostrophes (single quotes ') to those kinds of quotation marks. It looks like you are trying to print your string somewhere where its converted to the ISO-8859-1 encoding. You should be able to specifically convert that character to something that makes more sense, like the above mentioned single quote '.
$string =~ s/\x{2019}/'/g;
That should get rid of that one warning. But if you import something with unicode and then expect it to be output as latin-1 more characters will fail.
The encoding ISO-8859-1 does not contain the character U+2019.

Perl - can't remove trailing characters at the end of string

I have some trailing characters at the end of a string peregrinevwap^_^_
print "JH 4 - app: $application \n";
app: peregrinevwap^_^_
Do you know why they are there and how I can remove them. I tried the chomp command but this hasn't worked.
Check out the tr//cd operator to get rid of unwanted characters.
It's documented in "perldoc perlop"
$application =~ tr/a-zA-Z//cd;
Will remove everything except letters from the string and
$application =~ tr/^_//d;
Will remove all "^" and "_" characters.
If you only want to remove certain characters when they at the end of the string, use the s// search/replace operator with regular expressions and the $ anchor to match the end of the string.
Here's an example:
s/[\^_]*$//;
Let's hope the underscores do not occur at the end of your strings, otherwise you can't automatically separate them from these unwanted characters.
Are you sure these characters are actually ^ and _ characters?
^_ could also indicate Ctrl-Underscore, ASCII character 0x1F (Unit Separator). (Not a character I've ever seen used, but you never know.)
If this is in fact the case, you can remove them with something like:
$application =~ s/\x1F//g;

How to search for a string that contains no whitespace in perl

my $string3 = "anima ls";
my $t3 = $string3 =~ /[^\s]+/;
print "$t3\n";
I wanted to write a regex that searches for a string containing no whitespace. The above code works even if i give space.
The regex [^\s]+ searches for at least one character that is not whitespace. It is better written as \S+, though. A regex that matches any string that does not contain a whitespace character is rather
/^\S+$/

searching a word with a particular character in it in perl

am trying to search a word where it starts with any character (Capital letter) but ends with zero in perl.
For example
ABC0
XYZ0
EIU0
QW0
What I have tried -
$abc =~ /^[A-Z].+0$/
But I am not getting proper output for this. Can anybody help me please?
The ^ anchores at the start of a string, the $ at the end. .+ matches as many non-newline-characters as possible. Therefore
"ABC0 XYZ0 EIU0 QW0" =~ /^[A-Z].+0$/
matches the whole string.
The \b assertion matches at word edges: everywhere a word character and a non-word-character are adjacent. The \w charclass holds only word characters, the \S charclass all non-space-characters. Either of these is better than ..
So you may want to use /\b[A-Z]\W*0\b/.
This might work :
$abc =~ /\b[A-Z].*0\b/
\b matches word boundaries.

Perl- How do I insert a space before each capital letter except for the first occurrence or existing?

I have a string like:
SomeCamel WasEnteringText
I have found various means of splitting up the string and inserting spaces with php str_replace but, i need it in perl.
Sometimes there may be a space before the string, sometimes not. Sometimes there will be a space in the string but, sometimes not.
I tried:
my $camel = "SomeCamel WasEnteringText";
#or
my $camel = " SomeCamel WasEntering Text";
$camel =~ s/^[A-Z]/\s[A-Z]/g;
#and
$camel =~ s/([\w']+)/\u$1/g;
and many more combinations of =~s//g; after much reading.
I need a guru to direct this camel towards an oasis of answers.
OK, based on the input below I now have:
$camel =~ s/([A-Z])/ $1/g;
$camel =~ s/^ //; # Strip out starting whitespace
$camel =~ s/([^[:space:]]+)/\u$1/g;
Which gets it done but seems excessive. Works though.
s/(?<!^)[A-Z][a-z]*+(?!\s+)\K/ /g;
And the less "screw this horsecrap" version:
s/
(?<!^) #Something not following the start of line,
[A-Z][a-z]*+ #That starts with a capital letter and is followed by
#Zero or more lowercased letters, not giving anything back,
(?!\s+) #Not followed by one or more spaces,
\K #Better explained here [1]
/ /gx; #"Replace" it with a space.
EDIT: I noticed that this also adds extra whitespace when you add punctuation into the mix, which probably isn't what the OP wants; thankfully, the fix is simply changing the negative look ahead from \s+ to \W+. Although now I'm beginning to wonder why I actually added those pluses. Drats, me!
EDIT2: Erm, apologies, originally forgot the /g flag.
EDIT3: Okay, someone downvote me. I went retarded. No need for the negative lookbehind for ^ - I really dropped the ball on this one. Hopefully fixed:
s/[A-Z][a-z]*+(?!\W)\K/ /gx;
1: http://perldoc.perl.org/perlre.html
Try:
$camel =~ s/(?<! )([A-Z])/ $1/g; # Search for "(?<!pattern)" in perldoc perlre
$camel =~ s/^ (?=[A-Z])//; # Strip out extra starting whitespace followed by A-Z
Please note that the obvious try of $camel =~ s/([^ ])([A-Z])/$1 $2/g; has a bug: it doesn't work if there are capital letters following one another (e.g. "ABCD" will be transformed into "ABCD" and not "A B C D")
Try :
s/(?<=[a-z])(?=[A-Z])/ /g
This inserts as space after a lower case character (ie not a space or start of string) and before and upper case character.
Improving ...
... on Hughmeir's, this works also with numbers and words starting with lower-case letters.
s/[a-z0-9]+(?=[A-Z])\K/ /gx
Tests
myBrainIsBleeding => my_Brain_Is_Bleeding
MyBrainIsBleeding => My_Brain_Is_Bleeding
myBRAInIsBLLEding => my_BRAIn_Is_BLLEding
MYBrainIsB0leeding => MYBrain_Is_B0leeding
0My0BrainIs0Bleeding0 => 0_My0_Brain_Is0_Bleeding0