Clean string from html tags and special characters - preg-replace

I want to clean my text from html tags, html spacial characters and characters like < > [ ] / \ * ,
I used $str = preg_replace("/&#?[a-zA-Z0-9]+;/i", "", $str);
it works well with html special characters but some characters doesn't remove like :
( /*/*]]>*/ )
how can I remove these characters?

If you are really using php as it looks like, you can just use:
$str = htmlspecialchars($str);
All HTML chars will be escaped (which could be better than just stripping them). If you really want just to filter these characters, what you need to do is escape those characters on the chars list:
$str = preg_replace("/[\&#\?\]\[\/\\\<\>\*\:\(\);]*/i","",$str);
Notice there's just one "/[]*/i", I removed the a-zA-Z0-9 as you should want these chars in. You can also classify only the desired chars to enter your string (will give you trouble with accentuations like á é ü if you use them, you have to specify every accepted char):
$str = preg_replace("/[^a-zA-Z0-9áÁéÉíÍãÃüÜõÕñÑ\.\+\-\_\%\$\#\!\=;]*/","",$str);
Notice also there's never too much to escape characters, unless for example for the intervals (\a-\z would do fine, \a-\z would match a, or -, or z).
I hope it helps. :)

Regular expression for html tags is:
/\<(.*)?\>/
so use something like this:
// The regular expression to remove HTML tags
$htmltagsregex = '/\<(.*)?\>/';
// what shit will substitute it
$nothing = '';
// the string I want to apply it to
$string = 'this is a string with <b>HTML tags</b> that I want to <strong>remove</strong>';
// DO IT
$result = preg_replace ($htmltagsregex,nothing,$string);
and it will return
this is a string with HTML tags that I want to remove
That's all

Related

Determine if a string only contains invisible characters in Swift

I was parsing a messy XML. I found many of the nodes contain invisible characters only, for instance:
"\n "
" "
"\t "
"\n "
"\n\n"
I saw some posts and answers about alphabet and numbers, but the XML being parsed in my project includes UTF8 characters. I am not sure how I can list all visible UTF8 characters in the filter.
How can I determine if a string is made up of completely invisible characters like above, so I can filter them out? Thanks!
Use CharacterSet for that.
let nonWhitespace = CharacterSet.whitespacesAndNewlines.inverted
let containsNonWhitespace = (string.rangeOfCharacter(from: nonWhitespace) != nil)
Trim the string of whitespaces and newlines and see what's left.
if someString.trimmingCharacters(in: .whitespacesAndNewlines).isEmpty {
// someString only contains whitespaces and newlines
}

Replace emdash with double dash

I want to replace ― back into --
I tried with the utf8 encodings but that doesn't work
string = "blablabla -- blablabla ―"
I want to replace the long dash (if there is one) with double hyphens. I tried it the simple way but that didn't work:
string= string.replace ("―", "--")
I also tried to encode it with utf8 and use the codes of the special characters
stringutf8= string.encode("utf-8")
emdash= u"\u2014"
hyphen= u"\u002D"
if emdash in stringutf8:
stringutf8.replace(emdash, 2*hyphen)
Any suggestions?
I am working with text files in which sometimes apparently the two hyphens are replaced automatically with a long dash...
thanks a lot!
You are dealing with strings here. Strings are lists of characters. Replace the character, leave the encoding out of the equation.
string = 'blablabla -- blablabla \u2014'
emdash = '\u2014'
hyphen = '\u002D'
string2 = string.replace(emdash, 2*hyphen)

Is there a function to escape all regex-relevant characters?

The regex I'm using in my application is a combination of user-input and code. Because I don't want to restrict the user I would like to escape all regex-relevant characters like "+", brackets , slashes etc. from the entry.
Is there a function for that or at least an easy way to get all those characters in an array so that I can do something like this:
for regexChar in regexCharacterArray{
myCombinedRegex = myCombinedRegex.replaceOccurences(of: regexChar, with: "\\" + regexChar)
}
Yes, there is NSRegularExpression.escapedPattern(for:):
Returns a string by adding backslash escapes as necessary to protect any characters that would match as pattern metacharacters.
Example:
let escaped = NSRegularExpression.escapedPattern(for: "[*]+")
print(escaped) // \[\*]\+

How do I print a tab character in Pascal?

I'm trying to figure out in all the Internets what's the special character for printing a simple tab in Pascal. I have to format a table in a CLI program and that would be handy.
Single non printable characters can be constructed using their ascii code prefixed with #
Since the ascii value for tab is 9, a tab is then #9. Characters such constructed must be outside literals, but don't need + to concatenate:
E.g.
const
sometext = 'firstfield'#9'secondfield'#13#10;
contains two fields separated by a tab, ended by a carriage return (#13) + a linefeed #10
The ' character can be made both via this route, or shorter by just ending the literal and reopening it:
const
some2 = '''bla'''; // will contain 'bla' with the ticks.
some3 = 'start''bla''end'; // will contain start'bla'end
write( ^i );
:-)

How to convert Unicode characters to escape codes

So, I have a bunch of strings like this: {\b\cf12 よろてそ } . I'm thinking I could iterate over each character and replace any unicode (Edit: Anything where AscW(char) > 127 or < 0) with a unicode escape code (\u###). However, I'm not sure how to programmatically do so. Any suggestions?
Clarification:
I have a string like {\b\cf12 よろてそ } and I want a string like {\b\cf12 [STUFF]}, where [STUFF] will display as よろてそ when I view the rtf text.
You can simply use the AscW() function to get the correct value:-
sRTF = "\u" & CStr(AscW(char))
Note unlike other escapes for unicode, RTF uses the decimal signed short int (2 bytes) representation for a unicode character. Which makes the conversion in VB6 really quite easy.
Edit
As MarkJ points out in a comment you would only do this for characters outside of 0-127 but then you would also need to give some other characters inside the 0-127 range special handling as well.
Another more roundabout way, would be to add the MSScript.OCX to the project and interface with VBScript's Escape function. For example
Sub main()
Dim s As String
s = ChrW$(&H3088) & ChrW$(&H308D) & ChrW$(&H3066) & ChrW$(&H305D)
Debug.Print MyEscape(s)
End Sub
Function MyEscape(s As String) As String
Dim scr As Object
Set scr = CreateObject("MSScriptControl.ScriptControl")
scr.Language = "VBScript"
scr.Reset
MyEscape = scr.eval("escape(" & dq(s) & ")")
End Function
Function dq(s)
dq = Chr$(34) & s & Chr$(34)
End Function
The Main routine passes in the original Japanese characters and the debug output says:
%u3088%u308D%u3066%u305D
HTH