How to escape special charcters? - special-characters

I am using a html purifier package for purifying my rich text from any xss before storing in database.
But my rich text allows for Wiris symbols which uses special character as → or  .
Problem is the package does not allow me to escape these characters. It removes them completely.
What should I do to escape them ??
Example of the string before purifying
<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mmultiscripts><mi>y</mi><mprescripts/><none/><mn>2</mn></mmultiscripts><mo> </mo><mover><mo>→</mo><mo>=</mo></mover><mo> </mo><msup><mi>z</mi><mn>2</mn></msup><mo> </mo></math></p>
After purifying
<p><math xmlns="http://www.w3.org/1998/Math/MathML"><msup><mi>x</mi><mn>2</mn></msup><mo> </mo><mo>+</mo><mo> </mo><mmultiscripts><mi>y</mi><mprescripts></mprescripts><none><mn>2</mn></mmultiscripts><mo> </mo><mover><mo>→</mo><mo>=</mo></mover><mo> </mo><msup><mi>z</mi><mn>2</mn></msup><mo> </mo></math></p>

My guess is that these entities are failing the regexen that HTML Purifier is using to check for valid entities in HTMLPurifier_EntityParser, here:
$this->_textEntitiesRegex =
'/&(?:'.
// hex
'[#]x([a-fA-F0-9]+);?|'.
// dec
'[#]0*(\d+);?|'.
// string (mandatory semicolon)
// NB: order matters: match semicolon preferentially
'([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
// string (optional semicolon)
"($semi_optional)".
')/';
$this->_attrEntitiesRegex =
'/&(?:'.
// hex
'[#]x([a-fA-F0-9]+);?|'.
// dec
'[#]0*(\d+);?|'.
// string (mandatory semicolon)
// NB: order matters: match semicolon preferentially
'([A-Za-z_:][A-Za-z0-9.\-_:]*);|'.
// string (optional semicolon)
// don't match if trailing is equals or alphanumeric (URL
// like)
"($semi_optional)(?![=;A-Za-z0-9])".
')/';
Notice how it expects numeric entities to start with 0 currently. (Perfectly sane since it's designed to handle pure HTML, without add-ons, and to make that safe; but in your use-case, you want more entity flexibility.)
You could extend that class and overwrite the constructor (where these regexen are being defined, by instead defining your own where you remove the 0* from the // dec part of the regexen), instantiating that, try setting $this->_entity_parser on a Lexer created with HTMLPurifier_Lexer::create($config) to your instantiated EntityParser object (this is the part I am least sure about whether it would work; you might have to create a Lexer patch with extends as well), then supply the altered Lexer to the config using Core.LexerImpl.
I have no working proof-of-concept of these steps for you right now (especially in the context of Laravel), but you should be able to go through those motions in the purifier.php file, before the return.

I solved the problem by setting key Core.EscapeNonASCIICharacters to true
under my default key in my purifier.php file and the problem has gone.

Related

Using Powershell and HTMLBody.Replace how do I replace values inside an existing table?

I have an existing email template file for Outlook with To, CC, Subject and Body prefilled.
I can replace the values I need on the subject just fine, however, when it comes to the HTMLBody part, it only replaces values outside the table; I've tested this by putting all 15 placeholders outside the table.
In Powershell, I defined an array with the items that will be replaced and another that reads the values from a JSON file, then I loop through both in order to replace the values on the HTMLBody.
This is the code in question:
$emailToreplaceValues=#(
"[DailyReportDate]",
"[DailyReportSuccess]",
"[DailyReportFailure]",
"[DailyReportFailureRate]"
)
$newValues=#(
$valuesJSON.DailyReport.Date,
$valuesJSON.DailyReport.Success,
$valuesJSON.DailyReport.Failure,
$dailyReportFailureRate
)
$reportEmail = $outlookObj.CreateItemFromTemplate("$emailTemplate")
$reportEmail.Subject = $reportEmail.Subject.Replace("[date]", $date)
for($i=0;$i -le $newValues.Count;$i++) {
$reportEmail.HTMLBody = $reportEmail.HTMLBody.Replace($emailToreplaceValues[$i], $newValues[$i])
}
There's more values but for the sake of brevity, I only included a few of the values, from my understanding, the issue is that some of those values are inside a HTML table cell but I don't know if I can access the table or cells directly.
Firstly, do not use MailItem.HTMLBody property as variable - it is expensive to set and read, and it might not be the same HTML you set as Outlook performs some massaging and validation. Introduce an explicit variable, set it to the value of HTMLBody, do all your string replacements in a loop using that variable, then set the MailItem.HTMLBody property once.
You can also try to output the value of that variable to make sure the old values to be replaced are really there and are not broken by HTML formatting or encoding.
For the sake of future reference, the only way I was able to fix this, was by grabbing the html code off the email that I based my email template off.
I organized it so that any tags I want replaced are in their own line without anything else other than the spaces for indentation, then defined it as a variable that goes through the replace cycle and gets assigned to the MailItem.HTMLBody property after the replace cycle.

What function do I use in a Salesforce apex trigger to trim a text field?

I'm new to writing apex triggers. I've looked through a lot of the apex developer documentation, and I can't seem to find the combination of functions that I should use to automatically trim characters from a text field.
My org has two text fields on the Case object, which automatically store the email addresses that are included in an email-to-case. The text fields have a 255 character limit each. We are seeing errors pop up because the number of email addresses that these fields contain often exceeds 255 characters.
I need to write a trigger that can trim these text fields to the last ".com" before it hits the 255 character limit.
Perhaps I'm going about this all wrong. Any advice?
You can use replace() function in Apex.
String s1 = 'abcdbca';
String target = 'bc';
String replacement = 'xy';
String s2 = s1.replace(target, replacement);
If you need to use regular expression to find the pattern, then you can use replaceAll()
String s1 = 'a b c 5 xyz';
String regExp = '[a-zA-Z]';
String replacement = '1';
String s2 = s1.replaceAll(regExp, replacement);
For more information please refer Apex Reference Guide
The following code I think that covers what you are searching:
String initialUrl = 'https://stackoverflow.com/questions/69136581/what-function-do-i-use-in-a-salesforce-apex-trigger-to-trim-a-text-field';
Integer comPosition = initialUrl.indexOf('.com');
System.debug(initialUrl.left(comPosition + 4));
//https://stackoverflow.com
The main problems that I see are that other extensions are not covered (like ".net" urls) and that urls that have a ".com" appearing previous to the last one (something like "https://www.comunications.com"). But I think that this covers most of the use cases.
Why not increasing the length of that specific field? Trimming the text might cause data loss or unusable. Just go to object manager find that object and that specific field. Edit field and increase the length.

Lexer command 'more' in antlr doesn't match the expected value

I've been using different lexer modes in antlr and I've been encountered problems with the 'more' command in the lexer, as it doesn't match everything inside this respective token. To make things more clear, here is what my code looks like roughly:
//DEFAULT_MODE
fragment A: ('A'); //same done for A-Z
KEYWORD_CLASS: C L A S S;
NUM: [0-9];
KEYWORD_SMTH: S M T H->mode(NUMBER_MODE);
mode NUMBER_MODE;
NUMBER: NUM+ ->mode(ANOTHER_MODE);
NO_NUMBER: ~[0-9]->more, mode(DEFAULT_MODE);
Now when I try to test the parser rule
rule: KEYWORD_SMTH NUMBER? CLASS;
then I'm expecting to match the following phrase:
SMTH CLASS
But for some reason the first letter of the C is not matched to the Token. I have to type something like
SMTH gCLASS
in order to match the keyword CLASS. If I understand correctly, the 'more' command will match everything that is not a number and bring it back to default mode, so it can be part of another token. Can someone please tell me where my mistake is? Thanks.
Assuming you omitted the rule that skips/hides spaces, this is what happens when tokenising SMTH CLASS:
token KEYWORD_SMTH is created for the text text "SMTH"
the mode changes from DEFAULT_MODE to NUMBER_MODE
the beginning of a token is created for the text "C" (NO_NUMBER...)
the mode changes from NUMBER_MODE to DEFAULT_MODE
inside the DEFAULT_MODE, the "C" previously matched is glued to whatever "LASS" is tokenised as (note this will NOT match the KEYWORD_CLASS!)
So, assuming that "LASS" is tokenised as an IDENTIFIER token or similar, you will have ended up with 2 tokens:
KEYWORD_SMTH (text "SMTH")
IDENTIFIER (text "C" + "LASS")

Query string parsing as number when it should be a string

I am trying to send a search input to a REST service. In some cases the form input is a long string of numbers (example: 1234567890000000000123456789). I am getting 500 error, and it looks like something is trying the convert the string to a number. The data type for the source database is a string.
Is there something that can be done in building the query string that will force the input to be interpreted as a string?
The service is an implementation of ArcGIS server.
More information on this issue per request.
To test, I have been using a client form provided with the service installation (see illustration below).
I have attempted to add single and double quotes, plus wildcard characters in the form entry. The form submission does not error, but no results are found. If I shorten the number("1234"), or add some alpha numeric characters ("1234A"), the form submission does not error.
The problem surfaced after a recent upgrade to 10.1. I have looked for information that would tie this to a known problem, but not found anything yet.
In terms of forcing the input to be interpreted as a string, you enclose the input in single quotes (e.g., '1234567890000000000123456789'). Though if you are querying a field of type string then you need to enclose all search strings in single quotes, and in that case none of your queries should be working. So it's a little hard to tell from the information you've provided what exactly you are doing and what might be going wrong. Can you provide more detail and/or code? Are you formatting a where clause that you are using in a Query object via one of Esri's client side API's (such as the JavaScript API)? In that case, for fields of data type string you definitely need to enclose the search text in single quotes. For example if the field you are querying were called 'FIELD', this is how you'd format the where clause:
FIELD = '1234'
or
FIELD Like '1234%'
for a wildcard search. If you are trying to enter query criteria directly into the Query form of a published ArcGIS Server service/layer, then there too you need to enclose the search in single quotes, as in the above examples.
According to an Esri help technician, this is known bug.

Incorrect partition detected by PartitionScanner in custom TextEditor for eclipse

I have a PartitionScanner that extends RuleBasedPartitionScanner in my custom text editor plugin for Eclipse. I am having issues with the partition scanner detecting character sequences within larger strings, resulting in document being partitioned incorrectly. For example, within the constructor of m partition scanner I have following rule set-up:
public MyPartitionScanner() {
...
rules.add(new MultiLineRule("SET", "ENDSET", mytoken));
...
}
However, if I happen to use a token that contains the character sequence "SET," it seems like partition scanner would continue searching for endSequence("ENDSET") and will make the rest of the document as single partition set to "mytoken."
var myRESULTSET34 = ...
Is there a way to make the partition scanner ignore the word "SET" from the token above? And only recognize the whole word "SET"?
Thank you.
Using the MultilineRule as is, you won't be able to differentiate. But you can create your own subclass that overrides sequenceDetected and does a lookback/lookahead when the super impl returns true, to make sure that it's preceded/followed by EOF/whitespace. If it doesn't, then push back the characters onto the scanner and return false.