Occurence of "?" leads to body being combined with the subject? - special-characters

Occurence of "?" in the body content leads to body being combined with the subject and also content after "?" is truncated
Email to a friend

The last questionmark in an href is used to split query from path. You should url-encode your parameters. Since you seem to be using ruby, you can just use url_encode ( http://rdoc.info/stdlib/erb/1.8.7/ERB/Util:url_encode ).

Related

Rest API - Multi-Column Sort issue

I have seen few articles about Best Practices with REST API and they are suggesting belo for multi column sort.
GET /users?sort_by=-last_modified,+email
https://www.moesif.com/blog/technical/api-design/REST-API-Design-Filtering-Sorting-and-Pagination/
When I am using this approach, I see that - works fine but + gets replaced by a space.
A quick google indicates that + is a special character after ? in URL. What am I missing out here?
> The following characters have special meaning in the path component of
> your URL (the path component is everything before the '?'): ";" |
> "/" | "?"
>
> In addition to those, the following characters have special meaning in
> the query part of your URL (everything after '?'). Therefore, if they
> are after the '?' you need to escape them: ":" | "#" | "&" | "=" |
> "+" | "$" | ","
>
> For a more in-depth explanation, see the RFC.
What am I missing out here?
History, mostly.
U+002B (+) is a sub-delim, in the context of a URI, and can be used freely in the query part; see RFC 3986 Appendix A.
But on the web, a common source of query data is HTML form submissions; when we submit a form, the processing engine collects the key value pairs from the form and creates an application/x-www-form-urlencoded character sequence, which becomes the query of the URI.
Because this is such a common case, the query parsers in web server frameworks often default to reversing the encoding before giving your bespoke code access to the data.
Which means that in your web logs, you would see:
/users?sort_by=-last_modified,+email
because that's the URI that you received, but in your parameter mapping you would see
"sort_by" = "-last_modified, email"
Because the "form data" is being decoded before you get to look at it.
Form urlencoding has an explicit step in it that replaces any spaces (U+0020) with U+002B, and U+002B is instead percent-encoded.
To check if this is what is going on, try instead the following request:
GET /users?sort_by=-last_modified,%2Bemail
What I expect you will find is that the plus you are looking for now appears in your form parameters:
"sort_by" = "-last_modified,+email"

How to remove quotes in my product description string?

I'm using OSCommerce for my online store and I'm currently optimizing my product page for rich snippets.
Some of my Google Indexed pages are being marked as "Failed" by Google due to double quotes in the description field.
I'm using an existing code which strips the html coding and truncates anything after 197 characters.
<?php echo substr(trim(preg_replace('/\s\s+/', ' ', strip_tags($product_info['products_description']))), 0, 197); ?>
How can I include the removal of quotes in that code so that the following string:
<strong>This product is the perfect "fit"</strong>
becomes:
This product is the perfect fit
Happened with me, try to use:
tep_output_string($product_info['products_description']))
" becomes "
We can try using preg_replace_callback here:
$input = "SOME TEXT HERE <strong>This product is the perfect \"fit\"</strong> SOME MORE TEXT HERE";
$output = preg_replace_callback(
"/<([^>]+)>(.*?)<\/\\1>/",
function($m) {
return str_replace("\"", "", $m[2]);
},
$input);
echo $output;
This prints:
SOME TEXT HERE This product is the perfect fit SOME MORE TEXT HERE
The regex pattern used does the following:
<([^>]+)> match an opening HTML tag, and capture the tag name
(.*?) then match and capture the content inside the tag
<\/\\1> finally match the same closing tag
Then, we use a callback function which does an additional replacement to strip off all double quotes.
Note that in general using regex against HTML is bad practice. But, if your text only has single level/occasional HTML tags, then the solution I gave above might be viable.

PHP: Using preg_replace to replace an unknown string between two known strings

I have $stringF. Contained within $stringF is the following (the string is all one line, not word-wrapped as below):
http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=
AFQjCNHWQk0M4bZi9xYO4OY4ZiDqYVt2SA&clid=
c3a7d30bb8a4878e06b80cf16b898331&cid=52779892300270&ei=
H4IAW6CbK5WGhQH7s5SQAg&url=https://abcnews.
go.com/Lifestyle/wireStory/latest-royal-wedding-thousands-streets-windsor-55280649
I want to locate that string and make it look like this:
https://abcnews.go.com/Lifestyle/wireStory/latest-royal-
wedding-thousands-streets-windsor-55280649
Basically I need to use preg_replace to find the following string:
http://news.google.com/news/url?sa= ***SOME UNKNOWN CONTENT*** &url=http
and replace it with the following string:
http
I'm a little rusty with my php, and even rustier with regular expressions, so I'm struggling to figure this one out. My code looks like this:
$stringG = preg_replace('http://news.google.com/news/url?sa=*&url=http','http',$stringH);
except I know I can't use wildcards and I know I need to specially deal with the special characters (colon, forward slash, question mark, and sign, etc). Hoping someone can help me out here.
Also of note is that my $stringF contains multiple instances of such strings, so I need the preg_replace to be not greedy - otherwise it will replace a huge chunk of my string unnecessarily.
PHP has tools for that, no need to use a regex. parse_url to get the components of an url (scheme, host, path, anchor, query, ...) and parse_str to get the keys/values of the query part.
$url = 'http://news.google.com/news/url?sa=t&fd=R&ct2=us&usg=AFQjCNHWQk0M4bZi9xYO4OY4ZiDqYVt2SA&clid=c3a7d30bb8a4878e06b80cf16b898331&ci=52779892300270&ei=H4IAW6CbK5WGhQH7s5SQAg&url=https://abcnews.go.com/Lifestyle/wireStory/latest-royal-wedding-thousands-streets-windsor-55280649';
parse_str(parse_url($url, PHP_URL_QUERY), $arr);
echo $arr['url'];

How to extract email body and attachment

I am trying to extract a message rom multi-part email body or from attachment, so I used :0B to try each option like the following:
msgID=""
#extract message in the attachment if it's plain text
:0B
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($)($)\/[a-z0-9+]+=*
{msgID="$MATCH"}
#extract message in the body if it's there
:0EB
* ^()\/[a-z]+[0-9]+[^\+]
{msgID = "$MATCH"}
But msgID got the same message from the body which was inline image code, what's wrong with it, who know the better condition to filter it?
I also need to detect if the sub-header in the body is text and base64 encoded, then decode it, how to stipulate it with regex:
:0B
* ^Content-Type:text/html;
* ^Content-Location:text_0.txt
* ^Content-Transfer-Encoding:base64
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($)($)\/[a-z0-9+]+=*
{ msgID= msgId =`printf '%s' "$MATCH" | base64 -d` }
It always complains no match: ^Content-Type:text/html;
I'm guessing you are trying to say, there are two types of incoming messages. One looks something like this:
From: Sender <there#example.net>
To: You <AmyX#example.com>
Subject: plain text
ohmigod0
And the other is a complex MIME multipart with the same contents:
From: Sender <there#example.net>
To: Amy X <AmyX#example.com>
Subject: MIME complexity
MIME-Version: 1.0
Content-Type: multipart/related; boundary=12345
--12345
Content-type: text/plain; charset="us-ascii"
Content-transfer-encoding: base64
Content-disposition: attachment; filename="text_0.txt"
Content-location: text_0.txt
b2htaWdvZDA=
--12345--
If this is correct, you would want to create a recipe to handle the more complex case first, because it has more features -- if your regex hits, it's unlikely to be a false positive. If not, fall back to the simpler pattern, and assume there will never be any false positives on this (perhaps because this account only receives email from a single system).
# extract message in the attachment if this is a MIME message
:0B
* ^Content-Disposition: *attachment.*(($)[a-z0-9].*)*($))($)\/[a-z0-9+]+=*
{ msgID="$MATCH" } # hafta have spaces inside the braces
:0EB # else, do this: assume the first non-empty body line is msgID
* ^()\/[a-z]+[0-9]+[^\+]
{ msgID="$MATCH" } # still need spaces inside braces;
# ... and, as pointed out many times before, cannot have spaces
# around the equals sign
The regular expression for the attachment is an oversimplification, but I already showed you how to cope with a complex MIME message in a previous question of yours -- if you have multiple cases (for example, base64-encoded attachment, or just a plain-text attachment, or no MIME), I would arrange them from more-complex (meaning more features in the regex) and fall back successively to simpler regexes, with higher chance of false positives. You can chain :0E ("else") cases for as long as you like -- if a regex succeeds and the following recipes are :0E recipes, they will all be skipped.
In response to your update, there are two problems with your attempt. The first, as you note, is that the first regex doesn't match. You have no space after the colon, and I'm guessing there is one in the message you are matching against. You need to understand that every character in a regex needs to match exactly, with the exception of regex metacharacters, which have special meaning. You would typically see something like this in many Procmail recipes:
* ^Content-Type:[ ]*text/html;
where the spaces between the square brackets are a space and a tab. The character class (the stuff in the square brackets) matches either character once, and the asterisk * says to repeat this pattern zero or more times. This allows for arbitrary spacing after the colon. The square brackets and the star are metacharacters. (This is very basic stuff which should be in any Procmail introduction you may have read.)
Your other problem is that each regex is applied in isolation. So your recipe says, if the Content-Type header appears anywhere in the body, and the Content-Location header appears anywhere else (typically, in another MIME header somewhere) etc. In other words, your recipe is very prone to false positives. This is why the rule I proposed earlier is so complex: It looks for these headers in sequence, in a single block, that is, in a single MIME header (though there is nothing to actually make sure that the context is a MIME body part header; more on that in a bit).
Because we want to ensure that there are four different headers, in any order, the regex for this is going to be huge: ABCD|ACDB|ACDB|ABDC|ADCB|BACD|... where A is the Content-Type header regex, B is the Content-Location regex, etc. You could cheat a little bit and craft a single regex which matches a sequence of four matches of the same header-identifying regex -- this is unlikely to cause any false positives (there is no sane reason to have two copies of the same header) and simplifies the code significantly, though it's still complex. Pay attention here: We want to create a single regex which matches any one out of these four headers.
^Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment)
... followed by any header, repeated four times, followed by the MIME body part (which you had after the Content-Disposition header, slightly out of context, but not incorrectly per se).
(Your code has text/html but if the attachment isn't HTML, as suggested by the format and the filename, it should be text/plain; so I'm going with that instead.)
Before we go there, I'll point out that MIME parsing in Procmail is not done a lot, precisely because it tends to explode into enormously complex regular expressions. MIME has a lot of options, and you need each regex to allow for omission or inclusion of each optional element. There are options for how to encode things (base64, or quoted-printable, or not encoded at all) and options to include or omit quotes around many elements, and options to use a multipart message with one or more body parts or just put the data in the body, like in my constructed first example message (which is still technically a MIME message; its implied content type is text/plain; charset="us-ascii" and the default content transfer encoding is 7bit, which conveniently happens to be what email before MIME always had to look like).
So unless you are in this because (a) you really, really want to learn the deepest secrets of Procmail or (b) you are on a very constrained system where you have to because there is nothing else you can use, I would seriously suggest that you move to a language with a proper MIME parser. A Python script which decodes this would be just half a dozen lines or so, and you get everything normalized and decoded nicely for you with no need for you to reinvent quoted-printable decoding or character set translation. (You can still call the Python script from Procmail if you like.)
I'll also point out here that a proper MIME parser would extract the boundary= parameter from the top-level headers in a multipart message, and make sure any matching on body part headers only occurs immediately after a boundary separator. The following Procmail code does not do that, so we could get a false positive if a message contains a match somewhere else than in the MIME body part headers (such as, for example, if a bounce message contains a fragment of the MIME headers of the bounced message; in this case, you would like for the recipe not to match, but it will).
:0B
* ^(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
(Content-(Type:[ ]text/plain;|\
Location:[ ]*text_0\.txt|\
Transfer-Encoding:[ ]*base64|\
Disposition:[ ]*attachment).*(($)[a-z0-9].*)*)($)\
($)\/[a-z0-9/+]+=*
{ msgid=`printf '%s' "$MATCH" | base64 -d` }
:0BE
* ^^\/[a-z]+[0-9]*[^\+]
{ msgid="$MATCH" }
(Unfortunately, Procmail's regex engine doesn't have the {4} repetition operator, so we have to repeat the regex literally four times!)
As noted before, Procmail, unfortunately, doesn't know anything about MIME. As far as Procmail is concerned, the top-level headers are headers, and everything else is body. There have been attempts to write MIME libraries or extensions for Procmail, but they don't tend to reduce complexity, just shuffle it around.

Why encode url parameter if contains # symbol

I'm passing a URL parameter to a form via a get request. I need to URL encode the parameter when the parameter contains a '#' . Otherwise the request fails. Why is this required ? Why do I need to URL encode the '#' parameter but not other text ?
'#' is used in URLs to indicate where a fragment identifier
(bookmarks/anchors in HTML) begins.
The part following the # is never seen by the server. It is generally used for navigation at the client-end.
The following characters need to be encoded in order to be used literally.
When using GET, anything after # (and the # itself) will not be seen by the server.