Sphinx Index/Replace Curly Apostrophes - sphinx

I asked this previously as a Regex question yet the issue appears to be Sphinx as even the correct Regex set-up failed in Sphinx. So: Is there a way to convert curly apostrophes in the index to straight ones? I have tried:
Exceptions.text: ’ => '
regexp using the actual character: regexp_filter=(\w+)\’s=>\1
regexp using the unicode: regexp_filter=(\w+)\x{2019}s=>\1
And nothing has worked. I get that the ’ may fail because these are text files but not sure why unicode would. In any event the ’ is breaking things for me and I'd hate to do mysql replace on the entire database just for this.

Related

Perl regex presumably removing non ASCII characters

I found a code with regex where it is claimed that it strips the text of any non-ASCII characters.
The code is written in Perl and the part of code that does it is:
$sentence =~ tr/\000-\011\013-\014\016-\037\041-\055\173-\377//d;
I want to understand how this regex works and in order to do this I have used regexr. I found out that \000, \011, \013, \014, \016, \037, \041, \055, \173, \377 mean separate characters as NULL, TAB, VERTICAL TAB ... But I still do not get why "-" symbols are used in the regex. Do they really mean "dash symbol" as shown in regexr or something else? Is this regex really suited for deleting non-ASCII characters?
This isn't really a regex. The dash indicates a character range, like inside a regex character class [a-z].
The expression deletes some ASCII characters, too (mainly whitespace) and spares a range of characters which are not ASCII; the full ASCII range would simply be \000-\177.
To be explicit, the d flag says to delete any characters not between the first pair of slashes. See further the documentation.

How to escape '|' in emacs's org-mode? [duplicate]

I've got a table in Emacs org-mode, and the contents are regular expressions. I can't seem to figure out how to escape a literal pipe-character (|) that's part of a regex though, so it's interpreted as a table-cell separator. Could someone point me to some help? Thanks.
Update: I'm also looking for escapes for a slash (/), so that it doesn't trigger the start of an italic/emphasis sequence. I experimented with \/ and \// - for example, suppose I want the literal text /foo/ in a table cell. Here are 3 ways of attempting it:
| /foo/ | \/foo/ | \//foo/ |
In LaTeX export, that becomes:
\emph{foo} & \/foo/ & \//foo/
So none of them is the plain /foo/ I'm hoping for.
\vert for the pipe.
Forward slashes seem to work fine for me unescaped when exporting both to HTML and PDF.
Use a broken-bar character, “¦”, Unicode 00A6 BROKEN BAR. This may or may not work for your specific needs, but it’s a good visual approximation.
You could also format the relevant text as verbatim or code:
Text in the code and verbatim string is not processed for Org mode
specific syntax; it is exported verbatim.
So you might try something like =foo | bar= (code) or foo ~|~ bar (verbatim). It does change the output format, though.

SharePoint 2013 REST API odata $filter ignores unicode characters such as German umlauts äöü

I'm trying to use SharePoint 2013 REST API (odata) with unicode characters such as umlauts (ä ö ü).
...?$select=Title%2CID&$filter=substringof%28%27hello%20w%F6rld%27%2C%20Title%29&$orderby=ID%20desc&$top=14
^^ should search for "hello w*ö*rld" using substringof('...', Field)
I'm escaping the URL correctly (and also single quotes with double quotes) and filtering works for all kinds of characters (even backslash and quotes), however, entering ä/ö/ü or any other unicode character has no effect, it is as if those characters were simply filtered out on the server side (i can insert a lot of ääääääs without changing the results).
Any idea how to escape those? I tried the obvious (%ab { \u1234 \xab x1234) without success. Can't find anything on the web or in the specs either.
Thanks for suggestions.
UPDATE - SOLVED
I found that you can use the %uhhhh variant of escaping them:
?$filter=substringof('hello w%u00f6rld')
Of course one must only escape that once (i.e. not the whole thing again), but it seems that's the way to go.
(can't answer my own question now lol)

Converted to Junk Character - When Copy Paste in Text Box

Whenever i Copy and paste any Below Mention CHARACTER in text Box
Below are Copied character ( test this in notepad )
…
”
‘
Below are Typed Character
...
"
'
then that was converted to Junk Character. How can i Block this .
When i Type those character from keybord then it works but when copy paste it converted to Junk.
How can i detect and delete all this character before processing because ..user dont know about this issue ..
I want to delete that character wen user press Submit button.
” and ’ are not junk characters. They are perfectly good Unicode characters (U+201C LEFT DOUBLE QUOTATION MARK and U+2018 LEFT SINGLE QUOTATION MARK). Modern applications should be capable of dealing with all Unicode characters; if you can't handle the smart quotes you probably also can't handle accents, Greek, Cyrillic, Chinese or any of the other characters users are likely to want to use. You should concentrate on ensuring that your application supports Unicode, rather than trying to fix this one visible symptom.
Pasting ' and " (ASCII straight quote) characters into a text box should not turn them into non-ASCII ‘smart’ quotes. Where they typically tend to come from is Microsoft Word's misguided ‘AutoReplace’ feature, which replaces straight quotes with smart quotes as you type. This is an annoyance, but ultimately it's limited to Office and there's not really much you can do about it. Whilst you can manually replace “ and ” with " by doing a trivial string replacement (and how you do that depends on what language/environment you are talking about), you'll also be removing correct usage of those characters, and you won't be fixing all the other sad broken auto-replacements that MS Office does.
The … single-character ellipsis is a slightly different case, and arguably ‘junk’: to Unicode, U+2026 HORIZONTAL ELLIPSIS is a ‘compatibility character’ which is only intended to round-trip nicely to existing encodings that include it as a separate characters. Normally three dot characters should be used instead. You can replace compatibility characters by using Unicode normalisation, in particular Normal Form KC. Again, how you access normalisation is something that depends on your programming language/environment. For example in Python, unicodedata.normalize('NFKC', u'…') gives you u'...'.
Is your vnc client / server ON, try to exit (shutdown) all vnc server / clients and try again - if your copy paste works.

Comment token inside string

In pig etc. /* begins a block comment. If I put this in a regex string 'blah/blah/*', emacs thinks this is a block comment and syntax highlighting goes to hell. I am not familiar with elisp but I am certain that is a problem with script that is providing annotations for pig.
How can I fix it?
phils pointed out a better designed major mode in the question comments, but since you are still curious: The pig mode version you are using doesn't have the syntax table set up right. The most reliable way for emacs to recognize comments and strings is to use the syntax table to map characters to start/end of comments and strings. The version you are using is trying to do it with font-lock.
You have to escape the \'es and the *. All the characters that are used by the regexp engine, have to be escaped.
If you want to match "\", you might have to write "\\" when using replace-regexp interactively and "\\\\" if you use it as a lisp function.
(I even have to escape my escapes in this comment, so there are 8 escapes in the last escape sequence above)