How can I use sed to replace a word between periods without spaces? - sed

I have a large document (Johnson's DICTIONARY OF THE ENGLISH LANGUAGE) that has a large number of typos in this format:
MITHRIDATE (MI'THRIDATE) n.s.[mithridate, Fr.] Mithridate is one of the capital medicines of the shops, consisting of a great number of ingredients, and has its name from its inventor Mithridates, king of Pontus.Quincy.
where there is a proper name like Quincy beginning with a capital letter surrounded by periods that do not have spaces.
I want to change all the occurrences like .Quincy. to . Quincy. without changing the items like n.s. to n. s..
I am willing to ignore the possibly lacking space after the s. in n.s.[mithridate].
I have tried various combinations of character classes and back references without success.

Related

Delete greek letters in string

I am pre-processing dirty text fields. I have managed to delete single characters and numbers, but there are still Greek letters (from formulas) that I want to delete altogether.
Any type of Greek letter can occur at any position in the string.
Any ideas how to do it?
select regexp_replace(' ω ω α ω alkanediylbis alkylimino bis alkanolpolyethoxylate the formula where straight branched chain alkylene group also known alkanediyl group that has the range carbon atoms and least carbon atoms length and can the same different and are primary alkyl groups which contain carbon atoms each and can the same different and are alkylene groups which contain the range from carbon atoms each and and are the same different numerals the range each ', '\W+', '')
[Α-Ωα-ω] will match the standard Greek alphabet. (Note that the Α here is a distinct character from the Latin A, though they probably look identical).
Some commonly-used symbols are outside of the standard alphabet, so at the very least, you probably want to match the whole Greek Unicode block using [\u0370-\u03FF].
Unicode also has
The Greek Extended block containing letters with diacritics
The Coptic block with some very similar-looking characters
The Mathematical Operators block with its own ∆/∏/∑ symbols
Several copies of the Greek alphabet in the Mathematical Alphanumeric Symbols block
...and probably more.
Rather than trying to list everything you want to replace, it might be easier to list the things you want to keep. For example, to remove everything outside the printable ASCII range:
select regexp_replace(
'ABCΑαΒβΓγΔδΕεΖζΗηΘθΙιΚκΛλΜμΝνΞξΟοΠπΡρΣσςΤτΥυΦφΧχΨψΩω123',
'[^\u0020-\u007E]', '', 'g'
);
regexp_replace
----------------
ABC123

Weird characters in a Microsoft Word document won't export/can't be searched

I have a document which has been sloppily authored. It's a dictionary that contains cyrillic characters. Most of the dictionary is manageable, but I'm stuck with one thing I need help with. Words have accented letters in them and they're mostly formatted properly as a letter with a unicode accent (thus forming a single letter). However there are some very peculiar letters that look similar for example to: a;´ (where "a" is any arbitrary cyrillic letter). You'd expect á in its place. However it wouldn't be a problem per se if only this thing could be exported to, say HTML and manipulated in a text editor. The problem is that Word treats this "thing" as a single character/entity and
when exporting it is COMPLETELY omitted
when copied it can only be pasted into Notepad (which translates it into three separate characters), when being pasted into WordPad it just won't appear at all.
when a search is run in Word it won't find the letter, neither the actual character nor the exactly copied/pasted combination.
the letter will disappear when the document is opened in any other software, such as Libre Office
At this point I'm trying to:
understand what this combination is exactly
run a search/replace operation to find and weed out all of those errors
Here's a sample Word file.
Here's a screenshot of the word/letter in question:
which when typed correctly should appear like "скре́пка".
The 'character' appears to be a Word field of type 'eq' (equation). Here is the field with toggled field codes:
If it is a large document you could try to create a VBA routine that removes the fields and replaces them with corresponding characters.
Assuming that #Anonimista’s analysis is correct, as I think it is, you could fix the file by running some search and replace operations in Word, replacing e.g. ^19eq \o(е;´)^21 by е́ (the latter is Cyrillic letter е followed by combining acute accent U+0301). This is dull because you would need to do this for each vowel separately (and for uppercase vowels too). But I cannot find a way to use wildcards in this context; the codes ^19 and ^21 for start and end of field work only when wildcards are not enabled.

Replacing non-Latin chars with Latin counterparts like Phonebook App

I have a phonebook app where I generate the title for section headers by comparing the first letters of entries.
The indexes are predefined so I expect letters to be assigned from A-Z and for numbers #.
The problem is there are many letter with accents including ü, İ, ç etc in many languages. In my approach, since these chars do not fall under the range A-Z, they are assigned to # which is not desired.
The native iOS Phonebook app assigns for example ü to U and so on. Is there a simple way to make this casting without defining a set of chars?
Thanks.
Check out Unicode Normalization. You probably want some combination of NFD and extraction of the adequate data. If you look at this file from Unicode, you will see something like
00E9;LATIN SMALL LETTER E ACUTE;Ll;0;L;0065 0301;;;;N;;;00C9;;00C9
Where 00E9, ie 'é', is decomposed as 0065 0301. You pick up 0065 (a), and discard 0301 (´). This file should get you started nicely. There may be equivalent functions in Objective-C/iOS, but I wouldn't know where to start...

Subset of Unicode normally used in writing?

What is the subset of Unicode characters that are normally used in writing — such as those that would be typically found in a newspaper article?
For example, in English, the characters in the range [a-zA-Z0-9], plus some punctuation characters, would be sufficient for most writing.
But I want to support languages that use characters that fall outside the ASCII range, while excluding the non-printing or decorative characters.
The objective is to restrict the user input to the application to codepoints that are legitimately used in written language. Because the user input will be saved and displayed, I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters, Unicode flow control characters, etc.
Regrettably, I am not fluent in every single language found in Unicode. Has anyone compiled a list of all of the subset of Unicode characters that are normally used in writing?
The official list of Unicode code points is UnicodeData.txt. This is a plain text file with one line per code point; it's easily machine-readable. For example:
0022;QUOTATION MARK;Po;0;ON;;;;;N;;;;;
The third semicolon-delimited field is the abbreviated name of the "General Category". This is explained further in chapter 4 of the Unicode Standard, specifically in section 4.5; see the table on page 131 (page 12 of the PDF file). For example, "Lu" is uppercase letters, "Ll" is lowercase letters, Pc, Pd, Ps, et al are various kinds of punctuation. (The first letter of the two-letter abbreviation represents a higher-level category such as letter, digit, punctuation, etc.)
Note that some ranges of code points are not listed explicitly. For example, the range of CJK (Chinese, Japanese, Korean) ideographs is represented as:
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FCC;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
I think there are other files on unicode.org that fill in these gaps.
I'm still not 100% clear on just what subset you're trying to define, but you can probably define it as a particular set of General Category values.
I do not want to allow pranksters to input text consisting entirely of things like diacritics, Unicode combining characters
Diacritics/combining characters will be used in normal written language. So if you want to stop 'pranksters' you're going to need something more sophisticated than just a list of permitted characters. You'll have to do some sort of linguistic analysis for every language you want to permit.
I'd recommend not bothering with this, because it's going to be hard and you won't succeed anyway. Just let people write what they want.
Try WGL4 (652 characters), MES-1 (335 characters) or MES-2 (1062 characters). Find these at Wikipedia.
You may wish to exclude characters IJijĸĿŀʼn˚―⅛⅜⅝⅞♪ from MES-1 if you want to use this set.
Edit: I realize this is a bad answer. Especially the removing characters from MES-1 part was total garbage. I shouldn't have posted this. I'm ashamed of whoever upvoted this.
If anything, use Subset1 (678 characters), Subset2 (1193 characters) and Subset3 (2823 characters). https://unicodesubsets.miraheze.org/wiki/User:PiotrGrochowski

How to enumerate all Unicode canonically equivalent sequences in Perl?

Does there exist a standard Perl module or function that, given a Unicode Combining Character Sequence (or, more generally, an arbitrary Unicode text string), will generate a list of all canonically equivalent strings?
For example, if given the character U+1EAD, I'd like to get back a list of all these canonically equivalent sequences:
0061 0302 0323
0061 0323 0302
00E2 0323
1EA1 0302
1EAD
(I don't particularly care whether the interface is in terms of arrays of USVs or utf strings.)
Is this an XY problem? If you want to compare/match 2 unicode strings and you're worried that different ways of encoding the accented characters would create false negatives, then the best way to do this would be to normalize the 2 strings using one of the normalization functions from Unicode::Normalize, before doing the comparison or match.
Otherwise it gets a little messy.
You could get the complete character name using charnames::viacode(0x1EAD); (for U+1EAD it would be LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW), and get the various composing characters by splitting the name on WITH|AND. Then you could generate all combinations (checking that they exist!) of the base character + modifiers and the other modifiers. At this point you will run into the problem of matching the combining characters names in the full name (eg CIRCUMFLEX) with the combining character real name (COMBINING CIRCUMFLEX ACCENT). There are probably rules for this, but I don't know them.
This would be my naive attempt, there may be better ways of doing this, but since so far no one has volunteered the information...