How do I match accent characters in ANSI text file? - gold-parser

I'm trying to match accent character in ANSI file, and they are recognized as EOF.
The text file is:
VALB. VIÑAS - CICLÓN
The grammar I'm using is:
"Start Symbol" = <Value>
{String Ch} = {All Printable} + {Control Codes} + {All Valid}
String = {String Ch}+
<Value> ::= String
and it matches the Ñ char as EOF and terminate.
Is there a way to tell gold parser that the file is ANSI?

Related

Commas in SUMMARY icalendar

Is it ok to have commas in a summary tag in a ics document?
Because I am using calcurse to load an .ics and it doesn't load the event with summary comma separated.
According to the RFC5545 Specification, Comma's need to be backslashed in that situation. See:
SUMMARY is defined here: https://www.rfc-editor.org/rfc/rfc5545#section-3.8.1.12 as of Value Type: TEXT
TEXT is defined here: https://www.rfc-editor.org/rfc/rfc5545#section-3.3.11
Here is part of the above specification that describes what to do with certain characters if you want to include them in a text value:
text = *(TSAFE-CHAR / ":" / DQUOTE / ESCAPED-CHAR)
; Folded according to description above
ESCAPED-CHAR = ("\\" / "\;" / "\," / "\N" / "\n")
; \\ encodes \, \N or \n encodes newline
; \; encodes ;, \, encodes ,
TSAFE-CHAR = WSP / %x21 / %x23-2B / %x2D-39 / %x3C-5B /
%x5D-7E / NON-US-ASCII
; Any character except CONTROLs not needed by the current
; character set, DQUOTE, ";", ":", "\", ","
Description: If the property permits, multiple TEXT values are
specified by a COMMA-separated list of values.
...
The "TEXT" property values may also contain special characters
that are used to signify delimiters, such as a COMMA character for
lists of values or a SEMICOLON character for structured values.
In order to support the inclusion of these special characters in
"TEXT" property values, they MUST be escaped with a BACKSLASH
character. .... A COMMA character in
a "TEXT" property value MUST be escaped with a BACKSLASH
character. ....

Swift String including Special Characters

I have a user enter a multi string in an NSTextView.
var textViewString = textView.textStorage?.string
Printing the string ( print(textViewString) ), I get a multi-line string, for example:
hello this is line 1
and this is line 2
I want a swift string representation that includes the new line characters. For example, I want print(textStringFlat) to print:
hello this is line 1\n\nand this is line 2
What do I need to do to textViewString to expose the special characters?
If you just want to replace the newlines with the literal characters \ and n then use:
let escapedText = someText.replacingOccurrences(of: "\n", with: "\\n")

Replace emdash with double dash

I want to replace ― back into --
I tried with the utf8 encodings but that doesn't work
string = "blablabla -- blablabla ―"
I want to replace the long dash (if there is one) with double hyphens. I tried it the simple way but that didn't work:
string= string.replace ("―", "--")
I also tried to encode it with utf8 and use the codes of the special characters
stringutf8= string.encode("utf-8")
emdash= u"\u2014"
hyphen= u"\u002D"
if emdash in stringutf8:
stringutf8.replace(emdash, 2*hyphen)
Any suggestions?
I am working with text files in which sometimes apparently the two hyphens are replaced automatically with a long dash...
thanks a lot!
You are dealing with strings here. Strings are lists of characters. Replace the character, leave the encoding out of the equation.
string = 'blablabla -- blablabla \u2014'
emdash = '\u2014'
hyphen = '\u002D'
string2 = string.replace(emdash, 2*hyphen)

Filtering out all non-kanji characters in a text with Python 3

I have a text in which there are latin letters and japanese characters (hiragana, katakana & kanji).
I want to filter out all latin characters, hiragana and katakana but I am not sure how to do this in an elegant way.
My direct approach would be to just filter out every single letter of the latin alphabet in addition to every single hiragana/katakana but I am sure there is a better way.
I am guessing that I have to use regex but I am not quite sure how to go about it. Are letters somehow classified in roman letters, japanese, chinese etc.
If yes, could I somehow use this?
Here some sample text:
"Lesson 1:",, "私","わたし","I" "私たち","わたしたち","We" "あ なた","あなた","You" "あの人","あのひと","That person" "あの方","あのかた","That person (polite)" "皆さん","みなさん"
The program should only return the kanjis (chinese character) like this:
`私、人,方,皆`
I found the answer thanks to Olsgaarddk on reddit.
https://github.com/olsgaard/Japanese_nlp_scripts/blob/master/jp_regex.py
# -*- coding: utf-8 -*-
import re
''' This is a library of functions and variables that are helpful to have handy
when manipulating Japanese text in python.
This is optimized for Python 3.x, and takes advantage of the fact that all strings are unicode.
Copyright (c) 2014-2015, Mads Sørensen Ølsgaard
All rights reserved.
Released under BSD3 License, see http://opensource.org/licenses/BSD-3-Clause or license.txt '''
## UNICODE BLOCKS ##
# Regular expression unicode blocks collected from
# http://www.localizingjapan.com/blog/2012/01/20/regular-expressions-for-japanese-text/
hiragana_full = r'[ぁ-ゟ]'
katakana_full = r'[゠-ヿ]'
kanji = r'[㐀-䶵一-鿋豈-頻]'
radicals = r'[⺀-⿕]'
katakana_half_width = r'[⦅-゚]'
alphanum_full = r'[!-~]'
symbols_punct = r'[、-〿]'
misc_symbols = r'[ㇰ-ㇿ㈠-㉃㊀-㋾㌀-㍿]'
ascii_char = r'[ -~]'
## FUNCTIONS ##
def extract_unicode_block(unicode_block, string):
''' extracts and returns all texts from a unicode block from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.findall( unicode_block, string)
def remove_unicode_block(unicode_block, string):
''' removes all chaacters from a unicode block and returns all remaining texts from string argument.
Note that you must use the unicode blocks defined above, or patterns of similar form '''
return re.sub( unicode_block, '', string)
## EXAMPLES ##
text = '初めての駅 自由が丘の駅で、大井町線から降りると、ママは、トットちゃんの手を引っ張って、改札口を出ようとした。ぁゟ゠ヿ㐀䶵一鿋豈頻⺀⿕⦅゚abc!~、〿ㇰㇿ㈠㉃㊀㋾㌀㍿'
print('Original text string:', text, '\n')
print('All kanji removed:', remove_unicode_block(kanji, text))
print('All hiragana in text:', ''.join(extract_unicode_block(hiragana_full, text)))

unicode error preventing creation of text file

What is causing this error and how can I fix it?
(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
I have also tried reading different files in the same directory an get this same unicode error as well.
file1 = open("C:\Users\Cameron\Desktop\newtextdocument.txt", "w")
for i in range(1000000):
file1.write(str(i) + "\n")
You should escape backslashes inside the string literal. Compare:
>>> print("\U00000023") # single character
#
>>> print(r"\U00000023") # raw-string literal with
\U00000023
>>> print("\\U00000023") # 10 characters
\U00000023
>>> print("a\nb") # three characters (literal newline)
a
b
>>> print(r"a\nb") # four characters (note: `r""` prefix)
a\nb
\U is being treated as the start of a Unicode literal. Use a raw string (a preceding r) to prevent this translation:
>>> 'C:\Users'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
>>> r'C:\Users'
'C:\\Users'