unicode error preventing creation of text file - unicode

What is causing this error and how can I fix it?
(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
I have also tried reading different files in the same directory an get this same unicode error as well.
file1 = open("C:\Users\Cameron\Desktop\newtextdocument.txt", "w")
for i in range(1000000):
file1.write(str(i) + "\n")

You should escape backslashes inside the string literal. Compare:
>>> print("\U00000023") # single character
#
>>> print(r"\U00000023") # raw-string literal with
\U00000023
>>> print("\\U00000023") # 10 characters
\U00000023
>>> print("a\nb") # three characters (literal newline)
a
b
>>> print(r"a\nb") # four characters (note: `r""` prefix)
a\nb

\U is being treated as the start of a Unicode literal. Use a raw string (a preceding r) to prevent this translation:
>>> 'C:\Users'
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
>>> r'C:\Users'
'C:\\Users'

Related

Swift Regex Search String Except \r\n and \t

I am attempting to match phone numbers that is 6 digits or more with the following regex in swift. Phone numbers can also possess paranthesis and + for country codes.
"[0-9\\s\\-\\+\\(\\)]{6,}".
However, the above implementation matches \r\n and \t as well. How can I write the regex such that it will not match any \r\n or \t.
I attempted the following but didn't work:
"[0-9\\s\\-\\+\\(\\)(^\\r\\n\\t)]{6,}"
"[0-9\\s\\-\\+\\(\\)(?: (\\r|\\n|\\r\\n|\\t)]{6,}"
Thanks.
I suggest using
let regex = "^(?:[ +()-]*[0-9]){6,}[ +()-]*$"
Or
let regex = "^(?:[ +()-]*[0-9]){6,}[ +()-]*\\z"
Details
^ - start of string
(?:[ +()-]*[0-9]){6,} - six or more repetitions of
[ +()-]* - zero or more spaces, +, (, ) or - chars
[0-9] - a digit
[ +()-]* - zero or more spaces, +, (, ) or - chars
$ - end of string (\z is the very end of string).
If the pattern is used inside NSPredicate with MATCHES you may omit the ^ and $/\z anchors.

How to convert string in UTF-8 to ASCII ignoring errors and removing non ASCII characters

I am new to Scala.
Please advise how to convert strings in UTF-8 to ASCII ignoring errors and removing non ASCII characters in output.
For example, how to remove non ASCII character \uc382 from result string: "hello���", so that "hello" is printed in output.
scala.io.Source.fromBytes("hello\uc382".getBytes ("UTF-8"), "US-ASCII").mkString
val str = "hello\uc382"
str.filter(_ <= 0x7f) // keep only valid ASCII characters
If you had text in UTF-8 as bytes that is now in a String then it was converted.
If you have text in a String and you want it in ASCII as bytes, you can convert it later.
It seems that you just want to filter for only the UTF-16 code units for the C0 Controls and Basic Latin codepoints. Fortunately, such codepoints take only one code unit so we can filter them directly without converting them to codepoints.
"hello\uC382"
.filter(Character.UnicodeBlock.of(_) == Character.UnicodeBlock.BASIC_LATIN)
.getBytes(StandardCharsets.US_ASCII)
.foreach {
println }
With the question generalized to an arbitrary, known character encoding, filtering doesn't do the job. Instead, the feature of the encoder to ignore characters that are not present in the target Charset can be used. An Encoder requires a bit more wrapping and unwrapping. (The API design is based on streaming and reusing the buffer within the same stream and even other streams.) So, with ISO_8859_1 as an example:
val encoder = StandardCharsets.ISO_8859_1
.newEncoder()
.onMalformedInput(CodingErrorAction.IGNORE)
.onUnmappableCharacter(CodingErrorAction.IGNORE)
val string = "ñhello\uc382"
println(string)
val chars = CharBuffer.allocate(string.length())
.put(string)
chars.rewind()
val buffer = encoder.encode(chars)
val bytes = Array.ofDim[Byte](buffer.remaining())
buffer.get(bytes)
println(bytes)
bytes
.foreach {
println }

Unicode confusion #3423435

Once again I enter that goddamn unicode-hell ... sigh =(
There are two files:
$ file *
kreise_tmp.geojson: ASCII text
pandas_tmp.csv: UTF-8 Unicode text
I read the first file like this:
with open('kreise_tmp.geojson') as f:
jdata = json.loads(f.read())
I read the second file like this:
pandas_data = pd.read_csv(r'pandas_tmp.csv', sep=";")
Now check out what's inside the strings:
>>> jdata['features'][0]['properties']['name']
u'Kreis Euskirchen' # a unicode string?
>>> pandas_data['kreis'][0]
'Kreis D\xc3\xbcren' # not a unicode string?
Why are the strings from the "UTF-8 Unicode text" file just normal strings and the strings from the "ASCII text" file unicode strings?
JSON strings are always Unicode.
~$ python2
>>> import json
>>> json.loads('"\xc3\xbc"')
u'\xfc'
But they are often serialized with \u escapes, so file will only see ASCII.
>>> json.dumps(_)
'"\\u00fc"'
add encoding='utf-8' to the opening of files to decode them with utf-8
pandas_data = pd.read_csv(r'pandas_tmp.csv', sep=";", encoding='utf8')
you can also do the same with the JSON
with open('kreise_tmp.geojson', encoding='utf8') as f:
jdata = json.loads(f.read())
Also in Python 2.7, you can add this to the top of the file..
#!/usr/bin/env python
# -*- coding: utf-8 -*-

UnicodeEncodeErrors while using DictWriter for utf-8

I am trying to write a dictionary containing utf-8 strings to a CSV. I'm following the instructions from here. However, despite meticulously encoding and decoding these utf-8 strings, I am getting a UnicodeEncodeErrors involving 'ascii' sets.
I have a list of dictionaries which contain strings and ints as values related to changes to Wikipedia articles. The list below corresponds to this change, for example:
edgelist = [{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]
The problem is list[1]['editorName']. It has type 'str' and el[1]['editorName'].decode('utf-8') is u'Eep\xb2'
The code I am attempting is:
_ENCODING = 'utf-8'
def dictToCSV(edgelist,output_file):
with codecs.open(output_file,'wb',encoding=_ENCODING) as f:
w = csv.DictWriter(f,sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
for k,v in d.items():
if type(v) == int:
d[k]=str(v).encode(_ENCODING)
w.writerow({k:v.decode(_ENCODING) for k,v in d.items()})
This returns:
dictToCSV(edgelist,'test2.csv')
File "csv_to_charts.py", line 129, in dictToCSV
w.writerow({k:v.decode(_ENCODING,'ignore') for k,v in d.items()})
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb2' in position 3: ordinal not in range(128)
Other permutations such as swapping decode for encode or nothing in the final problematic line also return errors:
w.writerow({k:v.encode(_ENCODING) for k,v in d.items()}) returns 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
w.writerow({k:v for k,v in d.items()}) returns UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
Following this, I changed with codecs.open(output_file,'wb',encoding=_ENCODING) as f: to with open(output_file,'wb') as f: and still receive the same error.
Excluding the list element(s) or the keys containing this problematic string, the script works fine otherwise.
I just edited your code as follows and the csv was written successfully.
from django.utils.encoding import smart_str
import csv
def dictToCSV(edgelist, output_file):
f = open(output_file, 'wb')
w = csv.DictWriter(f, fieldnames=sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
w.writerow(dict(k=smart_str(v)) for k, v in d.items())
f.close()
Copy the Django code and customize it to your need.
A strict interpretation of ASCII encoding only allows ordinals 0-127. Any value outside that range is not ASCII by definition. Since both \xc2 & \xb2 have ordinals higher than 127, they cannot be interpreted as ASCII.
I'm not a Python user, the RFC for CSV mentions ASCII as a common usage but defines an optional 'charset' parameter for the MIME type; I wonder if the writer you're using also might have an 'encoding' setting?
Your strings are already in UTF-8, and DictWriter doesn't work with codecs.open. Following that example:
# coding: utf-8
import csv
edgelist = [
{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]
with open('out.csv','wb') as f:
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
w = csv.DictWriter(f,sorted(edgelist[0].keys()))
w.writeheader()
for d in edgelist:
w.writerow(d)
Output:
articleName,bytesAdded,editorName,revID
Barack Obama,183,Schonbrunn,121844749
Barack Obama,107,Eep²,121862749
Note, you can use 'editorName': 'Eep²' directly instead of 'editorName': 'Eep\xc2\xb2'. The byte string will be UTF-8-encoded per the # coding: utf-8 and if you save the source file in UTF-8.

Base64ing Unicode characters

Can Unicode characters be encoded and decoded with Base64?
I have attempted to encode the string 'الله', but when I decoded it all I got was '????'.
Base64 converts binary to text. If you want to convert text to a base64 format, you'll need to convert the text to binary using some appropriate encoding (e.g. UTF-8, UTF-16) first.
Of course they can. It depends on how your language or Base64 routine handles Unicode input. For example, Python's b64 routines expect an encoded string (as Base64 encodes binary to text, not Unicode codepoints to text).
Python 2.5.1 (r251:54863, Jul 31 2008, 22:53:39)
[GCC 4.1.2 (Ubuntu 4.1.2-0ubuntu4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> a = 'ûñö'
>>> import base64
>>> base64.b64encode(a)
'w7vDscO2'
>>> base64.b64decode('w7vDscO2')
'\xc3\xbb\xc3\xb1\xc3\xb6'
>>> print '\xc3\xbb\xc3\xb1\xc3\xb6'
ûñö
>>>
>>> u'üñô'
u'\xfc\xf1\xf4'
>>> base64.b64encode(u'\xfc\xf1\xf4')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.5/base64.py", line 53, in b64encode
encoded = binascii.b2a_base64(s)[:-1]
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-2: ordinal not in range(128)
>>> base64.b64encode(u'\xfc\xf1\xf4'.encode('utf-8'))
'w7zDscO0'
>>> base64.b64decode('w7zDscO0')
'\xc3\xbc\xc3\xb1\xc3\xb4'
>>> print base64.b64decode('w7zDscO0')
üñô
>>> a = 'الله'
>>> a
'\xd8\xa7\xd9\x84\xd9\x84\xd9\x87'
>>> base64.b64encode(a)
'2KfZhNmE2Yc='
>>> b = base64.b64encode(a)
>>> print base64.b64decode(b)
الله
You didn't specify which language(s) you're using, but try converting the string to a byte array (however that's done in your language of choice) and then base64 encoding that byte array.
In .NET you can try this (encode):
byte[] encbuf;
encbuf = System.Text.Encoding.Unicode.GetBytes(input);
string encoded = Convert.ToBase64String(encbuf);
...and to decode:
byte[] decbuff;
decbuff = Convert.FromBase64String(this.ToString());
string decoded = System.Text.Encoding.Unicode.GetString(decbuff);