rake db:seed for i18n data - rake

I am running rake db:seed to populate i18n data in the database but it doesn't recognize the i18n characters. The error that I get is --
rake aborted!
$HOME/db/seeds.rb:9: invalid multibyte char (US-ASCII)
$HOME/db/seeds.rb:9: invalid multibyte char (US-ASCII)
$HOME/db/seeds.rb:9: syntax error, unexpected $end, expecting '}'
Do I need to convert the native language strings into UTF-8 characters before calling the rake?

Just add the following line as the first line of your seeds.rb the file:
# -*- coding: utf-8 -*-
UPDATE:
In ruby 2.0 (and seems above) you don't need to do this anymore, utf-8 is now being the default encoding.

Related

Getting following error on generating language scorer on Deepspeech

File "generate_scorer_package", line 1
SyntaxError: Non-UTF-8 code starting with '\xea' in file generate_scorer_package on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Before answering this question, I am going to make some assumptions:
Firstly, I believe you are following the DeepSpeech Playbook and are at the step in generating a kenlm.scorer file, as documented here
Secondly, I am going to assume that you are using a Python editor of some descrition, like PyCharm.
The error SyntaxError: Non-UTF-8 code starting with '\xea' in file generate_scorer_package on line 2, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details is not related to DeepSpeech; it is related the Python encoding of the file that is being executed.
Python 3 assumes that the encoding of the .py file is UTF-8; however some editors - particularly editors in other locales - can override this setting.
To force the file to UTF-8 encoding, add the following code to the top of the generate_scorer_package.py file:
# coding: utf8
NOTE: It MUST be at the top of the file
Alternatively, identify where in your editor the encoding is set, and change it.
See also these Stack Overflow questions that are similar:
SyntaxError: Non-UTF-8 code starting with '\x92' in file D:\AIAssistant\build\gui.py on line 92, but no encoding declared;
SyntaxError: Non-UTF-8 code starting with '\x82'

ERROR: invalid byte sequence for encoding "UTF8": 0xff

Im importing a databse called adventure works to postgresql
and these message appears
ERROR: invalid byte sequence for encoding "UTF8": 0xff
CONTEXT: COPY businessentity, line 1
SQL state: 22021
As the error says, the byte 0xFF isn't valid in a UTF8 file. Since you're trying to load data from a SQL Server sample database I suspect the file was saved as UTF16 with a Byte Order Mark. Unicode isn't a single encoding. Unicode text files can contain a signature at the start which specifies the encoding used in the file. As the link shows, for UTF16 the BOM can be 0xFF 0xFE or 0xFE 0xFF, values which are invalid in UTF8.
As far as I know you can't specify a UTF16 encoding with COPY, so you'll have to either convert the CSV file to UTF8 with a command line tool or export it again as UTF8. If you exported the data using any SQL Server tool (SSMS, SSIS, bcp) you can easily specify the encoding you want. For example :
bcp Person.BusinessEntity out "c:\MyPath\BusinessEntity.csv" -c -C 65001
Will export the data using the 65001 codepage, which is UTF8

Writing accented characters from user input to a text file Python 3.7

Hello I have the following code snippet:
while True:
try:
entry = input("Input element: ")
print (entry)
with open(fileName,'a',encoding='UTF-8') as thisFile:
thisFile.write(entry)
except KeyboardInterrupt:
break
This one basically continuously gets input and writes it to a file until manually interrupted. However, when the user inputs something like a Ñ. It outputs: UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed I explicity put the utf-8 encoding and even tried latin-1 but still the same error. I have also put the # -*- coding: utf-8 -*- on top of my code and tried thisFile.write(entry.encode('utf-8') but it still gives me the error.
Setting the following environment variables fixed it for me.
export LANG=C.UTF-8
export LC_ALL=C.UTF-8
or another method is running it via:
PYTHONIOENCODING="UTF-8" python3 writetest.py

Character with byte sequence 0x9d in encoding 'WIN1252' has no equivalent in encoding 'UTF8'

I am reading a csv file in my sql script and copying its data into a postgre sql table. The line of code is below :
\copy participants_2013 from 'C:/Users/Acrotrend/Desktop/mip_sahil/mip/reelportdata/Participating_Individual_Extract_Report_MIPJunior_2013_160414135957.Csv' with CSV delimiter ',' quote '"' HEADER;
I am getting following error : character with byte sequence 0x9d in encoding 'WIN1252' has no equivalent in encoding 'UTF8'.
Can anyone help me with what the cause of this issue and how can I resolve it?
The problem is that 0x9D is not a valid byte value in WIN1252.
There's a table here: https://en.wikipedia.org/wiki/Windows-1252
The problem may be that you are importing a UTF-8 file and postgresql is defaulting to Windows-1252 (which I believe is the default on many windows systems).
You need to change the character set on your windows command line before running the script with chcp. Or in postgresql you can:
SET CLIENT_ENCODING TO 'utf8';
Before importing the file.
Simply specify encoding 'UTF-8' as the encoding in the \copy command, e.g. (I broke it into two lines for readability but keep it all on the same line):
\copy dest_table from 'C:/src-data.csv'
(format csv, header true, delimiter ',', encoding 'UTF8');
More details:
The problem is that the Client Encoding is set to WIN1252, most likely because it is running on Windows machine but the file has a UTF-8 character in it.
You can check the Client Encoding with
SHOW client_encoding;
client_encoding
-----------------
WIN1252
Any encoding has numeric ranges of valid code. Are you sure so your data are in win1252 encoding?
Postgres is very strict and doesn't import any possible encoding broken files. You can use iconv that can works in tolerant mode, and it can remove broken chars. After cleaning by iconv you can import the file.
I had this problem today and it was because inside of a TEXT column I had fancy quotes that had been copy/pasted from an external source.

How to debug Postgres copy command failure

I have around 75k records which I am loading to a Postgres table using copy command which is failing. I get an exception
ERROR: invalid byte sequence for encoding "UTF8": 0xbd
Now i need to find which line is having this entry. Is there any way to do this? I am thinking in lines of enabling some postgres logging that might help or any other solution
Note: I am getting the issue with only one particular file. Other files are getting loaded without issues
I always seem to get a line-number in my error, no matter whether I use COPY or \copy and feed a file via redirection or -f.
ERROR: invalid byte sequence for encoding "UTF8": 0xa3
CONTEXT: COPY z, line 3
If there are only a couple of bad chars and you just want to strip them you can use iconv (assuming you're on a unix-like system).
iconv -c --from=utf8 --to=utf8 /tmp/badchars.txt > /tmp/stripped.txt
You could always run diff against the before + after versions if you wanted to see what was stripped out.