The encoding 'GB2312' is not supported. in reading process with matlab - matlab

I tried to implement k means by MATLAB. However, when I use csvread('Filename'); in my program. It reminds me the Warning The encoding 'GB2312' is not supported. and the program can't read the csv data. Can anybody tell me what is wrong?
data=csvread('ClusterSamples.csv');
plot(data(:,1),data(:,2),'r+');
[m,n]=size(data);

The character encoding is not supported.
If you're using Mac or Linux you can use the iconv(1) tool.
cp ClusterSamples.csv ClusterSamples.csv.old && \
iconv -f GB2312 -t UTF-8 < ClusterSamples.csv.old > ClusterSamples.csv`
If not, you can use a text editor to change the character encoding and resave

Related

Babel writes utf-16 file; how can I make it write uft-8?

When I run babel --plugins transform-react-jsx like_button.jsx > like_button.js the resulting like_button.js is utf-16 encoded (and like_button.jsx has some 8 bit encoding, probably utf-8).
How can I make bable write like_button.js utf-8 encoded?
Babel's output is definitely UTF-8. Since you are seeing UTF-16 in your file, and the file is being written by your terminal, it seems most likely that your terminal is re-encoding the data before writing it to a file.
The easiest option for you would be to change from
-babel --plugins transform-react-jsx like_button.jsx > like_button.js
+babel --plugins transform-react-jsx like_button.jsx --out-file like_button.js
so that Babel itself is responsible for writing the output to the file, which removes the terminal from the equation.
If you don't want to do that, you'll need to look into your terminal options to see if there is an explicit encoding set somewhere.

pandoc: Cannot decode byte '\xd0': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream

I'm getting this error when I made pandoc --filter pandoc-citeproc myfile.markdown myfile.pdf
pandoc: Cannot decode byte '\xd0': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream
I have searched here and here, but I have double checked from the text editor and my file is UTF-8 encoded. It has accented Spanish characters, but the same command worked without anyproblem in the past. Any pointers to a solution would be appreciated.
My bad. The problem is related with the command I use to tell pandoc to create the pdf ouput. The proper command should be:
pandoc --filter pandoc-citeproc myfile.markdown -o myfile.pdf
note the -o flag between the input markdown file and the ouput pdf file. That's why I got the same utf-8 message that the people trying to convert from pdf to other formats documented in my links.
Check JabRef encoding
In my case, I bumped into a similar error when converting Pandoc Markdown to XHTML. The culprit was a set of BibTeX citations which JabRef had encoded by default in ISO8859_1.
This default JabRef behaviour can be changed once and for all by setting Default encoding: to UTF8 in JabRef's Options > Preferences > General menu.

wiki dump encoding

I'm using WikiPrep to process the latest wiki dump enwiki-20121101-pages-articles.xml.bz2. Instead of "use Parse::MediaWikiDump;" I replaced that by "use MediaWiki::DumpFile::Compat;" and did the proper changes in the code. Then, I ran
perl wikiprep.pl -f enwiki-20121101-pages-articles.xml.bz2
I got an error
enwiki-20121101-pages-articles.xml.bz2:1: parser error : Document is empty
BZh91AY&SY±H¦ÂOÿ~Ð`ÿÿÿ¿ÿÿÿ¿ÿÿÿÿÿÿÿÿÿÿ½ÿýþdß8õEnÞ¶zëJ¨Eà®mEÓP|f÷Ô
^
I guess there are some non-utf8 characters contained in the dump. So I ran
iconv -f utf8 -t utf8 enwiki-20121101-pages-articles.xml.bz2
And indeed, I got some errors
BZh91AY&SYiconv: illegal input sequence at position 10
So, my question is what's the encoding format of wiki dump and if I wish to convert it to utf-8, what shall I do? Or how should modify wikiprep.pl to avoid such problems.
Many thanks
-- [solved] I should first unzip the file first.
You are running iconv on the compressed (bz2) version of the file, rather than the XML file itself. Uncompress it first.
(Posting borrible's answer so that this resolved question is not listed as unanswered.)

UTF-8 encoding inside database encrypted

i Convert my database from this tutorial
http://en.gentoo-wiki.com/wiki/Convert_latin1_to_UTF-8_in_MySQL
but i didn't notice the arabic characters INSIDE the database is encrypted , like
اوÙاµ ®ØµØ… „Ù‡ Øكلق§Ø‡Ø°Ù…ا؄مشٳÙÙ‹ ÙÙ„...
through the php script connect with the database everything GOOD , but inside the database the arabic characters looks like that
i try to return the database to the old encoding which is WINDOWS-1256 using iconv by the following command
# iconv -f UTF-8 -t WINDOWS-1252 database.sql > database_1252.sql
i got this error
iconv: illegal input sequence at position
so i try to run the command again using -c option
# iconv -c -f UTF-8 -t WINDOWS-1252 database.sql > database_1252.sql
it's worked and i can see the arabic characters inside the database as well, but alot of characters missing , for example :
i would like to go shopping
after the converting
i would like to
i want to know how could i fix the Arabic Characters to be read as normal inside the database complete not missing anything
thanks
Wait wait .... you say your database was in WINDOWS-1256 (or WINDOWS-1252?) and you converted it based on tutorial latin1 -> utf8? No wonder the characters are malformed.
I wouldn't trust to the tutorial solution at all. I would recommend that you return to your former version of the database and use mysql alter table command to change the encoding.

Convert GB2312 to UTF-8

I have a text file that contains localized language strings that is currently encoded in GB2312 (simplified Chinese), but all of my other language files are in UTF-8. I am finding it very difficult to work with this file, as none of my text editors will work properly with it and keep corrupting it. Are there any tools to convert this to UTF-8, and are there any downsides to doing this? Would it be better to just keep it as GB2312 and use a different editor (if so, can you recommend one)?
Update: I'm using Windows XP (English install).
Update #2: I've tried using Notepad++ and Notepad2 to edit the GB2312 files, but both are unable to read the files and corrupt them.
You can try this online service that uses the Open Source iconv utility.
You can also install Charco, a command-line version of it on your machine.
For GB2312, you can use CP936 as the encoding.
If you are a .Net developer you can make a small tool that does just that.
I've struggled with this as well and found that it was actually simple to solve from a programmatic point of view.
All you need is something like this (I tested it and it works):
In C#
static void Main(string[] args) {
string infile = args[0];
string outfile = args[1];
using (StreamReader sr = new StreamReader(infile, Encoding.GetEncoding(936))) {
using (StreamWriter sw = new StreamWriter(outfile, false, Encoding.UTF8)) {
sw.Write(sr.ReadToEnd());
sw.Close();
}
sr.Close();
}
}
In VB.Net
Private Shared Sub Main(ByVal args() As String)
Dim infile As String = args(0)
Dim outfile As String = args(1)
Dim sr As StreamReader = New StreamReader(infile, Encoding.GetEncoding(936))
Dim sw As StreamWriter = New StreamWriter(outfile, false, Encoding.UTF8)
sw.Write(sr.ReadToEnd)
sw.Close
sr.Close
End Sub
I might be thinking a bit too simple here, but if it's just this one plain text file, you could try the following:
Replace all & by &, all < by < and all > by > (to be on the safe side)
Prepend the following to the text file:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /></head><body><pre>
Open the file in your favorite browser
Select and copy all text
Paste it in Notepad and save as UTF-8.
You'd be done with this before you could have written any code to do the conversion or downloaded any programs that would do the conversion for you.
Of course, I'm not a hundred percent sure this'll work, and your browser would need the correct fonts and everything, but considering you're working with these kinds of files I'm assuming you already have those.
GB 2312 is mostly compatible with GB 18030, so any tool able to deal with the latter should treat GB 2312 correctly as well. There are many tools for converting GB 18030 to UTF-8 (or some other Unicode encoding form), but I can't recommend any specific one for Windows, because I work on Unix. If you're wanting to write a bit of code, the iconv library, or ICU, springs to mind: you'll find all the conversion data readily available in these libraries.
Conversion from GB 2312 to UTF-8 is completely safe and lossless, you shouldn't worry about it.
I agree on the currently chosen answer in that "found that it was actually simple to solve from a programmatic point of view", especially when your source file contains sensitive information that you do not want to expose to an unknown 3rd-party online service.
And, nowadays Python is available out-of-box in most Linux environment, and also easy to install on a Windows environment (easier than installing C# stack, IMHO). So, without further ado, this is the 2-liner Python script that can convert GB2312 to UTF8. I tested it, it works.
# Usage: python this_script.py your_input.txt your_output.txt
import io, sys
io.open(sys.argv[2], "w", encoding="utf-8").write(io.open(sys.argv[1], encoding="gb2312").read())
If there is command line tool iconv in your OS, you can achieve this by running the one-line scirpt:
# From GB18030
iconv -f gb18030 -t utf8 -o output.txt input.txt
# From GB2313
iconv -f gb2313 -t utf8 -o output.txt input.txt
Check whether your OS have iconv:
$ iconv --version
iconv (Debian GLIBC 2.31-13+deb11u3) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.