Babel writes utf-16 file; how can I make it write uft-8? - babeljs

When I run babel --plugins transform-react-jsx like_button.jsx > like_button.js the resulting like_button.js is utf-16 encoded (and like_button.jsx has some 8 bit encoding, probably utf-8).
How can I make bable write like_button.js utf-8 encoded?

Babel's output is definitely UTF-8. Since you are seeing UTF-16 in your file, and the file is being written by your terminal, it seems most likely that your terminal is re-encoding the data before writing it to a file.
The easiest option for you would be to change from
-babel --plugins transform-react-jsx like_button.jsx > like_button.js
+babel --plugins transform-react-jsx like_button.jsx --out-file like_button.js
so that Babel itself is responsible for writing the output to the file, which removes the terminal from the equation.
If you don't want to do that, you'll need to look into your terminal options to see if there is an explicit encoding set somewhere.

Related

how to find the encoding type of configuration.properties files in powershell [duplicate]

This isn't really a programming question, is there a command line or Windows tool (Windows 7) to get the current encoding of a text file? Sure I can write a little C# app but I wanted to know if there is something already built in?
Open up your file using regular old vanilla Notepad that comes with Windows.
It will show you the encoding of the file when you click "Save As...".
It'll look like this:
Whatever the default-selected encoding is, that is what your current encoding is for the file.
If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).
I realize there are many different types of encoding, but this was all I needed when I was informed our export files were in UTF-8 and they required ANSI. It was a onetime export, so Notepad fit the bill for me.
FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.
More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicdoe
If you have "git" or "Cygwin" on your Windows Machine, then go to the folder where your file is present and execute the command:
file *
This will give you the encoding details of all the files in that folder.
The (Linux) command-line tool 'file' is available on Windows via GnuWin32:
http://gnuwin32.sourceforge.net/packages/file.htm
If you have git installed, it's located in C:\Program Files\git\usr\bin.
Example:
C:\Users\SH\Downloads\SquareRoot>file *
_UpgradeReport_Files; directory
Debug; directory
duration.h; ASCII C++ program text, with CRLF line terminators
ipch; directory
main.cpp; ASCII C program text, with CRLF line terminators
Precision.txt; ASCII text, with CRLF line terminators
Release; directory
Speed.txt; ASCII text, with CRLF line terminators
SquareRoot.sdf; data
SquareRoot.sln; UTF-8 Unicode (with BOM) text, with CRLF line terminators
SquareRoot.sln.docstates.suo; PCX ver. 2.5 image data
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary info
SquareRoot.vcproj; XML document text
SquareRoot.vcxproj; XML document text
SquareRoot.vcxproj.filters; XML document text
SquareRoot.vcxproj.user; XML document text
squarerootmethods.h; ASCII C program text, with CRLF line terminators
UpgradeLog.XML; XML document text
C:\Users\SH\Downloads\SquareRoot>file --mime-encoding *
_UpgradeReport_Files; binary
Debug; binary
duration.h; us-ascii
ipch; binary
main.cpp; us-ascii
Precision.txt; us-ascii
Release; binary
Speed.txt; us-ascii
SquareRoot.sdf; binary
SquareRoot.sln; utf-8
SquareRoot.sln.docstates.suo; binary
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary infobinary
SquareRoot.vcproj; us-ascii
SquareRoot.vcxproj; utf-8
SquareRoot.vcxproj.filters; utf-8
SquareRoot.vcxproj.user; utf-8
squarerootmethods.h; us-ascii
UpgradeLog.XML; us-ascii
Another tool that I found useful: https://archive.codeplex.com/?p=encodingchecker
EXE can be found here
Install git ( on Windows you have to use git bash console). Type:
file --mime-encoding *
for all files in the current directory , or
file --mime-encoding */*
for the files in all subdirectories
Here's my take how to detect the Unicode family of text encodings via BOM. The accuracy of this method is low, as this method only works on text files (specifically Unicode files), and defaults to ascii when no BOM is present (like most text editors, the default would be UTF8 if you want to match the HTTP/web ecosystem).
Update 2018: I no longer recommend this method. I recommend using file.exe from GIT or *nix tools as recommended by #Sybren, and I show how to do that via PowerShell in a later answer.
# from https://gist.github.com/zommarin/1480974
function Get-FileEncoding($Path) {
$bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)
if(!$bytes) { return 'utf8' }
switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
'^efbbbf' { return 'utf8' }
'^2b2f76' { return 'utf7' }
'^fffe' { return 'unicode' }
'^feff' { return 'bigendianunicode' }
'^0000feff' { return 'utf32' }
default { return 'ascii' }
}
}
dir ~\Documents\WindowsPowershell -File |
select Name,#{Name='Encoding';Expression={Get-FileEncoding $_.FullName}} |
ft -AutoSize
Recommendation: This can work reasonably well if the dir, ls, or Get-ChildItem only checks known text files, and when you're only looking for "bad encodings" from a known list of tools. (i.e. SQL Management Studio defaults to UTF16, which broke GIT auto-cr-lf for Windows, which was the default for many years.)
A simple solution might be opening the file in Firefox.
Drag and drop the file into firefox
Press Ctrl+I to open the page info
and the text encoding will appear on the "Page Info" window.
Note: If the file is not in txt format, just rename it to txt and try again.
P.S. For more info see this article.
I wrote the #4 answer (at time of writing). But lately I have git installed on all my computers, so now I use #Sybren's solution. Here is a new answer that makes that solution handy from powershell (without putting all of git/usr/bin in the PATH, which is too much clutter for me).
Add this to your profile.ps1:
$global:gitbin = 'C:\Program Files\Git\usr\bin'
Set-Alias file.exe $gitbin\file.exe
And used like: file.exe --mime-encoding *. You must include .exe in the command for PS alias to work.
But if you don't customize your PowerShell profile.ps1 I suggest you start with mine: https://gist.github.com/yzorg/8215221/8e38fd722a3dfc526bbe4668d1f3b08eb7c08be0
and save it to ~\Documents\WindowsPowerShell. It's safe to use on a computer without git, but will write warnings when git is not found.
The .exe in the command is also how I use C:\WINDOWS\system32\where.exe from powershell; and many other OS CLI commands that are "hidden by default" by powershell, *shrug*.
you can simply check that by opening your git bash on the file location then running the command file -i file_name
example
user filesData
$ file -i data.csv
data.csv: text/csv; charset=utf-8
Some C code here for reliable ascii, bom's, and utf8 detection: https://unicodebook.readthedocs.io/guess_encoding.html
Only ASCII, UTF-8 and encodings using a BOM (UTF-7 with BOM, UTF-8 with BOM,
UTF-16, and UTF-32) have reliable algorithms to get the encoding of a document.
For all other encodings, you have to trust heuristics based on statistics.
EDIT:
A powershell version of a C# answer from: Effective way to find any file's Encoding. Only works with signatures (boms).
# get-encoding.ps1
param([Parameter(ValueFromPipeline=$True)] $filename)
begin {
# set .net current directoy
[Environment]::CurrentDirectory = (pwd).path
}
process {
$reader = [System.IO.StreamReader]::new($filename,
[System.Text.Encoding]::default,$true)
$peek = $reader.Peek()
$encoding = $reader.currentencoding
$reader.close()
[pscustomobject]#{Name=split-path $filename -leaf
BodyName=$encoding.BodyName
EncodingName=$encoding.EncodingName}
}
.\get-encoding chinese8.txt
Name BodyName EncodingName
---- -------- ------------
chinese8.txt utf-8 Unicode (UTF-8)
get-childitem -file | .\get-encoding
Looking for a Node.js/npm solution? Try encoding-checker:
npm install -g encoding-checker
Usage
Usage: encoding-checker [-p pattern] [-i encoding] [-v]
Options:
--help Show help [boolean]
--version Show version number [boolean]
--pattern, -p, -d [default: "*"]
--ignore-encoding, -i [default: ""]
--verbose, -v [default: false]
Examples
Get encoding of all files in current directory:
encoding-checker
Return encoding of all md files in current directory:
encoding-checker -p "*.md"
Get encoding of all files in current directory and its subfolders (will take quite some time for huge folders; seemingly unresponsive):
encoding-checker -p "**"
For more examples refer to the npm docu or the official repository.
Similar to the solution listed above with Notepad, you can also open the file in Visual Studio, if you're using that. In Visual Studio, you can select "File > Advanced Save Options..."
The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file. It has a lot more text encodings listed in there than Notepad does, so it's useful when dealing with various files from around the world and whatever else.
Just like Notepad, you can also change the encoding from the list of options there, and then saving the file after hitting "OK". You can also select the encoding you want through the "Save with Encoding..." option in the Save As dialog (by clicking the arrow next to the Save button).
The only way that I have found to do this is VIM or Notepad++.
EncodingChecker
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.

The encoding 'GB2312' is not supported. in reading process with matlab

I tried to implement k means by MATLAB. However, when I use csvread('Filename'); in my program. It reminds me the Warning The encoding 'GB2312' is not supported. and the program can't read the csv data. Can anybody tell me what is wrong?
data=csvread('ClusterSamples.csv');
plot(data(:,1),data(:,2),'r+');
[m,n]=size(data);
The character encoding is not supported.
If you're using Mac or Linux you can use the iconv(1) tool.
cp ClusterSamples.csv ClusterSamples.csv.old && \
iconv -f GB2312 -t UTF-8 < ClusterSamples.csv.old > ClusterSamples.csv`
If not, you can use a text editor to change the character encoding and resave

Which is the Dockerfile encoding?

Defining my Dockerfile I got to this line:
...
MAINTAINER Ramón <ramon#example.com>
...
Which encoding shall I use to save this file?
Shall I escape non ASCII characters?
Considering Docker is done in Go, and Go has native support for utf-8, it is best to save a Dockerfile directly encoded in UTF-8.
That way, all characters (ASCII or not) are supported.
See "Dealing with encodings in Go".
Even though Go has good support for UTF-8 (and minimal support for UTF-16), it has no built-in support for any other encoding.
If you have to use other encodings (e.g. when dealing with user input), you have to use third party packages, like for example go-charset.
Here, it is best if the Dockerfile is directly encoded in UTF-8.
Update July 2016, docker 1.12-rc5 adds:
PR 23372: Support unicode characters in parseWords
PR 23234: Skip UTF-8 BOM bytes from Dockerfile and .dockerignore if exist
You need to set the locale correctly, remove the accent, check the encoding with a basic docker run -it container env and then put a correct encoding, the "Bible" on that is http://jaredmarkell.com/docker-and-locales/

pandoc: Cannot decode byte '\xd0': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream

I'm getting this error when I made pandoc --filter pandoc-citeproc myfile.markdown myfile.pdf
pandoc: Cannot decode byte '\xd0': Data.Text.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream
I have searched here and here, but I have double checked from the text editor and my file is UTF-8 encoded. It has accented Spanish characters, but the same command worked without anyproblem in the past. Any pointers to a solution would be appreciated.
My bad. The problem is related with the command I use to tell pandoc to create the pdf ouput. The proper command should be:
pandoc --filter pandoc-citeproc myfile.markdown -o myfile.pdf
note the -o flag between the input markdown file and the ouput pdf file. That's why I got the same utf-8 message that the people trying to convert from pdf to other formats documented in my links.
Check JabRef encoding
In my case, I bumped into a similar error when converting Pandoc Markdown to XHTML. The culprit was a set of BibTeX citations which JabRef had encoded by default in ISO8859_1.
This default JabRef behaviour can be changed once and for all by setting Default encoding: to UTF8 in JabRef's Options > Preferences > General menu.

Convert GB2312 to UTF-8

I have a text file that contains localized language strings that is currently encoded in GB2312 (simplified Chinese), but all of my other language files are in UTF-8. I am finding it very difficult to work with this file, as none of my text editors will work properly with it and keep corrupting it. Are there any tools to convert this to UTF-8, and are there any downsides to doing this? Would it be better to just keep it as GB2312 and use a different editor (if so, can you recommend one)?
Update: I'm using Windows XP (English install).
Update #2: I've tried using Notepad++ and Notepad2 to edit the GB2312 files, but both are unable to read the files and corrupt them.
You can try this online service that uses the Open Source iconv utility.
You can also install Charco, a command-line version of it on your machine.
For GB2312, you can use CP936 as the encoding.
If you are a .Net developer you can make a small tool that does just that.
I've struggled with this as well and found that it was actually simple to solve from a programmatic point of view.
All you need is something like this (I tested it and it works):
In C#
static void Main(string[] args) {
string infile = args[0];
string outfile = args[1];
using (StreamReader sr = new StreamReader(infile, Encoding.GetEncoding(936))) {
using (StreamWriter sw = new StreamWriter(outfile, false, Encoding.UTF8)) {
sw.Write(sr.ReadToEnd());
sw.Close();
}
sr.Close();
}
}
In VB.Net
Private Shared Sub Main(ByVal args() As String)
Dim infile As String = args(0)
Dim outfile As String = args(1)
Dim sr As StreamReader = New StreamReader(infile, Encoding.GetEncoding(936))
Dim sw As StreamWriter = New StreamWriter(outfile, false, Encoding.UTF8)
sw.Write(sr.ReadToEnd)
sw.Close
sr.Close
End Sub
I might be thinking a bit too simple here, but if it's just this one plain text file, you could try the following:
Replace all & by &, all < by < and all > by > (to be on the safe side)
Prepend the following to the text file:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /></head><body><pre>
Open the file in your favorite browser
Select and copy all text
Paste it in Notepad and save as UTF-8.
You'd be done with this before you could have written any code to do the conversion or downloaded any programs that would do the conversion for you.
Of course, I'm not a hundred percent sure this'll work, and your browser would need the correct fonts and everything, but considering you're working with these kinds of files I'm assuming you already have those.
GB 2312 is mostly compatible with GB 18030, so any tool able to deal with the latter should treat GB 2312 correctly as well. There are many tools for converting GB 18030 to UTF-8 (or some other Unicode encoding form), but I can't recommend any specific one for Windows, because I work on Unix. If you're wanting to write a bit of code, the iconv library, or ICU, springs to mind: you'll find all the conversion data readily available in these libraries.
Conversion from GB 2312 to UTF-8 is completely safe and lossless, you shouldn't worry about it.
I agree on the currently chosen answer in that "found that it was actually simple to solve from a programmatic point of view", especially when your source file contains sensitive information that you do not want to expose to an unknown 3rd-party online service.
And, nowadays Python is available out-of-box in most Linux environment, and also easy to install on a Windows environment (easier than installing C# stack, IMHO). So, without further ado, this is the 2-liner Python script that can convert GB2312 to UTF8. I tested it, it works.
# Usage: python this_script.py your_input.txt your_output.txt
import io, sys
io.open(sys.argv[2], "w", encoding="utf-8").write(io.open(sys.argv[1], encoding="gb2312").read())
If there is command line tool iconv in your OS, you can achieve this by running the one-line scirpt:
# From GB18030
iconv -f gb18030 -t utf8 -o output.txt input.txt
# From GB2313
iconv -f gb2313 -t utf8 -o output.txt input.txt
Check whether your OS have iconv:
$ iconv --version
iconv (Debian GLIBC 2.31-13+deb11u3) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.