This isn't really a programming question, is there a command line or Windows tool (Windows 7) to get the current encoding of a text file? Sure I can write a little C# app but I wanted to know if there is something already built in?
Open up your file using regular old vanilla Notepad that comes with Windows.
It will show you the encoding of the file when you click "Save As...".
It'll look like this:
Whatever the default-selected encoding is, that is what your current encoding is for the file.
If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).
I realize there are many different types of encoding, but this was all I needed when I was informed our export files were in UTF-8 and they required ANSI. It was a onetime export, so Notepad fit the bill for me.
FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.
More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicdoe
If you have "git" or "Cygwin" on your Windows Machine, then go to the folder where your file is present and execute the command:
file *
This will give you the encoding details of all the files in that folder.
The (Linux) command-line tool 'file' is available on Windows via GnuWin32:
http://gnuwin32.sourceforge.net/packages/file.htm
If you have git installed, it's located in C:\Program Files\git\usr\bin.
Example:
C:\Users\SH\Downloads\SquareRoot>file *
_UpgradeReport_Files; directory
Debug; directory
duration.h; ASCII C++ program text, with CRLF line terminators
ipch; directory
main.cpp; ASCII C program text, with CRLF line terminators
Precision.txt; ASCII text, with CRLF line terminators
Release; directory
Speed.txt; ASCII text, with CRLF line terminators
SquareRoot.sdf; data
SquareRoot.sln; UTF-8 Unicode (with BOM) text, with CRLF line terminators
SquareRoot.sln.docstates.suo; PCX ver. 2.5 image data
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary info
SquareRoot.vcproj; XML document text
SquareRoot.vcxproj; XML document text
SquareRoot.vcxproj.filters; XML document text
SquareRoot.vcxproj.user; XML document text
squarerootmethods.h; ASCII C program text, with CRLF line terminators
UpgradeLog.XML; XML document text
C:\Users\SH\Downloads\SquareRoot>file --mime-encoding *
_UpgradeReport_Files; binary
Debug; binary
duration.h; us-ascii
ipch; binary
main.cpp; us-ascii
Precision.txt; us-ascii
Release; binary
Speed.txt; us-ascii
SquareRoot.sdf; binary
SquareRoot.sln; utf-8
SquareRoot.sln.docstates.suo; binary
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary infobinary
SquareRoot.vcproj; us-ascii
SquareRoot.vcxproj; utf-8
SquareRoot.vcxproj.filters; utf-8
SquareRoot.vcxproj.user; utf-8
squarerootmethods.h; us-ascii
UpgradeLog.XML; us-ascii
Another tool that I found useful: https://archive.codeplex.com/?p=encodingchecker
EXE can be found here
Install git ( on Windows you have to use git bash console). Type:
file --mime-encoding *
for all files in the current directory , or
file --mime-encoding */*
for the files in all subdirectories
Here's my take how to detect the Unicode family of text encodings via BOM. The accuracy of this method is low, as this method only works on text files (specifically Unicode files), and defaults to ascii when no BOM is present (like most text editors, the default would be UTF8 if you want to match the HTTP/web ecosystem).
Update 2018: I no longer recommend this method. I recommend using file.exe from GIT or *nix tools as recommended by #Sybren, and I show how to do that via PowerShell in a later answer.
# from https://gist.github.com/zommarin/1480974
function Get-FileEncoding($Path) {
$bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)
if(!$bytes) { return 'utf8' }
switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
'^efbbbf' { return 'utf8' }
'^2b2f76' { return 'utf7' }
'^fffe' { return 'unicode' }
'^feff' { return 'bigendianunicode' }
'^0000feff' { return 'utf32' }
default { return 'ascii' }
}
}
dir ~\Documents\WindowsPowershell -File |
select Name,#{Name='Encoding';Expression={Get-FileEncoding $_.FullName}} |
ft -AutoSize
Recommendation: This can work reasonably well if the dir, ls, or Get-ChildItem only checks known text files, and when you're only looking for "bad encodings" from a known list of tools. (i.e. SQL Management Studio defaults to UTF16, which broke GIT auto-cr-lf for Windows, which was the default for many years.)
A simple solution might be opening the file in Firefox.
Drag and drop the file into firefox
Press Ctrl+I to open the page info
and the text encoding will appear on the "Page Info" window.
Note: If the file is not in txt format, just rename it to txt and try again.
P.S. For more info see this article.
I wrote the #4 answer (at time of writing). But lately I have git installed on all my computers, so now I use #Sybren's solution. Here is a new answer that makes that solution handy from powershell (without putting all of git/usr/bin in the PATH, which is too much clutter for me).
Add this to your profile.ps1:
$global:gitbin = 'C:\Program Files\Git\usr\bin'
Set-Alias file.exe $gitbin\file.exe
And used like: file.exe --mime-encoding *. You must include .exe in the command for PS alias to work.
But if you don't customize your PowerShell profile.ps1 I suggest you start with mine: https://gist.github.com/yzorg/8215221/8e38fd722a3dfc526bbe4668d1f3b08eb7c08be0
and save it to ~\Documents\WindowsPowerShell. It's safe to use on a computer without git, but will write warnings when git is not found.
The .exe in the command is also how I use C:\WINDOWS\system32\where.exe from powershell; and many other OS CLI commands that are "hidden by default" by powershell, *shrug*.
you can simply check that by opening your git bash on the file location then running the command file -i file_name
example
user filesData
$ file -i data.csv
data.csv: text/csv; charset=utf-8
Some C code here for reliable ascii, bom's, and utf8 detection: https://unicodebook.readthedocs.io/guess_encoding.html
Only ASCII, UTF-8 and encodings using a BOM (UTF-7 with BOM, UTF-8 with BOM,
UTF-16, and UTF-32) have reliable algorithms to get the encoding of a document.
For all other encodings, you have to trust heuristics based on statistics.
EDIT:
A powershell version of a C# answer from: Effective way to find any file's Encoding. Only works with signatures (boms).
# get-encoding.ps1
param([Parameter(ValueFromPipeline=$True)] $filename)
begin {
# set .net current directoy
[Environment]::CurrentDirectory = (pwd).path
}
process {
$reader = [System.IO.StreamReader]::new($filename,
[System.Text.Encoding]::default,$true)
$peek = $reader.Peek()
$encoding = $reader.currentencoding
$reader.close()
[pscustomobject]#{Name=split-path $filename -leaf
BodyName=$encoding.BodyName
EncodingName=$encoding.EncodingName}
}
.\get-encoding chinese8.txt
Name BodyName EncodingName
---- -------- ------------
chinese8.txt utf-8 Unicode (UTF-8)
get-childitem -file | .\get-encoding
Looking for a Node.js/npm solution? Try encoding-checker:
npm install -g encoding-checker
Usage
Usage: encoding-checker [-p pattern] [-i encoding] [-v]
Options:
--help Show help [boolean]
--version Show version number [boolean]
--pattern, -p, -d [default: "*"]
--ignore-encoding, -i [default: ""]
--verbose, -v [default: false]
Examples
Get encoding of all files in current directory:
encoding-checker
Return encoding of all md files in current directory:
encoding-checker -p "*.md"
Get encoding of all files in current directory and its subfolders (will take quite some time for huge folders; seemingly unresponsive):
encoding-checker -p "**"
For more examples refer to the npm docu or the official repository.
Similar to the solution listed above with Notepad, you can also open the file in Visual Studio, if you're using that. In Visual Studio, you can select "File > Advanced Save Options..."
The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file. It has a lot more text encodings listed in there than Notepad does, so it's useful when dealing with various files from around the world and whatever else.
Just like Notepad, you can also change the encoding from the list of options there, and then saving the file after hitting "OK". You can also select the encoding you want through the "Save with Encoding..." option in the Save As dialog (by clicking the arrow next to the Save button).
The only way that I have found to do this is VIM or Notepad++.
EncodingChecker
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.
I'm adding the following line -classpath/p ${installer:sys.userHome}/.comput/updates/latest.jar to the vmoption file. (Tried both options: via installer 'Add VM option' action and via launcher config).
Works pretty fine with ASCII user name (with spaces as well), but fails with non-ascii user names (I'm testing with Russian). The vmoption file looks fine to me: the path is correct and has the right encoding: CP 1251 for my case:
However the path passed to JVM seems to have incorrectly decoded characters: On the attached screen you may see the actual path passed to JVM (checked via YourKit) from Install4J launcher:
and you may also compare it with the screen when the non-ascii path is passed via command prompt:
The only workaround I have found is to substitute the path with 8.3 Windows path, but converting to it on pure Java seems very error prone to me.
Appeciate your help very much!
I am trying to create a command line to compress as RAR file using password through command line in Windows 7. I have installed WinRAR 5.31 x64.
The following command works for me:
rar a -r -m0 -hp"!(/!$!#!#=)\%" C:\files1.rar" *.*
The password is !(/!$!#!#=)\%.
My problem occurs if I wanted to put double quotes " inside my password, for example at the beginning:
rar a -r -m0 -hp""!(/!$!#!#=)\%" C:\files1.rar" *.*
The password should be "!(/!$!#!#=)\%.
That does not work for me, I tried putting \ before of ", but this is also not working.
Could anyone guide me through it in order to figure it out this special character in my password?
Further to the answer by Mofi:
Especially for Linux users using winrar/rar from the commandline, it may be worth realizing that rar effectively accepts "keyfiles", which may overcome the need to fiddle with quotes as part of the password.
Rar's documented maximum password length is 127 characters/bytes. It is not clear (to me) precisely which characters are part of the password space, but at least base64-encoded strings work. However, rar currently uses a password based key derivation function based on PBKDF2 using the HMAC-SHA256 hash function, which has a block size of 512 bits. Per PBKDF2, passwords longer than the block size of the hash function are first pre-hashed into a digest of 256 bits, which digest is then used as the password (instead of the original password). To avoid this, the archive password you pick should be no longer than 512 bits or 64 characters.
In a base64-encoded string, each character represents 6 bits of data; a 64 character password thus amounts to 384 random bits, which may be derived from 48 random bytes.
rar a -hp"$(dd if=/dev/urandom bs=48 count=1 | base64 -w0 | tee /tmp/pwd)" archive
The dd-pipe above will read 48 (pseudo)random bytes from the kernel's (non-blocking) random number source device, convert these into a 64 character password, tell rar to use that password for deriving a 256-bit (AES256) encryption key (RAR5-format), and at the same time store the password in the file `/tmp/pwd'.
The archive may again be accessed, e.g. listed, by reading the password back from the file, for instance like so:
rar l -p"$(cat /tmp/pwd)" archive.rar
The password file may be safely stored separately or together with the archive, in the latter case (of course) after encrypting it, e.g. with gpg using your own public key so as to lock the archive password under your private key/key phrase. All of this aims to conveniently make good use of rar's password/key space.
I note that I didn't dive into unrar's publicly available source code; the above is merely based on the general documentation. In addition, I don't know if the above can be made to work under Windows.
The Windows command interpreter cmd.exe and Rar.exe itself determine how arguments specified on command line are interpreted on parsing the command line. Argument strings containing a space or one of these characters &()[]{}^=;!'+,`~<|> must be enclosed in double quotes. This makes it very difficult to pass a double quote character as part of an argument string to a console application, especially at begin of an argument string.
But there is a solution for this very uncommon and very specific problem caused by a password/passphrase starting with a straight double quote character which marks usually begin/end of an argument string within all characters between are interpreted literally.
The manual of console version of WinRAR is the text file Rar.txt in program files folder of WinRAR. It can be read in this manual that Rar.exe supports reading switches from an environment variable RAR. By using this environment variable and special parsing of Windows command line interpreter on a SET command line it is possible to create a RAR archive from command line with a password starting with a single straight double quote character.
#echo off
setlocal EnableExtensions DisableDelayedExpansion
set "RAR=-hp""!(/!$!#!#=)\%%""
"%ProgramFiles%\WinRAR\Rar.exe" a -r -m0 -x"%~f0" "%USERPROFILE%\Desktop\files1.rar" *.*
endlocal
The switch -hp is read from environment variable RAR in addition to the other switches specified directly on RAR command line as explained by the manual.
The environment variable RAR is defined using syntax set "variable=value" as explained in detail by answer on Why is no string output with 'echo %var%' after using 'set var = text' on command line?
A password/passphrase with space or one of these characters &()[]{}^=;!'+,`~<|> needs to be enclosed in double quotes on Windows command line. For that reason Rar.exe removes from the passed password/passphrase the first and last double quote if there is one at begin and/or end. So it is not possible to define the password with "!(/!$!#!#=)\%. The password must be defined with two additional double quotes using ""!(/!$!#!#=)\%" to let really used password start with a straight double quote character.
In a batch file % marks begin/end of an environment variable reference except it is escaped with one more %.
So finally the command line set "RAR=-hp""!(/!$!#!#=)\%%"" defines the environment variable RAR with switch -hp passing the string "!(/!$!#!#=)\% to Rar.exe as password to use on encryption.
The RAR archive files1.rar is created on user's desktop by this code as root of directory C: is usually write-protected.
Note: Rar and WinRAR interpret *.* different to * as also explained in manual in comparison to Windows kernel functions interpreting them identical. Rar adds only files containing a dot in name of file into the RAR archive file on using *.*. So you might better use just * as wildcard.
The switch -x"%~f0" prevents adding the batch file also into the RAR archive file if being stored in current directory on execution of the batch file. Run in a command prompt window call /? for an explanation of %~f0 – full name of argument 0 which means batch file name with extension and full path.
I am reading a csv file in my sql script and copying its data into a postgre sql table. The line of code is below :
\copy participants_2013 from 'C:/Users/Acrotrend/Desktop/mip_sahil/mip/reelportdata/Participating_Individual_Extract_Report_MIPJunior_2013_160414135957.Csv' with CSV delimiter ',' quote '"' HEADER;
I am getting following error : character with byte sequence 0x9d in encoding 'WIN1252' has no equivalent in encoding 'UTF8'.
Can anyone help me with what the cause of this issue and how can I resolve it?
The problem is that 0x9D is not a valid byte value in WIN1252.
There's a table here: https://en.wikipedia.org/wiki/Windows-1252
The problem may be that you are importing a UTF-8 file and postgresql is defaulting to Windows-1252 (which I believe is the default on many windows systems).
You need to change the character set on your windows command line before running the script with chcp. Or in postgresql you can:
SET CLIENT_ENCODING TO 'utf8';
Before importing the file.
Simply specify encoding 'UTF-8' as the encoding in the \copy command, e.g. (I broke it into two lines for readability but keep it all on the same line):
\copy dest_table from 'C:/src-data.csv'
(format csv, header true, delimiter ',', encoding 'UTF8');
More details:
The problem is that the Client Encoding is set to WIN1252, most likely because it is running on Windows machine but the file has a UTF-8 character in it.
You can check the Client Encoding with
SHOW client_encoding;
client_encoding
-----------------
WIN1252
Any encoding has numeric ranges of valid code. Are you sure so your data are in win1252 encoding?
Postgres is very strict and doesn't import any possible encoding broken files. You can use iconv that can works in tolerant mode, and it can remove broken chars. After cleaning by iconv you can import the file.
I had this problem today and it was because inside of a TEXT column I had fancy quotes that had been copy/pasted from an external source.
I have a *.tar.gz file that have inside occasionally some names with non ascii letters.
for example when tar encounter a file containing word: naïve it outputs: na\303\257ve
Is there any swich, or tool to convert these slashed values to a proper letter ?
http://www.gnu.org/software/tar/manual/tar.html
By default GNU tar attempts to unquote each file or member name, replacing escape sequences according to the following table: ...
This default behavior is controlled by the following command line
option:
--unquote
Enable unquoting input file or member names (default).
--no-unquote
Disable unquoting input file or member names.
In other words, see if "--no-unquote" is an option for your version of Cygwin.
PS:
Which version of Cygwin tar are you using?