This isn't really a programming question, is there a command line or Windows tool (Windows 7) to get the current encoding of a text file? Sure I can write a little C# app but I wanted to know if there is something already built in?
Open up your file using regular old vanilla Notepad that comes with Windows.
It will show you the encoding of the file when you click "Save As...".
It'll look like this:
Whatever the default-selected encoding is, that is what your current encoding is for the file.
If it is UTF-8, you can change it to ANSI and click save to change the encoding (or visa-versa).
I realize there are many different types of encoding, but this was all I needed when I was informed our export files were in UTF-8 and they required ANSI. It was a onetime export, so Notepad fit the bill for me.
FYI: From my understanding I think "Unicode" (as listed in Notepad) is a misnomer for UTF-16.
More here on Notepad's "Unicode" option: Windows 7 - UTF-8 and Unicdoe
If you have "git" or "Cygwin" on your Windows Machine, then go to the folder where your file is present and execute the command:
file *
This will give you the encoding details of all the files in that folder.
The (Linux) command-line tool 'file' is available on Windows via GnuWin32:
http://gnuwin32.sourceforge.net/packages/file.htm
If you have git installed, it's located in C:\Program Files\git\usr\bin.
Example:
C:\Users\SH\Downloads\SquareRoot>file *
_UpgradeReport_Files; directory
Debug; directory
duration.h; ASCII C++ program text, with CRLF line terminators
ipch; directory
main.cpp; ASCII C program text, with CRLF line terminators
Precision.txt; ASCII text, with CRLF line terminators
Release; directory
Speed.txt; ASCII text, with CRLF line terminators
SquareRoot.sdf; data
SquareRoot.sln; UTF-8 Unicode (with BOM) text, with CRLF line terminators
SquareRoot.sln.docstates.suo; PCX ver. 2.5 image data
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary info
SquareRoot.vcproj; XML document text
SquareRoot.vcxproj; XML document text
SquareRoot.vcxproj.filters; XML document text
SquareRoot.vcxproj.user; XML document text
squarerootmethods.h; ASCII C program text, with CRLF line terminators
UpgradeLog.XML; XML document text
C:\Users\SH\Downloads\SquareRoot>file --mime-encoding *
_UpgradeReport_Files; binary
Debug; binary
duration.h; us-ascii
ipch; binary
main.cpp; us-ascii
Precision.txt; us-ascii
Release; binary
Speed.txt; us-ascii
SquareRoot.sdf; binary
SquareRoot.sln; utf-8
SquareRoot.sln.docstates.suo; binary
SquareRoot.suo; CDF V2 Document, corrupt: Cannot read summary infobinary
SquareRoot.vcproj; us-ascii
SquareRoot.vcxproj; utf-8
SquareRoot.vcxproj.filters; utf-8
SquareRoot.vcxproj.user; utf-8
squarerootmethods.h; us-ascii
UpgradeLog.XML; us-ascii
Another tool that I found useful: https://archive.codeplex.com/?p=encodingchecker
EXE can be found here
Install git ( on Windows you have to use git bash console). Type:
file --mime-encoding *
for all files in the current directory , or
file --mime-encoding */*
for the files in all subdirectories
Here's my take how to detect the Unicode family of text encodings via BOM. The accuracy of this method is low, as this method only works on text files (specifically Unicode files), and defaults to ascii when no BOM is present (like most text editors, the default would be UTF8 if you want to match the HTTP/web ecosystem).
Update 2018: I no longer recommend this method. I recommend using file.exe from GIT or *nix tools as recommended by #Sybren, and I show how to do that via PowerShell in a later answer.
# from https://gist.github.com/zommarin/1480974
function Get-FileEncoding($Path) {
$bytes = [byte[]](Get-Content $Path -Encoding byte -ReadCount 4 -TotalCount 4)
if(!$bytes) { return 'utf8' }
switch -regex ('{0:x2}{1:x2}{2:x2}{3:x2}' -f $bytes[0],$bytes[1],$bytes[2],$bytes[3]) {
'^efbbbf' { return 'utf8' }
'^2b2f76' { return 'utf7' }
'^fffe' { return 'unicode' }
'^feff' { return 'bigendianunicode' }
'^0000feff' { return 'utf32' }
default { return 'ascii' }
}
}
dir ~\Documents\WindowsPowershell -File |
select Name,#{Name='Encoding';Expression={Get-FileEncoding $_.FullName}} |
ft -AutoSize
Recommendation: This can work reasonably well if the dir, ls, or Get-ChildItem only checks known text files, and when you're only looking for "bad encodings" from a known list of tools. (i.e. SQL Management Studio defaults to UTF16, which broke GIT auto-cr-lf for Windows, which was the default for many years.)
A simple solution might be opening the file in Firefox.
Drag and drop the file into firefox
Press Ctrl+I to open the page info
and the text encoding will appear on the "Page Info" window.
Note: If the file is not in txt format, just rename it to txt and try again.
P.S. For more info see this article.
I wrote the #4 answer (at time of writing). But lately I have git installed on all my computers, so now I use #Sybren's solution. Here is a new answer that makes that solution handy from powershell (without putting all of git/usr/bin in the PATH, which is too much clutter for me).
Add this to your profile.ps1:
$global:gitbin = 'C:\Program Files\Git\usr\bin'
Set-Alias file.exe $gitbin\file.exe
And used like: file.exe --mime-encoding *. You must include .exe in the command for PS alias to work.
But if you don't customize your PowerShell profile.ps1 I suggest you start with mine: https://gist.github.com/yzorg/8215221/8e38fd722a3dfc526bbe4668d1f3b08eb7c08be0
and save it to ~\Documents\WindowsPowerShell. It's safe to use on a computer without git, but will write warnings when git is not found.
The .exe in the command is also how I use C:\WINDOWS\system32\where.exe from powershell; and many other OS CLI commands that are "hidden by default" by powershell, *shrug*.
you can simply check that by opening your git bash on the file location then running the command file -i file_name
example
user filesData
$ file -i data.csv
data.csv: text/csv; charset=utf-8
Some C code here for reliable ascii, bom's, and utf8 detection: https://unicodebook.readthedocs.io/guess_encoding.html
Only ASCII, UTF-8 and encodings using a BOM (UTF-7 with BOM, UTF-8 with BOM,
UTF-16, and UTF-32) have reliable algorithms to get the encoding of a document.
For all other encodings, you have to trust heuristics based on statistics.
EDIT:
A powershell version of a C# answer from: Effective way to find any file's Encoding. Only works with signatures (boms).
# get-encoding.ps1
param([Parameter(ValueFromPipeline=$True)] $filename)
begin {
# set .net current directoy
[Environment]::CurrentDirectory = (pwd).path
}
process {
$reader = [System.IO.StreamReader]::new($filename,
[System.Text.Encoding]::default,$true)
$peek = $reader.Peek()
$encoding = $reader.currentencoding
$reader.close()
[pscustomobject]#{Name=split-path $filename -leaf
BodyName=$encoding.BodyName
EncodingName=$encoding.EncodingName}
}
.\get-encoding chinese8.txt
Name BodyName EncodingName
---- -------- ------------
chinese8.txt utf-8 Unicode (UTF-8)
get-childitem -file | .\get-encoding
Looking for a Node.js/npm solution? Try encoding-checker:
npm install -g encoding-checker
Usage
Usage: encoding-checker [-p pattern] [-i encoding] [-v]
Options:
--help Show help [boolean]
--version Show version number [boolean]
--pattern, -p, -d [default: "*"]
--ignore-encoding, -i [default: ""]
--verbose, -v [default: false]
Examples
Get encoding of all files in current directory:
encoding-checker
Return encoding of all md files in current directory:
encoding-checker -p "*.md"
Get encoding of all files in current directory and its subfolders (will take quite some time for huge folders; seemingly unresponsive):
encoding-checker -p "**"
For more examples refer to the npm docu or the official repository.
Similar to the solution listed above with Notepad, you can also open the file in Visual Studio, if you're using that. In Visual Studio, you can select "File > Advanced Save Options..."
The "Encoding:" combo box will tell you specifically which encoding is currently being used for the file. It has a lot more text encodings listed in there than Notepad does, so it's useful when dealing with various files from around the world and whatever else.
Just like Notepad, you can also change the encoding from the list of options there, and then saving the file after hitting "OK". You can also select the encoding you want through the "Save with Encoding..." option in the Save As dialog (by clicking the arrow next to the Save button).
The only way that I have found to do this is VIM or Notepad++.
EncodingChecker
File Encoding Checker is a GUI tool that allows you to validate the text encoding of one or more files. The tool can display the encoding for all selected files, or only the files that do not have the encodings you specify.
File Encoding Checker requires .NET 4 or above to run.
I have Cygwin installed in order to use Linux command line tools on Windows. I also added it to my PATH. In general, it works fine, but I observe this weird behavior:
I want to run sha256sum on the file C:\Users\s1504gl\Desktop\Täst .txt. Note the german Umlaut ä and the whitespace before the file extension. In order to avoid problems with paths, I always quote paths in command line calls, such as:
sha256sum "C:\Users\s1504gl\Desktop\Täst .txt"
However, PowerShell returns
/usr/bin/sha256sum: '"C:\Users\s1504gl\Desktop\T'$'\303\244''st .txt"': No such file or directory
When I rename the file to either Täst.txt or Test .txt, it works. So the combination of the special character ä and the whitespace seems to cause the problem. Exchanging double quotes by single quotes does not change anything in this case.
I am pretty sure it has to to with PowerShell since the example works without any problems on my Linux machine.
Is there some other way of escaping special characters and/or blanks that I do not know?
Run from Cygwin terminal
sha256sum "/cygdrive/C/Users/s1504gl/Desktop/Täst\ .txt"
In general Cygwin program do not accept Windows paths and works surely with POSIX path
I found the following workaround:
I create a temporary file from R, containing all the necessary commands and then run this tempfile using bash which is also included in Cygwin. This way, I escape from the problem occurring due to different encodings in Windows and the Linux tools from Cygwin.
I am trying to create a command line to compress as RAR file using password through command line in Windows 7. I have installed WinRAR 5.31 x64.
The following command works for me:
rar a -r -m0 -hp"!(/!$!#!#=)\%" C:\files1.rar" *.*
The password is !(/!$!#!#=)\%.
My problem occurs if I wanted to put double quotes " inside my password, for example at the beginning:
rar a -r -m0 -hp""!(/!$!#!#=)\%" C:\files1.rar" *.*
The password should be "!(/!$!#!#=)\%.
That does not work for me, I tried putting \ before of ", but this is also not working.
Could anyone guide me through it in order to figure it out this special character in my password?
Further to the answer by Mofi:
Especially for Linux users using winrar/rar from the commandline, it may be worth realizing that rar effectively accepts "keyfiles", which may overcome the need to fiddle with quotes as part of the password.
Rar's documented maximum password length is 127 characters/bytes. It is not clear (to me) precisely which characters are part of the password space, but at least base64-encoded strings work. However, rar currently uses a password based key derivation function based on PBKDF2 using the HMAC-SHA256 hash function, which has a block size of 512 bits. Per PBKDF2, passwords longer than the block size of the hash function are first pre-hashed into a digest of 256 bits, which digest is then used as the password (instead of the original password). To avoid this, the archive password you pick should be no longer than 512 bits or 64 characters.
In a base64-encoded string, each character represents 6 bits of data; a 64 character password thus amounts to 384 random bits, which may be derived from 48 random bytes.
rar a -hp"$(dd if=/dev/urandom bs=48 count=1 | base64 -w0 | tee /tmp/pwd)" archive
The dd-pipe above will read 48 (pseudo)random bytes from the kernel's (non-blocking) random number source device, convert these into a 64 character password, tell rar to use that password for deriving a 256-bit (AES256) encryption key (RAR5-format), and at the same time store the password in the file `/tmp/pwd'.
The archive may again be accessed, e.g. listed, by reading the password back from the file, for instance like so:
rar l -p"$(cat /tmp/pwd)" archive.rar
The password file may be safely stored separately or together with the archive, in the latter case (of course) after encrypting it, e.g. with gpg using your own public key so as to lock the archive password under your private key/key phrase. All of this aims to conveniently make good use of rar's password/key space.
I note that I didn't dive into unrar's publicly available source code; the above is merely based on the general documentation. In addition, I don't know if the above can be made to work under Windows.
The Windows command interpreter cmd.exe and Rar.exe itself determine how arguments specified on command line are interpreted on parsing the command line. Argument strings containing a space or one of these characters &()[]{}^=;!'+,`~<|> must be enclosed in double quotes. This makes it very difficult to pass a double quote character as part of an argument string to a console application, especially at begin of an argument string.
But there is a solution for this very uncommon and very specific problem caused by a password/passphrase starting with a straight double quote character which marks usually begin/end of an argument string within all characters between are interpreted literally.
The manual of console version of WinRAR is the text file Rar.txt in program files folder of WinRAR. It can be read in this manual that Rar.exe supports reading switches from an environment variable RAR. By using this environment variable and special parsing of Windows command line interpreter on a SET command line it is possible to create a RAR archive from command line with a password starting with a single straight double quote character.
#echo off
setlocal EnableExtensions DisableDelayedExpansion
set "RAR=-hp""!(/!$!#!#=)\%%""
"%ProgramFiles%\WinRAR\Rar.exe" a -r -m0 -x"%~f0" "%USERPROFILE%\Desktop\files1.rar" *.*
endlocal
The switch -hp is read from environment variable RAR in addition to the other switches specified directly on RAR command line as explained by the manual.
The environment variable RAR is defined using syntax set "variable=value" as explained in detail by answer on Why is no string output with 'echo %var%' after using 'set var = text' on command line?
A password/passphrase with space or one of these characters &()[]{}^=;!'+,`~<|> needs to be enclosed in double quotes on Windows command line. For that reason Rar.exe removes from the passed password/passphrase the first and last double quote if there is one at begin and/or end. So it is not possible to define the password with "!(/!$!#!#=)\%. The password must be defined with two additional double quotes using ""!(/!$!#!#=)\%" to let really used password start with a straight double quote character.
In a batch file % marks begin/end of an environment variable reference except it is escaped with one more %.
So finally the command line set "RAR=-hp""!(/!$!#!#=)\%%"" defines the environment variable RAR with switch -hp passing the string "!(/!$!#!#=)\% to Rar.exe as password to use on encryption.
The RAR archive files1.rar is created on user's desktop by this code as root of directory C: is usually write-protected.
Note: Rar and WinRAR interpret *.* different to * as also explained in manual in comparison to Windows kernel functions interpreting them identical. Rar adds only files containing a dot in name of file into the RAR archive file on using *.*. So you might better use just * as wildcard.
The switch -x"%~f0" prevents adding the batch file also into the RAR archive file if being stored in current directory on execution of the batch file. Run in a command prompt window call /? for an explanation of %~f0 – full name of argument 0 which means batch file name with extension and full path.
I have a *.tar.gz file that have inside occasionally some names with non ascii letters.
for example when tar encounter a file containing word: naïve it outputs: na\303\257ve
Is there any swich, or tool to convert these slashed values to a proper letter ?
http://www.gnu.org/software/tar/manual/tar.html
By default GNU tar attempts to unquote each file or member name, replacing escape sequences according to the following table: ...
This default behavior is controlled by the following command line
option:
--unquote
Enable unquoting input file or member names (default).
--no-unquote
Disable unquoting input file or member names.
In other words, see if "--no-unquote" is an option for your version of Cygwin.
PS:
Which version of Cygwin tar are you using?
I have a little c# console program that outputs some text using Console.WriteLine. I then pipe this output into a textfile like:
c:myprogram > textfile.txt
However, the file is always an ansi text file, even when I start cmd with the /u switch.
cmd /? says about the /u switch:
/U Causes the output of internal
commands to a pipe or file to be Unicode
And it indeed makes a difference, when I do an
c:echo "foo" > text.txt
the text.txt is unicode (without BOM)
I wonder why piping the output of my console program into a new file does not create an unicode file likewise and how i could change that?
I just use Windows Power Shell (which produces a unicode file with correct BOM), but I'd still like to know how to do it with cmd.
Thanks!
The /U switch, as the documentation says, affects whether internal commands generate Unicode output. Your program is not one of cmd.exe's internal commands, so the /U option does not affect it.
To create a Unicode text file, you need to make sure your program is generating Unicode text.
Even that may not be enough, though. I came across this blog from Junfeng Zhang describing how to write Unicode text in a console program. It checks the file type of the standard output handle. For character files (a console or LPT port), it calls WriteFileW. For all other types of handles (including disk files and pipes), it converts the output string to the console's current code page. I'm afraid I don't know how that translates into .Net terms, though.
I had a look how mscorlib implements Console.WriteLine, and it seems to decide on which text output encoding to use based on a call to GetConsoleOutPutCP. So I'm guessing (but have not yet confimed) that the codepage returned is a differnt one for a PS console than for a cmd console so that my program indeed only outputs ansi when running from cmd.