Unicode to UTF-8 - unicode

i'm using vbscript to extract data from db2 and write to file.
Writing to file like:
Set objTextFile = objFSO.CreateTextFile(sFilePath, True, True)
that creates file in unicode. But that is xml file and it uses UTF-8.
So when i open xml file with MS XML Notepad it throws error:
'hexadecimal value 0x00 is an invalid character'
So i opening this text file with TextPad and saving in UTF-8. After that XML opens without any problems.
Can i convert file from Unicode to UTF-8 by vbScript?

Using the Stream object to save your file with the utf-8 charset might work better for you; here's a simple .vbs function you could test out on your data:
Option Explicit
Sub Save2File (sText, sFile)
Dim oStream
Set oStream = CreateObject("ADODB.Stream")
With oStream
.Open
.CharSet = "utf-8"
.WriteText sText
.SaveToFile sFile, 2
End With
Set oStream = Nothing
End Sub
' Example usage: '
Save2File "The data I want in utf-8", "c:\test.txt"

Well, in some cases, we need to do this in WSH in a machine without ADO. In this case, keep in your mind that WSH don't create file in UTF-8 format (CreateTextFile method not work with UTF-8), but is completely possible to manipulate an UTF-8 file (appending data). Thinking this, I found an non-orthodoxal solution. Follow this steps:
1) Open a blank NOTEPAD, click FILE > SAVE AS, type a name for the file (like UTF8FileFormat.txt, per example), change the field "Encoding" to UTF-8 and click in [Save]. Leave NOTEPAD.
2) In your WSH you will use the UTF8FileFormat.txt to create your UTF8 text file. To do this, after your FileSystemObject declaration, use the CopyFile method to copy the UTF8FileFormat.txt to a new file (remember to use the Overwrite option) and, then, use the OpenTextFile method to open your new file with ForAppending and NoCreate options. After this, you will can write in this file normally (as in CreateTextFile method). Your new file will be in UTF-8 format. Below follow an example:
'### START
' ### REMEMBER: You need to create the UTF8FileFormat.txt file in a blank
' ### NOTEPAD with UTF-8 Encoding first.
Unicode=-1 : ForAppending=8 : NoCreate=False : Overwrite=True
set fs = CreateObject("Scripting.FileSystemObject")
fs.CopyFile "UTF8FileFormat.txt","MyNewUTF8File.txt",Overwrite
set UTF8 = fs.OpenTextFile("MyNewUTF8File.txt", ForAppending, NoCreate)
UTF8.writeline "My data can be writed in UTF-8 format now"
UTF8.close
set UTF8 = nothing
'### END

Related

Python Encoding String Issue: String "Île-de-France" is being written to CSV file as "√éle-de-France"

I'm trying to write a string "Île-de-France" to a CSV file, however, what is being written to file is "√éle-de-France". I tried encoding="utf-8" in the open and it didn't work. What am I missing? Thanks!
with open('/tmp/directwrite.csv', 'w', newline='') as csvfile:
fieldnames = ['location']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerow({'location': 'Île-de-France'})
It’s an issue with your viewer. If using Excel or some other windows program, useencoding='utf-8-sig'. That writes a signature at the beginning of the file that Windows programs recognize as a UTF-8 file; otherwise, it assumes an ANSI encoding for the file.

Trying to upload specific characters in Python 3 using Windows Powershell

I'm running this code in Windows Powershell and it includes this file called languages.txt where I'm trying to convert between bytes to strings:
Here is languages.txt:
Afrikaans
አማርኛ
Аҧсшәа
العربية
Aragonés
Arpetan
Azərbaycanca
Bamanankan
বাংলা
Bân-lâm-gú
Беларуская
Български
Boarisch
Bosanski
Буряад
Català
Чӑвашла
Čeština
Cymraeg
Dansk
Deutsch
Eesti
Ελληνικά
Español
Esperanto
فارسی
Français
Frysk
Gaelg
Gàidhlig
Galego
한국어
Հայերեն
हिन्दी
Hrvatski
Ido
Interlingua
Italiano
עברית
ಕನ್ನಡ
Kapampangan
ქართული
Қазақша
Kreyòl ayisyen
Latgaļu
Latina
Latviešu
Lëtzebuergesch
Lietuvių
Magyar
Македонски
Malti
मराठी
მარგალური
مازِرونی
Bahasa Melayu
Монгол
Nederlands
नेपाल भाषा
日本語
Norsk bokmål
Nouormand
Occitan
Oʻzbekcha/ўзбекча
ਪੰਜਾਬੀ
پنجابی
پښتو
Plattdüütsch
Polski
Português
Română
Romani
Русский
Seeltersk
Shqip
Simple English
Slovenčina
کوردیی ناوەندی
Српски / srpski
Suomi
Svenska
Tagalog
தமிழ்
ภาษาไทย
Taqbaylit
Татарча/tatarça
తెలుగు
Тоҷикӣ
Türkçe
Українська
اردو
Tiếng Việt
Võro
文言
吴语
ייִדיש
中文
Then, here is the code I used:
import sys
script, input_encoding, error = sys.argv
def main(language_file, encoding, errors):
line = language_file.readline()
if line:
print_line(line, encoding, errors)
return main(language_file, encoding, errors)
def print_line(line, encoding, errors):
next_lang = line.strip()
raw_bytes = next_lang.encode(encoding, errors=errors)
cooked_string = raw_bytes.decode(encoding, errors=errors)
print(raw_bytes, "<===>", cooked_string)
languages = open("languages.txt", encoding="utf-8")
main(languages, input_encoding, error)
Here's the output (credit: Learn Python 3 the Hard Way by Zed A. Shaw):
I don't know why it doesn't upload the characters and shows question blocks instead. Can anyone help me?
The first string which fails is አማርኛ. The first character, አ is in unicode 12A0 (see here). In UTF-8, that is b'\xe1\x8a\xa0'. So, that part is obviously fine. The file really is UTF-8.
Printing did not raise an exception, so your output encoding can handle all of the characters. Everything is fine.
The only remaining reason I see for it to fail is that the font used in the console does not support all of the characters.
If it is just for play, you should not worry about it. Consider it working correctly.
On the other hand, I would suggest changing some things in your code:
You are running main recursively for each line. There is absolutely no need for that and it would run into recursion depth limit on a longer file. User a for loop instead.
for line in lines:
print_line(line, encoding, errors)
You are opening the file as UTF-8, so reading from it automatically decodes UTF-8 into Unicode, then you encode it back into row_bytes and then encode again into cooked_string, which is the same as line. It would be a better exercise to read the file as raw binary, split it on newlines and then decode. Then you'd have a clearer picture of what is going on.
with open("languages.txt", 'rb') as f:
raw_file_contents = f.read()

How can I save Perl/Expect output that contains mixed ascii content?

I have a perl script that uses the expect library to login to a remote system. I'm getting the final output of the interaction with the before method:
$exp->before();
I'm saving this to a text file. When I use cat on the file it outputs fine in the terminal, but when I open the text file in an editor or try to process it the formatting is bizarre:
[H[2J[1;19HCIRCULATION ACTIVITY by TERMINAL (Nov 6,14)[11;1H
Is there a better way to save the output?
When I run enca it's identified as:
7bit ASCII characters
Surrounded by/intermixed with non-text data
you can remove none ascii chars.
$str1 =~ s/[^[:ascii:]]//g;
print "$str1\n";
I was able to remove the ANSI escape codes from my output by using the Text::ANSI::Util library's ta_strip() function:
my $ansi_string = $exp->before();
my $clean_string = ta_strip($ansi_string);

Perl unicode file with non-unicode content

A software is producing UTF-8 files, but writing content to the file that isn't unicode. I can't change that software and have to take the output as it is now. Don' t know if this will show up here correctly, but an german umlaut "ä" is shown in the file as "ä".
If I open the file in Notepad++, it tells me the file is UTF-8 (without BOM) encoded. Now, if I say "convert to ANSI" in Notepad and then switch the file encoding back to UTF-8 (without converting), the German umlauts in the file are correct. How can I achieve the exact same behaviour in Perl? Whatever I tried up to now, the umlaut mess just got worse.
To reproduce, create yourself an UTF-8 encoded file and write content to it:
Ok, I'll try. Create yourself a UTF-8 file and write this to it:
Männer Schüle Vöogel SüÃ
Then, on an UTF-8 mysql database, create a table with varchar field an UTF8_unicode encoding. Now, use this script:
use utf8;
use DBI;
use Encode;
if (open FILE, "test.csv") {
my $db = DBI->connect(
'DBI:mysql:your_db;host=127.0.0.1;mysql_compression=1', 'root', 'Yourpass',
{ PrintError => 1 }
);
my $sql="";
my $sql = qq{SET NAMES 'utf8';};
$db->do($sql);
while (my $line = <FILE>) {
my $sth = $db->prepare("INSERT IGNORE INTO testtable (testline) VALUES (?);");
$sth->execute($line);
}
}
The exact contents of file will get written to the database. But, the output I expect in database is with German umlauts:
Männer Schüler Vögel Süß
So, how can I convert that correctly?
It's ironic: as I see it, the software you talk about is not writing 'non-unicode content' (that's non-sense) - it encodes it UTF-8 twice. Let's take this ä character, for example: it's represented by two bytes in UTF-8, %C3 %A4. But then something in that program decides to treat these bytes as Latin-1 encodings instead: thus they become two separate characters (which will be eventually encoded into UTF-8, and that's what'll be saved into a file).
I suppose the simplest way of reversing this is making Perl think that it uses a series of bytes (and not a sequence of characters) when dealing with the string read from the file. It can be done as simple (and as ugly) as...
open my $fh, '<:utf8', $file_name or die $!;
my $string = <$fh>; # a sequence of characters
$string = utf8::decode($string); # ... will be considered a sequence of octets
Sounds like something is converting it a second time, assuming it to be something like ISO 8859-15 and then converting that to UTF-8. You can reverse this by converting UTF-8 to ISO 8859-15 (or whichever encoding seems to make sense for your data).
As seen on http://www.fileformat.info/info/unicode/char/E4/index.htm the bytes 0xC3 0xA4 are the valid UTF-8 encoding of ä. When viewed as ISO 8859-15 (or 8859-1, or Windows-1252, or a number of other 8-bit encodings) they display the string ä.

Unicode named Folder shows ? in wscript prompt

I am facing problems with Unicode named folders. When I drag the folder to the script, it doesn't show the path of the folder properly.
Simple VBScript (this is just a portion of it):
Dim Wshso : Set Wshso = WScript.CreateObject("WScript.Shell")
Dim FSO : Set FSO = CreateObject("Scripting.FileSystemObject")
If WScript.Arguments.Count = 1 Then
If FSO.FileExists(Wscript.Arguments.Item(0)) = true and FSO.FolderExists(Wscript.Arguments.Item(0)) = false Then
Alert "You dragged a file, not a folder! My god." & vbcrlf & "Script will terminate immediately", 0, "Alert: User is stupid", 48
WScript.Quit
Else
targetDir = WScript.Arguments.Item(0)
Wshso.Popup targetDir
End If
Else
targetDir = Wshso.SpecialFolders("Desktop")
Alert "Note: No folder to traverse detected, default set to:" & vbcrlf & Wshso.SpecialFolders("Desktop"), 0, "Alert", 48
End If
If it is a normal path without Unicode characters, it's fine. But in this case:
Directory: 4minute (포미닛) - Hit Your Heart
Then it will show something like 4minute (?) - Hit Your Heart
And if I do a FolderExists it can't find the dragged folder.
Is there any workaround to support Unicode named Folders?
Thanks!
I'll edit if this is not clear enough
This does seem to be a problem peculiar to the Windows Script Host's DropHandler shell extension. Whereas:
test.vbs "C:\포미닛.txt"
C:\WINDOWS\System32\WScript.exe "test.vbs" "C:\포미닛.txt"
both work when typed from the console (even if the console can't render the Hangul so it looks like ?), a drag and drop operation that should result in the same command goes through a Unicode->ANSI->Unicode translation that loses all characters that aren't in the current ANSI code page. (So 포미닛 will work on a default Korean Windows install but not Western.)
I'm not aware of a proper way to fix the problem. You could perhaps work around it by changing the DropHandler for .vbs files in the registry:
HKEY_CLASSES_ROOT\VBSFile\ShellEx\DropHandler\(Default)
from the WSH DropHandler ({60254CA5-953B-11CF-8C96-00AA00B8708C}) to {86C86720-42A0-1069-A2E8-08002B30309D}, the one used for .exe, .bat and similar, which doesn't suffer from this issue. You would also probably have to change the file association for .vbs to put quotes around the filename argument too, since the EXE DropHandler doesn't, to avoid problems with spaces in filenames.
Since this affects argument-passing for all VBS files it would be a perilous fix to deploy on any machine but your own. If you needed to do that, maybe you could try creating a new file extension with the appropriate DropTarget rather than changing VBSFile itself? Or maybe forgo drop-onto-script behaviour and provide a file Open dialog or manual drop field instead.
For anyone landing here from Google...
Bobince's tip lead me to work around this problem by wrapping my vbscript file (myscript.vbs) in a dos batch file (mybatch.bat).
The tip was:
"Seem to be a problem peculiar to the Windows Script Host's
DropHandler shell extension whereas.... the one used for .exe, .bat and
similar... doesn't suffer from this issue."
mybatch.bat contains:
:Loop
IF "%1"=="" GOTO Continue
set allfiles=%allfiles% "%1"
SHIFT
GOTO Loop
:Continue
"myscript.vbs" %allfiles%
You may also find this code from my myscript.vbs to be helpful
For Each strFullFileName In Wscript.Arguments
' do stuff
Next
Based on DG's answer, if you just want to accept one file as drop target then you can write a batch file (if you have it named as "x.bat" place VBScript with filename "x.bat.vbs" at same folder) that just contains:
#"%0.vbs" %1
the # means to not output the row on the display (I found it to show garbage text even if you use chcp 1250 as first command)
don't use double-quotes around %1, it won't work if your VBScript uses logic like the following (code I was using below was from http://jeffkinzer.blogspot.com/2012/06/vbscript-to-convert-excel-to-csv.html). Tested it and it works fine with spaces in the file and folder names:
Dim strExcelFileName
strExcelFileName = WScript.Arguments.Item(0) 'file name to parse
' get path where script is running
strScript = WScript.ScriptFullName
Dim fso
Set fso = CreateObject ("Scripting.FileSystemObject")
strScriptPath = fso.GetAbsolutePathName(strScript & "\..")
Set fso = Nothing
' If the Input file is NOT qualified with a path, default the current path
LPosition = InStrRev(strExcelFileName, "\")
if LPosition = 0 Then 'no folder path
strExcelFileName = strScriptPath & "\" & strExcelFileName
strScriptPath = strScriptPath & "\"
else 'there is a folder path, use it for the output folder path also
strScriptPath = Mid(strExcelFileName, 1, LPosition)
End If
' msgbox LPosition & " - " & strExcelFileName & " - " & strScriptPath
Modify WSH DropHandler ({60254CA5-953B-11CF-8C96-00AA00B8708C}) to {86C86720-42A0-1069-A2E8-08002B30309D} and add this function to convert short path to long:
Function Short2Long(shortFullPath)
dim fs
Set fs = CreateObject("Scripting.FileSystemObject")
Set f = fs.GetFile(shortFullPath)
Set app = CreateObject("Shell.Application")
Short2Long = app.NameSpace(f.ParentFolder.Path).ParseName(f.Name).Path
end function