How to programmatically rename a file containing decomposed characters? - unicode

I occasionnaly have to deal with files produced in a Mac environment, and with filenames containing decomposed characters (looks like "é", but really is "e´"). Those are visibly not recognized by Scripting.FileSystemObject and therefore cannot be acted on. I need to programmatically rename those files to remove the decomposed characters before further processing.
From what I found : "é (U+00E9) is a character that can be decomposed into an equivalent string of the base letter e (U+0065) and combining acute accent (U+0301)."
In other words, both strings look exactly like this : "é", but the length of the first one is 1 and the length of the second one is 2. If converted, it actually looks like this "e´".
Here's a little script for testing purposes :
(Please create those two test files by copy/pasting the names)
Filename with composed character (working) : é.txt
Filename with decomposed character (not working) : é.txt
Set args = WScript.Arguments
Set FSO = CreateObject("Scripting.FileSystemObject")
For Each Arg in Wscript.Arguments
Set objFile = FSO.GetFile(Arg)
fPath = Left(objFile.Path, Len(objFile.Path)-Len(objFile.Name))
FSO.movefile arg, fpath & "a.txt"
Set objFile = Nothing
Set FSO = Nothing
next
The file with the decomposed character produces a "File not found" error.
I managed to convert a string from decomposed to composed characters, but still not working when trying to rename an actual file.
I'm completely stuck at this point, and any help would be highly appreciated! Thanks in advance.

This has to do with the VBS/WSH DropHandler (HKEY_CLASSES_ROOT\VBSFile\ShellEx\DropHandler)
The DropHandler of VBS/WSH files is {60254CA5-953B-11CF-8C96-00AA00B8708C}.
EXE/BAT/CMD files are handled by {86C86720-42A0-1069-A2E8-08002B30309D}.
VBS/WSH drophandler parses the dropped object(s) to a long file path while the EXE/BAT/CMD drophandler parses the dropped object(s) to short file path (such as C:\PROGRA~1).
The problem is that the DropHandler of VBS doesn't parse the dropped object in Unicode way.
Your code is relying on items being dropped apparently so you rely on the WScript.Arguments.
The FSO functions CAN handle filenames like you describe.
You can test this by performing a
Set objFile = FSO.GetFile("<PATH>\e´.txt")`
or even
FSO.FileExists("<PATH>\e´.txt")
However, coming in through the arguments, the filenames are already crippled by the drophandler. I see no safe way of changing this behaviour other than messing around in the Windows Registry or by changing your script to not use 'drag-'n-drop' but getting the filenames from the OpenFile dialog perhaps.

Related

Compare filenames with different encoding in Octave

I'm trying to accomplish following task in Octave:
Read filename from text file
Search for this file in particular location on hard drive
My script works for most files, but for certain files containing unicode characters I'm unable to match the filename from textfile with filename as it appears in the file system.
Filenames in textfile are in UTF-8 encoding and I read them in Octave with function fgetl().
Filenames from file system are obtained via function readdir(). I'm on Windows, NTFS file system.
For example, one problematic filename contains character "Č".
When printed out in Octave console, the characters appear exactly the same. However, a HEX viewer reveals that the characters are not actually the same. In the first case the character is encoded as 0x010C, in the second case as 0x0043 + 0x030C. Comparing both of them via strcmp() fails, of course.
What I tried to do is to omitt all non-ASCII characters from the filename and then compare them. But this didn't work, probably because in the second variant the first part of the character (0x0043) is actually ASCII.
Now I'm looking for some way of converting one format to another to be able to compare them. Any ideas?
EDIT:
As I discovered later, the character Č in the filename on Windows is actually written as C+ˇ, which is just another way you can write that character. So the difference probably insn't in encoding standard, but in 2 different ways to achieve 1 visible character (glyph).
This question basically then changes to a task of matching characters written "at once" and corresponding pair of letter+combining character.

Can not seem to use a base64 string from a file for a variable in batch

I'm having an issue parsing a base64 string from a text file in to a batch variable
I have a script that is generating a config for an application in XML using batch, I have the XML generating fine. The problem is, within the XML I generate is some base64 that encodes more XML with variables that I need to modify. Headache and a half. (The application in question requires this or it breaks the config)
I have the XML that needs to be encoded in to a base64 string in a text file but I need to load that text file into a variable but I think the string is breaking the variable.
The base64 it has generated is this:
PD94bWwgdmVyc2lvbj0iMS4wIj8+DQo8QXJyYXlPZlN5c3RlbVZhcmlhYmxlIHhtbG5zOnhzaT0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEtaW5zdGFuY2UiIHhtbG5zOnhzZD0iaHR0cDovL3d3dy53My5vcmcvMjAwMS9YTUxTY2hlbWEiPg0KPFN5c3RlbVZhcmlhYmxlPg0KPElEPiVVU0VSTkFNRSU8L0lEPg0KPFZhbHVlPmxlZ2FsaXQ8L1ZhbHVlPg0KPFJlYWRPbmx5PnRydWU8L1JlYWRPbmx5Pg0KPFR5cGU+U3RyaW5nPC9UeXBlPg0KPC9TeXN0ZW1WYXJpYWJsZT4NCjxTeXN0ZW1WYXJpYWJsZT4NCjxJRD4lTE9HT05fVVNFUk5BTUUlPC9JRD4NCjxSZWFkT25seT50cnVlPC9SZWFkT25seT4NCjxWYWx1ZT5sZWdhbGl0PC9WYWx1ZT4NCjxUeXBlPlN0cmluZzwvVHlwZT4NCjwvU3lzdGVtVmFyaWFibGU+DQo8U3lzdGVtVmFyaWFibGU+DQo8SUQ+JVNFX0xPQ0FMX1RFTVAlPC9JRD4NCjxWYWx1ZT5DOlxVc2Vyc1xsZWdhbGl0XEFwcERhdGFcTG9jYWxcVGVtcFwyXDwvVmFsdWU+DQo8UmVhZE9ubHk+dHJ1ZTwvUmVhZE9ubHk+DQo8VHlwZT5QYXRoPC9UeXBlPg0KPC9TeXN0ZW1WYXJpYWJsZT4NCjxTeXN0ZW1WYXJpYWJsZT4NCjxJRD4lU0VfTE9DQUxfRElDVF9ST09UJTwvSUQ+DQo8VmFsdWU+QzpcVXNlcnNcbGVnYWxpdFxEb2N1bWVudHNcU3BlZWNoRXhlY1w8L1ZhbHVlPg0KPFJlYWRPbmx5PnRydWU8L1JlYWRPbmx5Pg0KPFR5cGU+UGF0aDwvVHlwZT4NCjwvU3lzdGVtVmFyaWFibGU+DQo8U3lzdGVtVmFyaWFibGU+DQo8SUQ+JVNFX0NFTlRSQUxfRElDVF9ST09UJTwvSUQ+DQo8VmFsdWU+XFxMSVQtU0VSVkVSXFBoaWxpcHNfU0VfRW50ZXJwcmlzZVxDZW50cmFsX0RpY3RhdGlvbjwvVmFsdWU+DQo8UmVhZE9ubHk+ZmFsc2U8L1JlYWRPbmx5Pg0KPFR5cGU+UGF0aDwvVHlwZT4NCjwvU3lzdGVtVmFyaWFibGU+DQo8L0FycmF5T2ZTeXN0ZW1WYXJpYWJsZT4NCg==
for /f "tokens=*" %%c in (%~dp0\base.txt) do (
set base=%%c
)
echo %base%
I'm using the above for loop to load the file into a variable, but when echoing I get no output as it doesn't seem to have set the variable for some reason. Other text files I've loaded in to a variable using this method work.
TL;DR -
Is this a pre-WinXp system?
Details:
Is that a long base64, or are you happy to see me? :-)
First I wondered about the line's length. But any WinXp+ will handle 8k chars, so the 7,664 in your example shouldn't be an issue.
So I wondered about plus signs and back slashes outside double-quotes. However, the lack of dbl-quotes wasn't an issue in my testing. This worked just fine:
--CMD:--
for /f %A in (c:\temp\z.txt) do #set zLongTmp=%A
--OUTPUT:--
...a 7,664 char output...
So I checked the string, and found only [A-Z0-9+/]....so nothing out of the ordinary
I've done other tests, but all succeeded. I'm left wondering if this is a pre-WinXp system, that has a cmdline max of 2047 characters. And if that's not it, I'd still advise using Double quotes around the 'set' command.
Example:
for /f %A in (c:\temp\z.txt) do #set "zLongTmp=%A"
echo %A
The main issue here I was having was the file I was trying to load as a variable was encoded in UCS-2 LE BOM. I had to ensure the file was encoded in UTF 8 Without BOM and all worked as required.
Thanks to everyone who helped figure this out.

How can I strip out "file separator" characters from CSS/text files?

My CSS files have become contaminated with "file separator" characters (AKA "INFORMATION SEPARATOR FOUR" or ALT/028 characters). How can I get rid of them?
This is the character:
http://www.fileformat.info/info/unicode/char/1c/index.htm
Background
I manage a number of .CSS text files that are fairly similar. Unfortunately a number of these file have somehow got "file separator" characters pasted into them. Although they do still seem to work in browsers any file that has one of these characters anywhere within it can not be indexed by my desktop search utility (X1 Search). And this is making them extremely hard for me to manage because I need to compare CSS files contantly.
[Bizarrely X1 Search ignores the character if the filename extension is .TXT but files to index the entire file if the filename extension is .CSS]
Worse this "file separator" character is almost invisible within my text editor (TextPad 7.2). The only way I can detect it is to make spaces and carriage returns visible and then it appears as blank space. Worse still it appears to be impossible to search for using text search.
To make it clear what I mean an example that I have pasted into this page. The "file separator" character is on LineB below
LineA
LineB
LineC
LineD
Is there any way to remove this character from multiple text (in this case CSS) files at once?
NB I do NOT want to remove the whole line, just the one character(!)
Thanks
J
P.S. I am running on Windows7 (x64). I am using TextPad 7.3.
I have eventually managed to answer my own question.
Text Crawler and the use of a regular expression of "\x1c" appears to be the answer.
Fwiw, both Agent Ransack and FileLocator Pro filter out any characters in the ASCII range 0-31 (excluding 0x09 - tab) from the input field.

PyGame: Proper use of Unicode

My goal is to create a program, with which the user can learn Bible verses by getting shown a problem and solving it through input (e.g. "Quote vers Gen 3:15"). As the Bible translation, I have to work with, is German, it contains a ton of umlauts, which are never showing properly.
My PyGame file's header:
#!/usr/bin/python
# -*- coding: utf-8 -*-
Later on, I list the three German umlauts:
u'ö'.encode('utf-8')
u'ä'.encode('utf-8')
u'ü'.encode('utf-8')
The txt-file is parsed by this function:
def load_list(listname):
fullname = os.path.join("daten", listname + ".txt")
with codecs.open(fullname, "r", "utf-8-sig") as name:
lines = name.readlines()
for x in range(0, len(lines)):
lines[x] = lines[x].strip("\n")
lines[x] = lines[x].strip("\r")
print lines
I'm aware, that I could combine the two lines with the strip-commands, but that's not the topic here.
How can I get my PyGame to display the umlauts from the text-file correctly as well also display the user input's umlauts correctly? I checked hundreds of suggestions, I can't get anything really working here.
Any help is highly appreciated, before I lose my sane mind (well, as I'm sitting here, coding games, I probably did already anyway :D )
I'll try to summarize:
Printing something else than a string or unicode opject triggers that object's __repr__() method. If it is a sequence, this applies to the contained elements as well, causing any non-ascii character to be escaped with \xXX (or \uXXXX) notation. Note the difference between print 'text' and print ['text']: in the latter case, the string's quotes will be printed as well (besides the brackets of course). Use str.join() for concatenating lists of strings in order to control the way the output looks.
It's a good idea to always explicitely decode input (as you do by using codecs) and encode the output (which is not done in the code snippets in your question).
The source file encoding (the # coding: utf8 line in the header) has nothing to do with encoding of input and output. It only enables you to type non-ascii character in string literals (= characters inside quotes in the source file), instead of using \xXX escapes.
Hope that makes some things clearer. There's a lot that can go wrong that looks like an encoding error, and it's not always easy to find out what's actually happening.

Breaking UTF-16 Unicode text by delimiters in Applescript?

I have a list of text coded in MacRoman, broken by linefeeds. Somehow a second list could not be saved in MacRoman, so I had to use Unicode UTF-16 to get German "ö", "ä" and stuff. While ListA gets filled like expected, listB doesn't get broken anymore and I end up with a single string, which I'm unable to break anymore/don't know how. Can someone help me out?
set ListA to (read file myFile1 using delimiter linefeed) as list
display dialog "" & item 1 of ListA
--> "Name A"
set ListB to (read file myFile2 using delimiter linefeed as Unicode text) as list
display dialog "" & item 1 of ListB
--> "Name A
Name B
Name C
Name D"
There can be many different types of characters that separate lines in text files. It's not always a linefeed. The easiest way to handle them is with the applescript command "paragraphs" rather than using the delimiter when reading the file. Paragraphs is pretty good at figuring out what character is used and handling it. It doesn't always work but it's worth a try before you go any deeper into the problem. As such, try reading your files like this...
set ListB to paragraphs of (read file myFile2 as Unicode text)
If that doesn't work then you'll have to try and figure out what the character is. What I do in these cases is physically open the file and select the return character with my mouse... and copy it. Then I go back to AppleScript Editor and paste it into this command. Paste it where I have the letter "a". It will give you the character id.
id of "a"
Then you can read the file using the delimiter like this, obviously using the id number from the command above in place of 97...
set ListB to read file myFile2 using delimiter (character id 97) as Unicode text
Are you sure the file uses LF line endings? This works for me:
set f to POSIX file "/tmp/1"
set b to open for access f with write permission
set eof b to 0
write "あ" & linefeed & "い" to b as Unicode text -- UTF-16
close access b
read f using delimiter linefeed as Unicode text
Did you try saving the file as UTF-8? You can read it by replacing Unicode text with «class utf8».