Search for unicode text inside Windows XP - unicode

Is there a way of searching for unicode characters inside a text file under Windows XP? For example suppose I wish to find text documents with the euro symbol. Although the standard XP search allows me to search for the euro symbol it does not produce any matches when I know they should be at least a few. Wingrep has the same issue. Is there any simple software/setting the I have missed?

The input encoding of the search field (in Windows XP, UTF-16) may not match the encoding of the text file (probably UTF-8).
I haven't used this tool (freeware), but it might work for your needs.

In windows or what ever else system you can find out that is it the document unicode (have a unicode character ) or not ?
To achieve this just use this simpl code, not that this code, written in C# and you should use your own equevalent.
public bool IsUnicode(string str)
{
int asciiBytesCount = System.Text.Encoding.ASCII.GetByteCount(str);
int unicodBytesCount = System.Text.Encoding.UTF8.GetByteCount(str);
if (asciiBytesCount!=unicodBytesCount )
return true;
return false;
}
if you do not want to write any code and find out that , is document contain any unicode character just see the document (save) Type.

Related

Weird Normalization on .net

I am trying to normalize a string (using .net standard 2.0) using Form D, and it works perfectly and running on a Windows machine.
[TestMethod]
public void TestChars()
{
var original = "é";
var normalized = original.Normalize(NormalizationForm.FormD);
var originalBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(original));
Assert.AreEqual("233,0", originalBytesCsv);
var normalizedBytesCsv = string.Join(',', Encoding.Unicode.GetBytes(normalized));
Assert.AreEqual("101,0,1,3", normalizedBytesCsv);
}
When I run this on Linux, it returns "253,255" for both strings, before and after normalization. These two bytes form the word 65533 which is the Unicode Replacement char, used when something goes wrong with encoding. That's the part where I am lost.
What am I missing here? Is there someone to point me in the right direction?
It might be related to the encoding of the source file. I'm not sure which encodings .net on Linux supports, but to be on the safe side, you should use plain ASCII source files and Unicode escapes for Non-ASCII characters:
var original = "\u00e9";
There is no text but encoded text.
When communicating text to person or program, both the bytes and the character encoding are essential.
The C# compiler (like all programs that process text, except in special cases like JSON) must know which character encoding the input files use. You must inform it accurately. The default is UTF-8 and that is a fine choice, especially for C# files, which are, lexically, sequences of Unicode codepoints.
If you used your editor or IDE or file transfer without full mindfulness of these requirements, you might have used an unintended character encoding.
For example, "é" when saved as Windows-1252 (0xE9) but read as UTF-8 (leading code unit that should be followed by two continuation code units), would give � to indicate this mishandling to the readers.
To be on the safe side, use UTF-8 everywhere but do it mindfully.

Word Object Model: Save as Unicode Text

I know that it is possible to save a document as text with the Word Object Model. (MSDN Link)
It says in the documentation that the number for Unicode Text is "7", which is why I use the following code in AutoHotkey: oWord.Documents(1).SaveAs2(SpeicherortB,7)
(Saves Document 1 of the oWord Application to the location "SpeicherortB" as Unicode (7))
Unlike the documentation suggests, the result is not Unicode though, Asian or Russian characters are not supported. Do you have any idea how to fix this?
For reference: I need to use the Object Model as I am running my code through AutoHotkey.
The MsoEncoding parameter has to be set to the number 65001.
The final AutoHotkey line would thus look like this:
oWord.Documents(1).SaveAs2(filename, 7,,,,,,,,,, 65001)

Convert non english characters into Unicode (UTF-8)

I copied large amount of text from another system to my PC. When I viewed the text in my PC, it looked weird. So I copied all the fonts from the other PC and installed them in mine too. Now the text looks okay, but actually it seems that is not in Unicode. For example, if I copy the text and paste in another UTF-8 supported editor such as Notepad++, I get English characters ("bgah;") only like shown below.
How to convert this whole text into unicode text, like the one below. So I can copy the text and paste anywhere else.
பெயர்
The above text was manually obtained using http://www.google.com/transliterate/indic/Tamil
I need this conversion to be done, so I can copy them into database tables.
'Ja-01' is a font with a custom 'visual encoding'.
That is to say, the sequence of characters really is "bgah;" and it only looks like Tamil to you because the font's shapes for the Latin characters bg look like பெ.
This is always to be avoided, because by storing the content as "bgah;" you lose the ability to search and process it as real Tamil, but this approach was common in the pre-Unicode days especially for less-widespread scripts without mature encoding standards. This application probably predates widespread use of TSCII.
Because it is a custom encoding not shared by any other font, it is very unlikely you will be able to find a tool to convert content in this encoding to proper Unicode characters. It does not appear to be any standard character ordering, so you will have to look at the font (eg in charmap.exe) and note down every character, find the matching character in Unicode and map between them.
For example here's a trivial Python script to replace characters in a file:
mapping= {
u'a': u'\u0BAF', # Tamil letter Ya
u'b': u'\u0BAA', # Tamil letter Pa
u'g': u'\u0BC6', # Tamil vowel sign E (combining)
u'h': u'\u0BB0', # Tamil letter Ra
u';': u'\u0BCD', # Tamil sign virama (combining)
# fill in the rest of the mapping information here!
}
with open('ja01data.txt', 'rb') as fp:
data= fp.read().decode('utf-8')
for char in mapping:
data= data.replace(char, mapping[char])
with open('utf8data.txt', 'wb') as fp:
fp.write(data.encode('utf-8'))
The font you found is getting you into trouble. The actual cell text is "bgah;", it gets rendered to பெயர் because you found a font that can work with 8-bit non-Unicode characters. So reading it or pasting it into Notepad++ is going to produce "bgah;" since that's the real text. It can only ever be rendered properly again by forcing the program that displays the string to use that same font.
Ditch the font and enter Unicode so it looks like this:
"bgah" looks like a Baamini based system, which is pre-unicode. It was popular in Canada (and the SL Tamil diaspora in general) in the 90s.
As the others mentioned, it looks like a custom visual encoding that mimics the performance of a foreign script while maintaining ASCII encoding.
Google "Baamini to unicode convertor". The University of Colombo seems to have put one up: http://www.ucsc.cmb.ac.lk/ltrl/services/feconverter/?maps=t_b-u.xml
Let me know if this works. If not, I can ask around and get something for you.
You could first check whether the encoding is TSCII, as this sounds most probable. It is an 8-bit encoding, and the fonts you copied are probably based on that encoding. Check out whether the TSCII to UTF-8 converter at SourceForge is suitable. The project there is called “Any Tamil Encoding to Unicode” but they say that only TSCII is supported for now.

Reading unicode string from registry

I'm using codegear c++ builder 2007. I'm trying to read a string value with a path from the registry. This path can contain unicode characters, for example russian.
I have added a string value with regedit and verified by exporting that the value really contains the expected unicode characters. The result in S1, S2 and S3 below all contains '?' (0x3F) instead of the unicode characters. What am I missing?
TRegistry *Registry = new TRegistry;
try
{
Registry->RootKey = HKEY_CURRENT_USER;
if (Registry->OpenKey ("Software\\qwe\\asd", false))
{
AnsiString S1 = Registry->ReadString ("zxc");
WideString S2 = Registry->ReadString ("zxc");
UTF8String S3 = Registry->ReadString ("zxc");
}
}
__finally
{
delete Registry;
}
/Björn
The VCL in C++Builder (and Delphi) 2007 uses Ansi, not Unicode. TRegistry::ReadString() is internally calling the Win32 API RegQueryValueExA() function instead of RegQueryValueExW(), and TRegistry::ReadString() returns an AnsiString that uses the OS default Ansi codepage. Any Unicode data gets automatically converted to Ansi by the OS before your code ever sees it. The '?' character means that a Unicode character got converted to an Ansi codepage that does not support that character. It does not matter what string type you assign the result of ReadString() to, the Unicode data has already been lost before ReadString() even exits.
If you need to read Unicode data as Unicode, then you need to call RegQueryValueExW() directly instead of using TRegistry::ReadString() (or upgrade to C++Builder 2009 or later, which now use Unicode).
http://do-the-right-things.blogspot.com/2008/03/codegear-delphi-2006nets-tregistry.html
CodeGear Delphi 2006.Net's TRegistry fails in Framework 2 SP1
I don't know whether C++ 2007 is also affected, but if it is, maybe there is a patch available somewhere.

RichTextBox use to retrieve Text property in C++

I am using a hidden RichTextBox to retrieve Text property from a RichEditCtrl.
rtb->Text; returns the text portion of either English of national languages – just great!
But I need this text in \u12232? \u32232? instead of national characters and symbols. to work with my db and RichEditCtrl. Any idea how to get from “пассажирским поездом Невский” to “\u12415?\u12395?\u23554?\u20219?\u30456?\u35527?\u21729? (where each national character is represented as “\u23232?”
If you have, that would be great.
I am using visual studio 2008 C++ combination of MFC and managed code.
Cheers and have a wonderful weekend
If you need a System::String as an output as well, then something like this would do it:
String^ s = rtb->Text;
StringBuilder^ sb = gcnew StringBuilder(s->Length);
for (int i = 0; i < s->Length; ++i) {
sb->AppendFormat("\u{0:D5}?", (int)s[i]);
}
String^ result = s->ToString();
By the way, are you sure the format is as described? \u is a traditional Escape sequence for a hexadecimal Unicode codepoint, exactly 4 hex digits long, e.g. \u0F3A. It's also not normally followed by ?. If you actually want that, format specifier {0:X4} should do the trick.
You don't need to use escaping to put formatted Unicode in a RichText control. You can use UTF-8. See my answer here: Unicode RTF text in RichEdit.
I'm not sure what your restrictions are on your database, but maybe you can use UTF-8 there too.