Itoa and different character sets c++ visual studio 2013 - unicode

My code:
m_ListCtrlCandidates.InsertItem(i, _itoa_s(candidate[i].ID, (char*)(LPCTSTR)str, 10));
m_ListCtrlCandidates.SetItemText(i, 1, _itoa(candidate[i].FingerNumber, (char*)(LPCTSTR)str, 10));
m_ListCtrlCandidates.SetItemText(i, 2, _itoa(candidate[i].SampleNumber, (char*)(LPCTSTR)str, 10));
m_ListCtrlCandidates.SetItemText(i, 3, _itoa(candidate[i].ConfidenceLevel, (char*)(LPCTSTR)str, 10));
Error:
Error 2 error C4996: '_itoa': This function or variable may be unsafe. Consider using _itoa_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. d:\documents\visual studio 2013\projects\gatekeeper\gatekeeper\gatekeeperdlg.cpp 416 1 Gatekeeper
I'm using an SDK that has the following code in their sample. It adds potential matches to a list in the dialog. Initially I had my project set to unicode and updated the code to work. This was giving me trouble so I looked at the sample code and it's character set was blank. So I changed mine and now I get this error.
If I switch it to _itoa_s I get the error that the function doesn't take 3 arguments. So I guess I'm missing the size argument, but I'm not sure what size it is suppose to be. Also, it compiles fine in their sample code when left as _itoa.
I'd really like to keep it in Unicode. Using _wtoi instead of atoi helped in other places. Is there something similar for this case?

I'm using an SDK that has the following code in their sample.
That's unfortunate!
_itoa(candidate[i].FingerNumber, (char*)(LPCTSTR)str, 10)
I'm guessing that you're using MFC, str is a CString, and you're calling CListCtrl::SetItemText.
The (LPCTSTR) operator on CString gets the pointer to the underlying buffer holding the string data. That's a const TCHAR*, so if you're compiling in ANSI mode, this is a const char* pointer to bytes; in Unicode mode it's a const wchar_t* point to 16-bit code units.
Casting this to a non-const char* and asking _itoa to write to that is a pretty bad idea. This overwrites whatever was originally in the CString, and if the number is big enough that the resulting string is longer than what was originally in the CString you could be writing over the end of the array, causing memory corruption havoc.
Casting it to a char* in Unicode mode is even weirder as you're using a wchar_t array as storage for char* bytes. And SetItemText() in Unicode mode would be expecting to get wchar_t characters instead surely?
Using _wtoi instead of atoi helped in other places. Is there something similar
_itow exists as the wchar_t analogue of _itoa. (At least in VS. Neither function is standard C[++] as such.)
You can switch on #ifdef _UNICODE and call either _itoa or _itow to match whichever type TCHAR is. But unless you really need to support an ancient ANSI-only build for some legacy reason there's not much reason to bother with TCHAR switching these days. You can usually just stick to Unicode mode and use wchar_t-based strings for text.
error C4996: '_itoa': This function or variable may be unsafe.
Microsoft deprecated a number of C functions that write variable amounts of content to pre-allocated buffers, because the buffers can easily be too short, resulting in the aforementioned memory corruption horror. This has been a cause of countless security problems in applications in the past, so it's best avoided.
Unfortunately warning 4996 actually deprecates some standard functions that aren't really dangerous too, which is quite tiresome, especially as the _s versions suggested as a replacement are typically not supported by other compilers.
In this case though MS are kind of right, _itoa isn't really safe. To use it safely you'd have to allocate a buffer large enough for the longest possible integer of the type you're passing, and it's really easy to get that wrong.
If I switch it to _itoa_s I get the error that the function doesn't take 3 arguments. So I guess I'm missing the size argument, but I'm not sure what size it is suppose to be
It's however many elements are available at the end of the pointer to write to. So at present that would depend on the length of the CString whose buffer you're stealing. But it's not a good idea to do that. You could allocate your own buffer:
wchar_t str[8];
_itow_s(candidate[i].FingerNumber, str, 8, 10);
This is safe, though it still fails (with errno EINVAL) if FingerNumber has more than 7 digits as there would be no space to store them all (including the \0 terminator).
Functions like itoa that write variable content to buffers are pretty ugly in general. If you can use modern C++ with the STL, there are simpler, safer string-handling methods available, for example:
#include <string>
std::wstring fingers = std::to_wstring(candidate[i].FingerNumber);
m_ListCtrlCandidates.SetItemText(i, 1, fingers.c_str());
(although how well this will mix with old-school MFC and CString is another question.)

Related

wxTextCtrl OSX mutated vowel

i am using wxMac 2.8 in non-unicode build. I try to read a file with mutated vowels "ü" to a wxtextctrl. When i do, the data gets interpreted as current encoding, but it is a multibyte string. I narrowed the problem down to this:
text_ctrl->Clear();
text_ctrl->SetValue("üüüäääööößßß");
This is the result:
üüüäääööößßß
Note that the character count has doubled - printing the string in gdb displays "\303\274" and similar per original char. Typing "ü" or similar into the textctrl is no problem. I tried various wxMBConv methods but the result is always the same. Is there a way to solve this?
Best regards,
If you use anything but 7 bit ASCII, you must use Unicode build of wxWidgets. Just do yourself a favour and switch to it. If you have too much existing code that was written for "ANSI" build of wxWidgets 2.8 and earlier and doesn't compile with Unicode build, use wxWidgets 2.9 instead where it will compile -- and work as intended.
It sounds like your text editor (for program source code) is in a different encoding from the running program.
Suppose for example that your text entry control and the rest of your program are (correctly) using UTF-8. Now if your text editor is using some other encoding, then a string that looks fine on screen will actually contain garbage bytes.
Assuming you are in a position to help create a pure-UTF8 world, then you should:
1) Encode UTF-8 directly into the string literals using escapes, e.g. "\303" or "\xc3". That's annoying to do, but it means you just don't have to worry about you text editor (or the editor settings of other developers).
2) Then check that the program is using UTF-8 everywhere.

How to I mark an empty translation (msgstr) as translated in po gettext files?

I found that is the translation for a string (msgid) is empty all gettext tools will consider the string as untranslated.
Is there a workaround for this? I do want to have an empty string as the translation for this item.
As this seems to be a big design flaw in the gettext specification, I decided to use:
Unicode Character 'ZERO WIDTH SPACE' (U+200B) inside these fields.
I realize this is an old question, but I wanted to point out an alternate approach:
msgid "This is a string"
msgstr "\0"
Since gettext uses embedded nulls to signal the end of a string, and it properly translates C escape sequences, I would guess that this might work and result in the empty string translation? It seemed to work in my program (based on GNU libintl) but I can't tell if this is actually standard / permitted by the system. As I understand gettext PO is not formally specified so there may be no authoritative answer other than looking at source code...
https://www.gnu.org/software/gettext/manual/html_node/PO-Files.html
It's often not a nice thing to do to programmers to put embedded nulls in things but it might work in your case? Arguably it's less evil than the zero-width-space trick, since it will actually result in a string whose size is zero.
Edit:
Basically, the worst thing that can happen is you get a segfault / bad behavior when running msgfmt, if it would get confused about the size of strings which it assumes don't have embedded null, and overflow a buffer somewhere.
Assuming that msgfmt can tolerate this though, libintl is going to have to do the right thing with it because the only means it has to return strings is char *, so the final application can only see up to the null character no matter what.
For what it's worth, my po-parser library spirit-po explicitly supports this :)
https://github.com/cbeck88/spirit-po
Edit: In gettext documentation, it appears that they do mention the possibility of embedded nulls in MO files and said "it was strongly debated":
https://www.gnu.org/software/gettext/manual/html_node/MO-Files.html
Nothing prevents a MO file from having embedded NULs in strings. However, the program interface currently used already presumes that strings are NUL terminated, so embedded NULs are somewhat useless. But the MO file format is general enough so other interfaces would be later possible, if for example, we ever want to implement wide characters right in MO files, where NUL bytes may accidentally appear. (No, we don’t want to have wide characters in MO files. They would make the file unnecessarily large, and the ‘wchar_t’ type being platform dependent, MO files would be platform dependent as well.)
This particular issue has been strongly debated in the GNU gettext development forum, and it is expectable that MO file format will evolve or change over time. It is even possible that many formats may later be supported concurrently. But surely, we have to start somewhere, and the MO file format described here is a good start. Nothing is cast in concrete, and the format may later evolve fairly easily, so we should feel comfortable with the current approach.
So, at the least it's not like they're going to say "man, embedded null in message string? We never thought of that!" Most likely it works, if msgfmt doesn't crash then I would assume it's kosher.
I have had the same problem for a long time, and I actually don't think you can at all. My best option was to insert a comment so I could mark it "translated" from there:
# No translation needed / Translated
msgid "This is a string"
msgstr ""
So far, it's been by best workaround :/ If you do end up finding a way, please post!

Porting a Delphi 2006 app to XE

I am wanting to port several large apps from Delphi 2006 to XE. The reasons are not so much to do with Unicode, but to take advantage of (hopefully) better IDE stability, native PNG support, more components, less VCL bugs, less dependence on 3rd party stuff, less ribbing from you guys, etc. Some of the apps might benefit from Unicode, but that's not a concern at present. At the moment I just want to take the most direct route to getting them to compile again.
As a start, I have changed all ambiguous string declarations, i.e. string to AnsiString or ShortString, char to AnsiChar and pChar to pAnsiChar and recompiled with D2006. So far so good. Nothing broke.
My question is: Where to from here? Assuming I just present my sources to the XE compiler and light the touch paper, what is likely to be the big issue?
For example,
var
S : AnsiString ;
...
MainForm.Caption := S ;
Will this generate an error? A warning? I'm assuming the VCL is now Unicode, or will XE pull in a non-Unicode component, or convert the string? Is it in fact feasible to keep an app using 8-bit strings in XE, or will there be too many headaches?
If the best/easiest way to go is to Unicode, I'll do that, even though I won't be using the extended characters, at least in the near future anyway.
The other thing that I wonder about is 3rd party stuff. I guess I will need to get updated versions that are XE-compatible.
Any (positive!) comment appreciated.
It is a long jump taking 2006 to 2011
But it is possible if you consider that:
You have to convert String variables using the new conversions methods ;
You have to check all the versions between 2006 and xe to know how the libraries have changed, bacause some have been spplited, others merged, and a few deleted ;
You have to buy/download the upgrade (if any) of your 3rd party components.
The VCL is completely Unicode now, so the code you showed will generate a compiler warning, not an error, about an implicit conversion from AnsiString to UnicodeString. That is a potentially lossy conversion if the AnsiString contains non-ASCII characters (which the compiler cannot validate). If you continue using AnsiString, then you have to do an explicit type-cast to avoid the warning:
var
S : AnsiString ;
...
MainForm.Caption := String(S);
You are best off NOT Ansi-fying your code like this. Embrace Unicode. Your code will be easier to manage for it, and it will be more portable to future versions and platforms. You should restrict AnsiString usage to just those places where Ansi is actually needed - network communications, file I/O of legacy data, etc. If you want to save memory inside your app, especially if you are using ASCII characters only, use UTF8String instead of AnsiString. UTF-8 is an 8-bit encoding of Unicode, and conversions between UTF8String and UnicodeString are loss-less with no compiler warnings.

What encoding Win32 API functions expect?

For example, MessageBox function has LPCTSTR typed argument for text and caption, which is a pointer to char or pointer to wchar when _UNICODE or _MBCS are defined, respectively.
How does the MessageBox function interpret those stings? As which encoding?
Only explanation I managed to find is this:
http://msdn.microsoft.com/en-us/library/cwe8bzh0(VS.90).aspx
But it doesn't say anything about encoding? Just that in case of _MBCS one character takes up one wchar (which is 16-bit on Windows) and that in case of _UNICODE one or two char's (8-bit).
So are those some Microsoft's versions of UTF-8 and UTF-16 that ignore anything that has to be encoded in 3 or four bytes in case of UTF-8 and anything that has to be encoded in 4 bytes in case of UTF-16? And is there a way to show anything outside of basic multilingual plane of Unicode with MessageBox?
There are normally two different implementations of each function:
MessageBoxA, which accepts ANSI strings
MessageBoxW, which accepts Unicode strings
Here, 'ANSI' means the multi-byte code page currently assigned to the process. This varies according to the user's preferences and locale setting, although Win32 API functions such as WideCharToMultiByte can be counted on to do the right conversion, and the GetACP function will tell you the code page in use. MSDN explains the ANSI code page and how it interacts with Unicode.
'Unicode' generally means UCS-2; that is, support for characters above 0xFFFF isn't consistent. I haven't tried this, but UI functions such as MessageBox in recent versions (> Windows 2000) should support characters outside the BMP.
The ...A functions are obsolete and only wrap the ...W functions. The former were required for compatibility with Windows 9x, but since that is not used any more, you should avoid them at any costs and use the ...W functions exclusively. They require UTF-16 strings, the only native Windows encoding. All modern Windows versions should support non-BMP characters quite well (if there is a font that has these characters, of course).

How remove the warning "large integer implicitly truncated" for sqlite/unicode support?

I use the solution of http://ioannis.mpsounds.net/2007/12/19/sqlite-native-unicode-like-support/ for my POS App for the iPhone, and work great.
However, as say in the comments:
For instance, sqlite_unicode.c line 1861 contains integral constants greater than 0xffff but are declared as unsigned short. I wonder how I should cope with that.
I'm fixing all the warnings in my project and this is the last one. The code is this:
static unsigned short unicode_unacc_data198[] = { 0x8B8A, 0x8D08, 0x8F38, 0x9072, 0x9199, 0x9276, 0x967C, 0x96E3, 0x9756, 0x97DB, 0x97FF, 0x980B, 0x983B, 0x9B12, 0x9F9C, 0x2284A, 0x22844, 0x233D5, 0x3B9D, 0x4018, 0x4039, 0x25249, 0x25CD0, 0x27ED3, 0x9F43, 0x9F8E, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF, 0xFFFF };
I don't know about this hex stuff, so what to do? I don't get a error, not know if this could cause one in the future...
Yes, 0x2284A is indeed larger than 0xFFFF, which is the largest a 16-bit unsigned integer can contain.(*)
This is a lookup table for mapping characters with diacritical marks to basic unaccented characters. For some reason, a few mappings are defined that point to characters outside the ‘Basic Multilingual Plane’ of Unicode characters that fit in 16 bits.
U+2284A and the others above are highly obscure extended Chinese characters. I'm not sure why a character in the BMP would refer to such a character as its base unaccented version. Maybe it's an error in the source data used to generate the tables, or maybe it's just another weird quirk of the Chinese writing system. Either way, it's extremely unlikely you'll ever need that mapping. So just change all the five-digit hex codes in this array to be 0xFFFF instead (which seems to be what the code is using to signify ‘no mapping’).
(*: in theory a short could be more than 16 bits, but in reality it isn't going to be. If it were it looks like this code would totally fall over anyway, as it's freely mixing short with u16 pointers.)