converting a string to Unicode in C

converting a string to Unicode in C - unicode

I have a string in a variable and that string comes from the core part of the project. Now i want to convert that to unicode string. How can i do that
and adding L or _T() or TEXT() is not an option.
To further make thing clear please see below
Void foo(char* string) {
//Here the contents of the variable STRING should be converted to Unicode
//The soln should be possible to use in C code.
}
TIA
Naveen

L is used to create wchar_t literals.
From your comment about SafeArrayPutElement and the way you us the term 'Unicode' it's clear you're using Windows. Assuming that that char* string is in the legacy encoding Windows is using and not UTF-8 or something (a safe assumption on Windows) you can get a wchar_t string in the following ways:
// typical Win32 conversion in C
int output_size = MultiByteToWideChar(CP_ACP,0,string,-1,NULL,0);
wchar *wstring = malloc(output_size * sizeof(wchar_t));
int size = MultiByteToWideChar(CP_ACP,0,string,-1,wstring,output_size);
assert(output_size==size);
// make use of wstring here
free(wstring);
If you're using C++ you might want to make that exception safe by using std::wstring instead (this uses a tiny bit of C++11 and so may require VS2010 or above):
std::wstring ws(output_size,L'\0');
int size = MultiByteToWideChar(CP_ACP,0,string,-1,ws.data(),ws.size());
// MultiByteToWideChar tacks on a null character to mark the end of the string, but this isn't needed when using std::wstring.
ws.resize(ws.size() -1);
// make use of ws here. You can pass a wchar_t pointer to a function by using ws.c_str()
//std::wstring handles freeing the memory so no need to clean up
Here's another method that uses more of the C++ standard library (and takes advantage of VS2010 not being completely standards compliant):
#include <locale> // for wstring_convert and codecvt
std::wstring ws = std::wstring_convert<std::codecvt<wchar_t,char,std::mbstate_t>,wchar_t>().from_bytes(string);
// use ws.c_str() as before
You also imply in the comments that you tried converting to wchar_t and got the same error. If that's the case when you try these methods for converting to wchar_t then the error lies elsewhere. Probably in the actual content of your string. Perhaps it's not properly null terminated?

You can't say "converted to Unicode". You need to specify an encoding, Unicode is not an encoding but (roughly) a character set and a set of encodings to express those characters as sequences of bytes.
Also, you must specify the input encoding, how is e.g. a character such as "å" encoded in string?

Related

How can I save a string array to PlayerPrefs in Unity?

I have an array and I would like to save it to PlayerPrefs. I heard, I can do this:
PlayerPrefs.SetStringArray('title', anArray);
but for some reason it does not work.
Maybe I'm not using some library like using UnityEngine.PlayerPrefs;?
Can someone help me?
Thanks in advance

You can't. PlayerPrefs doesn't support arrays.
But you could use a special separator and do e.g.
PlayerPrefs.SetString("title", string.Join("###", anArray));
and then for reading use
var anArray = PlayerPrefs.SetString("title").Split(new []{"###"}, StringSplitOptions.None);
Or if you know the content and in particular which character is never used you could also use a single char e.g.
PlayerPrefs.SetString("title", string.Join("/n", anArray));
and then for reading use
var anArray = PlayerPrefs.SetString("title").Split('/n');
Yes as TEEBQNE mentioned there is PlayerPrefsX.cs which might be the source of the confusion.
I would NOT recommend it though! It simply converts all the different input types into byte[] and from there to Base64 strings.
That might be cool and all for int[], bool[], etc. But for string[] this is absolutely inefficient since the Base64 bytes representation of a string is way longer than the string itself!
It might be a valid alternative though if you can not rely on your strings contents and you can not be sure that your separator sequence is never actually a content of any string.

In Swift, how to get estimate of String length in constant time?

In Swift 3, you can count the characters in a String with:
str.characters.count
I need to do this frequently, and that line above looks like it could be O(N). Is there a way to get a string length, or a length of something — maybe the underlying unicode buffer — with an operation that is guaranteed to not have to walk the entire string? Maybe:
str.utf16.count
I ask because I'm checking the length of some text every time the user types a character, to limit the size of a UITextView. The call doesn't need to be an exact count of the glyphs, like characters.count.

This is a good question. The answer is... complicated. Converting from UTF-8 to UTF-16, or vice-versa, or converting to or from some other encoding, will all require examining the string, since the characters can be made up of more than one code unit. So if you want to get the count in constant time, it's going to come down to what the internal representation is. If the string is using UTF-16 internally, then it's a reasonable assumption that string.utf16.count would be in constant time, but if the internal representation is UTF-8 or something else, then the string will need to be analyzed to determine what the length in UTF-16 would be. So what's String using internally? Well:
https://github.com/apple/swift/blob/master/stdlib/public/core/StringCore.swift
/// The core implementation of a highly-optimizable String that
/// can store both ASCII and UTF-16, and can wrap native Swift
/// _StringBuffer or NSString instances.
This is discouraging. The internal representation could be ASCII or UTF-16, or it could be wrapping a Foundation NSString. Hrm. We do know that NSString uses UTF-16 internally, since this is actually documented, so that's good. So the main outlier here is when the string stores ASCII. The saving grace is that since the first 128 Unicode code points have the same values as the ASCII character set, any ASCII character 0xXX should correspond to the UTF-16 character 0x00XX, so the UTF-16 length should simply be the ASCII length times two, and thus calculable in constant time. Is this the case in the implementation? Let's look.
In the UTF16View source, there is no implementation of count. It appears that count is inherited from Collection's implementation, which is implemented via distance():
public var count: IndexDistance {
return distance(from: startIndex, to: endIndex)
}
UTF16View's implementation of distance() looks like this:
public func distance(from start: Index, to end: Index) -> IndexDistance {
// FIXME: swift-3-indexing-model: range check start and end?
return start.encodedOffset.distance(to: end.encodedOffset)
}
And in the String.Index source, encodedOffset looks like this:
public var encodedOffset : Int {
return Int(_compoundOffset >> _Self._strideBits)
}
where _compoundOffset appears to be a simple 64-bit integer:
internal var _compoundOffset : UInt64
and _strideBits appears to be a straight integer as well:
internal static var _strideBits : Int { return 2 }
So it... looks... like you should get constant time from string.utf16.count, since unless I'm making a mistake somewhere, you're just bit-shifting a couple of integers and then comparing the results (I'd probably still run some tests to be sure). The caveat is, of course, that this isn't documented, and thus could change in the future—particularly since the documentation for String does claim that it needs to iterate through the string:
Unlike with isEmpty, calculating a view’s count property requires iterating through the elements of the string.
With all that said, you're using a UITextView, which is implemented in Objective-C via NSAttributedString. If you're willing to incur the Objective-C message-passing overhead (which, let's be honest, is probably occurring under the scenes anyway to generate the String), you can just call its length property, which, since NSAttributedString is built on top of NSString, which does guarantee that it uses UTF-16 internally, is almost certain to be in constant time.

Char string encoding differences between native C++ and C++/CLI?

I have a strange problem for which I believe there is a solution but I cannot find it. Your help would be appreciated.
On the one hand, I have a native C++ class named Native which has a static wchar_t array containing accentuated characters. This array is const and defined at build time.
/// Header file
Native
{
public:
static const wchar_t* Array() const { return mArray; }
private:
static const wchar_t *mArray;
};
//--------------------------------------------------------------
/// .cpp file
const wchar_t* Native::mArray = {L"This is a description éàçï"};
On the other hand, I have a C++/CLI class that uses the array like this:
/// C++/CLI use
System::String^ S1 = gcnew System::String( Native::Array() );
System::String^ S2 = gcnew System::String( L"This is a description éàçï" };
The problem is that while S2 gives This is a description éàçï as expected, S1 gives This is a description Ã©Ã Ã§Ã¯. I do not understand why passing a pointer to a static array will not give the same result as giving the same array directly???
I guess this is an encoding problem but I would have expected the same results for both S1 and S2. Do you know how to solve the problem? The way I must use it in my program is like S1 i.e. by accessing the build time static array with a static method that returns a const wchar_t*.
Thanks for your help!
EDIT 1
What is the best way to define literals at build time in C++ using Intel C++ 13.0 to make them directly usable in C++/CLI System::String constructor? This could be the ultimate question for my problem.

I don't have enough reputation to add a comment to ask this question, so I apologize for posting this as an answer if that seems inappropriate.
Could the problem be that your compiler defines wchar_t to be 8 bits? I'm basing that is possible on this answer:
Should I use wchar_t when using UTF-8?
To answer your question (in the comments) about building a UTF-16 array at build time, I believe you can force it to be UTF-16 by using u"..." for your literal instead of L"..." (see http://en.cppreference.com/w/cpp/language/string_literal)
Edit 1:
For what it's worth, I tried your code (after fixing a couple compile errors) using Microsoft Visual Studio 10 and didn't have the same problem (both strings printed as expected).
I don't know if it will help you, but another possible way to statically initialize this wchar_t array is to use std::wstring to wrap your literal and then set your array to the c-string pointer returned by wstring::c_str(), shown as follows:
std::wstring ws(L"This is a description éàçï");
const wchar_t* Native::mArray = ws.c_str();
This edit was inspired by Dynamic wchar_t array (C++ beginner)

How does the auto-free()ing work when I use functions like mktemp()?

Greetings,
I'm using mktemp() (iPhone SDK) and this function returns a char * to the new file name where all "X" are replaced by random letters.
What confuses me is the fact that the returned string is automatically free()d. How (and when) does that happen? I doubt it has something to do with the Cocoa event loop. Is it automatically freed by the kernel?
Thanks in advance!

mktemp just modifies the buffer you pass in, and returns the same poiinter you pass in, there's no extra buffer to be free'd.
That's at least how the OSX manpage describes it(I couldn't find documentation for IPhone) , and the posix manpage (although the example in the posix manpage looks to be wrong, as it pass in a pointer to a string literal - possibly an old remnant, the function is also marked as legacy - use mkstemp instead. The OSX manpage specifically mention that as being an error).
So, this is what will happen:
char template[] = "/tmp/fooXXXXXX";
char *ptr;
if((ptr = mktemp(template)) == NULL) {
assert(ptr == template); //will be true,
// mktemp just return the same pointer you pass in
}

If it's like the cygwin function of the same name, then it's returning a pointer to an internal static character buffer that will be overwritten by the next call to mktemp(). On cygwin, the mktemp man page specifically mentions _mktemp_r() and similar functions that are guaranteed reentrant and use a caller-provided buffer.

What kind of data type is this?

In an class header I have seen something like this:
enum {
kAudioSessionProperty_PreferredHardwareSampleRate = 'hwsr', // Float64
kAudioSessionProperty_PreferredHardwareIOBufferDuration = 'iobd' // Float32
};
Now I wonder what data type such an kAudioSessionProperty_PreferredHardwareSampleRate actually is?
I mean this looks like plain old C, but in Objective-C I would write #"hwsr" if I wanted to make it a string.
I want to pass such an "constant" or "enum thing" as argument to an method.

This converts to an UInt32 enum value using the ASCII value of each of the entries. This style have been around for a long time in Mac OS headers.
'hwsr' has the same value as if you had written 0x68777372, but is a lot more reader friendly. If you used the #"hwsr" style instead you would need more than 4 bytes to represent the same.
The advantage of using this style is that you are actually able to quickly identify the content of a raw data stream if you can see the ASCII values of it.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

converting a string to Unicode in C - unicode

You can't say "converted to Unicode". You need to specify an encoding, Unicode is not an encoding but (roughly) a character set and a set of encodings to express those characters as sequences of bytes. Also, you must specify the input encoding, how is e.g. a character such as "å" encoded in string?

Related

How can I save a string array to PlayerPrefs in Unity?

In Swift, how to get estimate of String length in constant time?

Char string encoding differences between native C++ and C++/CLI?

How does the auto-free()ing work when I use functions like mktemp()?

What kind of data type is this?

Categories

Resources