Detect Unicode characters in NSString on iPhone

Detect Unicode characters in NSString on iPhone - iphone

I am working on an SMS application for the iPhone. I need to detect if the user has entered any unicode characters inside the NSString they wish to send.
I need to do this is because unicode characters take up more space in the message, and also because I need to convert them into their hexadecimal equivalents.
So my question is how do I detect the presence of a unicode character in an NSString (which I read from a UITextView). Also, how do I then convert those characters into their UCS‐2 hexadecimal equivalents?
E.g 繁 = 7E41, 体 = 4F53, 中 = 4E2D, 文 = 6587

To check for only ascii characters (or another encoding of your choice) use:
[myString canBeConvertedToEncoding:NSASCIIStringEncoding];
It will return NO if the string contains non-ascii characters. You can then convert the string to UCS-2 data with:
[myString dataUsingEncoding:NSUTF16BigEndianStringEncoding];
or NSUTF16LittleEndianStringEncoding depending on your platform. There are slight differences between UCS-2 and UTF-16. UTF-16 has superseded UCS-2. You can read about the differences here:
http://en.wikipedia.org/wiki/UTF-16/UCS-2

I couldn't get this to work.
I has a html string with NON BREAKING SPACE
</div>Great Guildford St/SouthwarkSt & nbsp;Stop:& nbsp; BM<br>Walk to SE1 0HL<br>
"Great Guildford St/SouthwarkSt \U00a0Stop:\U00a0 BM",
I tried 3 types of encode/decode
// NSData *asciiData = [instruction dataUsingEncoding:NSUTF16BigEndianStringEncoding];
// NSString *asciiString = [[NSString alloc] initWithData:asciiData
// encoding:NSUTF16BigEndianStringEncoding];
// NSData *asciiData = [instruction dataUsingEncoding:NSASCIIStringEncoding];
// NSString *asciiString = [[NSString alloc] initWithData:asciiData
// encoding:NSASCIIStringEncoding];
//little endian
NSData *asciiData = [instruction dataUsingEncoding:NSUTF16LittleEndianStringEncoding];
NSString *asciiString = [[NSString alloc] initWithData:asciiData
encoding:NSUTF16LittleEndianStringEncoding];
none of these worked.
They seemed to work as if I NSLog the string it looks ok
NSLog(#"HAS UNICODE :%#", instruction);
..do encode/decode
NSLog(#"UNICODE AFTER:%#", asciiString);
Which output
HAS UNICODE: St/SouthwarkSt  Stop:  BM
UNICODE AFTER: St/SouthwarkSt  Stop:  BM
but I happened to store these in an NSArray and I happened to call [stringArray description] and all the unicode was still in there
instructionsArrayString: (
"Great Guildford St/SouthwarkSt \U00a0Stop:\U00a0 BM",
"Walk to SE1 0HL"
)
So something in NSLog hides but it shows up in NSArray description so you may think youve removed the Unicode when you haven't.
Will try another method that replace the characters.

Related

How to "normalize" an URL replacing any special characters with new ones

In any URL, you can have special characters like *? & ~ : / *
and soon if not already, accentuated characters
What I'd like is to convert ANY url into it's nearest equivalent in pure ASCII character
THEN replacing any remaining spécial charaters by a _
I've tried this looking and inspiring myslef with many examples over the net, but it do not work (for example, using this code, the character "é" is not converted to "e" in #"http://www.mélange.fr/~fermer.php?aa=10&ee=13")
NSMutableCharacterSet *charactersToKeep = [NSMutableCharacterSet alphanumericCharacterSet];
[charactersToKeep addCharactersInString:#"://&=~?"];
NSCharacterSet* charactersToRemove = [charactersToKeep invertedSet];
myNSString = [[[myNSString decomposedStringWithCanonicalMapping] componentsSeparatedByCharactersInSet:charactersToRemove] componentsJoinedByString:#""];
to start, after I will have to convert remaining special characters with _
How may I achieve this ?
As an example (and only for example), I'd like to convert :
http://www.mélange.fr/~fermer.php?aa=10&ee=13
to
http___www.melange.fr__fermer_php_aa_10_ee_13
of course without having to check one by one each possible special or accentued character.

Two thoughts:
To replace accented characters with unaccented ones, there are a couple of candidates:
You can use CFStringTransform:
NSMutableString *mutableString = [string mutableCopy];
CFStringTransform((__bridge CFMutableStringRef)mutableString, NULL, kCFStringTransformStripCombiningMarks, NO);
You could use dataUsingEncoding:allowLossyConversion:
NSData *data = [string dataUsingEncoding:NSASCIIStringEncoding allowLossyConversion:YES];
NSString *result = [[NSString alloc] initWithData:data encoding:NSUTF8StringEncoding];
Characters it doesn't know what to do with become ? and but this sometimes replaces one character with multiple characters (e.g. © with (C)), which you may or may not want.
Once you do this international character conversion, it looks like you want replace any non-alphanumeric character (or period) with an underscore, which you could do with a stringByReplacingOccurrencesOfString with a regular expression:
NSString *result = [string stringByReplacingOccurrencesOfString:#"[^a-z0-9\\.]"
withString:#"_"
options:NSRegularExpressionSearch | NSCaseInsensitiveSearch
range:NSMakeRange(0, [string length])];
There are lots of permutations of this regular expression that will accomplish the same thing, but hopefully you get the idea.

How to append unicode ranging U+0000 to U+0099 to NSString in iOS

How to append unicode ranging U+0000 to U+0099 To NSString in iOS. I have used the following link for reference http://en.wikipedia.org/wiki/List_of_Unicode_characters

Try to use this one....
NSString uses UTF-16 to store codepoints internally, so those in the range you're looking for (U+1F300 to U+1F6FF) will be stored as a surrogate pair (four bytes). Despite its name, characterAtIndex: (and unichar) doesn't know about codepoints and will give you the two bytes that it sees at the index you give it (the 55357 you're seeing is the lead surrogate of the codepoint in UTF-16).
To examine the raw codepoints, you'll want to convert the string/characters into UTF-32 (which encodes them directly). To do this, you have a few options:
1) Get all UTF-16 bytes that make up the codepoint, and use either this algorithm or CFStringGetLongCharacterForSurrogatePair to convert the surrogate pairs to UTF-32.
2) Use either dataUsingEncoding: or getBytes:maxLength:usedLength:encoding:options:range:remainingRange: to convert the NSString to UTF-32, and interpret the raw bytes as a uint32_t.
3) Use a library like ICU.

I'm not sure this is 100% correct solution, but it works:
NSString *uniString = [NSString stringWithFormat:#"%C", (unichar)0x0021];
Where 0x0021 is your unicode char code.
You can test it with this loop:
for (unichar ch = 0x0000; ch <= 0x0099; ch++) {
NSString *uniString = [NSString stringWithFormat:#"%C", ch];
NSLog(#"%#", uniString);
}

How to generate vcard in iPhone using the address book contact details?

I have an application where I want to import the address book details into vcard format. This is the code that I have done but the problem my email address, photo, organisation name, etc is not getting saved in vcard.
-(NSString*)vcardrepresentation
{
NSMutableArray *mutableArray = [[NSMutableArray alloc] init];
[mutableArray addObject:#"BEGIN:VCARD"];
[mutableArray addObject:#"VERSION:3.0"];
[mutableArray addObject:[NSString stringWithFormat:#"FN:%# %#", self.contactlist.objContact.firstname,self.contactlist.objContact.lastname]];
[mutableArray addObject:[NSString stringWithFormat:#"ORG:%#",self.contactlist.objContact.companyname]];
[mutableArray addObject:[NSString stringWithFormat:#"ADR:%#",self.contactlist.objContact.City]];
if ([phoneArray count]!=0)
[mutableArray addObject:[NSString stringWithFormat:#"TEL:%#", phoneemail.phoneNumber]];
if ([emailArray count]!=0)
{
[mutableArray addObject:[NSString stringWithFormat:#"EMAIL:%#",phoneemail.phoneNumber]];
}
if ([contactlist.objContact.Photo length]==0)
{
[mutableArray addObject:[NSString stringWithFormat:#"PHOTO:%#",[UIImage imageNamed:#"man.png"]]];
}
else
{
[mutableArray addObject:[NSString stringWithFormat:#"PHOTO:%#",[UIImage imageWithData:contactlist.objContact.Photo]]];
}
[mutableArray addObject:#"END:VCARD"];
NSString *string = [mutableArray componentsJoinedByString:#"\n"];
return string;
}
How can I save all the contact data in vcard format?

Rani, I suggest the following pseudocode:
Get contact photograph as NSData (contactlist.objContact.Photo)
Convert NSData bytes to BASE 64 encoding scheme (NSData to base64, base64EncodedString)
Add encoded data and properties to vCard:
[mutableArray addObject:[NSString stringWithFormat:#"PHOTO;ENCODING=BASE64;TYPE=JPEG:%#", data]];
For your information vCard photographs are images encoded with Base 64 scheme. There are 16 supported file formats including GIF and JPEG. Here's an example:
PHOTO;ENCODING=BASE64;TYPE=GIF:
R0lGODdhfgA4AOYAAAAAAK+vr62trVIxa6WlpZ+fnzEpCEpzlAha/0Kc74+PjyGM
SuecKRhrtX9/fzExORBSjCEYCGtra2NjYyF7nDGE50JrhAg51qWtOTl7vee1MWu1
50o5e3PO/3sxcwAx/4R7GBgQOcDAwFoAQt61hJyMGHuUSpRKIf8A/wAY54yMjHtz
...

(1) It looks like you are setting the value of the EMAIL property to the phone number.
(2) The format of the ADR property is incorrect. The correct format is to separate the address into its individual components, delimited by semicolons. The format is:
ADR:post-office-box;extended-address;street-address;city;state;zip-code;country
If an address is missing a component (for example, it doesn't have a post office box), then an empty string should be used. Therefore, an ADR value should always contain 6 semicolons.
(3) Semicolons, commas, backslashes, and especially newlines should be escaped in all vCard property values. Semicolon and comma characters have a special meanings inside some properties (such as ADR and ORG), so it is especially important that these characters be escaped for these properties. The characters are escaped with backslashes like so: \;, \,, \\, \n.
(4) Beware of folding. The specs recommended that no line should exceed 75 characters (excluding newline). If a line exceeds this limit, then it can be "folded" by inserting a newline and adding at least one tab or space character at the beginning of the line (as shown in #rjobidon's answer).
(5) The correct newline sequence for a vCard is \r\n not \n.

Converting an NSString to and from UTF32

I'm working with a database that includes hex codes for UTF32 characters. I would like to take these characters and store them in an NSString. I need to have routines to convert in both ways.
To convert the first character of an NSString to a unicode value, this routine seems to work:
const unsigned char *cs = (const unsigned char *)
[s cStringUsingEncoding:NSUTF32StringEncoding];
uint32_t code = 0;
for ( int i = 3 ; i >= 0 ; i-- ) {
code <<= 8;
code += cs[i];
}
return code;
However, I am unable to do the reverse (i.e. take a single code and convert it into an NSString). I thought I could just do the reverse of what I do above by simply creating a c-string with the UTF32 character in it with the bytes in the correct order, and then create an NSString from that using the correct encoding.
However, converting to / from cstrings does not seem to be reversible for me.
For example, I've tried this code, and the "tmp" string is not equal to the original string "s".
char *cs = [s cStringUsingEncoding:NSUTF32StringEncoding];
NSString *tmp = [NSString stringWithCString:cs encoding:NSUTF32StringEncoding];
Does anyone know what I am doing wrong? Should I be using "wchar_t" for the cstring instead of char *?
Any help is greatly appreciated!
Thanks,
Ron

You have a couple of reasonable options.
1. Conversion
The first is to convert your UTF32 to UTF16 and use those with NSString, as UTF16 is the "native" encoding of NSString. It's not actually all that hard. If the UTF32 character is in the BMP (e.g. it's high two bytes are 0's), you can just cast it to unichar directly. If it's in any other plane, you can convert it to a surrogate pair of UTF16 characters. You can find the rules on the wikipedia page. But a quick (untested) conversion would look like
UTF32Char inputChar = // my UTF-32 character
inputChar -= 0x10000;
unichar highSurrogate = inputChar >> 10; // leave the top 10 bits
highSurrogate += 0xD800;
unichar lowSurrogate = inputChar & 0x3FF; // leave the low 10 bits
lowSurrogate += 0xDC00;
Now you can create an NSString using both characters at the same time:
NSString *str = [NSString stringWithCharacters:(unichar[]){highSurrogate, lowSurrogate} length:2];
To go backwards, you can use [NSString getCharacters:range:] to get the unichar's back and then reverse the surrogate pair algorithm to get your UTF32 character back (any characters which aren't in the range 0xD800-0xDFFF should just be cast to UTF32 directly).
2. Byte buffers
Your other option is to let NSString do the conversion directly without using cStrings. To convert a UTF32 value into an NSString you can use something like the following:
UTF32Char inputChar = // input UTF32 value
inputChar = NSSwapHostIntToLittle(inputChar); // swap to little-endian if necessary
NSString *str = [[[NSString alloc] initWithBytes:&inputChar length:4 encoding:NSUTF32LittleEndianStringEncoding] autorelease];
To get it back out again, you can use
UTF32Char outputChar;
if ([str getBytes:&outputChar maxLength:4 usedLength:NULL encoding:NSUTF32LittleEndianStringEncoding options:0 range:NSMakeRange(0, 1) remainingRange:NULL]) {
outputChar = NSSwapLittleIntToHost(outputChar); // swap back to host endian
// outputChar now has the first UTF32 character
}

There are two probelms here:
1:
The first one is that both [NSString cStringUsingEncoding:] and [NSString getCString:maxLength:encoding:] return the C-string in native-endianness (little) without adding a BOM to it when using NSUTF32StringEncoding and NSUTF16StringEncoding.
The Unicode standard states that: (see, "How I should deal with BOMs")
"If there is no BOM, the text should be interpreted as big-endian."
This is also stated in NSString's documentation: (see, "Interpreting UTF-16-Encoded Data")
"... if the byte order is not otherwise specified, NSString assumes that the UTF-16 characters are big-endian, unless there is a BOM (byte-order mark), in which case the BOM dictates the byte order."
Although they're referring to UTF-16, the same applies to UTF-32.
2:
The second one is that [NSString stringWithCString:encoding:] internally uses CFStringCreateWithCString to create the C-string. The problem with this is that CFStringCreateWithCString only accepts strings using 8-bit encodings. From the documentation: (see, "Parameters" section)
The string must use an 8-bit encoding.
To solve this issue:
Explicitly state the encoding endianness you want to use both ways (NSString -> C-string and C-string -> NSString)
Use [NSString initWithBytes:length:encoding:] when trying to create an NSString from a C-string encoded in UTF-32 or UTF-16.
Hope this helps!

Problem with hash256 in Objective C

when i use this code for generate an hash256 in my iPhone app:
unsigned char hashedChars[32];
NSString *inputString;
inputString = [NSString stringWithFormat:#"hello"];
CC_SHA256([inputString UTF8String],
[inputString lengthOfBytesUsingEncoding:NSASCIIStringEncoding ],
hashedChars);
NSData * hashedData = [NSData dataWithBytes:hashedChars length:32];
The hash256 of inputString, is created correctly, but if i use a string like this #"\x00\x25\x53\b4", the hash256 is different from the real string with "\x" characters.
I think that the problem is in encoding "UTF8" instead of ascii.
Thanks!

I would be suspicious of the first character, "\x00" - thats going to terminate anything that thinks its dealing with "regular C strings".
Not sure whether lengthOfBytesUsingEncoding: takes that stuff into account, but its something I'd experiment with.

You're getting the bytes with [inputString UTF8String] but the length with [inputString lengthOfBytesUsingEncoding:NSASCIIStringEncoding]. This is obviously wrong. Moreover (assuming you mean "\xB4" and that it turns into something not in ASCII), "\xB4" is not likely to be in ASCII. The docs for NSString say
Returns 0 if the specified encoding cannot be used to convert the receiver
So you're calculating the hash of the empty string. Of course it's wrong.
You're less likely to have problems if you only generate the data once:
NSData * inputData = [inputString dataUsingEncoding:NSUTF8StringEncoding];
CC_SHA256(inputData.bytes, inputData.length, hashedChars);