iPhone SDK - stringWithContentsOfUrl ASCII characters in HTML source - iphone

When I fetch the source of any web page, no matter the encoding I use, I always end up with &# - characters (such as © or ®) instead of the actual characters themselves. This goes for foreign characters as well (such as åäö in swedish), which I have to parse from "&Aring" and such).
I'm using
+stringWithContentsOfUrl: encoding: error;
to fetch the source and have tried several different encodings such as NSUTF8StringEncoding and NSASCIIStringEncoding, but nothing seems to affect the end result string.
Any ideas / tips / solution is greatly appreciated! I'd rather not have to implement the entire ASCII table and replace all occurrances of every character... Thanks in advance!
Regards

I'm using
+stringWithContentsOfUrl: encoding: error;
to fetch the source and have tried several different encodings such as NSUTF8StringEncoding and NSASCIIStringEncoding, but nothing seems to affect the end result string.
You're misunderstanding the purpose of that encoding: argument. The method needs to convert bytes into characters somehow; the encoding tells it what sequences of bytes describe which characters. You need to make sure the encoding matches that of the resource data.
The entity references are an SGML/XML thing. SGML and XML are not encodings; they are markup language syntaxes. stringWithContentsOfURL:encoding:error: and its cousins do not attempt to parse sequences of characters (syntax) in any way, which is what they would have to do to convert one sequence of characters (an entity reference) into a different one (the entity, in practice meaning single character, that is referenced).
You can convert the entity references to un-escaped characters using the CFXMLCreateStringByUnescapingEntities function. It takes a CFString, which an NSString is (toll-free bridging), and returns a CFString, which is an NSString.

Are you sure they originally are not in Å form? Try to view the source code in a browser first.

That really, really sucks. I wanted to convert it directly and the above solution isn't really a good one, so I just wrote my own ascii-table converter (static) class. Works as it should have worked natively (though I have to fill in the ascii table myself...)
Ideas for optimization? ("ASCII" is a static NSDictionary)
#implementation InternetHelper
+(NSString *)HTMLSourceFromUrlWithString:(NSString *)str convertASCII:(BOOL)state
{
NSURL *url = [NSURL URLWithString:str];
NSString *source = [NSString stringWithContentsOfURL:url encoding:NSUTF8StringEncoding error:nil];
if (state)
source = [InternetHelper ConvertASCIICharactersInString:source];
return source;
}
+(NSString *)ConvertASCIICharactersInString:(NSString *)str
{
NSString *ret = [NSString stringWithString:str];
if (!ASCII)
{
NSString *path = [[NSBundle mainBundle] pathForResource:kASCIICharacterTableFilename ofType:kFileFormat];
ASCII = [[NSDictionary alloc] initWithContentsOfFile:path];
}
for (id key in ASCII)
{
ret = [ret stringByReplacingOccurrencesOfString:key withString:[ASCII objectForKey:key]];
}
return ret;
}
#end

Related

strange NSString behaviour when used with [NSBundle mainBundle] pathForResource

I have a weird one for you guys today, I think my NSStrings are incorrectly encoded.
NSString * convertedString = [NSString stringWithUTF8String:mesh->groupMesh[i].materialData->textureName];
-textureName is just a c style string that I'm converting into an NSString.
-The string is: "dennum1.png"
NSArray * line = [convertedString componentsSeparatedByString:#"."];
NSString * texPath = [[NSBundle mainBundle] pathForResource:line[0] ofType:line[1]];
I then split it into an NSArray line, separated by periods "."
This makes it so that line[0] is dennum1, and line[1] is png.
I even do an NSLog to make sure:
NSLog(#"Name:%# Type:%#", line[0], line[1]);
2013-09-21 02:15:27.386 SteveZissou[8846:c07] Name:dennum1 Type:png
I parse this over to the pathForResource function and I get a (null) response.
BUT if I hard type the file name into the code I.E:
convertedString = #"dennum1.png";
NSArray * line = [convertedString componentsSeparatedByString:#"."];
NSString * texPath = [[NSBundle mainBundle] pathForResource:line[0] ofType:line[1]];
NSLog(#"This is the texPath: %#",texPath);
IT WORKS?!
This is the texPath: /Users/meow/Library/Application Support/iPhone Simulator/6.0/Applications/2DEB8076-5F9D-45DE-8A73-10B1C8A084B4/SteveZissou.app/dennum1.png
Is it possible that the NSString that I hard type in the code and the NSString that comes from the conversion are encoded differently?
When I NSLog them individually I get the same result regardless of type:
2013-09-21 02:15:27.386 SteveZissou[8846:c07] This is the c style string: dennum1.png
2013-09-21 02:15:27.386 SteveZissou[8846:c07] This is the converted c style string: dennum1.png
2013-09-21 02:15:27.386 SteveZissou[8846:c07] This is the string manually typed in: dennum1.png
What happens if you use the methods in NSPathUtilities that are designed to handle this stuff. Like:
NSString * texPath = [[NSBundle mainBundle] pathForResource:string.stringByDeletingPathExtension ofType:string.pathExtension];
Also, there's -fileSystemRepresentation which will convert an NSString to a const char * and throw an exception if the string can't be converted correctly.
I figured it out after HOURS of skimming through possible code combinations and found it by accident.
I used NSURL functions to get the path based on the strings I was using, which resulted in this path:
file://localhost/Users/meow/Library/Application%20Support/iPhone%20Simulator/6.0/Applications/2DEB8076-5F9D-45DE-8A73-10B1C8A084B4/SteveZissou.app/dennum2.png%0D
Look at the very end! That's not supposed to be there! Turns out it's called a carriage return and it was pulled with the file (probably a remnant of the files formatting), but invisible to NSLog, however it wasn't invisible to NSURL (NSURL must read the bytes and display what they are?). So snipping the carriage return off the end of the path gave me the correct file and everything is OK.
I kept thinking to myself to use a hex editor to look at the file, but couldn't find one on the mac appstore free, I think if this was windows I would've caught it in half the time.

Subscript and Superscripts in CDATA of an xml file. Using UILabel to display the parsed XML contents

I need to display subscripts and superscripts (only arabic numerals) within a UILabel. The data is taken from an XML file. Here is the snippet of XML file:
<text><![CDATA[Hello World X\u00B2 World Hello]]></text>
Its supposed to display X2 (2 as superscript). When I read the string from the NSXMLParser and display it in the UILabel, it displays it as X\u00B2. Any ideas on how to make it work?
I think you can do something like this, assuming the CDATA contents have been read into an NSString and passed into this function:
-(NSString *)removeUnicodeEscapes:(NSString *)stringWithUnicodeEscapes {
unichar codeValue;
NSMutableString *result = [stringWithUnicodeEscapes mutableCopy];
NSRange unicodeLocation = [result rangeOfString:#"\\u"];
while (unicodeLocation.location != NSNotFound) {
// Get the 4-character hex code
NSRange charCodeRange = NSMakeRange(unicodeLocation.location + 2, 4);
NSString *charCode = [result substringWithRange:charCodeRange];
[[NSScanner scannerWithString:charCode] scanHexInt:&codeValue];
// Convert it to an NSString and replace in original string
NSString *unicodeChar = [NSString stringWithFormat:%C", codeValue];
NSRange replacementRange = NSMakeRange(unicodeLocation.location, 6);
[result replaceCharactersInRange:replacementRange withString:unicodeChar];
unicodeLocation = [result rangeOfString:#"\\u"];
}
return result;
}
I haven't had a chance to try this out, but I think the basic approach would work
\u00B2 is not any sort of XML encoding for characters. Apparently your data source has defined their own encoding scheme (which, frankly, is pretty stupid as XML is capable of encoding these directly, using entities outside of CDATA blocks).
In any case, you'll have to write your own parser that handles \u#### and converts that to the correct character.
I asked the question to my colleague and he gave me a nice and simple workaround. Am describing it here, in case others also get stuck at this.
Firstly goto this link. It has a list of all subscripts and superscripts. For example, in my case, I clicked on "superscript 0". In the following HTML page detailing "superscript 0", goto "Java Data" section and copy the "⁰". You can either place this directly in XML or write a simple regex in obj-c to replace \u00B2 with "⁰". And you will get nice X⁰. Do the same fro anyother superscript or subscript that you might want to display.

StringByAddingPercentEscapes not working on ampersands, question marks etc

I'm sending a request from my iphone-application, where some of the arguments are text that the user can enter into textboxes. Therefore, I need to HTML-encode them.
Here's the problem I'm running into:
NSLog(#"%#", testText); // Test & ?
testText = [testText stringByAddingPercentEscapesUsingEncoding:NSUTF8StringEncoding];
NSLog(#"%#", testText); // Test%20&%20?
As you can see, only the spaces are encoded, making the server disregard everything past the ampersand for the argument.
Is this the advertised behaviour of stringByAddingPercentEscapes? Do I have to manually replace every special character with corresponding hex code?
Thankful for any contributions.
They are not encoded because they are valid URL characters.
The documentation for stringByAddingPercentEscapesUsingEncoding: says
See CFURLCreateStringByAddingPercentEscapes for more complex transformations.
I encode my query string parameters using the following method (added to a NSString category)
- (NSString *)urlEncodedString {
CFStringRef buffer = CFURLCreateStringByAddingPercentEscapes(kCFAllocatorDefault,
(CFStringRef)self,
NULL,
CFSTR("!*'();:#&=+$,/?%#[]"),
kCFStringEncodingUTF8);
NSString *result = [NSString stringWithString:(NSString *)buffer];
CFRelease(buffer);
return result;
}

Organize objective-C string for filters

How can I organize this string better for best coding practices. It's a string that defines filters:
NSString* string3 = [[[[[[tvA.text stringByReplacingOccurrencesOfString:#"\n" withString:#" "] stringByReplacingOccurrencesOfString:#"&" withString:#"and"] stringByReplacingOccurrencesOfString:#"garçon" withString:#"garcon"] stringByReplacingOccurrencesOfString:#"Garçon" withString:#"Garcon"] stringByReplacingOccurrencesOfString:#"+" withString:#"and"] stringByAddingPercentEscapesUsingEncoding:NSASCIIStringEncoding];
Is there a way to have it be:
NSString* string3 = [[[[[tvA.text filter1] filter2] filter3] filter4] filter5] stringByAddingPercentEscapesUsingEncoding:NSASCIIStringEncoding];
You shouldn't be replacing & and + before percent-escaping. The problem is that stringByAddingPercentEscapesUsingEncoding: (IIRC) adds the minimum escapes to make it a "valid" URL string, whereas you want to escape anything that might have a special interpretation. For this, use CFURLCreateStringByAddingPercentEscapes():
return [(NSString*)CFURLCreateStringByAddingPercentEscapes(NULL, (CFStringRef)aString, NULL, (CFStringRef)#":/?#[]#!$&'()*+,;=", kCFStringEncodingUTF8) autorelease];
This encodes & and + correctly, instead of just changing them to "and". It also encodes newlines as %0a (so you might want to replace them with spaces; that's your call), and encodes ç as %C3%A7 (which is decoded correctly if you use UTF-8 on the server).
The first thing I'd do is capture the transformation into a method somewhere (where "somewhere" is either an instance method on an appropriate object or a class method on a utility class).
- (NSString *) transformString: (NSString *) aString
{
NSString *transformedString;
transformedString = [aString stringByReplacingOccurrencesOfString:#"\n" withString:#" "];
transformedString = [transformedString stringByReplacingOccurrencesOfString:#"&" withString:#"and"];
transformedString = [transformedString stringByReplacingOccurrencesOfString:#"garçon" withString:#"garcon"];
transformedString = [transformedString stringByReplacingOccurrencesOfString:#"Garçon" withString:#"Garcon"];
transformedString = [transformedString stringByReplacingOccurrencesOfString:#"+" withString:#"and"];
transformedString = [transformedString stringByAddingPercentEscapesUsingEncoding:NSASCIIStringEncoding];
return transformedString;
}
Then:
NSString *result = [myTransformer transformString: tVA.text];
A bit brutish, but it'll work. And by "brutish", I mean that it is going to be slow and will cause a bunch of interim strings to pile up in the autorelease pool. However, if this is something that you only do every now and then, don't worry about it -- while brutish, it is certainly quite straightforward.
If, however, this shows up in performance analysis as a bottleneck, you could first move to using NSMutableString as it has methods for doing replacements in place. That, at least, will reduce memory thrash and will likely be a bit faster in that there is less copying of strings going on.
If that is still too slow, then you will likely need to write yourself a fun little bit of parsing and processing code that walks through the original and copies it to new a new string while also doing any necessary transforms along the way.
But, don't bother optimizing until you prove that it is a problem. And, of course, if it is a problem, you have just one method to optimize!
If performance is not crucial, put the strings and their replacements into an NSDictionary and iterate over the items. Put it all in a helper method and use a NSMutableString to work on it (which reduces at least some of the cost).

NSXMLParser rss issue NSXMLParserInvalidCharacterError

NSXMLParserInvalidCharacterError # 9
This is the error I get when I hit a weird character (like quotes copied and pasted from word to the web form, that end up in the feed). The feed I am using is not giving an encoding, and their is no hope for me to get them to change that. This is all I get in the header:
< ?xml version="1.0"?>
< rss version="2.0">
What can I do about illegal characters when parsing feeds? Do I sweep the data prior to the parse? Is there something I am missing in the API? Has anyone dealt with this issue?
NSString *dataString = [[[NSString alloc] initWithData:webData encoding:NSASCIIStringEncoding] autorelease];
NSData *data = [dataString dataUsingEncoding:NSUTF8StringEncoding allowLossyConversion:YES];
NSXMLParser *parser = [[NSXMLParser alloc] initWithData:data];
Fixed my problems...
The NSString -initWithData:encoding: method returns nil if it fails, so you can try one encoding after another until you find one that converts. This doesn't guarantee that you'll convert all the characters correctly, but if your feed source isn't sending you correctly encoded XML, then you'll probably have to live with it.
The basic idea is:
// try the most likely encoding
NSString xmlString = [[NSString alloc] initWithData:xmlData
encoding:NSUTF8StringEncoding];
if (xmlString == nil) {
// try the next likely encoding
xmlString = [[NSString alloc] initWithData:xmlData
encoding:NSWindowsCP1252StringEncoding];
}
if (xmlString == nil) {
// etc...
}
To be generic and robust, you could do the following until successful:
1.) Try the encoding specified in the Content-Type header of the HTTP response (if any)
2.) Check the start of the response data for a byte order mark and if found, try the indicated encoding
3.) Look at the first two bytes; if you find a whitespace character or '<' paired with a nul/zero character, try UTF-16 (similarly, you can check the first four bytes to see if you have UTF-32)
4.) Scan the start of the data looking for the <?xml ... ?> processing instruction and look for encoding='something' inside it; try that encoding.
5.) Try some common encodings. Definitely check Windows Latin-1, Mac Roman, and ISO Latin-1 if your data source is in English.
6.) If none of the above work, you could try removing all bytes greater than 127 (or substitute '?' or another ASCII character) and convert the data using the ASCII encoding.
If you don't have an NSString by this point, you should fail. If you do have an NSString, you should look for the encoding declaration in the <?xml ... ?> processing instruction (if you didn't already in step 4). If it's there, you should convert the NSString back to NSData using that encoding; if it's not there, you should convert back using UTF-8 encoding.
Also, the CFStringConvertIANACharSetNameToEncoding() and CFStringConvertEncodingToNSStringEncoding() functions can help get the NSStringEncoding that goes with the encoding name form the Content-Type header or the <?xml ... ?> processing instruction.
You can also remove that encoding line from xml like this:
int length = str.length >100 ? 100:str.length;
NSString*mystr= [str stringByReplacingOccurrencesOfString:#"encoding=\".*?\""
withString:#""
options:NSRegularExpressionSearch
range:NSMakeRange(0, length)];