Extracting data(strings) from a string large string - iphone

A long time ago I had to extract data from a string, and I went with a while loop that went through the whole string char by char extracting bits of data that I need. It wasn't very efficient but it worked.
In my latest app I would like to try and do it in the way that a good engineer would do it. Are there ways to search the string for an expression? or a sub string maybe?
For example out of the html in the string, there is a line that will contain a team name.
<td width="25%"><span class="teamname">Blue Bombers</span></td>
Is there a call I can do that would find the "teamname" and then extract the teamname from between the > <.
I could go char by char saving the last 10 chars to a string until the string equals "teamname", then keep going until i hit the > save everything i get until i again hit a <. but i guess thats taking the easy inefficient way.
Many Thanks
-Code

You can get the range of string "class" using NSRange, then do your logic... it will probably reduce the character searching..
Your code should be like follows,
if ([substring rangeOfString:#"class"].location != NSNotFound) {
// "class" was found
else {
// "class" was not found
}

If that's the only part of the string you're interested in and then just find a starting point like "teamname" via -rangeOfString:. If there's more than one occurrence then make repeated calls with -rageOfString:options:range:.
If you need more comprehensive parsing, however..
If this string is actual XHTML then you may be able to use one of the various XML parsers, e.g. TouchXML, and then find what you need via DOM lookups. However if (as seems likely) it's not pure XHTML then this is unlikely to help. In that case you might try loading up the HTML in an offscreen UIWebView and using JavaScript calls to find specific elements.

Related

How to display an int without commas?

I have a list of Text views that include a year saved as an int. I'm displaying it in an interpolated string:
Text("\($0.property) \($0.year) \($0.etc)")
The problem, it adds a comma to the int, for example it displays 1,944 instead of the year 1944. I'm sure this is a simple fix but i've been unable to figure out how to remove the comma. Thanks!
There is explicit Text(verbatim:) constructor to render string without localization formatting, so a solution for your case is
Text(verbatim: "\($0.property) \($0.year) \($0.etc)")
Use Text(String(yourIntValue)) if you use interpolation you need to cast it as a string directly. If you allow the int to handle it, it shows with a ,.
So to recap.
let yourIntValue = 1234
Text(String(yourIntValue)) // will return `1234`.
Text("\(yourIntValue)") // will return `1,234`.
I use the built-in format parameter. It's useful for formatting well beyond just this one specific usage (no commas).
Text("Disk Cache \(URLCache.shared.currentDiskUsage,
format: .number.grouping(.never))"))

Swift string with key-value, is this format standard ? How can I get it as a dictionary?

I work with an array of string, each string var is a coded object.
I want to decode the object, when I print a string var I get something structured like that :
"firstName=\"Elliot\" lastName=\"Alderson\" gender=\"male\" age=\"33\",some description I also need to get"
Is that a standard format to store key value properties ? I can't find anything on internet. The keys are always the same so that's not a big deal to get theses values as a dictionary but I would like to know if there is like a best practice method to get theses data instead of just searching for each key and then reach value from the first quote to the second one (for each value)
Because my file is 30000 lines so I better choose the more optimized way.
Thanks !

Perl XML::SAX - character() method error

I'm new to using Perl XML::SAX and I encountered a problem with the characters event that is triggered. I'm trying to parse a very large XML file using perl.
My goal is to get the content of each tag (I do not know the tag names - given any xml file, I should be able to crack the record pattern and return every record with its data and tag like Tag:Data).
While working with small files, everything is ok. But when running on a large file, the characters{} event does partial reading of the content. There is no specific pattern in the way it cuts down the reading. Sometimes its the starting few characters of data and sometimes its last few characters and sometimes its just one letter from the actual data.
The Sax Parser is:
$myhandler = MyFilter->new();
$parser = XML::SAX::ParserFactory->parser(Handler => $myhandler);
$parser->parse_file($filename);
And, I have written my own Handler called MyFilter and overridding the character method of the parser.
sub characters {
my ($self, $element) = #_;
$globalvar = $element->{Data};
print "content is: $globalvar \n";
}
Even this print statement, reads the values partially at times.
I also tried loading the Parsesr Package before calling the $parser->parse() as:
$XML::SAX::ParserPackage = "XML::SAX::ExpatXS";
Stil doesn't work. Could anyone help me out here? Thanks in advance!
Sounds like you need XML::Filter::BufferText.
http://search.cpan.org/dist/XML-Filter-BufferText/BufferText.pm
From the description "One common cause of grief (and programmer error) is that XML parsers aren't required to provide character events in one chunk. They can, but are not forced to, and most don't. This filter does the trivial but oft-repeated task of putting all characters into a single event."
It's very easy to use once you have it installed and will solve your partial character data problem.

Make Lucene index a value and store another

I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:
this_example-has some/weird (chars) 100%
I want it stored right like that (so that I can retrieve exactly that for showing in the results list), but I want lucene to index it as:
this example has some weird chars 100
(you see, like a "sanitized" version of the original value) for a simplified search.
I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.
Any suggestions on how to implement that?
This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.
Thanks in advance.
If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.
var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, #"[\W_]+", " ");
You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.
public static void Main() {
var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());
IndexWriter writer = null; // TODO: Retrieve these.
Document document = null;
writer.AddDocument(document, analyzer: wrapper);
}
You are correct that this is the work of the analyzer. And I'd start by using a tool like luke to see what the standard analyzer does with your term before getting into what to use -- it tends to do a good job stripping noise characters and words.

PDF Table of Contents Parsing with iOS Quartz 2D

This question has been asked before, I know. However, nobody has answered it well. I'm wondering how to parse a PDF's "table of contents" on the iPhone. The docs tell me to use CGPDFDocumentGetCatalog but not how to use it. All they say is that it returns a dictionary. Also, I can't find any example code. Any suggestions?
looks like the closest thing seen on SO is Create a table of contents from a pdf file
It's basically just parsing the CGPDFDictionary called "Outline" in the CGPDFPage.
// get outline & loop through dictionary...
CGPDFDictionaryRef outlineRef;
if(CGPDFDictionaryGetDictionary(pdfDocDictionary, "Outlines", &outlineRef)) {
}
then you start with the First element and parse your way through.
CGPDFDictionaryGetDictionary(outlineRef, "First", &firstEntry)
You want to get the Title and the Destination.
NSString *outlineTitle = PSPDFStringFromPDFDict(outlineElementRef, #"Title");
CGPDFDictionaryGetObject(outlineElementRef, "Dest", &destinationRef)
The tricky thing starts with getting the correct destination, because there are (horray, PDF!) several ways to store it, plus several ways that are not defined in the PDF Reference but still out in the wild. Plus several variants that are just broken and you have to deal with it.
For example, you could get the Count of the outline dictionary using
CGPDFInteger elements;
if(CGPDFDictionaryGetInteger(outlineRef, "Count", &elements)) {
PSPDFLog(#"parsing outline: %ld elements. (Count will be ignored anyway)", (long int)elements);
}else {
PSPDFLogError(#"Error while parsing outline. No outlineRef?");
}
But note that Count sometimes is invalid due to broken PDF creation tools. See PDF as HTML. Even if it's broken, parsers will do their best to display as much data as they can. So my advice is to ignore Count and parse the dictionary anyway. (A few weeks ago I encountered a document that had Count = -10. Go figure)
I can't post the full code, as it's from my commercial PDF library PSPDFKit, and I need to make a living out of it ;) But this should get you started.