I have several large PDF docs (70-200, pages each). The PDFs themselves are generated from HTML pages (I can't get the source code of the HTML pages which is why I am working with the PDFs). Anyway, what I want to do is parse the PDF into separate pages based on the converted H1 tag attribute. When I print out the PDF I get this:
Seller Tag (AST)
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 8.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f4339680, spc=2.22\"";
}Table of Contents
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 34.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f432f940, spc=9.45\"";
}...
which looks like a bunch of attributes contained in a Dictionary. But when I run this code:
let strContent = myAppManager.pdfToText(fromPDF:pdfDirPath.absoluteString + "/" + thisFile)
let strPDF:NSAttributedString = strContent
let strNSPDF = strPDF.string as NSString
let rangeOfString = NSMakeRange(0, strNSPDF.length)
let arrAttributes = strPDF.attributes(at: 0, longestEffectiveRange: nil, in: rangeOfString)
print(arrAttributes)
I get this output
[__C.NSAttributedStringKey(_rawValue: NSColor): Device RGB colorspace 0.94118 0.32549 0.29804 1, __C.NSAttributedStringKey(_rawValue: NSBaselineOffset): 0, __C.NSAttributedStringKey(_rawValue: NSFont): "Helvetica 8.00 pt. P [] (0x7ff0f441d490) fobj=0x7ff0f4339680, spc=2.22"]
I was kind of expecting a high number, like 1000 or more entries, not 1.
So snooping around, I know the H1 HTML tag gets converted to this:
Table of Contents
{
NSBaselineOffset = 0;
NSColor = "Device RGB colorspace 0.94118 0.32549 0.29804 1";
NSFont = "\"Helvetica 34.00 pt. P [] (0x7ff0f262e590) fobj=0x7ff0f432f940, spc=9.45\"";
}
So what I am looking to do is delimit the converted H1s so I can get the content between as a page and do stuff with it. Any ideas or suggestions would be appreciated.
Quickly done, assuming you have:
someText[HEADER1]someText1[HEADER2]someText2[HEADER3]someText3...
Where [HEADERN] have the same attributes (and you know them) but not the same as someTextN.
We want in the end, and array of:
struct Page: CustomStringConvertible {
let title: NSAttributedString? //Tha's be the h1 tag content
let content: NSAttributedString?
var description: String {
return "Title: \(title?.string ?? "") - content: \(content?.string ?? "")"
}
}
Initial sample:
let htmlString = "<b>Title 1</b> Text for part one.\n <b>Title 2</b> Text for part two<b>Title 3</b>Text for part three"
let attributedString = try! NSAttributedString(data: Data(htmlString.utf8),
options: [.documentType : NSAttributedString.DocumentType.html],
documentAttributes: nil)
With:
let headerAttributes: [NSAttributedString.Key: Any] = [.font: UIFont.boldSystemFont(ofSize: 12)]
print("headerAttributes: \(headerAttributes)")
func headerOneAttributes(_ headerAttributes: [NSAttributedString.Key: Any], matches attributes: [NSAttributedString.Key: Any]?) -> Bool {
guard let attributes = attributes else { return false }
guard let attributesFont = attributes[.font] as? NSFont, let headerFont = headerAttributes[.font] as? NSFont else {
return false
}
return attributesFont.fontDescriptor.symbolicTraits == NSFontDescriptor.SymbolicTraits(rawValue: 268435458) //Here fonts arent' equal equal, some work here plus checking on other attributes too and font size?
// Do you own check
// return false
}
We can iterates the attributes to get all the headers ranges:
var headerRanges: [NSRange] = []
attributedString.enumerateAttributes(in: NSRange(location: 0, length: attributedString.length), options: []) { attributes, range, stop in
if headerOneAttributes(headerAttributes, matches: attributes) {
headerRanges.append(range)
}
}
With an iteration on the ranges:
var pages: [Page] = []
guard !headerRanges.isEmpty else { return }
//In case the first title doesn't "start at the beginning", we have a "content" with no title at start
if let first = headerRanges.first, first.location > 0 {
pages.append(Page(title: nil, content: attributedString.attributedSubstring(from: first)))
}
// Then we iterate
for (anIndex, aRange) in headerRanges.enumerated() {
print(pages)
let title = attributedString.attributedSubstring(from: aRange)
let subtext: NSAttributedString?
// If there is a "nextRange", then we get the end of subtext from it
if anIndex + 1 <= headerRanges.count - 1 {
let next = headerRanges[anIndex + 1]
let location = aRange.location + aRange.length
let length = next.location - location
subtext = attributedString.attributedSubstring(from: NSRange(location: location, length: length))
} else {
//There is no next => Until the end
let location = aRange.location + aRange.length
let length = attributedString.length - location
subtext = attributedString.attributedSubstring(from: NSRange(location: location, length: length))
}
pages.append(Page(title:title, content: subtext))
}
print(pages)
PS: UIFont/NSFont: ~the same, I tested on a macOS app, not iOS, that's why.
Okay, so #Larme put me on the right track for what I was looking for. Posting the code in hopes it helps someone else. I've tested this on a 77 page document and it worked. I should have noted in the question that I am working on MacOS.
func parsePDF(_ strPDFContent:NSMutableAttributedString) -> Array<Dictionary<String, Any>> {
//some initial setup
let strNSPDF = strPDFContent.string as NSString
var arrDocSet:Array<Dictionary<String, Any>> = []
//get all the page headers
var arrRanges = [NSRange]()
strPDFContent.enumerateAttribute(NSAttributedString.Key.font, in: NSRange(0..<strPDFContent.length), options: .longestEffectiveRangeNotRequired) {
value, range, stop in
if let thisFont = value as? NSFont {
if thisFont.pointSize == 34 {
arrRanges.append(range)
}
}
}
//get the content and store data
for (idx, range) in arrRanges.enumerated() {
//get title
let strTitle = String(strNSPDF.substring(with: range))
var textRange = NSRange(location:0, length:0)
//skip opening junk
if !strTitle.contains("Table of Contents\n") {
if idx < arrRanges.count-1 {
textRange = NSRange(location: range.upperBound, length: arrRanges[idx+1].lowerBound - range.upperBound)
} else if idx == arrRanges.count-1 {
textRange = NSRange(location: range.upperBound, length: strNSPDF.length - range.upperBound)
}
let strContent = String(strNSPDF.substring(with: textRange))
arrDocSet.append(["title":strTitle, "content":strContent, "contentRange":textRange, "titleRange":range])
}
}
print(arrDocSet)
return arrDocSet
}
This will output:
["titleRange": {10001, 27}, "title": "Set up Placements with AST\n", "content": "This page contains a sample web page showing how Xandr\'s seller tag (AST) functions can be implemented in the header and body of a sample client page.\nSee AST API Reference for more details on using ...
...
ready.\nExample\n$sf.ext.status();\n", "title": " SafeFrame API Reference\n", "contentRange": {16930, 9841}
Let me know if there's places I could be more efficient.
Related
I get a message from my response like "your bill is: 10.00"
But I need to show in bold the number and only that (everything after the ":"). I know I could use SubString, but don't understand exactly how to split text and correctly format it
my old test:
self.disclaimerLabel.attributedText = String(format: my).htmlAttributedString(withBaseFont: Font.overlineRegular07.uiFont, boldFont: Font.overlineBold02.uiFont, baseColor: Color.blueyGreyTwo.uiColor, boldColor: Color.blueyGreyTwo.uiColor)
How was my built ? If from 2 parts, set attributes to each before joining.
If you get my as a whole, you can access substrings with
let parts = my.split(separator: ":")
parts[1] will be "your bill is"
parts[2] will be "10:00"
The need to add styling to a single word or phrase is so common that it is worth having on hand a method to help you:
extension NSMutableAttributedString {
func apply(attributes: [NSAttributedString.Key: Any], to targetString: String) {
let nsString = self.string as NSString
let range = nsString.range(of: targetString)
guard range.length != 0 else { return }
self.addAttributes(attributes, range: range)
}
}
So then your only problem is discovering the stretch of text that you want to apply the attributes to. If you don't know that it is "10.00" then, as you've been told, you can find out by splitting the string at the colon-plus-space.
You can split your string into char : and then you can change text attributes like :
var str = "your bill is: 10.00"
var splitArray = str.components(separatedBy: ":")
let normalText = NSMutableAttributedString(string: splitArray[0] + ":")
let boldText = splitArray[1]
let boldTextAtr = NSMutableAttributedString(string: boldText, attributes: [NSAttributedString.Key.font: UIFont.boldSystemFont(ofSize: 16.0) ])
normalText.append(boldTextAtr)
let labell = UILabel()
labell.attributedText = normalText
labell.attributedText will print what you exactly want
I am in the process of writing code to display mentions within an NSAttributedString, which need to link out to a user profile. The format of the mentions is #username[userid], which would need to be displayed as simply #username, which is tappable.
I have the code working so far that the username becomes clickable, but I now need to remove the [userid] part, which of course modifies the length of the string so that ranges don't match anymore, etc. Not sure how I can solve this.
import Foundation
import UIKit
let comment = "Hello #kevin[1], #john and #andrew[2]!"
let wholeRange = NSRange(comment.startIndex..<comment.endIndex, in: comment)
let regex = try NSRegularExpression(pattern: #"(#[\w.-#]+)\[(\d+)\]"#)
let attributedString = NSMutableAttributedString(string: comment)
regex.enumerateMatches(in: comment, options: [], range: wholeRange) { match, _, _ in
guard let match = match else {
return
}
let userIdRange = Range(match.range(at: 2), in: comment)!
let userId = comment[userIdRange]
let usernameRange = match.range(at: 1)
attributedString.addAttribute(NSAttributedString.Key.link, value: URL(string: "test://profile/\(userId)")!, range: usernameRange)
}
print(attributedString)
The result right now can be represented like this, when printed:
Hello {
}#kevin{
NSLink = "test://profile/1";
}[1], #john and {
}#andrew{
NSLink = "test://profile/2";
}[2]!{
}
So #kevin and #andrew are links, #john is not (which is expected!), but the user ids are still visible. Surely this is a problem that has been solved before but I can't find any examples, not sure what keywords to search for. There are plenty of questions about detecting usernames/mentions in strings, and even more about making links in NSAttributedString, but that's not the problem I am trying to solve.
How would I turn the #username[userid] mentions into clickable #username links, so that the [userid] part is hidden?
You just need to get all the matching ranges, iterate them in reverse order, add the link to it and then replace the whole range with the name. Something like:
let comment = "Hello #kevin[1], #john and #andrew[2]!"
let attributedString = NSMutableAttributedString(string: comment)
let regex = try NSRegularExpression(pattern: #"(#[\w.-#]+)\[(\d+)\]"#)
var ranges: [(NSRange,NSRange,NSRange)] = []
regex.enumerateMatches(in: comment, range: NSRange(comment.startIndex..., in: comment)) { match, _, _ in
guard let match = match else {
return
}
ranges.append((match.range(at: 0),
match.range(at: 1),
match.range(at: 2)))
}
ranges.reversed().forEach {
let userId = attributedString.attributedSubstring(from: $0.2).string
let username = attributedString.attributedSubstring(from: $0.1).string
attributedString.addAttribute(.link, value: URL(string: "test://profile/\(userId)")!, range: $0.0)
attributedString.replaceCharacters(in: $0.0, with: username)
}
print(attributedString)
This will print
Hello {
}#kevin{
NSLink = "test://profile/1";
}, #john and {
}#andrew{
NSLink = "test://profile/2";
}!{
}
Quickly done:
let comment = "Hello #kevin[1], #john and #andrew[2]!"
let attributedString = NSMutableAttributedString(string: comment)
let wholeRange = NSRange(attributedString.string.startIndex..<attributedString.string.endIndex, in: attributedString.string)
let regex = try NSRegularExpression(pattern: #"(#[\w.-#]+)\[(\d+)\]"#)
let matches = regex.matches(in: attributedString.string, options: [], range: wholeRange)
matches.reversed().forEach { aResult in
let fullMatchRange = Range(aResult.range(at: 0), in: attributedString.string)! //#kevin[1]
let replacementRange = Range(aResult.range(at: 1), in: attributedString.string)! //#kevin
let userIdRange = Range(aResult.range(at: 2), in: attributedString.string)! // 1
let atAuthor = String(attributedString.string[replacementRange])
attributedString.addAttribute(.link,
value: URL(string: "test://profile/\(attributedString.string[userIdRange])")!,
range: NSRange(fullMatchRange, in: attributedString.string))
attributedString.replaceCharacters(in: NSRange(fullMatchRange, in: attributedString.string),
with: atAuthor)
}
print(attributedString)
Output:
Hello {
}#kevin{
NSLink = "test://profile/1";
}, #john and {
}#andrew{
NSLink = "test://profile/2";
}!{
}
What's to see:
I changed the pattern, for easy captures. See the sample in comment in the forEach().
I used matches in reverse order, else the ranges won't be accurate anymore!
I kept playing with attributedString.string instead of comment in case it's "unsync".
In App i have string like
1A11A1
I want to convert it to
1A1 1A1
There should be space after 3characters.
What i tried is : code = 1A11A1
let end = code.index(code.startIndex, offsetBy: code.count)
let range = code.startIndex..<end
if code.count < 3 {
code = code.replacingOccurrences(of: "(\\d+)", with: "$1", options: .regularExpression, range: range)
}
else {
code = code.replacingOccurrences(of: "(\\d{3})(\\d+)", with: "$1 $2", options: .regularExpression, range: range)
}
If your rule is that you want a "space after 3 characters," take three characters, add a space and then the rest:
let result = "\(code.prefix(3)) \(code.dropFirst(3))"
// "1A1 1A1"
Rob's solution is fine, just for the sake of it, there's also an option to use insert(" ", at: index), something like this:
extension String {
var postalCode: String {
var result = self
// Check that this string is the right length
guard result.count == 6 else {
return result
}
let index = result.index(result.startIndex, offsetBy: 3)
result.insert(" ", at: index)
return result
}
}
Test:
let str: String = "1A11A1"
print(str.postalCode) // prints 1A1 1A1
let str2: String = "1A1 1A1"
print(str2.postalCode) // prints 1A1 1A1 (doesn't change format)
let str3: String = "12345"
print(str3.postalCode) // prints 12345 (doesn't change format)
I imported a NSAttributedString from a rtf-file and now I want to split it at another given String. With the attributedSubstring method you get one attributedSubstring as result, but I want to split it at every part, where the other String appeares, so the result should be an Array of NSAttributedStrings.
Example:
var source = NSAttributedString(string: "I*** code*** with*** swift")
var splitter = "***"
var array = //The method I am looking for
The result should be the following Array(with attributedStrings): [I, code, with, swift]
Following extension method maps the string components using Array.map into [NSAttributedString]
extension NSAttributedString {
func components(separatedBy string: String) -> [NSAttributedString] {
var pos = 0
return self.string.components(separatedBy: string).map {
let range = NSRange(location: pos, length: $0.count)
pos += range.length + string.count
return self.attributedSubstring(from: range)
}
}
}
Usage
let array = NSAttributedString(string: "I*** code*** with*** swift").components(separatedBy: "***")
I am consuming an API that gives me the next page in the Header inside a field called Link. (For example Github does the same, so it isn't weird.Github Doc)
The service that I am consuming retrieve me the pagination data in the following way:
As we can see in the "Link" gives me the next page,
With $0.response?.allHeaderFields["Link"]: I get </api/games?page=1&size=20>; rel="next",</api/games?page=25&size=20>; rel="last",</api/games?page=0&size=20>; rel="first".
I have found the following code to read the page, but it is very dirty... And I would like if anyone has dealt with the same problem or if there is a standard way of face with it. (I have also searched if alamofire supports any kind of feature for this but I haven't found it)
// MARK: - Pagination
private func getNextPageFromHeaders(response: NSHTTPURLResponse?) -> String? {
if let linkHeader = response?.allHeaderFields["Link"] as? String {
/* looks like:
<https://api.github.com/user/20267/gists?page=2>; rel="next", <https://api.github.com/user/20267/gists?page=6>; rel="last"
*/
// so split on "," the on ";"
let components = linkHeader.characters.split {$0 == ","}.map { String($0) }
// now we have 2 lines like '<https://api.github.com/user/20267/gists?page=2>; rel="next"'
// So let's get the URL out of there:
for item in components {
// see if it's "next"
let rangeOfNext = item.rangeOfString("rel=\"next\"", options: [])
if rangeOfNext != nil {
let rangeOfPaddedURL = item.rangeOfString("<(.*)>;", options: .RegularExpressionSearch)
if let range = rangeOfPaddedURL {
let nextURL = item.substringWithRange(range)
// strip off the < and >;
let startIndex = nextURL.startIndex.advancedBy(1) //advance as much as you like
let endIndex = nextURL.endIndex.advancedBy(-2)
let urlRange = startIndex..<endIndex
return nextURL.substringWithRange(urlRange)
}
}
}
}
return nil
}
I think that the forEach() could have a better solution, but here is what I got:
let linkHeader = "</api/games?page=1&size=20>; rel=\"next\",</api/games?page=25&size=20>; rel=\"last\",</api/games?page=0&size=20>; rel=\"first\""
let links = linkHeader.components(separatedBy: ",")
var dictionary: [String: String] = [:]
links.forEach({
let components = $0.components(separatedBy:"; ")
let cleanPath = components[0].trimmingCharacters(in: CharacterSet(charactersIn: "<>"))
dictionary[components[1]] = cleanPath
})
if let nextPagePath = dictionary["rel=\"next\""] {
print("nextPagePath: \(nextPagePath)")
}
//Bonus
if let lastPagePath = dictionary["rel=\"last\""] {
print("lastPagePath: \(lastPagePath)")
}
if let firstPagePath = dictionary["rel=\"first\""] {
print("firstPagePath: \(firstPagePath)")
}
Console output:
$> nextPagePath: /api/games?page=1&size=20
$> lastPagePath: /api/games?page=25&size=20
$> firstPagePath: /api/games?page=0&size=20
I used components(separatedBy:) instead of split() to avoid the String() conversion at the end.
I created a Dictionary for the values to hold and removed the < and > with a trim.