Parse HTML with Swiftsoup (Swift)? - swift

I'm trying to parse some websites with Swiftsoup, let's say one of the websites is from Medium. How can I extract the body of the website and load the body to another UIViewController like what Instapaper does?
Here is the code I use to extract the title:
import SwiftSoup
class WebViewController: UIViewController, UIWebViewDelegate {
...
override func viewDidLoad() {
super.viewDidLoad()
let url = URL(string: "https://medium.com/#timjwise/stop-lying-to-yourself-when-you-snub-panhandlers-its-not-for-their-own-good-199d0aa7a513")
let request = URLRequest(url: url!)
webView.loadRequest(request)
guard let myURL = url else {
print("Error: \(String(describing: url)) doesn't seem to be a valid URL")
return
}
let html = try! String(contentsOf: myURL, encoding: .utf8)
do {
let doc: Document = try SwiftSoup.parseBodyFragment(html)
let headerTitle = try doc.title()
print("Header title: \(headerTitle)")
} catch Exception.Error(let type, let message) {
print("Message: \(message)")
} catch {
print("error")
}
}
}
But I got no luck to extract the body of the website or any other websites, any way to get it work? CSS or JavaScript (I know nothing about CSS or Javascript)?

Use function body https://github.com/scinfu/SwiftSoup#parsing-a-body-fragment
Try this:
let html = try! String(contentsOf: myURL, encoding: .utf8)
do {
let doc: Document = try SwiftSoup.parseBodyFragment(html)
let headerTitle = try doc.title()
// my body
let body = doc.body()
// elements to remove, in this case images
let undesiredElements: Elements? = try body?.select("img[src]")
//remove
undesiredElements?.remove()
print("Header title: \(headerTitle)")
} catch Exception.Error(let type, let message) {
print("Message: \(message)")
} catch {
print("error")
}

Related

Why is this Swift web scraper not working?

I am having trouble scraping an image HTML link with a code I found on youtube (https://www.youtube.com/watch?v=0jTyKu9DGm8&list=PLYjXqILgs9uPwYlmSrIkNj2O3dwPCcoBK&index=2). The code works perfectly fine in a playground, but there is something wrong with my implementation into an Xcode project. (More like: im not sure how to implement it into my project :) )
When I ran this code on a Playground it pulled the link that I needed exactly as I needed it to be outputted.
import Foundation
let url = URL(string: "https://guide.michelin.com/th/en/bangkok-
region/bangkok/restaurant/somtum-khun-kan")
let task = URLSession.shared.dataTask(with: url!) { (data, resp, error) in
guard let data = data else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: String.Encoding.utf8) else {
print("can not cast data into string")
return
}
let leftSideOfTheString = """
image":"
"""
let rightSideOfTheString = """
","#type
"""
guard let leftRange = htmlString.range(of: leftSideOfTheString) else {
print("can not find left range of string")
return
}
guard let rightRange = htmlString.range(of: rightSideOfTheString) else {
print("can not find right range of string")
return
}
let rangeOfValue = leftRange.upperBound..<rightRange.lowerBound
print(htmlString[rangeOfValue])
}
task.resume()
I then put the same exact code into a structure containing the code as a parameter and method, like so:
struct ImageLink {
let url = URL(string: "https://guide.michelin.com/th/en/bangkok-region/bangkok/restaurant/somtum-khun-kan")
func getImageLink() {
let task = URLSession.shared.dataTask(with: url!) { (data, resp, error) in
guard let data = data else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: String.Encoding.utf8) else {
print("can not cast data into string")
return
}
let leftSideOfTheString = """
image":"
"""
let rightSideOfTheString = """
","#type
"""
guard let leftRange = htmlString.range(of: leftSideOfTheString) else {
print("can not find left range of string")
return
}
guard let rightRange = htmlString.range(of: rightSideOfTheString) else {
print("can not find right range of string")
return
}
let rangeOfValue = leftRange.upperBound..<rightRange.lowerBound
print(htmlString[rangeOfValue])
}
task.resume()
}
}
Finally, to check if the code would give me the right link, I made an instance in a View and made a button printing the getImageLink() function like bellow. You'll see in commented out code that I tried displaying the image both by hard coding its link and by inserting the function call. The former worked as expected, the latter did not work.
import SwiftUI
struct WebPictures: View {
var imageLink = ImageLink()
var body: some View {
VStack {
//AsyncImage(url: URL(string: "\(imageLink.getImageLink())"))
//AsyncImage(url: URL(string: "https://axwwgrkdco.cloudimg.io/v7/__gmpics__/c8735576e7d24c09b45a4f5d56f739ba?width=1000"))
Button {
print(imageLink.getImageLink())
} label: {
Text("Print Html")
}
}
}
}
When I click the button to print the link I get the following message:
()
2022-05-16 17:21:30.030264+0800 MichelinRestaurants[35477:925525] [boringssl]
boringssl_metrics_log_metric_block_invoke(153) Failed to log metrics
https://axwwgrkdco.cloudimg.io/v7/__gmpics__/c8735576e7d24c09b45a4f5d56f739ba?width=1000
And if I click the button for a second time only this gets printed:
()
https://axwwgrkdco.cloudimg.io/v7/__gmpics__/c8735576e7d24c09b45a4f5d56f739ba?width=1000
If anybody knows how to help me out here that would be much appreciated!!
This fails because you do not wait until your func has pulled the link. You are in an async context here. One possible solution:
//Make a class in instead of a struct and inherit from ObservableObject
class ImageLink: ObservableObject {
let url = URL(string: "https://guide.michelin.com/th/en/bangkok-region/bangkok/restaurant/somtum-khun-kan")
//Create a published var for your view to get notified when the value changes
#Published var imageUrlString: String = ""
func getImageLink() {
let task = URLSession.shared.dataTask(with: url!) { (data, resp, error) in
guard let data = data else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: String.Encoding.utf8) else {
print("can not cast data into string")
return
}
let leftSideOfTheString = """
image":"
"""
let rightSideOfTheString = """
","#type
"""
guard let leftRange = htmlString.range(of: leftSideOfTheString) else {
print("can not find left range of string")
return
}
guard let rightRange = htmlString.range(of: rightSideOfTheString) else {
print("can not find right range of string")
return
}
let rangeOfValue = leftRange.upperBound..<rightRange.lowerBound
print(htmlString[rangeOfValue])
//Assign the scrapped link to the var
imageUrlString = htmlString[rangeOfValue]
}
task.resume()
}
}
And the view:
struct WebPictures: View {
//Observe changes from your imagelink class
#StateObject var imageLink = ImageLink()
var body: some View {
VStack {
AsyncImage(url: URL(string: imageLink.imageUrlString)) // assign imageurl to asyncimage
//AsyncImage(url: URL(string: "https://axwwgrkdco.cloudimg.io/v7/__gmpics__/c8735576e7d24c09b45a4f5d56f739ba?width=1000"))
Button {
imageLink.getImageLink()
} label: {
Text("Print Html")
}
}
}
}
Update:
In order to get the link when the view appears call it this way:
VStack {
AsyncImage(url: URL(string: imageLink.imageUrlString))
}
.onAppear{
if imageLink.imageUrlString.isEmpty{
imageLink.getImageLink()
}
}

swift soup parsing with evaluateJavaScript in webview

I want to get a video URL from HTML -:
<div class="_53mw" data-store="{"videoID":"607125377233758","playerFormat":"inline","playerOrigin":"permalink","external_log_id":null,"external_log_type":null,"rootID":607125377233758,"playerSuborigin":"misc","useOzLive":false,"playbackIsLiveStreaming":false,"canUseOffline":null,"playOnClick":true,"videoDebuggerEnabled":false,"videoViewabilityLoggingEnabled":false,"videoViewabilityLoggingPollingRate":-1,"videoScrollUseLowThrottleRate":true,"playInFullScreen":false,"type":"video","src":"*https:\/\/video.fdel10-1.fna.fbcdn.net\/v\/t42.1790-2\/271574467_153621553687266_1119427332980623121_n.mp4?_nc_cat=106&ccb=1-5&_nc_sid=985c63&efg=eyJ2ZW5jb2RlX3RhZyI6InN2ZV9zZCJ9&_nc_ohc=HuufVpgJRnEAX8_AEix&_nc_rml=0&_nc_ht=video.fdel10-1.fna&oh=00_AT8eUqTIMXRHidmafowZmL7-o4k2JG0FqA4QbFKNINiQ8Q&oe=61DD4DFB","width":414,"height":621*,"trackingNodes":"FH-R","downloadResources":null,"subtitlesSrc":null,"spherical":false,"sphericalParams":null,"defaultQuality":null,"availableQualities":null,"playStartSec":null,"playEndSec":null,"playMuted":null,"disableVideoControls":false,"loop":true,"numOfLoops":13,"shouldPlayInline":true,"dashManifest":null,"isAdsPreview":false,"iframeEmbedReferrer":null,"adClientToken":null,"audioOnlyVideoSrc":null,"audioOnlyEnabled":false,"permalinkShareID":null,"feedPosition":null,"chainDepth":null,"videoURL":"https:\/\/www.facebook.com\/100069898026392\/videos\/607125377233758\/","disableLogging":false}" data-sigil="inlineVideo">
i am doing this-:
do {
let doc: Document = try SwiftSoup.parseBodyFragment(html)
let headerTitle = try doc.title()
// my body
let body = doc.body()
// elements to remove, in this case images
let undesiredElements: Elements? = try body?.select("a")
//remove
try! undesiredElements?.remove()
// print("Header title: \(headerTitle)")
print("Header body: \(body)")
}
How to do this with swift soup
Try this code - :
do {
let doc: Document = try SwiftSoup.parse(html)
let size = try doc.getElementsByClass("_53mw").first()
let data = try size?.attr("data-store")
print(data!)
let videoUrl = convertToDictionary(text: data!)
let url = videoUrl!["src"] as! String
print(url)
} catch Exception.Error(let type, let message) {
print("Message: \(message)")
} catch {
print("error")
}

View pdf documents

I have a table with the names of pdf documents. Previously, there were 3 documents and each one has its own ViewController. How can I make it so that with hundreds of documents, I would select one from the table and show it on the View, if I select another document, then on the same View show another document.
while I have such a function, where I substituted the name of the documents in each class and showed it in different representations. But now I need to display everything on one ViewController when selecting any document
import UIKit
import PDFKit
class pdfViewClass {
class func filePDfFunc(nameFile: String, formatFile:String,
nameView:PDFView)
{
if let path = Bundle.main.path(forResource: nameFile,
ofType:formatFile) {
if let pdfDocument = PDFDocument(url: URL(fileURLWithPath:
path)) {
nameView.autoScales = true
nameView.displayDirection = .vertical
nameView.document = pdfDocument
}
}
}
}
You can use Native Apple UIDocumentInteractionController for viewing PDF file.
Create a function like below for View PDF
func viewPdf(urlPath: String, screenTitle: String) {
// open pdf for booking id
guard let url = urlPath.toUrl else {
print("Please pass valid url")
return
}
self.downloadPdf(fileURL: url, screenTitle: screenTitle) { localPdf in
if let url = localPdf {
DispatchQueue.main.sync {
self.openDocument(atURL: url, screenTitle: screenTitle)
}
}
}
}
Function for download PDF
// method for download pdf file
func downloadPdf(fileURL: URL, screenTitle: String, complition: #escaping ((URL?) -> Void)) {
// Create destination URL
if let documentsUrl: URL = FileManager.default.urls(for: .documentDirectory, in: .userDomainMask).first {
let destinationFileUrl = documentsUrl.appendingPathComponent("\(screenTitle).pdf")
if FileManager.default.fileExists(atPath: destinationFileUrl.path) {
try? FileManager.default.removeItem(at: destinationFileUrl)
}
let sessionConfig = URLSessionConfiguration.default
let session = URLSession(configuration: sessionConfig)
let request = URLRequest(url: fileURL)
let task = session.downloadTask(with: request) { tempLocalUrl, response, error in
if let tempLocalUrl = tempLocalUrl, error == nil {
// Success
if let statusCode = (response as? HTTPURLResponse)?.statusCode {
print("Successfully downloaded. Status code: \(statusCode)")
}
do {
try FileManager.default.copyItem(at: tempLocalUrl, to: destinationFileUrl)
complition(destinationFileUrl)
} catch let writeError {
print("Error creating a file \(destinationFileUrl) : \(writeError)")
}
} else {
print("Error took place while downloading a file. Error description: \(error?.localizedDescription ?? "N/A")")
}
}
task.resume()
} else {
complition(nil)
}
}
Function for open documents
func openDocument(atURL url: URL, screenTitle: String) {
self.documentInteractionController.url = url
self.documentInteractionController.name = screenTitle
self.documentInteractionController.delegate = self
self.documentInteractionController.presentPreview(animated: true)
}
On tap of tableView pass the specific index URL
viewPdf(urlPath: "http://www.africau.edu/images/default/sample.pdf", screenTitle: "Tesing Document")
You can do it using WKWebView easily. Use WKWebView to load your pdf doc.

How to load part of HTML from URL even after clicking event?

I have wrote some code to load all html from url and parse it to remove header. So I got the rest of the html under header to show.
However, after clicking event in the body html, the screen shows full html from the URL.
Is there any solution for this? or Did I make a mistake to approach this problem?
The code I made is below
import UIKit
import Fuzi
class ViewController: UIViewController {
#IBOutlet weak var webView: UIWebView!
override func viewDidLoad() {
super.viewDidLoad()
let myURLString = "http://yahoo.com"
var myHTMLString = ""
guard let myURL = URL(string: myURLString) else {
print("Error: \(myURLString) doesn't seem to be a valid URL")
return
}
do {
myHTMLString = try String(contentsOf: myURL, encoding: .utf8)
} catch let error {
print("Error: \(error)")
}
do {
// if encoding is omitted, it defaults to NSUTF8StringEncoding
let doc = try! HTMLDocument(string: myHTMLString, encoding: String.Encoding.utf8)
let fullHtml:String = (doc.firstChild(xpath: "//*")?.rawXML)!
if let header = doc.firstChild(xpath: "//body/div/header") {
let headerString:String = header.rawXML
let withoutHeader = fullHtml.replacingOccurrences(of: headerString, with: "")
webView.loadHTMLString(withoutHeader as String, baseURL: nil)
}
} catch let error{
print(error)
}
}
Try this, it works with UIWebViewNavigationType.linkClicked. You can modify it to use with other UIWebViewNavigationType.
import UIKit
import Fuzi
class ViewController: UIViewController, UIWebViewDelegate {
#IBOutlet weak var webView: UIWebView!
override func viewDidLoad() {
super.viewDidLoad()
webView.delegate = self
let myURLString = "http://yahoo.com"
var myHTMLString = ""
guard let myURL = URL(string: myURLString) else {
print("Error: \(myURLString) doesn't seem to be a valid URL")
return
}
do {
myHTMLString = try String(contentsOf: myURL, encoding: .utf8)
} catch let error {
print("Error: \(error)")
}
do {
// if encoding is omitted, it defaults to NSUTF8StringEncoding
let doc = try! HTMLDocument(string: myHTMLString, encoding: String.Encoding.utf8)
let fullHtml:String = (doc.firstChild(xpath: "//*")?.rawXML)!
if let header = doc.firstChild(xpath: "//body/div/header") {
let headerString:String = header.rawXML
let withoutHeader = fullHtml.replacingOccurrences(of: headerString, with: "")
webView.loadHTMLString(withoutHeader as String, baseURL: nil)
}
} catch let error{
print(error)
}
}
func webView(_ webView: UIWebView, shouldStartLoadWith request: URLRequest, navigationType: UIWebViewNavigationType) -> Bool {
if navigationType == .linkClicked
{
let myURLString = webView.request?.url.absoluteString
webView.stopLoading()
var myHTMLString = ""
let myURL = URL(string: myURLString)
do {
myHTMLString = try String(contentsOf: myURL, encoding: .utf8)
} catch let error {
print("Error: \(error)")
}
do {
// if encoding is omitted, it defaults to NSUTF8StringEncoding
let doc = try! HTMLDocument(string: myHTMLString, encoding: String.Encoding.utf8)
let fullHtml:String = (doc.firstChild(xpath: "//*")?.rawXML)!
if let header = doc.firstChild(xpath: "//body/div/header") {
let headerString:String = header.rawXML
let withoutHeader = fullHtml.replacingOccurrences(of: headerString, with: "")
webView.loadHTMLString(withoutHeader as String, baseURL: nil)
}
} catch let error{
print(error)
}
}
return true
}
I found an answer!
Thanks to #Sherman for the hint to solve this problem.
I had to replace a line of code below
if navigationType == .linkClicked
to
if navigationType == UIWebViewNavigationType.linkClicked
Then It works!

URL request using Swift

I have access the "dictionary" moviedb for
example : https://www.themoviedb.org/search/remote/multi?query=exterminador%20do%20futuro&language=en
How can i catch only the film's name and poster from this page to my project in Swift ?
It's answer :)
import UIKit
class ViewController: UIViewController {
override func viewDidLoad() {
super.viewDidLoad()
reload()
}
private func reload() {
let requestUrl = "https://www.themoviedb.org/search/remote/multi?query=exterminador%20do%20futuro&language=en"
let config = NSURLSessionConfiguration.defaultSessionConfiguration()
let session = NSURLSession(configuration: config)
let request = NSURLRequest(URL: NSURL(string: requestUrl)!)
let task = session.dataTaskWithRequest(request, completionHandler: { (data, response, error) -> Void in
if let error = error {
println("###### error ######")
}
else {
if let JSON = NSJSONSerialization.JSONObjectWithData(data,
options: .AllowFragments,
error: nil) as? [NSDictionary] {
for movie in JSON {
let name = movie["name"] as! String
let posterPath = movie["poster_path"] as! String
println(name) // "Terminator Genisys"
println(posterPath) // "/5JU9ytZJyR3zmClGmVm9q4Geqbd.jpg"
}
}
}
})
task.resume()
}
}
You need to include your api key along with the request. I'd just try something like this to see if it works or not. If it does, then you can go about using the api key in a different way to make it more secure. I wouldn't bother as it's not an api with much sensitive functionality.
let query = "Terminator+second"
let url = NSURL(string: "http://api.themoviedb.org/3/search/keyword?api_key=YOURAPIKEY&query=\(query)&language=‌​en")!
let request = NSMutableURLRequest(URL: url)
request.addValue("application/json", forHTTPHeaderField: "Accept")
let session = NSURLSession.sharedSession()
let task = session.dataTaskWithRequest(request) { data, response, error in
if let response = response, data = data {
print(response)
//DO THIS
print(String(data: data, encoding: NSUTF8StringEncoding))
//OR THIS
if let o = NSJSONSerialization.JSONObjectWithData(data, options: nil, error:nil) as? NSDictionary {
println(dict)
} else {
println("Could not read JSON dictionary")
}
} else {
print(error)
}
}
task.resume()
The response you'll get will have the full list of properties. You need the poster_path and title (or original_title) property of the returned item.