Swift scraping a webpage using regex or alternative - swift

See updates below first.
I am trying to scrape all the moderators for a specified sub-reddit on reddit.
The API only lets you get all the moderators usernames for a sub-reddit, so initially I had gotten all these and then performed an additional request for each of these profiles to get the avatar url. This ended up going past the API limit.
So instead I want to just get the source of the following page and paginate through while collecting the 10 usernames and avatar url's on each page. This will end up polling the website with less requests. I understand how to do the pagination part but for now I am trying to understand how to gather the usernames and adjoining avatar URLs.
So take the following url:
https://www.reddit.com/r/videos/about/moderators/
So I will pull the entire page source,
Add all the mods usernames & urls into a mod object, then into an array.
Would using regex on the string I get back be a good idea?
This is my code so far, any help would be great:
func tester() {
let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("\(error)")
return
}
let string = String(data: data, encoding: .utf8)
let regexUsernames = try? NSRegularExpression(pattern: "href=\"/user/[a-z0-9]\"", options: .caseInsensitive)
var results = regexUsernames?.matches(in: string as String, options: [], range: NSRange(location: 0, length: string.length))
let regexProfileURLs = try? NSRegularExpression(pattern: "><img src=\"[a-z0-9]\" style", options: .caseInsensitive)
print("\(results)") // This shows as empty array
}
task.resume()
}
I have also tried the following but get this error:
Can't form Range with upperBound < lowerBound
Code:
func tester() {
let url = URL(string: "https://www.reddit.com/r/videos/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("cannot cast data into string")
return
}
let leftSideOfValue = "href=\"/user/"
let rightSideOfValue = "\""
guard let leftRange = htmlString.range(of: leftSideOfValue) else {
print("cannot find range left")
return
}
guard let rightRange = htmlString.range(of: rightSideOfValue) else {
print("cannot find range right")
return
}
let rangeOfTheValue = leftRange.upperBound..<rightRange.lowerBound
print(htmlString[rangeOfTheValue])
}
UPDATE:
So I have gotten to a point where it will give me the first username, however I am looping and just getting the same one, over and over. What would be the best way to move on each incremental step? Is there a way to do something like let newHTMLString = htmlString.dropFirst(k: ?) to replace the htmlString with a substring that is after the elements we just got?
func tester() {
let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else {
print("data was nil")
return
}
guard let htmlString = String(data: data, encoding: .utf8) else {
print("cannot cast data into string")
return
}
let counter = htmlString.components(separatedBy:"href=\"/user/")
let count = counter.count
for i in 0...count {
let leftSideOfUsernameValue = "href=\"/user/"
let rightSideOfUsernameValue = "\""
let leftSideOfAvatarURLValue = "><img src=\""
let rightSideOfAvatarURLValue = "\">"
guard let leftRange = htmlString.range(of: leftSideOfUsernameValue) else {
print("cannot find range left")
return
}
guard let rightRange = htmlString.range(of: rightSideOfUsernameValue) else {
print("cannot find range right")
return
}
let username = htmlString.slice(from: leftSideOfUsernameValue, to: rightSideOfUsernameValue)
print(username)
guard let avatarURL = htmlString.slice(from: leftSideOfAvatarURLValue, to: rightSideOfAvatarURLValue) else {
print("Error")
return
}
print(avatarURL)
}
}
task.resume()
}
I have also tried:
let endString = String(avatarURL + rightSideOfAvatarURLValue)
let endIndex = htmlString.index(endString.endIndex, offsetBy: 0)
let substringer = htmlString[endIndex...]
htmlString = String(substringer)

You should be able to pull all names and urls into two separate arrays by calling a simple regex by doing something like:
func tester() {
let url = URL(string: "https://www.reddit.com/r/pics/about/moderators")!
let task = URLSession.shared.dataTask(with: url) { data, response, error in
guard let data = data, error == nil else { return }
guard let htmlString = String(data: data, encoding: .utf8) else { return }
let names = htmlString.matching(regex: "href=\"/user/(.*?)\"")
let imageUrls = htmlString.matching(regex: "><img src=\"(.*?)\" style")
print(names)
print(imageUrls)
}
task.resume()
}
extension String {
func matching(regex: String) -> [String] {
guard let regex = try? NSRegularExpression(pattern: regex, options: []) else { return [] }
let result = regex.matches(in: self, options: [], range: NSMakeRange(0, self.count))
return result.map {
return String(self[Range($0.range, in: self)!])
}
}
}
Or you can create an object for each of the <div class="_1sIhmckJjyRyuR_z7M5kbI"> and then grab the names and urls to use as required.

Related

How to download CSV file from the web with dowload URL Swift

So I need to get the data from a google sheet (which I'm trying to generate the link for, not sure), but I have a dummy link which works. I wrote this code to download the CSV file from the spreadsheet and then parse it into an array. When I print the parsed CSV, I simply get an array with one element which is App/... some location in the system.
func getDataFromSheet(){
let urlString = "https://docs.google.com/spreadsheets/d/e/2PACX-1vT2-wSYyvNPeF7W3HGyw_MPhMXfuQwzBMAx8SjBOWR5PlZpeTZUCmKuPo044wYKLpweZe7ucUVl0yT5/pub?gid=1025030631&single=true&output=csv"
// 2
if let imageUrl = URL(string: urlString) {
// 3
URLSession.shared.downloadTask(with: imageUrl) { (tempFileUrl, response, error) in
// 4
if let imageTempFileUrl = tempFileUrl {
do {
let content = try String(data: imageTempFileUrl.dataRepresentation, encoding: .utf8)
let parsedCSV: [String] = content!.components(
separatedBy: "\n"
).map{ $0.components(separatedBy: ",")[0] }
print(parsedCSV.description)
} catch {
print("Error")
}
}
}.resume()
}
}
try this example code, works for me:
func getDataFromSheet() {
let urlString = "https://docs.google.com/spreadsheets/d/e/2PACX-1vT2-wSYyvNPeF7W3HGyw_MPhMXfuQwzBMAx8SjBOWR5PlZpeTZUCmKuPo044wYKLpweZe7ucUVl0yT5/pub?gid=1025030631&single=true&output=csv"
guard let url = URL(string: urlString) else { print("error"); return }
URLSession.shared.dataTask(with: url) { data, response, error in
if let data = data {
if let content = String(data: data, encoding: .utf8) {
let parsedCSV: [String] = content.components(separatedBy: "\n")
// all data
print(parsedCSV, "\n")
// first line
print(parsedCSV.map{ $0.components(separatedBy: ",")}[0], "\n")
// second line
print(parsedCSV.map{ $0.components(separatedBy: ",")}[1], "\n")
}
}
}.resume()
}

Why is this URLSession.datatask not working in Swift 5 for macos

I am trying to make my own DynamicIP updater as a command line tool so I can set it up to run as a launch agent. I thought this would be a pretty simple thing to do, but I am not getting anything when I run this bit of code.
main.swift:
import AppKit
let userName = "yourUserName"
let password = "yourPassword"
let domain = "yourDomainName"
let ftp = "ftp"
let www = "www"
let checkIPURL = URL(string: "https://svc.joker.com/nic/checkip")
let domainUpdateURL = URL(string: "https://svc.joker.com/nic/update?username=\(userName)&password=\(password)&hostname=\(domain)")
let ftpUpdateURL = URL(string: "https://svc.joker.com/nic/update?username=\(userName)&password=\(password)&hostname=\(ftp).\(domain)")
let wwwUpdateURL = URL(string: "https://svc.joker.com/nic/update?username=\(userName)&password=\(password)&hostname=\(www).\(domain)")
var ipAddress = ""
if let url = checkIPURL {
print("1 - \(url)")
var request = URLRequest(url: url)
print("2 - \(request.url!)")
request.httpMethod = "POST"
print("3")
let session = URLSession.shared
print("4")
session.dataTask(with: request) { data, response, error in
print("4.1")
guard error == nil else {
print("Error:", error ?? "")
return
}
print("4.2")
guard (response as? HTTPURLResponse)?
.statusCode == 200 else {
print("down")
return
}
print("4.3")
if let data = data {
if let dataString = String(decoding: data, as: UTF8.self).removeHtmlTags() {
if let startIndex = dataString.lastIndex(of: " ") {
let chars = dataString.distance(from: startIndex, to: dataString.endIndex)-1
ipAddress = String(dataString.suffix(chars))
}
}
print(ipAddress)
} else {
print("No data")
}
print("up - \(response!)")
}.resume()
print("Done.")
}
extension String {
// Credit - Andrew - https://stackoverflow.com/questions/25983558/stripping-out-html-tags-from-a-string
func removeHtmlTags() -> String? {
do {
guard let data = self.data(using: .utf8) else {
return nil
}
let attributed = try NSAttributedString(data: data, options: [.documentType: NSAttributedString.DocumentType.html, .characterEncoding: String.Encoding.utf8.rawValue], documentAttributes: nil)
return attributed.string
} catch {
return nil
}
}
}
Everything outside of the session prints, but nothing inside of it prints (4.x statements).
I deleted the AppSandbox because when I have AppSandbox as a Capability and turn on Outgoing Connections I get a crash with EXC_BAD_INSTRUCTION (code=EXC_I386_INVOP, subcode=0x0).
But even with AppSandbox deleted it does not work.
The strange thing is this works fine in a playground (with a slight modification turning the String extension into a function within the playground), which really makes this a head scratcher for me.
Here's my playground code:
import AppKit
let userName = "yourUserName"
let password = "yourPassword"
let domain = "yourDomainName"
let ftp = "ftp"
let www = "www"
let checkIPURL = URL(string: "https://svc.joker.com/nic/checkip")
let domainUpdateURL = URL(string: "https://svc.joker.com/nic/update?username=\(userName)&password=\(password)&hostname=\(domain)")
let ftpUpdateURL = URL(string: "https://svc.joker.com/nic/update?username=\(userName)&password=\(password)&hostname=\(ftp).\(domain)")
let wwwUpdateURL = URL(string: "https://svc.joker.com/nic/update?username=\(userName)&password=\(password)&hostname=\(www).\(domain)")
var ipAddress = ""
if let url = checkIPURL {
print("1 - \(url)")
var request = URLRequest(url: url)
print("2 - \(request.url!)")
request.httpMethod = "POST"
print("3")
let session = URLSession.shared
print("4")
session.dataTask(with: request) { data, response, error in
print("4.1")
guard error == nil else {
print("Error:", error ?? "")
return
}
print("4.2")
guard (response as? HTTPURLResponse)?
.statusCode == 200 else {
print("down")
return
}
print("4.3")
if let data = data {
//if let dataString = String(decoding: data, as: UTF8.self).removeHtmlTags() {
if let dataString = removeHtmlTags(data: data) {
if let startIndex = dataString.lastIndex(of: " ") {
let chars = dataString.distance(from: startIndex, to: dataString.endIndex)-1
ipAddress = String(dataString.suffix(chars))
}
}
print(ipAddress)
} else {
print("No data")
}
print("up - \(response!)")
}.resume()
print("Done.")
}
func removeHtmlTags(data: Data) -> String? {
do {
let attributed = try NSAttributedString(data: data, options: [.documentType: NSAttributedString.DocumentType.html, .characterEncoding: String.Encoding.utf8.rawValue], documentAttributes: nil)
return attributed.string
} catch {
return nil
}
}
Is there something else I need to do to get this to work within the command line tool app I am trying to build?

Get image thumbnails from Rumble?

I want to get a thumbnail image for videos from Rumble.
When getting images from Youtube I just do like this:
https://img.youtube.com/vi/f3ZccBBjmQg/0.jpg
Want to same from Rumble video url-
https://rumble.com/vxhedt-80000-matches-chain-reaction-domino-effect.html?mref=7ju1&mc=4w36m
I checked the rumble webpage of the link you provided. I am not sure if there is a smarter/faster way but here is a way to get the thumbnailUrl from their html code.
func getThumbnailFromRumble() {
let url = URL(string:"https://rumble.com/vxhedt-80000-matches-chain-reaction-domino-effect.html?mref=7ju1&mc=4w36m")!
URLSession.shared.dataTask(with: url){ data, response, error in
guard error == nil else { return }
guard let httpURLResponse = response as? HTTPURLResponse,
httpURLResponse.statusCode == 200,
let data = data else {
return
}
let str = String(data: data, encoding: .utf8) // get the htm response as string
let prefix = "\"thumbnailUrl\":"
let suffix = ",\"uploadDate\""
let matches = str?.match("(\(prefix)(...*)\(suffix))")
if let thumbnailUrlMatch = matches?.first?.first {
let thumbnailUrl = thumbnailUrlMatch
.replacingOccurrences(of: prefix, with: "") // remove prefix from urlstring
.replacingOccurrences(of: suffix, with: "") // remove suffix from urlstring
.replacingOccurrences(of: "\"", with: "") // remove escaping characters from urlstring
if let url = URL(string: thumbnailUrl),
let data = try? Data(contentsOf: url) {
let uiImage = UIImage(data: data)
// use the uiimage
}
}
}.resume()
}
I use this extension to get the necessary string part from the html response
extension String {
func match(_ regex: String) -> [[String]] {
let nsString = self as NSString
return (try? NSRegularExpression(pattern: regex, options: []))?.matches(in: self, options: [], range: NSMakeRange(0, nsString.length)).map { match in
(0..<match.numberOfRanges).map { match.range(at: $0).location == NSNotFound ? "" : nsString.substring(with: match.range(at: $0)) }
} ?? []
}
}
Feel free to improve the code.

Core data how to use NSMangedObjectContext in multithreaded

Okay, I've been going at this for a day and can't seem to figure out what I am doing wrong. This is how my data model looks like for core data.
This is how my code looks like.
class Service {
static let shared = Service()
private let numberOfPokemons = 151
func downloadPokemonsFromServer(completion: #escaping ()->()) {
let urlString = "https://pokeapi.co/api/v2/pokemon?limit=\(numberOfPokemons)"
guard let url = URL(string: urlString) else { return }
var id: Int16 = 0
URLSession.shared.dataTask(with: url) { (data, response, error) in
if let err = error {
print("Unable to fetch pokemon", err)
}
guard let data = data else { return }
let privateContext = NSManagedObjectContext(concurrencyType: .privateQueueConcurrencyType)
privateContext.parent = CoreDataManager.shared.persistentContainer.viewContext
let decoder = JSONDecoder()
decoder.keyDecodingStrategy = .convertFromSnakeCase
do {
let pokemonJSON = try decoder.decode(PokemonsJSON.self, from: data)
pokemonJSON.pokemons.forEach { (JSONPokemon) in
id += 1
let pokemon = Pokemon(context: privateContext)
pokemon.name = JSONPokemon.name
pokemon.url = JSONPokemon.detailUrl
pokemon.id = id
}
try? privateContext.save()
try? privateContext.parent?.save()
completion()
} catch let err {
print("Unable to decode PokemonJSON. Error: ",err)
completion()
}
}.resume()
}
private var detailTracker = 0
func fetchMoreDetails(objectID: NSManagedObjectID) {
guard let pokemon = CoreDataManager.shared.persistentContainer.viewContext.object(with: objectID) as? Pokemon, let urlString = pokemon.url else { return }
print(pokemon.name)
print()
guard let url = URL(string: urlString) else { return }
URLSession.shared.dataTask(with: url) { (data, response, error) in
if let err = error {
print("Unable to get more details for pokemon", err)
}
guard let data = data else { return }
let privateContext = NSManagedObjectContext(concurrencyType: .privateQueueConcurrencyType)
privateContext.parent = CoreDataManager.shared.persistentContainer.viewContext
let decoder = JSONDecoder()
decoder.keyDecodingStrategy = .convertFromSnakeCase
do {
let pokemonDetailJSON = try decoder.decode(PokemonDetailJSON.self, from: data)
pokemonDetailJSON.types.forEach { (nestedType) in
let type = Type(context: privateContext)
type.name = nestedType.type.name
type.addToPokemons(pokemon)
}
try? privateContext.save()
try? privateContext.parent?.save()
} catch let err {
print("Unable to decode pokemon more details", err)
}
}.resume()
}
private var imageTracker = 0
func getPokemonImage(objectID: NSManagedObjectID) {
guard let pokemon = CoreDataManager.shared.persistentContainer.viewContext.object(with: objectID) as? Pokemon else { return }
let id = String(format: "%03d", pokemon.id)
let urlString = "https://assets.pokemon.com/assets/cms2/img/pokedex/full/\(id).png"
print(urlString)
guard let url = URL(string: urlString) else { return }
URLSession.shared.dataTask(with: url) { (data, response, error) in
if let err = error {
print("Unable to load image from session.", err)
}
guard let data = data else { return }
let privateContext = NSManagedObjectContext(concurrencyType: .privateQueueConcurrencyType)
privateContext.parent = CoreDataManager.shared.persistentContainer.viewContext
pokemon.image = data
self.imageTracker += 1
if self.imageTracker == self.numberOfPokemons {
try? privateContext.save()
try? privateContext.parent?.save()
}
}.resume()
}
}
I have 3 entities, which are Pokemon, Type & Ability. I am not doing nothing with ability right now, so we can just ignore that. The first func downloadPokemonFromServer just grabs the first 151 pokemon, saves the name and a url of pokemon. I then use that url to go into another URLSession and grab more information about that pokemon. Which is what the fetchMoreDetails func does. However, this func crashes my app. I don't know what I am doing wrong here, it crashes when I try to save it.
The third func getPokemonImage I go into another URLSession, get the data and save it to my pokemon image attribute. The thing is this works perfectly fine. It saves to my CoreData and it doesn't crash my app.
This is how I call it in my ViewController.
#objc func handleRefresh() {
if pokemonController.fetchedObjects?.count == 0 {
Service.shared.downloadPokemonsFromServer {
let pokemons = self.pokemonController.fetchedObjects
pokemons?.forEach({ (pokemon) in
Service.shared.getPokemonImage(objectID: pokemon.objectID)
//If I uncomment the line below it will crash my app.
//Service.shared.fetchMoreDetails(objectID: pokemon.objectID)
})
}
}
tableView.refreshControl?.endRefreshing()
}
Will someone pls help me figure out what I am doing wrong. Would really appreciate the help.
You need to make sure you're doing all the Core Data work on the same thread as the private context you've created. To do so please use:
privateContext.perform {
//Core data work: create new entities, connections, delete, edit and more...
}
This can prevent you a lot of headaches and troubles down the road
I think the problem is that you are trying to set a relationship between two objects from different contexts. Your pokemon object is registered with the view context:
guard let pokemon = CoreDataManager.shared.persistentContainer.viewContext.object(with: objectID) as? Pokemon, let urlString = pokemon.url else { return }
whereas your type object is registered with the private context:
let type = Type(context: privateContext)
type.name = nestedType.type.name
so this line will not work:
type.addToPokemons(pokemon)
I would try amending the code to use only the privateContext, something like this:
func fetchMoreDetails(objectID: NSManagedObjectID) {
let privateContext = NSManagedObjectContext(concurrencyType: .privateQueueConcurrencyType)
privateContext.parent = CoreDataManager.shared.persistentContainer.viewContext
guard let pokemon = privateContext.object(with: objectID) as? Pokemon, let urlString = pokemon.url else { return }
print(pokemon.name)
print()
guard let url = URL(string: urlString) else { return }
URLSession.shared.dataTask(with: url) { (data, response, error) in
if let err = error {
print("Unable to get more details for pokemon", err)
}
guard let data = data else { return }
let decoder = JSONDecoder()
decoder.keyDecodingStrategy = .convertFromSnakeCase
do {
let pokemonDetailJSON = try decoder.decode(PokemonDetailJSON.self, from: data)
pokemonDetailJSON.types.forEach { (nestedType) in
let type = Type(context: privateContext)
type.name = nestedType.type.name
type.addToPokemons(pokemon)
}
try? privateContext.save()
try? privateContext.parent?.save()
} catch let err {
print("Unable to decode pokemon more details", err)
}
}.resume()
}

Logical fault in repeat-while loop?

I have a problem with the repeat-while loop.I dont know why it doesnt work at my code.Maybe i have some logical fault.
func getJson(){
repeat{
movieName.isHidden = true
let randomNumber = Int(arc4random_uniform(128188))
let jsonUrlString = "https://api.url/3/movie/" + String(randomNumber) + "?api_key="
//let jsonUrlString = "https://api.url.org/3/movie/564?api_key=key&language=de-DE"
guard let url = URL(string: jsonUrlString) else
{return}
URLSession.shared.dataTask(with: url) { (data, response, err) in
guard let data = data else {return}
let dataString = String(data: data, encoding: .utf8)
print(dataString ?? String())
strTrue = false
Here I'm checking if the api returns some data if not they will give me a status code 34.Im checking for the word "status code".
If the api returns "status_code". I'm checking it with a if and then I turn string-code to true.So the condition of the repeat-while loop is
then string-code is true he will start it again and looking for a new movie id.But I tested it and it doesn't work. Maybe you guys can help me :)
let stringCode = dataString?.contains("status_code")
if stringCode == true {
print("yes fehler 34")
strTrue = stringCode!
print(strTrue)
}
do {
guard let json = try JSONSerialization.jsonObject(with: data, options: .mutableContainers) as? [String: Any] else {return}
let movie = Movie(json: json)
print(movie.name)
print(movie.genres)
let imageString = "https://image.url.org/t/p/w500"+(movie.imageUrl)
let url2 = URL(string: imageString)
//self.movieImage.downloadedFrom(url: url2!)
self.movieDescriptionLabel = movie.overview
self.movieNameLabel = movie.name
DispatchQueue.main.async {
self.movieName.isHidden = false
self.movieName.text = self.movieNameLabel
self.movieDescription.text = self.movieDescriptionLabel
self.movieImage.downloadedFrom(url: url2!)
}
} catch let jsonError {
print("Error",jsonError)
self.getJson()
}
}.resume()
}while(strTrue == true)