Getting all tag elements in an array with SwiftSoup - swift

I worked on a project in python using BeautifulSoup for parsing an Html doc and adding ruby and rt tags to each string. Recently I've been working on a similar project for a personal IOS app. I found SwiftSoup which was similar but ran into a problem parsing a tag which I was able to do beautifully using BeautifulSoup. In Beautiful soup I am able to get a tag like the one below
<p id="p6" data-pid="6" data-rel-pid="[41]" class="p6">
<span class="parNum" data-pnum="1"></span>
This is a(<span id="citationsource2"></span><a epub:type="noteref" href="#citation2">link</a>)to some website。
</p>
using .content from BS4 I am able to get the tags into an array like this
['\n', <span class="parNum" data-pnum="1"></span>, '\n This is a(', <span id="citationsource2"></span>, <a epub:type="noteref" href="#citation2">link</a>, ')to some website。\n ']
After i go through the array and check if the children tags have text or if the element in the array is a text element and i just append the ruby tags. The result is this
<p id="p6" data-pid="6" data-rel-pid="[41]" class="p6">
<span class="parNum" data-pnum="1"></span>
<ruby>This<rt>1</rt></ruby><ruby>is<rt>2</rt></ruby> <ruby>a<rt>3</rt></ruby>(<span id="citationsource2"></span><a epub:type="noteref" href="#citation2"><ruby>link<rt>4</rt></ruby></a>)<ruby>to<rt>5</rt></ruby> <ruby>some<rt>6</rt></ruby> <ruby>website<rt>7</rt></ruby>。
</p>
With SwiftSoup I parse the Document doing this since it doesn't have a similar method like the BS4 .content
let soup:Document = try! SwiftSoup.parse(html)
let elements:Elements = try! soup.select("p")
for j in try! elements.html(){
print(try! j)
//Doesn't work prints out every single character not every element
}
The problem is that it treats the whole content of the p tag as an element it doesnt separate the elements in the p tag like BS4 does. I looked at the documentation but I don't see anything about separating the elements from the tag into an array.
This is what I want to achieve with Swiftsoup
['\n', <span class="parNum" data-pnum="1"></span>, '\n This is a(', <span id="citationsource2"></span>, <a epub:type="noteref" href="#citation2">link</a>, ')to some website。\n ']
But end up getting everything as one element in the array instead of seperated elements.
[<span class="parNum" data-pnum="1"></span>This is a(<span id="citationsource2">
</span> <a epub:type="noteref" href="#citation2">link</a>)to some website.]
Is there any way of achieving this using swiftsoup or another swift html parser that could achieve the same thing?

After looking at the SwiftSoup files I was able to find the answer to my question. SwiftSoup has a method called getChildNodes which allows you to get all the content of the specified tag. It returns an array of the content of the tag. Hope this helps anyone who has also faced a similar problem.
let soup:Document = try! SwiftSoup.parseBodyFragment(html)
let p : Elements = try! soup.select("p")
for j in p{
print(try! j.getChildNodes())
}}

Related

Regex: Capture Groups and Empty Fields (SWIFT 5 | ICU Regex Engine)

I am in need of some help correcting my RegEx string - I have a string of text (A large body of HTML) and I need to take this HTML String and then pattern match it so that data that I have nested within' <div> tags can be extracted and used.
Lets take an example with a test case of <div id=1>
<div id=1>UID:1currentPartNumber:63222TRES003H1workcenter:VLCSKDcycleTime:98.8curPartCycleTime:63.66partsMade:233curCycleTimeActual:62.4target:291actual:233downtime:97statusReason:lineStatus:Productionefficiency:80.05plusminus:-260curProdTime:7/16/2019 12:28:01 PM</div>
What should be noted is that lineStatus can either have a value or be empty such as the same with statusReason
I am able to come up with a regex that does MOST of the work but I am struggling with cases where values are not present.
Here is my attempt:
(
(<div id=(\d|\d\d)>)
(UID:(\d|\d\d))
(currentPartNumber:(.{1,20}))
(workcenter:(.{1,20}))
(cycleTime:(.{1,6}))
(curPartCycleTime:(.{1,6}))
(partsMade:(.{1,6}))
(CycleTimeActual:(.{1,6}))
(target:(.{1,6}))
(actual:(.{1,6}))
(downtime:(.{1,6}))
((statusReason:((?:.)|(.{1,6}))))
((lineStatus:((?:.)|(.{1,6}))))
(Productionefficiency:(.{1,6}))
(plusminus:(.{1,6}))
(curProdTime:(.{1,30}))
)
Split it up just for readability.
Thanks,
You are very, very close.
If you use:
(
(<div id=\d{1,2}>)
(UID:\d{1,2})
(currentPartNumber:(.{1,20}))
(workcenter:(.{1,20}))
(cycleTime:(.{1,6}))
(curPartCycleTime:(.{1,6}))
(partsMade:(.{1,6}))
(CycleTimeActual:(.{1,6}))
(target:(.{1,6}))
(actual:(.{1,6}))
(downtime:(.{1,6}))
(statusReason:(.{0,6}))
(lineStatus:(.{0,6}))
(Productionefficiency:(.{1,6}))
(plusminus:(.{1,6}))
(curProdTime:(.{1,30}))
(<\/div>)
)
Then $3\n$4\n$6\n$8\n$10\n$12\n$14\n$16\n$18\n$20\n$22\n$24\n$26\n$28\n$30 will be:
UID:1
currentPartNumber:63222TRES003H1
workcenter:VLCSKD
cycleTime:98.8
curPartCycleTime:63.66
partsMade:233cur
CycleTimeActual:62.4
target:291
actual:233
downtime:97
statusReason:
lineStatus:
Productionefficiency:80.05
plusminus:-260
curProdTime:7/16/2019 12:28:01 PM
By using (statusReason:(.{0,6}))(lineStatus:(.{0,6})) you make the value of statusReason and lineStatus truly optional.
I also simplified the start <div> and UID detection.
Try Regex: ((<div id=(\d|\d\d)>)(UID:(\d|\d\d))(currentPartNumber:(.{1,20}))(workcenter:(.{1,20}))(cycleTime:(.{1,6}))(curPartCycleTime:(.{1,6}))(partsMade:(.{1,6}))(CycleTimeActual:(.{1,6}))(target:(.{1,6}))(actual:(.{1,6}))(downtime:(.{1,6}))(statusReason:(.{1,6})?)(lineStatus:(.{1,6})?)(Productionefficiency:(.{1,6}))(plusminus:(.{1,6}))(curProdTime:(.{1,30})))
Demo
Warning: You can't Parse HTML with regex

How to compare two integer values with text?

HTML code:
<li class="list-group-item ng-binding ng-scope" ng-repeat="sourceMatch in sourceMatches">
<img ng-src="/static/images/Timesjobs_small.jpg" width="32px" height="32px">
<span class="badge ng-binding">585784</span>
Times Jobs
</li>
I am getting this error.
Expected '52675
Times Jobs' to match '52619 Times Jobs'.
It is showing integer value in first line and text in second line so that am not able to compare. How to fix this?
getText() always returns string, but you have line break there, try to replace /n/r before comparing
var listItemText = $(li.list-group-item).getText()
.then(function(text){ return text.replace(/[\r\n]/g, "")
expect(listItemText).toMatch('52619 Times Jobs')

Match string and extract outside element?

Sample:
<div class="luikang">
<p>Lui Kang information:<br><br>
<strong>First game:</strong> Mortal Kombat (1992)<br>
<strong>Created by:</strong> John Tobias<br>
<strong>Orgin:</strong> Earthrealm<br>
<strong>Weapon:</strong> Nunchaku<br>
<strong>Colour:</strong> Red</p>
</div>
I would like to extract Nunchaku
My try so far:
/html/body//div[#class='luikang']/p/strong[contains(., 'Weapon:')]
I am guessing I need to use this too:
[count(preceding-sibling::br) < 1]
Any suggestion?
Try /html/body//div[#class='short_description']/p/strong[contains(., 'Weapon:')]/following-sibling::text()[1].

BeautifulSoup: get tag name of element itself, not its children

I have the below (simplified) code, which uses the following source:
<html>
<p>line 1</p>
<div>
<a>line 2</a>
</div>
</html>
soup = BeautifulSoup('<html><p>line 1</p><div><a>line 2</a></div></html>')
ele = soup.find('p').nextSibling
somehow_print_tag_of_ele_here
I want to get the tag of ele, in this case "div". However, I only seem to be able to get the tag of its children. Am I missing something simple? I thought that I could do ele.tag.name, but that is an exception since tag is None.
#Below correctly prints the div element "<div><a>line 2</a></div>"
print ele
#Below prints "None". Printing tag.name is an exception since tag is None
print ele.tag
#Below prints "a", the child of ele
allTags = ele.findAll(True)
for e in allTags:
print e.name
At this point, I am considering doing something along the way of getting the parent of ele, then getting the tags of parent's children and, having counted how many upper siblings ele has, counting down to the correct child tag. That seems ridiculous.
ele is already a tag, try doing this:
soup = BeautifulSoup('<html><p>line 1</p><div><a>line 2</a></div></html>')
print(soup.find('p').nextSibling.name)
so in your example it would be just
print(ele.name)
You can access anything inside an element as if accessing a dictionary.
Let's say you have an element like this one.
<input id="__VIEWSTATE3" name="__VIEWSTATE3" type="hidden" value="MwqzeTH4"/>
You can access each property like this
print(elem["id"])
# prints __VIEWSTATE3
print(soup.find('h1',id_='pdp_product_title'))
it doesnot print any detail please solved this
<h1 id="pdp_product_title" class="headline-2 css-zis9ta" data-test="product-title">Nike Air Force 1 Shadow</h1>

Extracting an href's text from an html document

I'm trying to parse this piece of HTML:
<div>
<p>
A few years ago,
I felt like I was stuck in a rut,
so I decided to follow in the footsteps
of the great American philosopher, Morgan Spurlock,
and try something new for 30 days.
</p>
</div>
I want to know how to get the text in label, such as: "A few years ago,"
I can get text in "<a> text </a>",
But I do not know how to get "A few years ago," in the label of "A few years ago, "
<a href="#" class="transcriptLink" onclick="seekVideo(0); return false;">
<a href="#" class="transcriptLink" onclick="seekVideo(2000); return false;">
....................
There are different about only onclick="seekVideo(....);
You can use XPath: /div/p/a[1]/text() - selects a by index or matching #onclick value: /div/p/a[starts-with(#onclick, 'seekVideo(0)')]/text(). So both queries return A few years ago,.
To get number in #onclick seekVideo you can use this expression:
substring-before(substring-after(#onclick, '('), ')')
e.g.: To find a whose #onclick seekVideo = 0 you can use this XPath:
/div/p/a[substring-before(substring-after(#onclick, '('), ')') = '0']/text()
or
/div/p/a[number(substring-before(substring-after(#onclick, '('), ')')) = 0]/text()
So both queries return A few years ago,.
Use:
string(//div/a[starts-with(#onclick, 'seekVideo(0)')])
This expression evaluates the string-value of the first a in the XML document that is a child of a div, and the string-value of whose onclick attribute starts with the string "seekVideo(0)"