I am in need of some help correcting my RegEx string - I have a string of text (A large body of HTML) and I need to take this HTML String and then pattern match it so that data that I have nested within' <div> tags can be extracted and used.
Lets take an example with a test case of <div id=1>
<div id=1>UID:1currentPartNumber:63222TRES003H1workcenter:VLCSKDcycleTime:98.8curPartCycleTime:63.66partsMade:233curCycleTimeActual:62.4target:291actual:233downtime:97statusReason:lineStatus:Productionefficiency:80.05plusminus:-260curProdTime:7/16/2019 12:28:01 PM</div>
What should be noted is that lineStatus can either have a value or be empty such as the same with statusReason
I am able to come up with a regex that does MOST of the work but I am struggling with cases where values are not present.
Here is my attempt:
(
(<div id=(\d|\d\d)>)
(UID:(\d|\d\d))
(currentPartNumber:(.{1,20}))
(workcenter:(.{1,20}))
(cycleTime:(.{1,6}))
(curPartCycleTime:(.{1,6}))
(partsMade:(.{1,6}))
(CycleTimeActual:(.{1,6}))
(target:(.{1,6}))
(actual:(.{1,6}))
(downtime:(.{1,6}))
((statusReason:((?:.)|(.{1,6}))))
((lineStatus:((?:.)|(.{1,6}))))
(Productionefficiency:(.{1,6}))
(plusminus:(.{1,6}))
(curProdTime:(.{1,30}))
)
Split it up just for readability.
Thanks,
You are very, very close.
If you use:
(
(<div id=\d{1,2}>)
(UID:\d{1,2})
(currentPartNumber:(.{1,20}))
(workcenter:(.{1,20}))
(cycleTime:(.{1,6}))
(curPartCycleTime:(.{1,6}))
(partsMade:(.{1,6}))
(CycleTimeActual:(.{1,6}))
(target:(.{1,6}))
(actual:(.{1,6}))
(downtime:(.{1,6}))
(statusReason:(.{0,6}))
(lineStatus:(.{0,6}))
(Productionefficiency:(.{1,6}))
(plusminus:(.{1,6}))
(curProdTime:(.{1,30}))
(<\/div>)
)
Then $3\n$4\n$6\n$8\n$10\n$12\n$14\n$16\n$18\n$20\n$22\n$24\n$26\n$28\n$30 will be:
UID:1
currentPartNumber:63222TRES003H1
workcenter:VLCSKD
cycleTime:98.8
curPartCycleTime:63.66
partsMade:233cur
CycleTimeActual:62.4
target:291
actual:233
downtime:97
statusReason:
lineStatus:
Productionefficiency:80.05
plusminus:-260
curProdTime:7/16/2019 12:28:01 PM
By using (statusReason:(.{0,6}))(lineStatus:(.{0,6})) you make the value of statusReason and lineStatus truly optional.
I also simplified the start <div> and UID detection.
Try Regex: ((<div id=(\d|\d\d)>)(UID:(\d|\d\d))(currentPartNumber:(.{1,20}))(workcenter:(.{1,20}))(cycleTime:(.{1,6}))(curPartCycleTime:(.{1,6}))(partsMade:(.{1,6}))(CycleTimeActual:(.{1,6}))(target:(.{1,6}))(actual:(.{1,6}))(downtime:(.{1,6}))(statusReason:(.{1,6})?)(lineStatus:(.{1,6})?)(Productionefficiency:(.{1,6}))(plusminus:(.{1,6}))(curProdTime:(.{1,30})))
Demo
Warning: You can't Parse HTML with regex
Related
I worked on a project in python using BeautifulSoup for parsing an Html doc and adding ruby and rt tags to each string. Recently I've been working on a similar project for a personal IOS app. I found SwiftSoup which was similar but ran into a problem parsing a tag which I was able to do beautifully using BeautifulSoup. In Beautiful soup I am able to get a tag like the one below
<p id="p6" data-pid="6" data-rel-pid="[41]" class="p6">
<span class="parNum" data-pnum="1"></span>
This is a(<span id="citationsource2"></span><a epub:type="noteref" href="#citation2">link</a>)to some website。
</p>
using .content from BS4 I am able to get the tags into an array like this
['\n', <span class="parNum" data-pnum="1"></span>, '\n This is a(', <span id="citationsource2"></span>, <a epub:type="noteref" href="#citation2">link</a>, ')to some website。\n ']
After i go through the array and check if the children tags have text or if the element in the array is a text element and i just append the ruby tags. The result is this
<p id="p6" data-pid="6" data-rel-pid="[41]" class="p6">
<span class="parNum" data-pnum="1"></span>
<ruby>This<rt>1</rt></ruby><ruby>is<rt>2</rt></ruby> <ruby>a<rt>3</rt></ruby>(<span id="citationsource2"></span><a epub:type="noteref" href="#citation2"><ruby>link<rt>4</rt></ruby></a>)<ruby>to<rt>5</rt></ruby> <ruby>some<rt>6</rt></ruby> <ruby>website<rt>7</rt></ruby>。
</p>
With SwiftSoup I parse the Document doing this since it doesn't have a similar method like the BS4 .content
let soup:Document = try! SwiftSoup.parse(html)
let elements:Elements = try! soup.select("p")
for j in try! elements.html(){
print(try! j)
//Doesn't work prints out every single character not every element
}
The problem is that it treats the whole content of the p tag as an element it doesnt separate the elements in the p tag like BS4 does. I looked at the documentation but I don't see anything about separating the elements from the tag into an array.
This is what I want to achieve with Swiftsoup
['\n', <span class="parNum" data-pnum="1"></span>, '\n This is a(', <span id="citationsource2"></span>, <a epub:type="noteref" href="#citation2">link</a>, ')to some website。\n ']
But end up getting everything as one element in the array instead of seperated elements.
[<span class="parNum" data-pnum="1"></span>This is a(<span id="citationsource2">
</span> <a epub:type="noteref" href="#citation2">link</a>)to some website.]
Is there any way of achieving this using swiftsoup or another swift html parser that could achieve the same thing?
After looking at the SwiftSoup files I was able to find the answer to my question. SwiftSoup has a method called getChildNodes which allows you to get all the content of the specified tag. It returns an array of the content of the tag. Hope this helps anyone who has also faced a similar problem.
let soup:Document = try! SwiftSoup.parseBodyFragment(html)
let p : Elements = try! soup.select("p")
for j in p{
print(try! j.getChildNodes())
}}
I am using Materialize Autocomplete and I wonder if there is a way to use text instead of "optional image". Why? In case the text is not unique then the user will not know which one to choose. It might happen that the options will be names and there might two people with the same name and surname.
When typing my question I found out that I cannot use duplicate entries in data
data: {
"Radek": myself,
"Radek": some other Radek,
"Radoslav": 'http://placehold.it/250x250'
},
js fiddle example
When you look at the source you find the following lines relevant for the images:
autocompleteOption.append('<img src="'+ data[key] +'" class="right circle"><span>'+ key +'</span>');
and
var img = $el.find('img');
$el.html("<span>" + beforeMatch + "<span class='highlight'>" + matchText + "</span>" + afterMatch + "</span>");
$el.prepend(img);
This prevents us from using the image-attribute for anything other than images.
We can insert something like this to trick Materialize
"Radoslav": " style="display: none;">Inserted Text <br><span style="display: none`
but it will just be converted to text resulting in a option equivalent to
"Inserted Text Radoslav": none
So there is sadly nothing to be gained here.
If you are looking to insert a linebreak, however, you can use this answer on How to force Materialize autocomplete text to flow to new line?
suppose i have a string like this:
string =
"<style>li { list-style: none; }</style>
<li><b>Source:</b> $Source</li>
<li><b>Security:</b> $Security</li>
"
I still get bullet points
i cant use it like this because the string is already wrapped in double quotes due to $variables like $Source
<li style="list-style: none;">
I get Unexpected token 'list-style:' in expression or statement. because i cant use double quotes
so my only choice is using it like this:
<style>li { list-style: none; }</style>
but it doesnt get applied...why is that?
UPDATE: To clarify, i want to utilize the listing mechanism WITHOUT bullet points showing up
If you set string variables with single quotes, those values can contain literal double quotes. The problem with this is that everything within the single quotes will be treated as a literal string, which means $Source and $Security would not get expanded. If you are going to use double quotes and variables within the same string, I suggest escaping the inner double quotes with a backtick.
$string =
"<ul style=`"list-style-type:none; padding-left:0`">
<li><b>Source:</b> $Source</li>
<li><b>Security:</b> $Security</li>
</ul>
"
The list-style-type:none property sets the list item marker to none. 'padding-left:0 removes the left indentation of the list.
To test, just output the contents to a file (s.html) and open the file from a browser.
$string | Set-Content s.html
If you are testing this in an email client like Outlook, results may vary. Outlook does not support list-style-type: none.
If you just need a list with bolded properties for purposes of reading in an email client, you can simplify this. Then use Send-Mailmessage with the -BodyAsHtml switch with the following string as the body.
$string = "
<b>Sources:</b> $source<br>
<b>Security:</b> $security<br>
"
Is there analog ":contains()"(JQuery, JSoup) selector in Mojolicious?
Selector ":contains('text') ~ td + td" work in JQuery and JSoup. How can I convert it to Mojolicious selector?
http://api.jquery.com/contains-selector/
Description: Select all elements that contain the specified text.
version added: 1.1.4jQuery( ":contains(text)" ) text: A string of text
to look for. It's case sensitive.
http://jsoup.org/apidocs/org/jsoup/select/Selector.html
:contains(text) elements that contains the specified text. The search
is case insensitive. The text may appear in the found element, or any
of its descendants.
Mojolicious analog?
Untested, but I would go in the direction of
$dom->find('*')
->grep(sub { $_->all_text =~ /text/ })
->map('following', 'td')
->map('find', 'td')
(if you have something more specific before your :contains, like at least a tag name selector, then replace the * with that, which should greatly help the performance).
Few experiment with hobbs code and I can repeat JQuery, JSoup selector result:
:contains('some string') ~ td + td
Mojo:
$dom
-> find('*')
-> grep(sub { $_ -> text =~ /some string/; })
-> map('following', '~ td + td')
-> flatten;
But, I don't think it's universal and best way to do such select. Just for start.
text
Extract text content from this element only (not including child
elements), smart whitespace trimming is enabled by default.
flatten
Flatten nested collections/arrays recursively and create a new
collection with all elements.
I have a simple form with four inputs. When I submit my form, I want to use the GET http method.
For the example :
aaa : foo
bbb : ____
ccc : bar
ddd : ____
I want to have this query string :
/?aaa=foo&ccc=bar
The problem is I have this query string :
/?aaa=foo&bbb=&ccc=bar&ddd=
How can I remove empty fields from my form in the query string ?
Thanks.
You could use jQuery's submit function to validate/alter your form:
<form method="get" id="form1">
<input type="text" name="aaa" id="aaa" />
<input type="text" name="bbb" id="bbb" />
<input type="submit" value="Go" />
</form>
<script type="text/javascript">
$(document).ready(function() {
$("#form1").submit(function() {
if($("#bbb").val()=="") {
$("#bbb").remove();
}
});
});
</script>
The current solutions depend on the client running Javascript/jQuery - which cannot be guaranteed. Another option is to use a rewrite rule. On an Apache server, I've used the following in order to strip out empty $_GET values from form submissions;
RewriteCond %{REQUEST_URI} \/formpage\/?
RewriteCond %{QUERY_STRING} ^(.*)(&|\?)[a-z-_]+=(?=&|$)(.*)$ [NC]
RewriteRule .* \/formpage?%1%2 [R,L,NE]
... which would turn this ..;
http://www.yoursite.com/formpage?search=staff&year=&firstname=&surname=smith&area=
... into this;
http://www.yoursite.com/formpage?search=staff&surname=smith
A quick explanation:
RewriteCond %{REQUEST_URI} \/formpage\/? is a means to limit the scope of your regular expression to a particular sub-page of your site's address (for example, http://www.yoursite.com/formpage). It's not strictly required, but you may wish to apply the expression only to the page on which the form appears. This is one way of accomplishing this.
RewriteCond %{QUERY_STRING} ^(.*)(&|\?)[a-z-_]+=(?=&|$)(.*)$ [NC] is a second condition, which then matches against the query string (i.e. anything that appears from (or including) the question mark, such as ?firstname=sam$surname=smith). Let's break it down;
^(.*) - The ^ symbol signifies that this is the START of the query string, and the period . is saying ANY character. The asterisk * is a modifier against the period wildcard, saying that any number of them should be included in the scope.
(&|\?)[a-z-_]+=(?=&|$) - This is the most interesting bit (if you like this kind of thing ...). This finds your empty query string. Between the first parenthesis, we're stating that the string begins with an ampersand OR a literal question mark (backslash makes the following character literal, in this case - otherwise the question mark has a different meaning). Between the square brackets, we're saying match any character from a-z, OR hyphen - OR underscore _, and immediately after that the plus symbol + is stating that we should match any of those preceding characters 1 or more times. The equals sign = is just what it looks like (the equals between your query string name and its value). Finally, between the next parenthesis, we have a lookahead clause ?=that states that the next character MUST BE an ampersand OR the very end of the line, as indicated by the dollar sign $. The intention here is that your query string will only be considered a match if it's followed by the start of another query string or the end of the line without including a value.
The final bits (.*)$ [NC] are just matching against any and all content that follows an empty query string, and the [NC] clause means that this is cASe iNsENsiTIve.
RewriteRule .* \/formpage?%1%2 [R,L,NE] is where we tell mod_rewrite what to do with our matched content. Essentially rewriting the URL to your pagename followed the whole query string except the matched empty strings. Because this will loop over your URL until it ceases to find a match, it does not matter if you have a single empty value or 50. It should spot them all and leave only parameters that were submitted with values. The [NE] clause on the rewrite rule means that characters will not be URL encoded - which may or may not be useful to you if you are expecting special characters (but you obviously need to sanitize your $_GET data when you're processing it, which you should be doing anyway).
Again - this is obviously an Apache solution using mod_rewrite. If you are running on another operating system (such as a Windows server), this will need to be adjusted accordingly.
Hope that's of some use to someone.
I love the idea given by RaphDG
But I modified the code a little. I just use the disabled property rather removing the field.
Here is the changed code:
<form method="get" id="form1">
<input type="text" name="aaa" id="aaa" />
<input type="text" name="bbb" id="bbb" />
<input type="submit" value="Go" />
</form>
<script type="text/javascript">
$(document).ready(function() {
$("#form1").submit(function() {
if($("#bbb").val()=="") {
$("#bbb").prop('disabled', true);
}
});
});
</script>
Thanks once again for the idea RaphDG (y)
With JQuery:
$('#form').submit(function (e) {
e.preventDefault();
var query = $(this).serializeArray().filter(function (i) {
return i.value;
});
window.location.href = $(this).attr('action') + (query ? '?' + $.param(query) : '');
});
Explanations:
.submit() hooks onto the form's submit event
e.preventDefault() prevents the form from submitting
.serializeArray() gives us an array representation of the query string that was going to be sent.
.filter() removes falsy (including empty) values in that array.
$.param(query) creates a serialized and URL-compliant representation of our updated array
setting a value to window.location.href sends the request
See also this solution.
With jQuery:
$('form[method="get"]').submit(function(){
$(this).find(':input').each(function() {
var inp = $(this);
if (!inp.val()) {
inp.remove();
}
});
});
I use this to remove all empty parameters from all GET forms on my site. The :input pseudo selector covers all inputs, textareas, selects and buttons.
If you want to remove all empty inputs, you can iterate over the form inputs like this:
<script type="text/javascript">
$(document).ready(function() {
$("#form1").submit(function() {
$(this).find('input').each(function() {
if($(this).val() == '') {
$(this).remove();
}
});
});
});
</script>
Here's a very similar answer to the other ones, but it factors in select and textarea inputs as well:
$(document).ready(function() {
$("form").submit(function() {
let form = $(this);
form.find('input, select, textarea').each(function() {
let input = $(this);
if (input.val() == '') {
input.prop('disabled', true);
}
});
});
});