jsoup: not selecting all rows in a table - select

I have a web page containing a table, where each row has either class vc_row_odd or vc_row_even, like this. The table has about 3000 rows, but using jsoup I am retrieving only about half of them.
I have looked at the source page, and the rows seem consistently structured. Why is jsoup selector not finding all rows I wonder ?
Here below is my jsoup code:
Document doc = Jsoup.connect(url).header("Authorization",
"Basic " + base64login).timeout(10 * 1000).get();
Elements trs = doc.select("tr[class~=vc_row_(odd|even)]");
logger.debug("Size trs " + trs.size());
<tr class="vc_row_odd">
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>
<tr class="vc_row_even">
<td>...</td>
<td>...</td>
<td>...</td>
<td>...</td>
</tr>

Found solution with setting maxSizeLimit to zero
http://jmchung.github.io/blog/2013/10/25/how-to-solve-jsoup-does-not-get-complete-html-document/

Just ran into this myself, the link in the existing self-answer is dead, but according to the latest archived version[1], the option is called maxBodySize[2], not maxSizeLimit. The default is 1MB, setting it to zero turns the limit off.
Document doc = Jsoup.connect(DATA_URL).maxBodySize(0).get();
Sources:
[1] https://web.archive.org/web/20180209022028/http://jmchung.github.io/blog/2013/10/25/how-to-solve-jsoup-does-not-get-complete-html-document/
[2] https://jsoup.org/apidocs/org/jsoup/Connection.html#maxBodySize-int-

Related

Selenium IDE - Extract substring from string

I have a variable with this string:
Total price: 65.09 GBP
How do I extract the number 65.09 from this string and store in another variable? P.s: just the number with 2 decimals, don't need the currency after it, also, the value can be anything from 0 to 1 billion, so it could have commas also to separate the numbers.
I saw some other posts regarding this same issue using split with delimiters or substr(), but I could not adapt it to my scenario, thanks!
I would use a regular expression to extract the number:
<tr>
<td>store</td>
<td>Total price: 65.09 GBP</td>
<td>v1</td>
</tr>
<tr>
<td>storeEval</td>
<td>storedVars['v1'].match(/[\d.,]+/)</td>
<td>v2</td>
</tr>
<tr>
<td>echo</td>
<td>${v2}</td>
<td></td>
</tr>

How to loop from certain number to certain number in AEM?

In data-sly-list I can loop through elements as in following examples
data-sly-list: Repeats the content of the host element for each enumerable property in the provided object.However, what if I want to loop through the first 4 elements or from 8th till 10th elements, how to do so?
Here is a simple loop:
<dl data-sly-list="${currentPage.listChildren}">
<dt>index: ${itemList.index}</dt>
<dd>value: ${item.title}</dd>
</dl>
Use something like this:
<dl data-sly-list="${currentPage.listChildren}">
<div data-sly-test="${itemList.count > 4 && itemList.count <8}" data-sly-unwrap>
<dt>index: ${itemList.index}</dt>
<dd>value: ${item.title}</dd>
</div>
</dl>
You can use either of count or index variable.

Data exporting from Financial Times

I am new at Matlab and I am currently working with financial data exporting from financial times website. I would like to know how can I get, for example, share price forecast information from this page
http://markets.ft.com/research/Markets/Tearsheets/Forecasts?s=DIS:NYQ
High +34.7 % 85.00
Med +15.7 % 73.00
Low -9.6 % 57.00
And save this information as a variables.
Here's a simple solution using urlread and regexpi:
% Create URL string and read in HTML
ftbaseurl = 'http://markets.ft.com/research/Markets/Tearsheets/Forecasts?s=';
ticksym = 'DIS:NYQ';
s = urlread([ftbaseurl ticksym]);
% Create pattern string for regular expression matching
trspan = '<tr><td class="text"><span class="';
tdspan1 = '</span></td><td><span class="\w\w\w color ">'; % \w\w\w matchs pos or neg
matchstr1 = '(?<percent>[\+|\-]*\d+.\d+)'; % percent: match (+or-)(1+ digits).(1+ digits)
tdspan2 = ' %</span></td><td>';
matchstr2 = '(?<price>\d+\.\d\d)</td></tr>'; % price: match (1+ digits) . 2 digits
pat = [trspan 'high">High' tdspan1 matchstr1 tdspan2 matchstr2 '|' ...
trspan 'med">Med' tdspan1 matchstr1 tdspan2 matchstr2 '|' ...
trspan 'low">Low' tdspan1 matchstr1 tdspan2 matchstr2];
% Match patterns in HTML, case insensitive, put results in struct array
forecasts = regexpi(s,pat,'names');
The result is a 1-by-3 struct array where each element has two fields, 'percent' and 'price', that each contain strings extracted by the regular expression parser. For example
>> forecasts(3)
ans = percent: '-10.3'
price: '57.00'
>> str2double(forecasts(3).percent)
-10.3000
I'll leave it to you to convert the strings to numbers (note that financial software usually stores prices in integer cents (or what ever the lowest denomination is) rather than floating point dollars to avoid numerical issues) and to turn this into a general function. Here's some more information on regular expressions in Matlab.
My comment above still stands. This is very inefficient. You're downloading the entire webpage HTML and parsing it in order to find a few small bits of data. This is fine if this doesn't update very often or if you don't need it to be very fast. Also, this scheme is fragile. If the Financial Times updates their website, it may break the code. And if you try downloading their regular webpages very often they may also have means of blocking you.

Using objective c and xpath find column index when table contains column spanning cells

Im parsing a table using hpple and libxml2 in an iPhone app. I have encountered a real problem when it comes to finding the column index when sibling cells span multiple rows using colspan.
I saw this question
But I can't use jquery to work out the column.
Consider the following table
<table>
<tbody>
<tr>
<td>One</td>
<td>Two</td>
<td id="example1">Three</td>
<td>Four</td>
<td>Five</td>
<td>Six</td>
</tr>
<tr>
<td colspan="2">One</td>
<td colspan="2">Two</td>
<td colspan="2" id="example2">Three</td>
</tr>
<tr>
<td>One</td>
<td>Two</td>
<td>Three</td>
<td>Four</td>
<td>Five</td>
<td>Six</td>
</tr>
</tbody>
</table>
How can i get the column index as 6 NOT 3 for the cell with id 'example1'?
EDIT Added more detail
NSString *xpathQuery = [NSString stringWithFormat:#"1
+ count(//a[contains(#href,'testHref')]
/../preceding-sibling::td[not(#colspan)])
+ sum(//a[contains(#href,'testHref')]/../preceding-sibling::td/#colspan)
+ sum(#colspan)
- count(#colspan)",bookingUrl, bookingUrl];
//Execute XPath
NSArray *array = [parser searchWithXPathQuery:columnCount];
This seems to work:
1 + count(preceding-sibling::td[not(#colspan)])
+ sum(preceding-sibling::td/#colspan)
+ sum(#colspan)
- count(#colspan)
Running the xpath for the <td> as the current node.

How to identify the current row in an Apex Tabular Form?

Friends,
I have written the following JavaScript to ascertain which row of an tabular form the user is currently on. i.e. they have clicked a select list on row 4. I need the row number to then get the correct value of a field on this same row which I can then perform some further processing on.
What this JavaScript does is get the id of the triggering item, e.g. f02_0004 This tells me that the select list in column 2 of row 4 has been selected. So my Javascript gets just the row information i.e. 0004 and then uses that to reference another field in this row and at the moment just output the value to show I have the correct value.
<script language="JavaScript" type="text/javascript">
function cascade(pThis){
var row = getTheCurrentRow(pThis.id);
var nameAndRow = "f03_" + row;
var costCentre = $x(nameAndRow).value;
alert("the cost centre id is " + costCentre);
}
// the triggerItem has the name fxx_yyyy where xx is the column number and
// yyyy is the row. This function just returns yyyyy
function getTheCurrentRow(triggerItem){
var theRow = triggerItem.slice(4);
return theRow;
}
Whilst this works I can't help feeling that I must be re-inventing the wheel and that
either, there are built-in's that I can use or if not there maybe a "better" way?
In case of need I'm using Apex 4.0
Thanks in advance for any you can provide.
Well, what you have described is exactly what I typically do myself!
An alternative in Apex 4.0 would be to use jQuery to navigate the DOM something like this:
var costCentre = $(pThis).parents('tr').find('input[name="f03"]')[0].value;
I have tested this and it works OK in my test form.