Intrincate sites using htmlunit - dump

I'm trying to dump the whole contents of a certain site using HTMLUnit, but when I try to do this in a certain (rather intrincate) site, I get an empty file (not an empty file per se, but it has an empty head tag, an empty body tag and that's it).
The site is https://www.abcdin.cl/abcdin/abcdin.nsf#https://www.abcdin.cl/abcdin/abcdin.nsf/linea?openpage&cat=Audio&cattxt=TV%20y%20Audio&catpos=03&linea=LCD&lineatxt=LCD%20&
And here's my code:
BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();
Some people say that I need to introduce a pause in my code, since the page takes a while to load in Google Chrome, but I set long pauses and it doesn't work.
Thanks in advanced.

Just some ideas...
Retrieving that URL with wget returns a non-trivial HTML file. Likewise running your code with webClient.setJavaScriptEnabled(false). So it's definitely something to do with the Javascript in the page.
With Javascript enabled, I see from the logs that a bunch of Javascript jobs are being queued up, and I get see corresponding errors like this:
EcmaError: lineNumber=[49] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js] message=[TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)
at
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601)
Maybe those jobs are meant to populate your HTML? So when they fail, the resulting HTML is empty?
The error looks strange, as HtmlUnit usually has no issues with JQuery. I suspect the issue is with the code calling that particular line of the JQuery library.

Related

Karate: How to wait for a spinner to disappear? [duplicate]

Firstly, Karate UI automation is really awesome tool. I am kind of enjoying it while writing the UI tests using Karate. I ran into a situation where in, i was trying to fetch the shadowRoot elements. I read few similar posts related to javascript executor with karate and learnt that it is already answered. it is recommended to use driver.eval. But in Karate 0.9.5 there is no eval, it has script() or scriptAll(). I have gone through documentation couple of times to figure out how i can fetch element inside an element but no luck.
Using traditional selenium+java, we can fetch shadowRoot like this way:
something like shadowRoot which sits inside a parent element like div or body.
//downloads-manager is the tagname and under that downloads-manager, a shadowRoot element exists
The HTML looks like this. it is from chrome://downloads.
<downloads-manager>
#shadow-root(open)
</download-manager>
WebElement downloadManager =driver.findElement(By.tagName("downloads-manager");
WebElement shadowRoot= (WebElement)((JavaScriptExecutor)driver)
.executeScript("return arguments[0].shadowRoot",downloadManager);
So i tried the following in Karate UI
script("downloads-manager","return _.shadowRoot"); //js injection error
script('downloads-manager', "function(e){ return e.shadowRoot;}"); // same injection error as mentioned above.
def shadowRoot = locate("downloads-manager").script("function(e){return e.shadowRoot};"); //returns an empty string.
I bet there is a way to get this shadowRoot element using Karate UI but i am kind of running out of options and not able to figure out this.
Can someone please look into this & help me?
-San
Can you switch to XPath and see if that helps:
* def temp = script('//downloads-manager', '_.innerHTML')
Else please submit a sample in this format so we can debug: https://github.com/intuit/karate/tree/develop/examples/ui-test
EDIT: after you posted the link to that hangouts example in the comments, I figured out the JS that would work:
* driver 'http://html5-demos.appspot.com/hangouts'
* waitFor('#hangouts')
* def heading = script('hangout-module', "_.shadowRoot.querySelector('h1').textContent")
* match heading == 'Paul Irish'
It took some trial and error and fiddling with the DevTools console to figure this out. So the good news is that it is possible, you can use any JS you need, and you do need to know which HTML element to call .shadowRoot on.
EDIT: for other examples of JS in Karate: https://stackoverflow.com/a/60800181/143475

Stop huge error output from testing-library

I love testing-library, have used it a lot in a React project, and I'm trying to use it in an Angular project now - but I've always struggled with the enormous error output, including the HTML text of the render. Not only is this not usually helpful (I couldn't find an element, here's the HTML where it isn't); but it gets truncated, often before the interesting line if you're running in debug mode.
I simply added it as a library alongside the standard Angular Karma+Jasmine setup.
I'm sure you could say the components I'm testing are too large if the HTML output causes my console window to spool for ages, but I have a lot of integration tests in Protractor, and they are SO SLOW :(.
I would say the best solution would be to use the configure method and pass a custom function for getElementError which does what you want.
You can read about configuration here: https://testing-library.com/docs/dom-testing-library/api-configuration
An example of this might look like:
configure({
getElementError: (message: string, container) => {
const error = new Error(message);
error.name = 'TestingLibraryElementError';
error.stack = null;
return error;
},
});
You can then put this in any single test file or use Jest's setupFiles or setupFilesAfterEnv config options to have it run globally.
I am assuming you running jest with rtl in your project.
I personally wouldn't turn it off as it's there to help us, but everyone has a way so if you have your reasons, then fair enough.
1. If you want to disable errors for a specific test, you can mock the console.error.
it('disable error example', () => {
const errorObject = console.error; //store the state of the object
console.error = jest.fn(); // mock the object
// code
//assertion (expect)
console.error = errorObject; // assign it back so you can use it in the next test
});
2. If you want to silence it for all the test, you could use the jest --silent CLI option. Check the docs
The above might even disable the DOM printing that is done by rtl, I am not sure as I haven't tried this, but if you look at the docs I linked, it says
"Prevent tests from printing messages through the console."
Now you almost certainly have everything disabled except the DOM recommendations if the above doesn't work. On that case you might look into react-testing-library's source code and find out what is used for those print statements. Is it a console.log? is it a console.warn? When you got that, just mock it out like option 1 above.
UPDATE
After some digging, I found out that all testing-library DOM printing is built on prettyDOM();
While prettyDOM() can't be disabled you can limit the number of lines to 0, and that would just give you the error message and three dots ... below the message.
Here is an example printout, I messed around with:
TestingLibraryElementError: Unable to find an element with the text: Hello ther. This could be because the text is broken up by multiple elements. In this case, you can provide a function for your text matcher to make your matcher more flexible.
...
All you need to do is to pass in an environment variable before executing your test suite, so for example with an npm script it would look like:
DEBUG_PRINT_LIMIT=0 npm run test
Here is the doc
UPDATE 2:
As per the OP's FR on github this can also be achieved without injecting in a global variable to limit the PrettyDOM line output (in case if it's used elsewhere). The getElementError config option need to be changed:
dom-testing-library/src/config.js
// called when getBy* queries fail. (message, container) => Error
getElementError(message, container) {
const error = new Error(
[message, prettyDOM(container)].filter(Boolean).join('\n\n'),
)
error.name = 'TestingLibraryElementError'
return error
},
The callstack can also be removed
You can change how the message is built by setting the DOM testing library message building function with config. In my Angular project I added this to test.js:
configure({
getElementError: (message: string, container) => {
const error = new Error(message);
error.name = 'TestingLibraryElementError';
error.stack = null;
return error;
},
});
This was answered here: https://github.com/testing-library/dom-testing-library/issues/773 by https://github.com/wyze.

PhantomJS change webpage content before evaluating

I'd like to either remove an HTML element or simply remove first N characters of a webpage before evaluating/rendering it.
Is there any way to do that?
It depends on multiple scenarios. I will only outline the steps for each combination of the answers to the following questions.
Is the piece of JS called onload (ol) or is the script block immediately evaluated (ie)?
Is it an inline script (is) or is the script loaded separately (src attribute) (ls)?
Does the script block also contain some code that should not be removed (nr) or can it be removed completely (rc)?
1. Script is loaded separately (ls) & code can be removed completely (rc)
Register to the onResourceRequested listener and request.abort() depending on the matched url.
2. Script is loaded separately (ls) & contains other code too (nr)
This can only be done when the following code blocks do not depend on the code that should not be removed (which is unlikely). This is most likely necessary for click events that are registered in the DOM.
In this case cancel the request like in 1., download the script through an XHR, remove the unwanted code parts and add code block to the DOM. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain: --web-security=false.
3. Script is loaded with the DOM (is) & JS executed through onload (ol) & can be removed completely (rc)
This is probably very error prone. You would begin an Interval with setInterval(function(){}, 5) from a page.onInitialized callback. Inside the interval you would need to check if window.onload (or something else you can get your hands on) is set in the page context. You remove it, if it is indeed the function that you wanted to remove by checking window.onload.toString().match(/something/).
This can be done directly and completely inside the page context (inside page.evaluate).
4. Script is loaded with the DOM (is) & JS executed through onload (ol) & contains other code too (nr)
Begin like in 3., but instead of removing window.onload, you can do
eval("window.onload = " + window.onload.toString().replace(/something/,''))
5. Script is loaded with the DOM (is) & the script block immediately evaluated (ie)
You can load the page as an XHR, replace the text and apply the adjusted content to the page. This will essentially be a filled about:blank page. For this to work, you would need to disable web security, because otherwise no resource can be requested if it is not on the same domain: --web-security=false or --local-to-remote-url-access=true. This would also work for 3. and 4..
There is still one problem though. Pages don't use full URLs most of the time. So when a script or element refers to stuff.php PhantomJS cannot request it. When the page.content is set then the page URL is essentially about:blank and all requests with incomplete URLs point to file:///.... Obviously there are no such files. Those resources must be replaced with their full URL counterparts.
There are three types of such URLs:
//example.com/resource.php variable protocol
/resource.php variable protocol and domain
resource.php variable protocol, domain and path to resource
Complete example:
var page = require('webpage').create(),
url = 'http://www.example.com';
page.open(url, function(status) {
if (status !== 'success') {
console.log('Unable to access network');
} else {
var content = page.evaluate(function(url){
var xhr = new XMLHttpRequest();
xhr.open("GET", url, false);
xhr.send();
return xhr.responseText;
}, url);
page.render("test_example.png");
page.content = content.replace(/xample/g,"asy");
page.render("test_easy.png");
console.log("url "+page.url); // about:blank
phantom.exit();
}
});
You might want to look into proper manipulation techniques apart from the simple string replace.

How do I customize wintersmith paginator?

I've been setting up a site with Wintersmith and am loving it for the most part, but I cannot wrap my head around some of the under-the-hood mechanics. I started with the "blog" skeleton that adds the paginator.coffee plugin.
The question requires some details, so up top, what I'm trying to accomplish:
Any files (markdown, html, json metadata) will be picked up either in /contents/article/<file> or /contents/articles/<subdir>/<file>
Output files are at /articles/YYYY/MM/DD/title-slug/
/blog.html lists all articles, paginated.
Files just under /contents (not in articles) are not treated as blog posts. Markdown and JSON metadata are still processed, but no permalinked URLs, not included in blog listings, file/directory structure is more directly copied over.
So, I solved #1 with this suggestion: How can I have articles in Wintersmith not in their own subdirectory? So far, great, and #3 is working -- the paginated listing includes all posts. #4 has not been an issue, it's the default behavior.
On #2 I found this solution: http://andrewphilipclark.com/2013/11/08/removing-the-boilerplate-from-wintersmith-blog-posts/ . As the author mentions, his solution was (sort of) subsequently incorporated into Wintersmith master, so I tried just setting the filenameTemplate accordingly. Unfortunately this applies to all content, not just that under /articles, so the rest of my site gets hosed (breaks #4). So then I tried the author's approach, adding a blogpost.coffee plugin using his code. This generates all the files out of /contents/articles into the correct permalink URLs, however the paginator now for some reason will no longer see files directly under /articles (point #1).
I've tried a lot of permutations and hacking. Tried changing the order of which plugin gets loaded first. Tried having PaginatorPage extend BlogpostPage instead of Page. Tried a lot of things. I finally realize, even after inspecting many of the core classes in Wintersmith source, that I do not understand what is happening.
Specifically, I cannot figure out how contents['articles']._.pages and .directories are set, which seems relevant. Nor do I understand what that underscore is.
Ultimately, Jade/CoffeeScript/Markdown are a great combo for minimizing coding and enhancing clarity except when you want to understand what's happening under the hood and you don't know these languages. It took me a bit to get the basics of Jade and CoffeeScript (Markdown is trivial of course) enough to follow what's happening. When I've had to dig into the wintersmith source, it gets deeper. I confess I'm also a node.js newbie, but I think the big issue here is just a magic framework. It would be helpful, for instance, if some of the core "plugins" were included in the skeleton site as opposed to buried in node_modules, just so curious hackers could see more quickly how things interconnect. More verbose docs would of course be helpful too. It's one thing to understand conceptually content trees, generators, views, templates, etc., but understanding the code flow and relations at runtime? I'm lost.
Any help is appreciated. As I said, I'm loving Wintersmith, just wish I could dispel magic.
Because coffee script is rubbish, this is extremely hard to do. However, if you want to, you can destroy the paginator.coffee and replace it with a simple javascript script that does a similar thing:
module.exports = function (env, callback) {
function Page() {
var rtn = new env.plugins.Page();
rtn.getFilename = function() {
return 'index.html';
},
rtn.getView = function() {
return function(env, locals, contents, templates, callback) {
var error = null;
var context = {};
env.utils.extend(context, locals);
var buffer = new Buffer(templates['index.jade'].fn(context));
callback(error, buffer);
};
};
return rtn;
};
/** Generates a custom index page */
function gen(contents, callback) {
var p = Page();
var pages = {'index.page': p};
var error = null;
callback(error, pages);
};
env.registerGenerator('magic', gen);
callback();
};
Notice that due to 'coffee script magic', there are a number of hoops to jump through here, such as making sure you return a buffer from getView(), and 'manually' overriding rather than using the obscure coffee script extension semantics.
Wintersmith is extremely picky about how it handles these functions. If callbacks are not invoked, for the returned value is not a Stream or Buffer, generated files will appear in the content summary, but not be rendered to disk during a build. Enable verbose logging and check of 'skipping foo' messages to detect this.

How to obtain wicket URL from PageClass and PageParameters without running Wicket application (i.e. without RequestCycle)?

In my project, there are additional (non-wicket) applications, which need to know the URL representation of some domain objects (e.g. in order to write a link like http://mydomain.com/user/someUserName/ into a notification email).
Now I'd like to create a spring bean in my wicket module, exposing the URLs I need without having a running wicket context, in order to make the other application depend on the wicket module, e.g. offering a method public String getUrlForUser(User u) returning "/user/someUserName/".
I've been stalking around the web and through the wicket source for a complete workday now, and did not find a way to retrieve the URL for a given PageClass and PageParameters without a current RequestCycle.
Any ideas how I could achieve this? Actually, all the information I need is somehow stored by my WebApplication, in which I define mount points and page classes.
Update: Because the code below caused problems under certain circumstances (in our case, being executed subsequently by a quarz scheduled job), I dived a bit deeper and finally found a more light-weight solution.
Pros:
No need to construct and run an instance of the WebApplication
No need to mock a ServletContext
Works completely independent of web application container
Contra (or not, depends on how you look at it):
Need to extract the actual mounting from your WebApplication class and encapsulate it in another class, which can then be used by standalone processes. You can no longer use WebApplication's convenient mountPage() method then, but you can easily build your own convenience implementation, just have a look at the wicket sources.
(Personally, I have never been happy with all the mount configuration making up 95% of my WebApplication class, so it felt good to finally extract it somewhere else.)
I cannot post the actual code, but having a look at this piece of code will give you an idea how you should mount your pages and how to get hold of the URL afterwards:
CompoundRequestMapper rm = new CompoundRequestMapper();
// mounting the pages
rm.add(new MountedMapper("mypage",MyPage.class));
// ... mount other pages ...
// create URL from page class and parameters
Class<? extends IRequestablePage> pageClass = MyPage.class;
PageParameters pp = new PageParameters();
pp.add("param1","value1");
IRequestHandler handler = new BookmarkablePageRequestHandler(new PageProvider(MyPage.class, pp));
Url url = rm.mapHandler(handler);
Original solution below:
After deep-diving into the intestines of the wicket sources, I was able to glue together this piece of code
IRequestMapper rm = MyWebApplication.get().getRootRequestMapper();
IRequestHandler handler = new BookmarkablePageRequestHandler(new PageProvider(pageClass, parameters));
Url url = rm.mapHandler(handler);
It works without a current RequestCycle, but still needs to have MyWebApplication running.
However, from Wicket's internal test classes, I have put the following together to construct a dummy instance of MyWebApplication:
MyWebApplication dummy = new MyWebApplication();
dummy.setName("test-app");
dummy.setServletContext(new MockServletContext(dummy, ""));
ThreadContext.setApplication(dummy);
dummy.initApplication();