Web crawling/scraping GWT based web pages

Web crawling/scraping GWT based web pages - gwt

I am trying to crawl a web page that is built using GWT and uses the GWT RPC mechanism for AJAX calls. The page I am trying to crawl is not mine - so I can't edit the server side. I am very new to GWT and from my initial couple of days with it - I think that you can't de-serialize the data unless you've the case interface with you.
Am I right or Is there a way to crawl the data intelligently?

You could do it using htmlunit and WebClient:
//real code mixed with pseudo-code:
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_3);
Map<String, String> urls = new HashMap<->();
LinkedList<String> urlsToVisit = new LinkedList<->();
urlsToVisit.put("http://some_gwt_app.com/#!home");
while (!urlsToVisit.isEmpty()) {
String page = urlsToVisit.remove();
if (urls.containsKey(page)) {
continue;
}
String rendered = webClient.getPage(page);
urls.put(page, rendered);
urlsToVisit.addAll(extractLinks(page));
}
You might have to experiment with the WebClient options a bit. In my case these seem to do a good job:
webClient.setThrowExceptionOnScriptError(false);
webClient.setRedirectEnabled(true);
webClient.setJavaScriptEnabled(true);
// important! Give the headless browser enough time to execute
// JavaScript. The exact time to wait may depend on your application.
webClient.waitForBackgroundJavaScript(20000);
webClient.setAjaxController(new NicelyResynchronizingAjaxController());

I scrape for a living, and GWT is the one framework that almost always flummoxes me. The fact that it passes serialized, non-human readable parameters prevents me from interject logic that will access the site.
On some simple GWT I've gotten scrapes to work be parsing the JavaScript and running portions as is, but I can't get all to work.

Related

blank.html is downloaded multiple times

GWT is used and the application is deployed on WebLogic using HTTPS.
The performance is poor and with F12 Developer Tools, we could see that blank.html is downloaded multiple times. This is clearly related to GWT but we have not been able to figure out why.
The following is from javascript:
defineSeed(2613, 2614, makeCastMap([Q$BaseModelData, Q$ModelData, Q$Theme, Q$Serializable]), Slate_0);
var SLATE;
function $clinit_GXT(){
$clinit_GXT = nullMethod;
IMAGES = new XImages_generatedBundle_0;
MESSAGES = new XMessages__0;
SSL_SECURE_URL = getModuleBaseURL() + 'blank.html';
}
This is from GWT.java:
/**
* URL to a blank file used by GXT when in secure mode for iframe src to
* prevent the IE insecure content. Default value is 'blank.html'.
*/
public static String SSL_SECURE_URL = GWT.getModuleBaseURL() + "blank.html";
Does anyone know under what circumstances blank.html is called?
Thanks!

This is from GWT.java:
This is actually from GXT.java.
This is used in a few cases when creating an <iframe> element, so that IE won't give errors if your site is hosted from SSL. I can actually only find one case (as of GXT 3.1.1) which uses this, in Layer.java. Only IE pages loaded from https urls will make use of this.
The Layer class uses this as a "shim", a way to prop up some DOM elements above overs, and work around some browser bugs (typically plugin or iframe related). Menus and popup dialogs use this to ensure that they don't appear "underneath" content that they should be "above".
This file is very small - just enough HTML to convince IE than the iframe has correctly loaded, and no more. It never changes, and should load nearly instantly.
As far as performance goes, this should only happen when a Menu or Window/Dialog/Tooltip is shown - these shouldn't be happening on app startup usually, at least not more than a window or two. Additionally, the browser should recognize that it is loading the same element and cache it correctly, and not load it multiple times (though it might be listed several times as hitting the cache). If the server has instructed the browser to never cache the file, that is something you should look at changing.
In short, this is very unlikely to be the cause of any performance issues, at least in GXT itself. If somehow you have the shim enabled on every single widget in your project, this should not be required. If the file is loading slowly, something may be very wrong with your server configuration.
For reference, here is the entire file:
<html></html>

Converting a Brownfield PHP Webapp to Zend Framework

We're thinking of converting our PHP Webapp from using no framework (which is killing us) to use Zend Framework. Because of the size of the application I don't think starting from scratch is going to be a viable option for management so I wanted to start researching how to slowly convert from the current site structure to one using Zend Framework but there isn't a lot of information on this process.
So far my plan is to dump the current code base into the public/ directory of the Zend Application, fix the numerous problems that I'm sure this will crop up and then start rewriting modules one at a time.
Has anyone had experience doing this in the past and how did it work out for you?

I've done a few of these now. What worked best for me was putting ZF 'around' the old app, so all requests go through ZF. I then have a 'Legacy' controller plugin, which checks whether the request can be satisfied by ZF, and if not, sends it to the old app:
class Yourapp_Plugin_Legacy extends Zend_Controller_Plugin_Abstract
{
public function preDispatch(Zend_Controller_Request_Abstract $request)
{
$dispatcher = Zend_Controller_Front::getInstance()->getDispatcher();
if (!$dispatcher->isDispatchable($request)) {
// send to the old code...
}
}
}
exactly how you then send the request to your old app depends a bit on how it is implemented. In one project, I examined the request, determined what file from the old code the request would have gone to, and then required that in. It sounds like this might be appropriate for you. In another project my solution was to route all these requests to a LegacyController in the ZF project, which ran the old code to get the resulting HTML and then rendered it inside the Zend_Layout from the new project.
The advantages of this approach are that you can gradually introduce ZF modules as you rewrite parts of the old app, until you reach the point where 100% of requests can be served by ZF. Also, since the ZF project has initialized before your old code is run, your old code can use the ZF autoloader, so you can start replacing classes in the old code with models written in a more ZF-style, and have them used by both parts of the app.

GWT - Is it possible to create new HTML elements (from the server) or i can just to update the ones loaded on the client?

Im new about this technology, but I would like to know if is possible to create new object (html elements, such div/span/and so on...) dinamically on server and send it to the client, or if i can just load the ones made on client-side when i develop it in the application.
I don't ask how to do it (i think its a delicate argument), but if I can, and (if yes) where i can get some stuff/example/tutorial to do this.
Example
What i usually do :
...
public void onSuccess(Boolean result) {
if(result) {
myFunction();
}
}
...
myFunction() {
InlineLabel label=new InlineLabel();
this.add(label)
}
What im looking for :
...
public void onSuccess(InlineLabel result) {
this.add(result)
}
So, i don't need to load in advance the Object, but load them only if i click on some button (or if i perform an action). This will save a lot of code (that is inutil, if i don't do any action) loaded (as JavaScript) on the client.
As usual, thanks for your time!

GWT does not support the pattern you showed, but you can achieve a similar effect with "code splitting": read http://code.google.com/webtoolkit/doc/latest/DevGuideCodeSplitting.html
With code splitting, the client only downloads the script it needs right away (configured by the developer). If, for example, the user navigates to a more complex area of the UI that requires more widgets, additional code will be downloaded.

I'm not entirely sure I understand your question, but please feel free to amend your question or post a comment if I've missed the mark.
The host page
A GWT app is loaded in the following (simplified) process:
A host page (HTML) is loaded
A bootstrapping script is loaded
A compiled app script is loaded
The host page can contain any HTML you want. The only requirement is that you include a <script> element that loads the GWT bootstrapping script.
As a result, you can have the server return a page that contains any server-generated markup you like.
Server-rendered HTML at runtime
Once your app is running, you can send off asynchronous requests in your code to retrieve arbitrary data from the server. One option is to retrieve server-generated HTML and insert it into your application.
For this option, you'll want to instantiate an HTML widget, then use its setHTML method to insert the server-generated markup into the widget.
Client-generated
As an alternative, you can retrieve structured data from the server via GWT RPC. Objects created on a Java-based server are serialised by GWT on the server and deserialised on the client back into regular objects. You can then pull data out of these objects using accessor methods (getName, getId, etc.). At this point, you have several options:
Generate some HTML using StringBuilder and the like, then use setHTML on an HTML widget.
Generate DOM elements with the DOM class
Set the data into widgets and add them to panels or the root panel.

Grails + GWT - using the same Date Format

I am developing an app using Grails and GWT for a client side.
I want to use the same date format both on the client side and on the server side (preferably defined in one file).
So far i've understood that Grails got it's own mechanism for internationalization (grails-app/i18n). I know i can access these messages from any server code using app context.
I can also access any resource file inside web-app directory.
For the client side, i can use ConstantsWithLookup interface and GWT.Create(...) to get an instance of it.
But, i still haven't found good solution to integrate these two together, so i have date format defined in one place. Any ideas or tips?
Thanks,
Sergey

After digging into Grails more, i came to a solution.
I've put constant into .properties file under grails-app/i18n.
Then, i hook to eventCompileEnd and i copy resources from grails-app/i18n to specific package in target\generated-sources.
After this step is completed, i generate google I18N interfaces using copied property files.
I've put this functionality to separate plugin.
_Events.groovy:
includeTargets << new File("${myPluginDir}/scripts/_MyInternal.groovy")
eventCompileEnd = {
internalCopyMessageResources();
}
eventCopyMessageResourcesEnd = {
generateI18NInterface();
}
Now it is possible to access localized data from server side and from client side.

Making GWT application crawlable by a search engine

I want to use the #! token to make my GWT application crawlable, as described here:
http://code.google.com/web/ajaxcrawling/
There is a GWT sample app available online that uses this, for example:
http://gwt.google.com/samples/Showcase/Showcase.html#!CwRadioButton
Will serve the following static webpage to the googlebot:
http://gwt.google.com/samples/Showcase/Showcase.html?_escaped_fragment_=CwRadioButton
I want my GWT app to do something similar. In short, I'd like to serve a different flavor of the page whenever the _escaped_fragment_ parameter is found in the URL.
What should I modify in order for the server to serve something else (a static page, or a page dynamically generated through a headless browser like HTML Unit)? I'm guessing it could be the web.xml file, but I'm not sure.
(Note: I thought of checking the Showcase app provided with the GWT SDK, but unfortunately it doesn't seem to support serving static files on _escaped_fragment_ and it doesn't use the #! token..)

If you want to use web.xml, then I think it won't work with a servlet-mapping, because the url-patterns ignore the get parameters. (Not 100% sure, if there is another way to make this possible.)
You could of course map Showcase.html to a servlet, and in that servlet decide what to do, based on the get parameter "_escaped_fragment_". But it's a little bit expensive to call a Servlet just to serve a static page for the majority of the requests (not too bad, but still. You could set cache headers, if you're sure that it doesn't change).
Or you could have an Apache or something in front of your server - but I understand, I wouldn't like to have to do that either. Maybe your JavaEE server (which one are you using BTW?) provides some mechanism for URL filtering before the request gets passed on to the web container - I'd like to know that, too!

Found my answer! The Showcase sample supporting crawlable hyperlinks is in the following branch:
http://code.google.com/p/google-web-toolkit/source/browse/branches/crawlability/samples/showcase/?r=7726
It defines a filter in the web.xml to redirect URLs with the _escaped_fragment_ token to the output of HTML Unit.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse