Website Data Crawler, posting data and traversal - traversal

Although there have been quite some posts on these topic, my question is little bit specific.
I need to parse few website and once done, I need to send some data to it. For example, say website A offers me a search tab, I need to programatically feed data to it. The resulting page might differ based on target site's updates.
I want to code such a crawler. So which tools/language would be best to realize this?
I am already well-versed in java and C, so anything based on these would be really helpful.

I would suggest using phantomjs. It's completely free and Windows, Linux, Mac are supported.
It is very simple to install.
It is very simple to execute using
command line.
Community is pretty big and solving straight-forward
problems is trivial.
It uses JavaScript as the scripting language so you'll be fine, I guess, with your Java background.
You'll have to get familiar with DOM structure. Well, you cannot write a crawler without knowing it (even in case you select completely visual solution).
Everything depends on how frequently the crawler should be executed: PhantomJs is great for long-term jobs. Use something else, visual, like iMacros in case you're looking for one-time solution. It can be used inside Mozilla as an extension (free of charge) and there's a standalone version that costs money.
Cheers

Related

MIT-Scratch adding/removing language features

I am seeking a way to allow my non-tech users to specify a workflow and execute it (if anyone is interested, I want them to specify and execute test cases). Visual programming seems a good way to go.
Can I modify the Scratch IDE to remove some categories (such as sound, motion, etc), and add some of my own? Ditto for individual keywords (obviously, I then need to handle new keywords).
I have Googled, but the answer is not immediately apparent.
[Update] I have just found Google's Blockly
Blockly was influenced by App Inventor, which in turn was influenced
by Scratch, which in turn was influenced by StarLogo.
It looks very promising. Especially when it says
Exportable code. Users can extract their programs as JavaScript, Python, PHP, Dart or other language so that when they outgrow Blockly
they can keep learning.
Open source. Everything about Blockly is open: you can fork it, hack it, and use it in your own websites.
Extensible. Make Blockly fit with your application by adding custom blocks for your API and remove unneeded blocks and
functionality.
One possible snag is that it is browser based, but if my management don't like that, then I can create a dummy Windows based app consisting of little but a TWebBrowser component.
I will investigate and report back - unless someone else posts an acceptable answer first.
The short answer to your initial question is: no. You can't customize Scratch, or not to the extent that you seem to ask/want.
That said, look at:
custom blocks.
scratch extensions.
variants like snap
using scratch's source code in squeak to make your own variant.
other systems inspired from scratch, like appinventor and blockly.
Only the first two are compatible with the scratch web site.
A word on the site: depending on your purpose with Scratch, the exchange between users is a powerful part of scratch. Check how cooperation is supported, like the backpack. There's also a good wiki that documents much of the above.

HTML form parsing and submission in Clojure (as per Hpricot)?

I am trying to find a high-level Clojure library for making HTTP and HTTPS requests, parsing out forms and links from responses and then POST-ing updated forms or following links. Ideally something that would automatically handle redirects and cookies (i.e. sessions). That is, I'd like to find something whereby my code can as closely as possible mimic a user driving a webapp from a browser, without the browser.
A number of years ago we used Hpricot and Ruby for a similar task but I'm prefer to do this in Clojure if at all possible. From memory - and I haven't used Hpricot for years - we were able to do all this with minimal effort: we were able to concentrate on the 'what' of driving the application, not the 'how'.
I found clj-http https://github.com/dakrone/clj-http but this seems to be one step lower-level than I'm looking for (no form parsing) - although it is based on Apache HttpComponents http://hc.apache.org/httpcomponents-client-ga/ which does seem to expose a nice, fluent, API for forms http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fluent.html.
Screen scraping in clojure asks about screen-scraping in Clojure, and there are several good suggestions for that, but nothing that really addresses the above.
HTTP Kit http://www.http-kit.org/client.html looks like it would be a great foundation for the above but doesn't do form parsing or session management (as far as I can see).
Currently I've veering toward using the Apache HttpComponents Java library directly from Clojure. Can anyone suggest any better - perhaps more Clojure idiomatic - alternative? Or anything that they found worked well in similar circumstances? My goal is to write the minimal amount of code quickly to investigate a problem with a web service. This is not production code. Saving time, rather than getting an 'ideal' solution is my main concern.
[The background is that I am trying to mimic certain forms of user behaviour in order to first reproduce and then try and track down an intermitent bug in a large body of legacy Java/EJB code. However the problem only seems to occur one time per several thousand POSTs. (The suspicious is of some form of caching issue.) The existence of the problem, after the fact, is easy to detect however.]
Have you looked at the Enlive library yet? Here is a good tutorial on it.
You seem to really have 2 parts here. The first part is (1) a Selenium-like client, which drives (2) a webserver.
For part (1), either Selenium, Enlive, or something similar will allow you to simulate a browser to submit data, read the responses, and respond from there. For part (2), it seems you just need a regular Clojure web framework such as Ring/Compojure (older & simpler) or Pedestal (newer & more powerful).

Using PHP within a SCORM course

I am creating an online course via SCORM for Moodle 1.9. I have decided that using SCORM is really the only way to design the course the way I want. It doesnt need to work in any other LMS, just one; so I am not worried about compatibility across the board.
What is the best way to use PHP files within a SCORM course?
I have tried linking directly to an outside PHP file, which does "work", but returning back to the SCORM files is kinda wierd. I have to add in this obscure path:
complete course
Although I have not done much testing, the above technically works. But, I would like to know what the best practices are when it comes to using anything other than HTML and javascript in SCORM. Please help!
Cheers
I understand your goal, but one of the key points of SCORM is portability (the "S" stands for "Shareable"). As such, server-side code is specifically prohibited in SCORM courses, because you never know which server-side code an LMS will support.
SCORM requires a pure client-side solution, with JavaScript handling course-to-LMS communication. There is no 'best way' to use PHP (or any other server-side language) in a SCORM course, and by using PHP, your course will not be SCORM-conformant.
If you want to use server-side code, perhaps you should consider AICC instead of SCORM.
I agree 100% with pipwerks' answer above. However, if you were absolutely sure you want to do this kind of "hack" you are describing, I would suggest that you link your SCORM content with ajax requests to your php file, so that the experience is as transparent to the user as possible. Of course, you should also be able to upload and execute php files from inside the SCORM's folder (I'm not sure if moodle supports this).

Prerequistes for learning how to code a facebook application

I am facing a really tough time doing this. I haven't done much on web development before this. I know HTML and dats it. What else do I need to learn in order to start coding a good facebook app (specifically scripting languages like JS, PHP, MySQL etc). I have already created a test application using a web hosting site. Time is of the essence here.
I would pick up Head First Servlets and JSP or Head First PHP & MySQL. They will be your best friends in the process of writing PHP or JSP. Then of course you should learn what's in the Facebook API and how to use it.
You mentioned that time is off the essence, but I simply can't think of a reasonable way by which you can learn PHP or JSP without putting in the grunt work (which takes time). You should definitely do some reading unless you want to make an app that's hacker's paradise.
If you're already familiar with Java, then JSP will feel more natural and probably easier/faster to learn.
P.S. Sorry for the multiple edits... I felt like my wording was a bit off.

Are there any medium-sized web applications built with CGI::Application that are open-sourced?

I learn best by taking apart something that already does something and figuring out why decisions were made in which manner.
Recently I've started working with Perl's CGI::Application framework, but found i don't really get along well with the documentation (too little information on how to best structure an application with it). There are some examples of small applications on the cgi-app website, but they're mostly structured such that they demonstrate a small feature, but contain mostly of code that one would never actually use in production. Other examples are massively huge and would require way too much time to dig through. And most of them are just stuff that runs on cgiapp, but isn't open source.
As such I am looking for something that has most base functionality like user logins, db access, some processing, etc.; is actually used for something but not so big that it would take hours to even set them up.
Does something like that exist or am i out of luck?
CGI::Application tends to be used for small, rapid-development web applications (much like Dancer, Maypole and other related modules). I haven't seen any real examples of open-source web apps built on top of it, though perhaps I'm not looking hard enough.
You could look at Catalyst. The wiki has a list of Catalyst-powered software and there are a large number of apps there - poke around, see if you like the look of the framework. Of this, this is Perl, so some of those apps will be using Template::Toolkit, some will use HTML::Mason... still, you'll get a general idea.
Try looking at Miril CMS. Although I don't know in which state it is.
I am the same with code, and had the same request. When I did not find a solution I created my own. which is https://github.com/alexxroche/Notice
I hope that it is a good solution to this request.
Notice demonstrates:
CGI::Application
CGI::Application::Plugin::ConfigAuto
CGI::Application::Plugin::AutoRunmode
CGI::Application::Plugin::DBH
CGI::Application::Plugin::Session;
CGI::Application::Plugin::Authentication
CGI::Application::Plugin::Redirect
CGI::Application::Plugin::DBIC::Schema
CGI::Application::Plugin::Forward
CGI::Application::Plugin::TT
It comes with an example mysql schema, but because of DBIC::Schema it can be used with PostgreSQL, (or anything else that DBIx::Class supports.)
I use Notice in all of my real life applications since 2007. The version in github is everything except the branding and the content.
Check out the Krang CMS.