Web Scraping with Scala [closed] - scala

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Just wondering if anyone knows of a web-scraping library that takes advantage of Scala's succinct syntax. So far, I've found Chafe, but this seems poorly-documented and maintained. I'm wondering if anyone out there has done scraping with Scala and has advice. (I'm trying to integrate into an existing Scala framework rather than use a scraper written in, say, Python.)

First there is a plethora of HTML scraping libs in JVM all you need to do is pimp one of them (pimp my library pattern).
The four I have used are:
HtmlUnit - Will emulate the browser and even run Javascript
Jericho - Preserves formatting and ideal if you want to edit the scraped HTML
NekoHtml
JSoup -- does not work with Scala. Might work
I have used Selenium but never for scraping. Scala has a wrapper around selenium.
I would recommend pimping an existing Java library over some half baked Scala lib.

I don't have a Scala-specific recommendation, but for the JVM in general I've had good success with:
JSoup You can CSS selectors to "scrape" the document. Really nice to work with.
Use Tagsoup to get your input HTML to XML, then use XML processors to "Scrape".
The Tagsoup route actually works quite well with Scala since Scala's built-in XML "dsl" is pretty concise (if you can forgive its perf issues and occasional API weirdness). Also, Tagsoup will handle nearly any garbage document you give it. It also has niceties like built-in understanding of many HTML entities that other SAXParsers will choke on as being undeclared.
tl;dr - JSoup + CSS selectors if possible, otherwise Tagsoup + scala XML. If slow is ok, tagsoup first, then jsoup the result.

I'd recommend Goose: https://github.com/jiminoc/goose
It's not as general-use as you might need but if you are scraping article content from popular sites, it may work out of the box. It also provides a framework for you to work from if you want to extend their code to cover other sites.

Related

Which technology stack to use for car pooling over web and mobile [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I want to start working on a project where I want to build a intranet website and mobile app for people working in my office for car pooling. The basic idea is that if anyone is interested in looking for someone to carpool with should make a posting of going from A to B at time X.People can then reply to it.
I've narrowed down my option to Scala+Lift+MongoDB or Node.JS+Redis/MongoDB+HTML5. I don't know which one is better or worse for the problem I have mentioned. Also looking at developing mobile apps for the same application where people can send carpool request over their phones.Looking for a stack which can complement the mobile development also.
I know there are various solutions for this, but I'm looking to learn something new and exciting and have fun while developing it.
The only requirement that influences the technology stack is "looking to learn something new and exciting and have fun while developing it" (just as broofa said).
However I have no idea how he came from that requirement to JavaScript.
Yes it is more marketable
Yes there are way more people that know it.
Yes you'll need it any way.
But is JavaScript in anyway interesting as a language? Not much I'd say. Any nice unique (or at least rare) concepts? To me it looks like programming in java, but not being allowed to use anything but Hashmaps + java.lang.*
Scala on the other hand combines functional and object oriented in an extremely interesting way. It has a strong type system which enables tricks that probably will make your head spin.
And even if you don't use the really fancy stuff you have a super powerful language to work with.
So if you want to learn: Go with Scala
The capabilities of the technology stack here are probably unimportant. Both Scala and Node will allow you to implement a web interface / HTML5-based application for mobile devices.
So it boils down to your other requirement, "learn something new and exciting". If you're not familiar with node or JavaScript, I'd suggest Node because ...
JS is a much more marketable skill than Scala (currently)
If you want other people to work on this code, more people know JS than Scala.
You are only learning one new language instead of two. (You have to learn JS in either case to implement the front end. With Node, that expertise carries over to the server as well.)
... and even if you are familiar with JS, working with Node will make you a much better JS developer.
My $.02. You should get somebody who knows something about Scala to chime in here however.

recommend a server side technology for gwt (beginner) [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am developing a gwt project and am looking for an appropriate server side technology.
it should support be open source and support user login (and not using openID...) with password recovery etc
it seems that the de-facto standard would be spring + hibernate. however, I am unfamiliar with neither of them and understand that the learning curve (especially for spring) is very high. gwt was quite easy to learn using GOOG's excellent online tutorials but the spring equivalent seem to impose lots of configuration files and deeper understanding of its internals.
so I am looking for a simpler server side technology to deploy my gwt app. I am definitely prepared to learn a new framework if necessary but not something that would take me 2 months just to understand the fundamentals...
any ideas...?
Spring Roo should get you started with a GWT app in no time. It even has scaffoling (like Rails) for easily generating code for views and models. Here is a good video that introduces Roo and here is a guide for the mandentory 10 minutes application that Rails pioneered years ago.
Also a cool thing about Roo is that it gets you started quickly while still doing everything correctly (i.e. integrate with Spring security, Hibernate, Maven, ...).
Edit: You could also try Vaadin (tutorial here) although I am unsure if that may be to simplistic for your needs.
You could have a look at Google AppEngine + GWT. It provides you a full development environment:
http://code.google.com/webtoolkit/doc/latest/tutorial/appengine.html
This post also provides some information on how to get started with Google Plugin for Eclipse, which supports GWT, Google AppEngine, etc.
I second using Google App Engine, especially the Java version as it integrates so easily with GWT. I am using it in this way right now. App Engine has well written and complete docs, similar to those of GWT.
A simple way to integrate the build processes is to (1) use the GWT code generator to generate the standard project tree and ant build process and then (2) read this article on integrating GAE/Java with GWT:
https://developers.google.com/web-toolkit/doc/latest/tutorial/appengine

What is the best language in which to write an expert system? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Is LISP or something like Jess the best choice? I'm interested in writing a program that makes a suggestion based on users' answers. Computational considerations are not really a factor this is pretty much a pattern matching engine. Also I would like to make an app for this and put it up on the web.
UPDATE: I would like to put this up on a blog or website and let people use it from there. I guess my question then is there a particular inference engine that works with the .NET family, or PHP, or something to that effect? What are some of the pros and cons of each options etc.
Step 1. Pick an inference engine. There are many choices. Here's a list: http://en.wikipedia.org/wiki/Expert_system#Shells_or_Inference_Engine
Step 2. Use the language that interfaces with the inference engine.
You'll be much happier leveraging an inference engine for expert systems work.
I would like to put this up on a blog or website and let people use it from there
Trivial.
is there a particular inference engine that works with the .NET family, or PHP, or something to that effect?
Doesn't matter.
Here's the confusion. Your "web site" and your "inference application" have NOTHING to do with each other. Nothing.
Your web site can be done in any tool set you can find. It doesn't matter.
Your inference application can be done in any tool set you can find. It doesn't matter.
Your web site will invoke the inference application through any API that makes sense. The lowest common denominator in API's (the reason that none of these choices matter) is to do this.
Write your inference application as a stand-alone command line tool.
Write your web application to run the stand-alone tool, collect the output and turn the output into an HTML page.
Note that this multi-porocess implementation may be faster and make better use of multi-core processors. It forces the OS to manage the web server (Apache HTTPD, for example), the web application and the expert system as potentially three, separate, parallel processes.
You can also take a look at Prolog. SWI-Prolog (http://www.swi-prolog.org) is very complete and has an HTTP support library included (http://www.swi-prolog.org/pldoc/package/http.html). This paper might be helpful in using SWI-Prolog on the web ("SWI-Prolog and the web" http://dare.uva.nl/record/285350)
And, you can find a tutorial on building expert systems with prolog at: http://www.amzi.com/ExpertSystemsInProlog/
You will hear a lot of subjective opinions here, since few people have experience in more than one language writing expert systems.
I can recommend Common Lisp, as there is quite some literature and existing code available in this language, and it is a very powerful language and not too difficult to learn (read "Practical Common Lisp" by Peter Seibel). Of course, any new high level language requires some effort to learn. For the web application, you can use, e.g., Hunchentoot and CL-WHO, and there are a lot of database bindings (I like Postmodern and CL-SQLite).
I would suggest CLIPS and its .net port clipsnet
http://sourceforge.net/projects/clipsnet/

Scala Tools & Libraries Wish List [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
Which tools or libraries do you wish existed in the Scala ecosystem?
Are there any existing ones you wish were greatly improved?
In no particular order:
A Scala version of the Clojure Incanter libraries would be very handy indeed, and could probably be even nicer to use than Clojure's.
It would also be exceptionally cool if the parallel version of the 2.8 collection library had been ready for (todays!) 2.8 release, rather than waiting for 2.8.1. Even cooler would be something with the power and feel of the 2.8 collection library which offloaded calculations to something like Hadoop.
Standard library support for software transactional memory would be very nice.
The IntelliJ IDEA plugin for Scala is an amazing piece of work, but (unsurprisingly) still lags behind Java in some annoying ways, particularly in on-the-fly error reporting.
There need to be some standard shims built so that various "enterprise" libraries (Spring/Hibernate/Ibatis/Freemarker, etc.) can use Scala objects without scattering #BeanProperty annotations around and without using Java collections objects.
A single lib for time, money and physical units would be cool
Scala Swing should be more complete (and more consistent)
Would be nice if the DBC lib for wrapping JDBC access would be finished
A Scala 3D engine would be awesome. Simplex3D and Sgine are on the way, but it's a long way...
I think it is important not to pack too much functionality into Scala. It is really easy to expand Scala on your own, so let's do that for a while. Then, when some framework emerge as a winner, this might shipped with Scala.
For those of you who have suffered the result of the JCP committee, please remember the disasters of premature standardizations.
That said, I have my own wish list :-) I would like a simple DSL for Date. The one from DPPs book would do.
Off the top of my head:
A good scala <-> JDBC bridge.
A good mocking framework.
Scala wrapper for Spring DI.

Favorite programming brainstorming activity? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
As an artist and musician, I often want to sit down and just let the code roll like a piece of free-form poetry, but I've found that doesn't work as well as when I have a set goal in mind. I've been experimenting lately with setting up tiny, fun goals for myself, not unlike how an artist would sketch a quick still-life, but I wonder...
What do others do when they want to code for fun, without the bondage of an already-committed project?
Design work, I find, flows much easier than just coding. I find that coding is often more of just implementation of a good design; I really like to just sit down with a pad of paper and a pen (and likely a bottle of wine) and work out an interesting design.
Project Euler is where I'm having fun at now. I can go at my own pace and work on the problems that interest me. Also, work in any language I choose.
Write documentation when coding doesn't come easy - coding will quickly seem much more appealing!
Going for a walk outside.
I tend to map my idea or build a structure in a MindMapping tool like MindMeister. And it's great for a team because it can be edited in real time by multiple persons!
I like to pick up a new language and learn how to express ideas in it. This usually has the benefit of showing me what I like and don't like about the languages I currently use. I usually pick some little tool project I've been wanting to do. Using the new language angle get's me motivated
My most recent 'new language' is Scala, in this case it will likely become a langue I use.
I like writing on whiteboards. Great for db diagrams, task lists, feature lists, (other lists,) random ideas, notes, etc. (db diagrams being the biggie for me)
Python is great for just getting things going on an idea and having the language (usually) behave like you would expect.
While it may have its drawbacks, it sounds like a great fit for what you are describing.
So to answer your question, the Python Challenge is entertaining and often gets me thinking about little things that would be fun to code, probably because it exposes you to different types of problems.

			
				
I like to code.
I like to find something interesting, code it and then see it works.
It does not have to be a project per see, it's good enough if it does something, like use Google api to get picasa albums, change song in iTunes or get details of current iTunes song, automate downloading of document from web site that is behind login and requires cookies and all that stuff, data parser in python, simple app on Mac, core data application, google codejam problems, topcoder.com problems ...
I like to learn new features of some language or some new language/technology/patterns/tool :-)
Usually I will work in Photoshop for a while. Get creative and try to come up with a new design that's not constrained by any code. Maybe even find something inspiring on the web for some new design ideas... then try to implement the design in code. That's the fun and challenging bit.
Use the REPL.
You figure out broadly the sort of thing you need to do - what APIs you need to use, what data structures you need to handle - and then prod them interactively until they start making sense. A ton of languages I use now have REPLs: Ruby, Python, Scala, Java (BeanShell, or JRuby/Jython etc.), C# ('csharp'), PHP (Facebook have made a REPL for it), Smalltalk (GNU gst) and, obviously, LISP/Scheme.