Excluding code sections from EPIC/tidyperl source formatting - perl

I in a large codebase of an application written in perl there is a lot of HTML and JS written inline in the perl file.
$html_str = qq^ <A LOT OF HTML> ^;
All the code development in done using Eclipse IDE and EPIC plugin. For ease of merging/diffs et al, I am looking for a way to tell the EPIC source formatter not to apply formatting rules to the HTML and JS that is written inline. Is there a way to do this?

HTML embedded in the code is a red flag. That's stuff a designer is going to want to tweak, and so should be able to get at easily. The HTML should be split out into template files. I realize this doesn't answer your question, but it does solve your problem.
Otherwise, use perltidy to handle Perl code formatting. It won't mess with content inside strings and certainly isn't going to try and format HTML.


MATLAB Results to HTML

Is there a really awesome way to organize results in MATLAB and create a set of HTML pages of the data?
I want to take a bunch of different runs and visualize the data and results in a way that is easy for people to flip through but I was hoping to do better than starting from scratch and writing raw HTML/XML code to disk.
You might like to take a look at the publish-to-HTML functionality in MATLAB. It's extremely easy: you just add some mark-up to the comments in a MATLAB script, click the publish button or use the publish command, and you get a nice HTML (or Word, PowerPoint or LaTeX) file containing the code and output of the script, with your marked up comments converted to nice paragraphs of explanatory text. Here are some links to the documentation:
Publishing MATLAB Code
Publishing Code from the Editor (video)
and to a blog article containing three enhancements to publishing, which display data as HTML tables in your published HTML:
HTML tables
Hope that helps!

Is there currently a way to get Emacs muse-mode to output rtf,odt or doc format?

Muse is a special mode in emacs that can be used as a wiki. It has multiple output formats like static HTML pages, LaTeX, PDF etc.
But sometimes I need to output something that less tech-savvy people can edit/correct and send back to me.
I think either RTF, ODT or DOC would do the trick.
My problem is that muse only supports HTML, LaTeX, TexInfo and XML out of the box.
Implementing an own output format is currently not an option as I cannot program in elisp and learning it would take too much time.
I searched for a way to convert to or use markdown as pandoc can convert to RTF. But I found only the following discussion that does not solve my problem.
My last resort would be to convert to HTML and then to RTF, ODT or DOC but AFAIK the results are far from great.
It would appreciate a solution that can be automated (with custom scripts).
I think, that importing of HTML into MS Word (or compatible processor) should work. As I remember, OpenOffice had some scripting support, so you can launch it, and perform some commands inside it.
Another way - writing RTF export backend, it shouldn't be too complicated, although it could be too much details to be taken into account. If you'll go this way, please write to muse mailing list, and I'll try to help you

firefox addon development and Unicode

So I started developing my firefox addon.
Most of the work is performed by a referenced javascript file.
Problem is that when I edit some of the html elements on the page and say, set their text it's written as pure giberish. I am writing the text in hebrew. Can't for the life of me figure the reason.
Any ideas?
Javascript strings are already Unicode at runtime. However, you have to make sure that your files are encoded correctly.
Always use utf-8 (without BOM) file encoding for all your js, XUL, DTD, properties files to be sure.
Firefox might try to guess the file character set incorrectly otherwise, and even worse some stuff might not even try guessing the encoding and instead simply always assume utf-8.
Better yet, do not hard-code strings in js/xul, but use DTD/properties files for localization (XUL tutorial, XUL School).
This, e.g. snippet works pretty well for me (on this very page):
document.getElementsByTagName("h1")[0].textContent="русский язык";
(Just fire up the Firefox Web Console)
"Inline" hewbrew embedded in js files might create additional problems because it is right-to-left and bidi sucks, so the localization approach should be preferred.

Best Way to Parse HTML to XML

Essentially, I currently have an iPhone app that can query and parse an XML file on my server. Right now, I currently have to manually update and upload my XML file every morning so my users can have the updated information. I would like to automate this process, which would essentially entail parsing various websites (NYTimes, iAmBored.com, etc), outputting the relevant information from each of these websites to an XML file, and uploading that file to my server.
Does anyone know the best way to accomplish this (parsing HTML to an XML file). Since I am a beginner, I'm not sure what languages this requires or what is the best way to do this?
Thanks a lot in advance!
You can try to translate HTML to XHTML (XHTML is based on XML so it's XML with some rules defined in a DTD).
You can also try to parse directly HTML with a SGML parser (As XHTML is based on XML, HTML is based on SGML).
The links are provided as inspiration.
If the content you need to scrape is in XHTML then you can easily use the XSLT language to transform original content in what you need inside the XML you provide to your users.
Otherwise any kind of scraping and XML producing solution will be fine, every programming language has its support to do such things.. but you could use XPath to select the elements you need from the page and then save them inside the output file.
Can you get what you need from the RSS/Atom feeds? That will simplify things greatly because they are XML rather than HTML and can be parsed by a standard XML parser. Of course, descriptions embedded inside RSS feeds will be HTML, so depending on your application, that may be when you need to parse HTML.
XSLT is a domain-specific programming language designed for processing XML, but you can also use any programming language that includes an XML parser for the task.
TagSoup - Just Keep On Truckin'
...a SAX-compliant parser written in Java
that, instead of parsing well-formed
or valid XML, parses HTML as it is
found in the wild: poor, nasty and
brutish, though quite often far from
TagSoup is designed for people
who have to process this stuff using
some semblance of a rational
application design.
By providing a SAX
interface, it allows standard XML
tools to be applied to even the worst
HTML. TagSoup also includes a
command-line processor that reads HTML
files and can generate either clean
HTML or well-formed XML that is a
close approximation to XHTML.
Also, Taggle, a TagSoup in C++, available now

Parsing source of a webpage with Objective-C

Is there a way to parse a website's source on the iPhone to get the URL's of photos on that page? If so how would you do that?
I'd say go for regular expressions - there is a one page library that wraps c regexesthat you can drop into your project.
I recommend regular expressions. There's a great open source Regex library for Cocoa called RegexKit. For the most part, you can just drop it in your code and it'll "just work".
Getting all the urls of images wouldn't be too difficult (less than 20 lines of code) if you assume that all images are going to be in <img> tags. You'd just grab all the image tags (something like: <img\s+[^>]+>), then iterate through those matches. For each match, you'd pull out whatever's in the src attribute: src\s*=\s*("|')?\s*([^\s"']+)(\s|"|')
You might need to tweak that a bit, but it shouldn't be too bad.
There is no super easy way. When I had to do it I wrote a libxml2 SAX parser. libxml2 has an html reader that works fairly well with malformed html, and libxml2 is included with the base system.
You could try it using regular expressions, but I wouldn't recommend that. You should have a look at NSXMLParser, assuming the webpage is coded to be XHTML compliant. TouchXML is another good library.
take a look at Event Driven XML Parsing in the iPhone reference library
Are you OK with any approach you use not picking up on images loaded dynamically via JavaScript.
The closest thing I could see working is to parse out any JavaScript imports, load those up too, and then use a regular expression across the whole file looking for anything that ends in ".jpg/.gif/.png" and grab the full URL out from that. The libxml approach would miss out on references to images not in img tags, but it might well be good enough.