HTML Tidy, cleaning up MS Word markup - ms-word

Have 10 years of archived article data, most of it riddled with MS Word save-as-html markup like <p class="MsoNormal">
First of all, is html tidy up to the task of stripping out MS Word generated markup, or do I need to take another approach?
Secondly, the first few years of articles are globbed together by month and stored in DB as text storage type. I'd dearly love to break these out into individual articles so I can make the site more easily searched (i.e. not bring up an entire month of news when a search term/phrase matches). The only clear pattern I have to work with to isolate the articles is the article title (in bold, between 16-20px) and the article date, generally 10px; both title and date appear prior to article body text. Is there a way to detect the <h1>-ness or <small>-ness of markup when I do not have exact markup to match against?
This may be next to impossible to answer, but just in general, what approach would you take to this unenviable task? ;-) I'm on the JVM in Scala, but could do the cleanup job on LAMP stack as well.
Ideas appreciated!

If I was you, I'd use my favorite HTML::Parser kit for Perl. If goes very well for complex and fuzzily stated problems like yours one.

Related

Multi-line commenting in VB

Something that has always burned me up in programming, not just VB, is how inefficient it is to make multi-line comments. I'm not exactly a neat freak, but I do like comments to all be about the same length, around 80 characters including leading whitespace. But, to do this, I have to manually control how long the comments are. And the really frustrating part is when the change to only a few words requires an unreasonable amount of work.
I have found many questions on StackOverflow asking about multi-line commenting, but none to actually address this feature.
Wouldn't it make sense to have a commenting feature in VB, Eclipse, etc. to enter a a mini word processing mode mode that would low simple features like word wrap that would format the comment automatically? Is there one available that I'm just missing?
Or am I just being lazy? But, if it is a good idea, how can it be suggested to Microsoft, Eclipse.org, and others.
The only way to do multi-line comments in VB.NET is to do a lot of single line comments.
Really, the only option you have is the single tick (') in front of a line.
You can use Ctrl+K, Ctrl+C and Ctrl+K, Ctrl+U to comment or uncomment selected lines of text. In C# you can use /* ... */ to comment an entire block of code.
Look for more information on Custom Writing

How to maintain notes organized with org-mode?

I have a bunch of papers to read and taking notes. The problem is I don't have much time to spend looking for a way to organize my note taking system. For me emacs org-mode seems to be a quite powerfull solution, and pretty straightforward.
I encounter another problem, how can I keep my notes organized with a single file, in a way that I can rapidly access all the notes?
Since you're short time, you might do well to start with a simple system in which you can capture the notes you need to take, and worry later about organization. Consider the following:
* Title of a paper
** First section name
- A note
- Another note
** Second section name
- Yet another note
- A fourth note
- A fifth note
* Title of another paper
** First section name
- Yet more notes
** Second section name
- &c., &c.
Using paper titles as top-level section headings makes it easy to navigate among papers with isearch; C-s Title of a paper RET brings you to the section containing all your notes on that paper. From there, you can search for a section title, or just use TAB on headings to fold and unfold until you're looking at what you want.
Unless I've misunderstood your requirement, that should give you a pretty quick and straightforward way to dive in and start taking your notes, without losing navigability. That'll also give you an opportunity for some initial, shallow exploration of the problem domain; then, once you've gotten past the current glut of work and have time to think about how you want your note-taking system to work, you can explore the problem more deeply, using org-mode's quick outline rearrangement tools at need to turn the scheme you've got into the scheme you need.

Lightweight doxygen html output

Is there for doxygen is a more lightweight HTML backend, which does not fill the page with tons of divs and tables? When looking at the css file, the output seems quite bloated. It possible to write another backend. I ask if there already exists one.
Reasons why I need this
It makes it easier to integrate the dox with the rest of the website.
I use hyphenate.js to make my "Related Pages" look good. But that script needs to know which tags it should use. This is much easier with less bloat markup.
Doxygen lacks complete documentation on how the output markup making reverse engineering using Firefox Web Developer tool necessary to modify the 1k lines CSS file. Less bloat markup makes less need for documentation, and it makes documentation more easy for Dimitri to maintain.
Less bloat markup makes the pages more portable.
With doxygen you can export your data in html format, tex format, XML (which you can later parse as you want), RTF, Man pages or Docbook.
The html output supports a custom header, footer and stylesheet (CSS) with the HTML_STYLESHEET attribute which might be what you want. You can rewrite those and adjust the output as you like.
If nothing satisfies you, then you might start thinking manually parsing one of the outputs above with your own scripting language and generate the desired format by yourself (if that suits you) or taking over control of the output generation directly via doxygen sources (https://github.com/doxygen/)
Sources: http://www.doxygen.nl/manual/output.html
What you need to do really depends on exactly what you want to end up with.
There are 'Input Filters', eg: ftp://ftp.rsa.com/pub/dsg/public/doxygen/doxyfilt.pl and 'Output Filters, eg: http://www.bigsister.ch/doxygenfilter/doxygenfilter.html . Writing a customized Filter should do exactly what you want. Using existing Code and putting up with it's limitations will be faster and may provide ideas for writing your own Program (if you wish to do so).
You can try this Website http://www.dirtymarkup.com/ with the output that you object to and see if one of the Tools it suggests will "clean" the Code up enough to your liking without removing too much functionality (IE: the ability to click on Links and expand / contract Sections).
If you really want it 'raw' try HTML2Text https://pypi.python.org/pypi/html2text and then you can 'wrestle it back' with Text2HTML http://txt2html.sourceforge.net/ . That will strip it bare and yet give you back some minimal HTML functionality (preserve Linking).
There are MANY 'HTML <-> Text' converters, in many Languages, use a Search Engine to find your own Source; one that is most suitable for you.
Here is a List of Tools from a reputable Site: http://www.w3.org/Tools/html2things.html .
Here is an List of alternates to convert Languages to HTML: http://www.w3.org/Tools/Prog_lang_filters.html . More Info here: http://www.w3.org/Tools/Filters.html .

Translating longer texts (view and email templates) with gettext

I'm developing a multilingual PHP web application, and I've got long(-ish) texts that I need to translate with gettext. These are email templates (usually short, but still several lines) and parts of view templates (longer descriptive blocks of text). These texts would include some simple HTML (things like bold/italic for emphasis, probably a link here or there). The templates are PHP scripts whose output is captured.
The problem is that gettext seems very clumsy for handling longer texts. Longer texts would generally have more changes over time than short texts — I can either change the msgid and make sure to update it in all translations (could be lots of work and very error-prone when the msgid is long), or I can keep the msgid unchanged and modify only the translations (which would leave misleading outdated texts in the templates). Also, I've seen advice against including HTML in gettext strings, but avoiding it would break a single natural piece of text into lots of chunks, which will be an even bigger nightmare to translate and reassemble, and I've also seen advice against unnecessary splitting of gettext strings into separate msgids.
The other approach I see is to ignore gettext altogether for these longer texts, and to separate those blocks in external subtemplates for each locale, and just include the one for the current locale. The disadvantage is that I'm separating the translation effort between gettext .po files and separate templates located in a completely different location.
Since this application will be used as a starting point for other applications in the future, I'm trying to come up with the best approach for the long term. I need some advice for best practices in such scenarios. How have you implemented similar cases? What turned out to work and what turned out a bad idea?
Here's the workflow I used, on a very heavily-trafficked site that had about several dozen long-ish blocks of styled textual content, translated into six languages:
Pick a text-based markup language (we used Markdown)
For long strings, use fixed message IDs like "About_page_intro_markdown" that:
describes the intent of the text
makes clear that it will be interpreted in markdown format
Have our app render "*_markdown" strings appropriately, making sure to allow only a few safe HTML tags
Build a tool for translators that:
shows them their Markdown rendered in realtime (sort of like the Markdown dingus)
makes it easy for them to see the now-authoritative base language translation of the text (since that's no longer in the msgid)
Teach translators how to use the new workflow
Pros of this workflow:
Message IDs don't change all the time
Because translators are editing in a safe higher-level syntax, hard to mess up HTML
Non-technical translators found it very easy to write in Markdown, vs. HTML
Cons of this workflow:
Having static unchanging message IDs means changes in the text need to be transmitted out of band (which we'd do anyway, as long text can raise questions about tone or emphasis)
I'm very happy with the way this workflow operated for our website, and would absolutely recommend it, and use it again. It took a couple of days to get started, but it was easy to build, train, and launch.
Hope this helps, and good luck with your project.
I just had this particular problem, and I believe I solved it in an elegant way.
The problem: We wanted to use Gettext in PHP, and use primary language strings as keys translations. However, for large blocks of HTML (with h1, h2, p, a, etc...) I'd either have to:
Create a translation for each tag with content.
or
Put the entire block with tags in one translation.
Neither of those options appealed to me, so this is what I did:
Keep simple strings ("OK","Add","Confirm","My Awesome App") as regular Gettext .po entries, with the original text as the key
Write content (large text blocks) in markdown, and keep them in files.
Example files would be /homepage/content.md (primary / source text), /homepage/content.da-DK.md, /homepage/content.de-DE.md
Write a class that fetches the content files (for the current locale) and parses it. I then used it like:
<?=Template::getContent("homepage/content")?>
However, what about dynamic large text? Simple. Use a templating engine. I decided on Smarty, and used it in my Template class.
I could now use templating logic.. within markdown! How awesome is that?!
Then came the tricky part..
For content to look good, at times you need to structure your HTML differently. Consider a campaign area with 3 "feature boxes" beneath it. The easy solution: Have a file for the campaign area, and one for each of the 3 boxes.
But I could do better than that.
I wrote a quick block parser, so I would write all the content in one file, and then render each block seperately.
Example file:
[block campaign]
Buy this now!
=============
Blaaaah... And a smarty tag: {$cool}
[/block]
[block feature 1]
Feature 1
---------
asdasd you get it..
[/block]
[block feature 2] ...
And this is how I would render them in the markup:
<?php
// At the top of the document...
// Class handles locale. :)
$template = Template::getContent("homepage/content", [
"cool" => "Smarty variable! AWESOME!"
]);
?>
...
<title><?=_("My Awesome App")?></title>
...
<div class="hero">
<!-- Template data already processed! :) -->
<?=$template->renderBlock("campaign")?>
</div>
<div class="featurebox">
<?=$template->renderBlock("feature 1")?>
</div>
<div class="featurebox">
<?=$template->renderBlock("feature 2")?>
</div>
I'm afraid I can't provide any source code, as this was for a company project, but I hope you get the idea.
gettext wasn't really designed for translating large pieces of text.
fwiw I've included basic HTML (strong, a, etc) in gettext strings as I was confident our translators knew what they were doing (mostly right) and that the translations would be well tested.
I've tried the approach of breaking up the text into one string per paragraph. Roughly as it looks odd if there's one paragraph of English in the middle of the text. Where one of those strings have changed this has meant that we have had to wait for translations before releasing a new version, which has slowed us down. On the plus side it's easy for translators to see which part of the text has changed. This approach worked well for the one application I've tried it with.
Splitting some text out into external locations also worked, but it caused management overhead, rather than just a .po file or two, there was a whole bunch of other text that had to be manually compared to the English version and updated accordingly. This is doable if you remember to provide notes to your translators explaining where and what the difference was in the English version.
I'm still not sold on either approach myself.

Lotus Notes Diff Tool

Is there any diff tool for Lotus Notes which allows to compare scripts, design elements and documents?
I see this is an old question, and most of the other answers are a little outdated now, so I thought I would add some hopefully valuable information for those who should stumble upon this now.
In Domino Designer, open either the Navigator or Package Explorer (Window menu -> Show Eclipse Views). Here you can expand databases/templates to see the design elements they contain. Select two or three elements (CTRL-click). They can be in different databases or the same database. Right click on one of the elements and select Compare with -> Each other.
You can also compare two databases element by element by selecting two databases/templates, right-clicking and selecting Compare with -> Each other. You will then get the differences between the two databases listed. You will be able to see which elements differ between the two databases, and which elements exist in one database but not the other. By double-clicking on a differing element, you will open a diff tool which lets you see differences line by line, and you can easily copy changes from left to right or right to left.
There is a tool from TeamStudio called Delta: http://www.teamstudio.com/products/delta.html
If all else fails (and by "all else" I mean the often ridiculous corporate procurement system) you can always do a an export to DXL (or a Design Synopsis for code alone) and use any decent text editor with a diff function. It's not TeamStudio Delta, but it will get you where you want to go.
There is a free tool from OpenNTF which does document comparisons:
http://www.openntf.org/Projects/pmt.nsf/ProjectLookup/Compare%20Notes%20Documents
Ytria also has a product which, among other things, will compare data documents (I don't believe it compares design elements).
http://www.ytria.com/website.nsf/WebPageRequest/Solutions_scanEZ_specen
And, I believe Martin Scott (http://www.martinscott.com) has a similar product which compares documents.
DDE (Domino Designer on Eclipse) let's you compare design elements natively. Same way as the search. It's pretty efficient (faster than a DXL exportation) and it's free.
I had a discussion on my blog a little while back about this:
http://rosshawkins.net/archive/2009/12/24/notesdomino-refactoringanalysis-tools.aspx
However what I've ended up doing in the past is exporting the design to the filesystem and using standard text tools (WinMerge and SublimeText for me personally) to do what I need.
Being able to do the raw dump is something that was added with the Eclipse based designer, and isn't overly obvious, but you can read more about it here:
rosshawkins.net/archive/2010/01/20/searching-the-contents-of-notesdomino-design-elements.aspx
(link mangled as my rep is too low to post 2 links in one post yet!)
Teamstudio Delta is really nice. However it might kill you with too many details. As Ross pointed out the Domino Designer 8.5 can use the Diff tool inherited from Eclipse. You also could head over to http://www.openntf.org and look for the DXLMagic project. It can generate a report that shows differences (including code) between 2 databases (typically a template and a variation of it). It is not as complete as Delta, but shows the essentials. It's free and source is included (Disclaimer: I wrote it).
This is what I do. I run a design synopsis of the database using the Notes Designer. Dump the file to a text file. You can actually split the synopsis out to different objects like Agents, Forms, Views, etc. Then you can run UNIX/Linux/Mac Unix commands to compare the elements. By doing this operation you find out what code is active, and have a complete documented source code. You do a lot of csplit and a few sed commands.
Version 12.0.1 has such a tool as part of the server. Look for comparedbs.ntf and designsynopsis.ntf on the Domino server.