Translating longer texts (view and email templates) with gettext - email

I'm developing a multilingual PHP web application, and I've got long(-ish) texts that I need to translate with gettext. These are email templates (usually short, but still several lines) and parts of view templates (longer descriptive blocks of text). These texts would include some simple HTML (things like bold/italic for emphasis, probably a link here or there). The templates are PHP scripts whose output is captured.
The problem is that gettext seems very clumsy for handling longer texts. Longer texts would generally have more changes over time than short texts — I can either change the msgid and make sure to update it in all translations (could be lots of work and very error-prone when the msgid is long), or I can keep the msgid unchanged and modify only the translations (which would leave misleading outdated texts in the templates). Also, I've seen advice against including HTML in gettext strings, but avoiding it would break a single natural piece of text into lots of chunks, which will be an even bigger nightmare to translate and reassemble, and I've also seen advice against unnecessary splitting of gettext strings into separate msgids.
The other approach I see is to ignore gettext altogether for these longer texts, and to separate those blocks in external subtemplates for each locale, and just include the one for the current locale. The disadvantage is that I'm separating the translation effort between gettext .po files and separate templates located in a completely different location.
Since this application will be used as a starting point for other applications in the future, I'm trying to come up with the best approach for the long term. I need some advice for best practices in such scenarios. How have you implemented similar cases? What turned out to work and what turned out a bad idea?

Here's the workflow I used, on a very heavily-trafficked site that had about several dozen long-ish blocks of styled textual content, translated into six languages:
Pick a text-based markup language (we used Markdown)
For long strings, use fixed message IDs like "About_page_intro_markdown" that:
describes the intent of the text
makes clear that it will be interpreted in markdown format
Have our app render "*_markdown" strings appropriately, making sure to allow only a few safe HTML tags
Build a tool for translators that:
shows them their Markdown rendered in realtime (sort of like the Markdown dingus)
makes it easy for them to see the now-authoritative base language translation of the text (since that's no longer in the msgid)
Teach translators how to use the new workflow
Pros of this workflow:
Message IDs don't change all the time
Because translators are editing in a safe higher-level syntax, hard to mess up HTML
Non-technical translators found it very easy to write in Markdown, vs. HTML
Cons of this workflow:
Having static unchanging message IDs means changes in the text need to be transmitted out of band (which we'd do anyway, as long text can raise questions about tone or emphasis)
I'm very happy with the way this workflow operated for our website, and would absolutely recommend it, and use it again. It took a couple of days to get started, but it was easy to build, train, and launch.
Hope this helps, and good luck with your project.

I just had this particular problem, and I believe I solved it in an elegant way.
The problem: We wanted to use Gettext in PHP, and use primary language strings as keys translations. However, for large blocks of HTML (with h1, h2, p, a, etc...) I'd either have to:
Create a translation for each tag with content.
or
Put the entire block with tags in one translation.
Neither of those options appealed to me, so this is what I did:
Keep simple strings ("OK","Add","Confirm","My Awesome App") as regular Gettext .po entries, with the original text as the key
Write content (large text blocks) in markdown, and keep them in files.
Example files would be /homepage/content.md (primary / source text), /homepage/content.da-DK.md, /homepage/content.de-DE.md
Write a class that fetches the content files (for the current locale) and parses it. I then used it like:
<?=Template::getContent("homepage/content")?>
However, what about dynamic large text? Simple. Use a templating engine. I decided on Smarty, and used it in my Template class.
I could now use templating logic.. within markdown! How awesome is that?!
Then came the tricky part..
For content to look good, at times you need to structure your HTML differently. Consider a campaign area with 3 "feature boxes" beneath it. The easy solution: Have a file for the campaign area, and one for each of the 3 boxes.
But I could do better than that.
I wrote a quick block parser, so I would write all the content in one file, and then render each block seperately.
Example file:
[block campaign]
Buy this now!
=============
Blaaaah... And a smarty tag: {$cool}
[/block]
[block feature 1]
Feature 1
---------
asdasd you get it..
[/block]
[block feature 2] ...
And this is how I would render them in the markup:
<?php
// At the top of the document...
// Class handles locale. :)
$template = Template::getContent("homepage/content", [
"cool" => "Smarty variable! AWESOME!"
]);
?>
...
<title><?=_("My Awesome App")?></title>
...
<div class="hero">
<!-- Template data already processed! :) -->
<?=$template->renderBlock("campaign")?>
</div>
<div class="featurebox">
<?=$template->renderBlock("feature 1")?>
</div>
<div class="featurebox">
<?=$template->renderBlock("feature 2")?>
</div>
I'm afraid I can't provide any source code, as this was for a company project, but I hope you get the idea.

gettext wasn't really designed for translating large pieces of text.
fwiw I've included basic HTML (strong, a, etc) in gettext strings as I was confident our translators knew what they were doing (mostly right) and that the translations would be well tested.
I've tried the approach of breaking up the text into one string per paragraph. Roughly as it looks odd if there's one paragraph of English in the middle of the text. Where one of those strings have changed this has meant that we have had to wait for translations before releasing a new version, which has slowed us down. On the plus side it's easy for translators to see which part of the text has changed. This approach worked well for the one application I've tried it with.
Splitting some text out into external locations also worked, but it caused management overhead, rather than just a .po file or two, there was a whole bunch of other text that had to be manually compared to the English version and updated accordingly. This is doable if you remember to provide notes to your translators explaining where and what the difference was in the English version.
I'm still not sold on either approach myself.

Related

Include non-Racket source in Scribble files

I'm writing documentation, and part of it includes small programs written in languages other than Racket. I can of course include them inline (using #verbatim), but I'd like to be able to at least minimally test/run them, so it'd be much more convenient to store them in separate files and just include the source.
What's the easiest way do that? i.e., I'd like to do something like:
#verbatim|{#include-file{path/to/file.ext}}| (though of course that doesn't quite work) and have the content included, literally. I thought that Ben Greenman's https://gitlab.com/bengreenman/scribble-include-text would do this, but it's behaving oddly, probably because there are character sequences in the file that are not playing well.

Lightweight doxygen html output

Is there for doxygen is a more lightweight HTML backend, which does not fill the page with tons of divs and tables? When looking at the css file, the output seems quite bloated. It possible to write another backend. I ask if there already exists one.
Reasons why I need this
It makes it easier to integrate the dox with the rest of the website.
I use hyphenate.js to make my "Related Pages" look good. But that script needs to know which tags it should use. This is much easier with less bloat markup.
Doxygen lacks complete documentation on how the output markup making reverse engineering using Firefox Web Developer tool necessary to modify the 1k lines CSS file. Less bloat markup makes less need for documentation, and it makes documentation more easy for Dimitri to maintain.
Less bloat markup makes the pages more portable.
With doxygen you can export your data in html format, tex format, XML (which you can later parse as you want), RTF, Man pages or Docbook.
The html output supports a custom header, footer and stylesheet (CSS) with the HTML_STYLESHEET attribute which might be what you want. You can rewrite those and adjust the output as you like.
If nothing satisfies you, then you might start thinking manually parsing one of the outputs above with your own scripting language and generate the desired format by yourself (if that suits you) or taking over control of the output generation directly via doxygen sources (https://github.com/doxygen/)
Sources: http://www.doxygen.nl/manual/output.html
What you need to do really depends on exactly what you want to end up with.
There are 'Input Filters', eg: ftp://ftp.rsa.com/pub/dsg/public/doxygen/doxyfilt.pl and 'Output Filters, eg: http://www.bigsister.ch/doxygenfilter/doxygenfilter.html . Writing a customized Filter should do exactly what you want. Using existing Code and putting up with it's limitations will be faster and may provide ideas for writing your own Program (if you wish to do so).
You can try this Website http://www.dirtymarkup.com/ with the output that you object to and see if one of the Tools it suggests will "clean" the Code up enough to your liking without removing too much functionality (IE: the ability to click on Links and expand / contract Sections).
If you really want it 'raw' try HTML2Text https://pypi.python.org/pypi/html2text and then you can 'wrestle it back' with Text2HTML http://txt2html.sourceforge.net/ . That will strip it bare and yet give you back some minimal HTML functionality (preserve Linking).
There are MANY 'HTML <-> Text' converters, in many Languages, use a Search Engine to find your own Source; one that is most suitable for you.
Here is a List of Tools from a reputable Site: http://www.w3.org/Tools/html2things.html .
Here is an List of alternates to convert Languages to HTML: http://www.w3.org/Tools/Prog_lang_filters.html . More Info here: http://www.w3.org/Tools/Filters.html .

How to style DITA xrefs in structured Framemaker

In the middle of a conversion project from unstructured Framemaker to DITA-compliant, structured Framemaker. Customer wants xrefs to be underlined in the output. Seems straightforward enough, but I've been all over the documentation and all over the internet and can't find what I need. The EDD file shows that we should be using the "link.external" style, which makes perfect sense, but for the life of me I can't figure out where link.external is defined. I've found one piece of documentation in all my searching that sort of comes close to what I need, but the process for styling an xref, according to this document, is long and laborious. I just can't believe that applying a simple style to an element is so hard. Where would I look for the definition of the "link.external" style (or any other style, for that matter)? What obvious point am I missing?
You apply the style in the Cross-Reference panel using building blocks in the cross-reference format(s).
For example:
Section 2.3.4, Volcanoes.
would be styled using the x-ref format below:
Section <$paranumonly>, <Emphasis><$paratext>.
Therefore, to underline all of the x-refs, create an underline character format such as Underline, and use it in a building block within every x-ref format that you have.
<Underline>“<$paratext>” on page\ <$pagenum>
The change only applies to the x-ref, not to the following text.

Automatically converting coding conventions

When working on different projects, with different people and using different frameworks you often struggle to keep your code compliant to their conventions. Some teams get very strict about naming variables/methods/classes and other things the others make holy wars around the topic. I understand them and I fully support, but as any developer I have my own preference I wish I could code with comfortably. This makes me think whether there is a simple solution.
Are there any tools or editors that can automatically convert code to follow a different standard? I imagine there can be no such smart tool that will support naming conversions, so I'm ok with that, but I really wish to see
foreach($lala as $lalala) {
and not
foreach($lala as $lalala)
{
same goes with statements:
if(I_LIEK_COOKIES) {
eat_cookie();
} else {
toss_cookie();
}
and not
if ( I_LIEK_COOKIES ) {
eat_cookie();
}
else
{
toss_cookie();
}
(note the spaces between and around the parenthesis too)
I won't even mention spaces/tabs, I can convert it in my IDE with a shortcut but it would be awesome.
So the things I would like to get customized are
spaces between parenthesis
tabs/spaces and spaces per tab
mustache brackets on the end of the line or on the new line
always attach mustache brackets to any if/ifelse/else/for/foreach etc.
Some of the extras anyone would appreciate:
Line ending style
Delete extra spaces on the line endings (like sublime text 2 can do on save, but would be great for other IDE/editors)
The perfect workflow would be like this:
I pull from git
The code gets converted to my style
I code stuff
I commit and push
Before everything gets pushed(or even commited) code gets converted to the convention style
Of course, someone may wish not to use git, then it would be simply converted when opening and after saving the file but as I understand it's impossible to do outside of an IDE/editor with a tool of some kind.
Has someone stumbled upon something like that? Could not find anything anywhere but tab/space conversion.
P.S. I wish to mention I'm working with PHP/JS so it's prioritized but I code using other languages on my spare time.
You could store configurations (e.g. vim .vimrcs, Eclipse preferences etc.) in each project's version control repository.
However, I think there's a big problem wrt. converting code when pushing/pulling to/from repositories. If someone reports an issue with your code (e.g. exception at line 100), converting the code when pulling from your repository is going to give you a different line 100. I don't think you can practically operate without working on the exact code that your compatriots are working with.

HTML Tidy, cleaning up MS Word markup

Have 10 years of archived article data, most of it riddled with MS Word save-as-html markup like <p class="MsoNormal">
First of all, is html tidy up to the task of stripping out MS Word generated markup, or do I need to take another approach?
Secondly, the first few years of articles are globbed together by month and stored in DB as text storage type. I'd dearly love to break these out into individual articles so I can make the site more easily searched (i.e. not bring up an entire month of news when a search term/phrase matches). The only clear pattern I have to work with to isolate the articles is the article title (in bold, between 16-20px) and the article date, generally 10px; both title and date appear prior to article body text. Is there a way to detect the <h1>-ness or <small>-ness of markup when I do not have exact markup to match against?
This may be next to impossible to answer, but just in general, what approach would you take to this unenviable task? ;-) I'm on the JVM in Scala, but could do the cleanup job on LAMP stack as well.
Ideas appreciated!
If I was you, I'd use my favorite HTML::Parser kit for Perl. If goes very well for complex and fuzzily stated problems like yours one.