If a trans-unit contains text that will go into an e-mail, the line endings need to be CRLF (\r\n). This has to be taken care of when importing/exporting XLIFF files from and to a database. Is there a suitable attribute to set on such a trans-unit? What about extype?
XLIFF specification: http://docs.oasis-open.org/xliff/v1.2/os/xliff-core.html#trans-unit
XLIFF follows XML whitespace rules, which means that you'll need to make sure to set xml:space="preserve" -- otherwise, the line breaks will be normalized away. If you need to store anything beyond that, extype might be a possible place, although I don't think that attribute is consistently supported in tools.
Related
I read following sentence from the link
Content authors need to find out how to declare the character encoding
used for the document format they are working with.
Note that just declaring a different encoding in your page won't
change the bytes; you need to save the text in that encoding too.
As per my knowledge, the characters from the text are stored in the computer as one or more bytes irrespective of the 'character encoding' specified in the web page.
I understood the above quoted text also, except the last sentence in bold font
you need to save the text in that encoding too
What does this sentence mean?
Is it saying that the content author/developer has to manually save the same text(which is already stored in the computer as one or more bytes) in the encoding specified by him/her? If yes, how to do it and why it is needed to do? If no, then what this sentence actually mean?
When you make a web page publicly available in the most basic sense you make a text file (that is located on a piece of hardware you own) public in the sense that when a certain adress is requested you return this file. That file can be saved on your local hardware or may not be saved there (dynamic content). Whatever the case, the user that is accessing your web page is provided a file. Once the user gains posession of the file he should be able to read it, that is where the encoding comes in play. If you have a raw binary file you can only guess what it contains and what encoding it is in, so most web pages provide the encoding that they return the file in alongside the file. This is where the bold text you ask about can be related to my answer - if you provide one encoding alongside the file (for example utf 8) but deliver the file in another encoding (ASCII) the user may see parts of the text or may not see it at all. And if you provide a static file it should be saved in the correct encoding (that is the one you told your file will be in).
As for the question how to store it - that is highly specific to the way you provide the file. Most text editors provide means to save a file in specific encoding. And most tools to bring up a page content provide convenient ways to give the file in a form that would be easy for the user to decode.
It is just a note, probably because of confusion by some users.
The text tell us that one should specify in some form the encoding of the file. This is straightforward. Webserver usually cannot know the encoding of a file. Note if pages are delivered by e.g. a database, the encoding could be implicit, but web consider file as first class citizen, so we still need to specify encoding.
The note makes just clears that by changing the encoding, the page is not transcoded by webrowser. The page will remain byte per byte the same, just clients (browsers) will misinterpret the content. So if you want to change the encoding, you should specify the new encoding, but also save the file (or save and convert) to the expected encoding. No magic will be done (usually) by web-servers.
There is no text but encoded text.
The fundamental rule of character encodings is that the reader must use the same encoding as the writer. That requires communication, conventions, specifications or standards to establish an agreement.
"Is it saying that the content author/developer has to manually save the same text(which is already stored in the computer as one or more bytes) in the encoding specified by him/her? If yes, how to do it and why it is needed to do?"
Yes, it always the case for every text file that a character encoding is chosen. Obviously, if the file already exists it is probably best not to change the encoding. You do it by some editor option (try the Save As… dialog or equivalent) or by some library property or configuration.
"save the text in that encoding too"
Actually, it's usually the other way around. You decide on the encoding you want or need to use and the HTML editor or library updates the contents with a matching declaration and any newly necessary character entity references (e.g., does 🚲 need to be written as 🚲? Does ¡ need to be written as ¡?) as it writes or streams the document. (If your editor doesn't do that then get a real HTML editor.)
I'm looking at a xliff file and found some weird boxes which I don't know what they are? (Please see screenshot)
Do you guys have any ideas what the weird bug boxes are?
Thank you very much and I'm looking forward to your reply!
I have never seen that character, but here is how I would go about finding out what it is:
The first thing to do is to check the source and target language of the XLIFF file, which should be defined in the XLIFF header. Perhaps this character is a valid character in either the source or the target language script.
The next step depends on whether you can contact the person who created the XLIFF file. If yes, you can show them what the file looks like for you and ask them if the file has perhaps been garbled during transmission.
If not, you could check the encoding of the XLIFF file. If it is UTF-16, just open the file in a hex editor, find the code point for this character, and look it up on unicode.org. If the file is encoded as UTF-8 open it in Notepad++ (or any other text editor that allows you to change the encoding), convert it to UTF-16, then proceed as described above.
If you don't know the encoding of the file it becomes a matter of guessing. You can look at some other <trans-units> (assuming that there are more than this one in your XLIFF file): if they contain other extended characters and they are displayed correctly your editor has probably guessed the right encoding, and you can convert to Unicode and look up the character code. Different text editors have different ways of guessing encodings: try a few.
It's possible that those characters are the result of an encoding conversion error, which are commonly called mojibake.
It's also possible this is some sort of emoji or unusual glyph that's not rendering correctly in your editor. This would be unusual, but given that it appears to be a UI string, it might be possible.
I want to, using the DocX library [https://docx.codeplex.com/], convert a .docx document to use a different font. Does anybody know how to do that? The samples projects are very spare, and the documentation is nonexistent.
I find, too, that often there are extraneous spaces in documents, and I want to iterate over all these until there are never two contiguous spaces. I can do this in a loop, I guess, replacing " " (2 spaces) with " " (1 space) until " " (2 spaces) is no longer found.
However, I also want to remove superfluous line breaks that sometimes occur when copying-and-pasting text into a document. I can do it "manually" (in Libre Office, not sure how it's done in MS Word), as I got an answer to this question:
(select "Regular Expressions" and then replace "$" (without the quotes) with a space)
...but how programmatically, with DocX?
Additionally, in some cases I want to ADD line breaks/"paragraph returns" where there are legitimate line breaks between the end of one paragraph and the start of another, but no extra line to separate them visually. According to this:
...I can add a paragraph/line break to a legitimate line break by searching for "$" and replacing that with "\n\n"
This does work, too (manually, in Libre Office); but again...how to do this with the DocX library?
It appears that not all of this is possible with the current version of the DocX library you are using. If it is not exposed in documentation, the functions might as well not exist, and you should not be using undocumented features.
There is a much more mature library available, however, called the "Open XML SDK", that can do everything you need.
The correct way to change a font, regardless of whether you are doing it with the document editor, or you are writing a program to manipulate these files, is to change the appropriate text's style attribute, or changing the definition of style in use.
You should never, ever, ever, ever directly change the font of any text. Personally, I think that the 'font type' and 'font size' menus should be removed entirely from word/libreoffice/etc, and only be accessible inside a 'change style properties' dialog; the only reason to directly apply a font is if you are actually providing an example of particular typeface under discussion!
See How to: Replace the styles parts in a word processing document (Open XML SDK) from the MSDN documentation for a description of the way that works.
To search and replace text, the applicable MSDN page is How to: Search and replace text in a document part (Open XML SDK). For specifically replacing multiple spaces with a single space, there are numerous results on Google that should all work to at least some degree.
We have some XML file that contains some invalid character, and the program says neither which file it is, nor which line number or character offset. It would be a few seconds work to fix the problem if I could just search for exactly that character, but I cannot find how to express a Unicode character in the file search (or at least I assume so, since the search returns nothing).
Neither 0x1e nor \u001e seem to match anything.
[EDIT] I mean, I can still change the code, and eventually find which file it is by catching the Exception, and using some kind of script/tool to find where exactly the character is, but I do believe it should be possible to search with Unicode in Eclipse, and that is what I am asking in this question.
It may be a problem with the character encoding.
As you're going to need to perform a global / site-wide search to find the , you'll probably need to set the global text file encoding:
Preferences -> Workspace -> Text file encoding
This option may be under the 'General' section in Eclipse, depending on your setup and installed plugins etc.
Ensure that the encoding is set to UTF-8.
You will also need to escape the unicode character sequences, like so:
\u2665
(which I see you have tried)
I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.