Translation of XHTML page with MathML - unicode

We have few XHTML pages with MathML in them. All are generated using Amaya. We have a requirement to translate them to different languages, but Amaya doesn't seem to support Unicode text encoding. Right now we plan to replace the text in XHTML manually.
I would be happy to know of other possible ways of implementing this translation process. Translation should maintain structure of the MathML.

Use XML to create a translation dictionary with Math entities. The translation can be done programmatically using XSLT and Amaya or E4X and AS3.

Related

Can a CSS file be used to generate PDF output?

Does DITA-OT support generation of PDF output with CSS customizations? I think it supports PDF generation using Apache FOP.
I generate both HTML and PDF output and want to use CSS.
Thanks...
The DITA Open Toolkit does not come with default support for using CSS to create PDF. But it can be done. Here is general info on a few ways to do it, to give you an idea:
If you have a late-model version of the Oxygen XML editor, you can use the transformation scenario called DITA Map PDF - based on HTML5 & CSS. This is probably the easiest way to go. If you want to have this capability on a server, there is an extra charge. See Oxygen PDF Chemistry for more info: https://www.oxygenxml.com/chemistry-html-to-pdf-converter.html
The XML Rocks DITA OT plugin, which requires a commercial PDF processor, one of these: Antenna House Formatter, PDFReactor, Vivliostyle or Prince. https://github.com/xmlrocks/dita-ot-pdf-css-page
Do it yourself. One way I have done this is to create normal XHTML output from the DITA OT, and then use a PDF processor and CSS to transform the XHTML to PDF. I have used Antenna House, but other commercial PDF processors (see above) can work also. You should make the XHTML all in one file (all DITA topics merged into one file) by adding this attribute to the <map> element: <map chunk="to-content">

Which mode should I use for RDFa code in emacs?

For all my typing tasks I use emacs.
Which mode should I use for RDFa code?
The nearest I can find is the n3-mode-for-emacs. But there are some small differences.
From Wikipedia:
RDFa (or Resource Description Framework in Attributes) is a W3C Recommendation that adds a set of attribute-level extensions to HTML, XHTML and various XML-based document types for embedding rich metadata within Web documents.
Since RDFa lives inside HTML and XML attributes it makes sense to use an HTML or XML mode, depending on the format of the base file.
nxml-mode works very well for XML and XHTML. html-mode or web-mode would be a good choice for XHTML and HTML.

Using RazorEngine for Text and Html emails

I'm using RazorEngine v3.3 to create emails using template files (emails are sent using SendGrid Web API). I implemented a base template so I can use my own html helpers by overriding the WriteTo() method as shown here
.
My problem is that my emails are part Html and part Text. For the Html templates, I use razor's default implementation that html-encodes the #Model values. This is because some of the data comes from user input.
However, I can't use the same implementation for the text part as Html will not be interpreted when being read.
So the way I see it I have 3 options:
use #Raw(Model) in all my text based templates to ignore Html encoding
create an other base template for my text templates which doesn't encode to Html
modify my Html base template so the WriteTo() method doesn't encode anything
The 1st solution seems the safest, but I have about 50 text based templates to go through, and it reduces readability.
The 2nd solution seems the cleanest to me, however this would prevent the use of cache, as I would be constantly doing Razor.SetTemplateService() to reassign the right base template ?
What would you recommend doing ? Thanks

asp.net web application to convert pdf to word

Is there any clear and proper process to convert a pdf file into a word file with all formatting and images in asp.net web application?
The best way to do that is by using the OCR. It will recognize the text and the images in the PDF file, and then you can save it on a DOC file. I know a third party toolkit named leadtools that should help you doing your requirements, since it support the ASP.NET environment. You can check their Online OCR Demo
Also, you can check their website for more information, or contact their support team.
PDF is a presentational format where all the content is placed by absolute positions. There are no paragraphs and other structured elements (unless it is a Tagged PDF). Technically, you can output every word character by character in any order, but visually it would look like a normal text. Thus, to make a proper conversion to word it is required to do content recognition or some kind of OCR (e.g. ABBYY FineReader)
There are some paid components on the market that allow to do text extraction and some do converting pages to images (obviously, this is not a desired approach for converting into word).

Best Way to Parse HTML to XML

Essentially, I currently have an iPhone app that can query and parse an XML file on my server. Right now, I currently have to manually update and upload my XML file every morning so my users can have the updated information. I would like to automate this process, which would essentially entail parsing various websites (NYTimes, iAmBored.com, etc), outputting the relevant information from each of these websites to an XML file, and uploading that file to my server.
Does anyone know the best way to accomplish this (parsing HTML to an XML file). Since I am a beginner, I'm not sure what languages this requires or what is the best way to do this?
Thanks a lot in advance!
You can try to translate HTML to XHTML (XHTML is based on XML so it's XML with some rules defined in a DTD).
You can also try to parse directly HTML with a SGML parser (As XHTML is based on XML, HTML is based on SGML).
The links are provided as inspiration.
If the content you need to scrape is in XHTML then you can easily use the XSLT language to transform original content in what you need inside the XML you provide to your users.
Otherwise any kind of scraping and XML producing solution will be fine, every programming language has its support to do such things.. but you could use XPath to select the elements you need from the page and then save them inside the output file.
Can you get what you need from the RSS/Atom feeds? That will simplify things greatly because they are XML rather than HTML and can be parsed by a standard XML parser. Of course, descriptions embedded inside RSS feeds will be HTML, so depending on your application, that may be when you need to parse HTML.
XSLT is a domain-specific programming language designed for processing XML, but you can also use any programming language that includes an XML parser for the task.
TagSoup - Just Keep On Truckin'
...a SAX-compliant parser written in Java
that, instead of parsing well-formed
or valid XML, parses HTML as it is
found in the wild: poor, nasty and
brutish, though quite often far from
short.
TagSoup is designed for people
who have to process this stuff using
some semblance of a rational
application design.
By providing a SAX
interface, it allows standard XML
tools to be applied to even the worst
HTML. TagSoup also includes a
command-line processor that reads HTML
files and can generate either clean
HTML or well-formed XML that is a
close approximation to XHTML.
Also, Taggle, a TagSoup in C++, available now