Word document to mathml? - ms-word

I have lots of word documents which contain math equations, some tables, and some expressions written in superscript and subscript. Is there a good tool besides MathType for converting my equations to mathml?

If the expressions are entered as mathzones in Word 2007 or later's in-build math formatter then Word includes a transformation to MathML built in, you can select (by an option in the ribbon) that if you cut and paste any math expression then they MathML version will be placed on the clipboard. If you want to bulk convert all the expressions in a document rather than manual cut and paste there is an old blog of mine on the subject at
http://dpcarlisle.blogspot.co.uk/2007/04/xhtml-and-mathml-from-office-20007.html

Related

Copy Microsoft Word text and equations as mathml and text together

I have text with equations in Microsoft Word 2013. I want to copy this text with equations together, but what I need is, text as plain text and equations as mathml.
When I copy mathml only Equation Options -> Copy MathML to clipboard as plain text worked perfectly. However if I copy equation with text, all comes as plain text only.
Is there any way to copy text with MathML?
I don't know of a way to do it without some sort of pre- or post-processing.
Taking the latter first, if your document contains only text and OMML equations, when you copy it, one of the clipboard formats Word provides to the Windows clipboard is HTML. In this HTML, the equations are present, but are coded with OMML, rather than MathML. You would need a script or some other means of converting the OMML into MathML (and presumably removing all the other MS-specific markup that you probably don't want in the final document).
The other way is to pre-process it. MathType will do this with its Convert Equations command on the MathType tab in Word. In the Convert Equations dialog, choose to convert OMML to MathML, and after it's finished, copy the entire text+MathML document to wherever you want to paste it. (Or save it as a .txt file.)

Removing some paragraph marks in a word document

I copy text from PDF files into word 2010 documents using Abbyy conversion software. I find the result will contain many line breaks which are incorrect. Is there any way I can remove any such marks if they are not preceded by either "." or "?" or "!"
I write macros in excel but have no experience of word coding
You could do a search and replace depending if you can find some sort of rules wich you can apply. Mayeby a little screenshot?

pandoc-generated docx misses italic variables in equations

I have the following segment of Markdown with embedded LaTeX equations:
# Fisher's linear discriminant
\newcommand{\cov}{\mathrm{cov}}
\newcommand{\A}{\mathrm{A}}
\renewcommand{\B}{\mathrm{B}}
\renewcommand{\T}{^\top}
The first method to find an optimal linear discriminant was proposed by Fisher
(1936), using the ratio of the between-class variance to the within-class variance
of the projected data, $d(\vec x)$, as a criterion. Expressed in terms of the
sample properties, the $p$-dimensional centroids $\bar {\vec x}_\A$ and
$\bar {\vec x}_\B$ and the $p \times p$ covariance matrices
$S_A = \cov_i ( \vec x_{\A i} )$ and $S_B = \cov_i ( \vec x_{\B i} )$, the
optimal direction is given by
$$
\vec w = \left ( \frac{ S_A + S_B }{2} \right ) ^{-1}
~ ( \bar {\vec x}_\B - \bar {\vec x}_\A ).
$$
When I convert it with pandoc to LaTeX and compile it with xelatex, I get the expected text with nicely rendered math. When I convert it with pandoc to MS Word using
pandoc test.text -o test.docx
and open it in MS Office Word 2007, I get the following:
Only those parts of the equations that are symbols or upright text get rendered correctly, while variable names in italics are replaced by a question mark in a box.
How can I make this work?
In Word 2007, I see a result similar to yours, except that here, I don't see the "question marks in boxes" characters, just space.
If I then take one of the expressions, and use your trick of going to linear display and back, the characters reappear for that expression.
If I save and re-open, the other expressions still do not display correctly, but if I save and look at the XML, I notice that
the Math font has been changed to Cambria Math
additional run parameter (w:rPr) XML specifying the Cambria Math
font has been inserted in many of the runs (w:r) inside the oMath
elements, even in the oMath expressions that do not display
correctly. However, in the oMath expression that now displays
correctly, this extra XML has been applied to every run. In the
others, it has only been applied to some runs (I think I can see the
pattern but I'm running out of time here right now...)
If I manually add the XML to the other runs and re-open the
document, the expressions appear correctly. Or at least, they do in
the one case I have tried.
Since Word 2010 displays the resuls correctly, I can only assume that it does not rely on these explicit font settings, whereas Word 2007 does. This doesn't really help you yet, because altering all those w:r elements would be even harder than what you are already doing. But it is possible that a default style/font needs to be set, either somewhere higher in the XML hierarchy, or perhaps elsewhere in the .zip (perhaps in fontTable.xml or styles.xml). I'm not familiar enough with Word's XML structures to guess what, if anything might be missing, but may be able to have a look tomorrow.
I suppose another possibility is that you just have to have all these extra rPr elements for this to work in Word 2007, which would suggest that pandoc may have been written for Word 2010, not 2007. (I don't know anything about the tool).
As an example, where you have
<m:r>
<m:t>(</m:t>
</m:r>
what you need is
<m:r>
<w:rPr>
<w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math" />
</w:rPr>
<m:t>(</m:t>
</m:r>
I did the following to get rid of the font issue:
Create a new empty word document.
Copy all content to the new document.
Choose Match Source Format.
As discussed above, Windows doesn't have the font Lucida Grande, so substituting the Math Font with Cambria Math should work.
Rename the test.docx to test.zip
vim test.zip and select test/word/settings.xml
find and change Lucida Grande to Cambria Math
save and rename zip to docx. This results in something like this docx.
You can then also supply that file as a sort of docx template to pandoc with the --reference-docx option.

Superscript in markdown (Github flavored)?

Following this lead, I tried this in a Github README.md:
<span style="vertical-align: baseline; position: relative;top: -0.5em;>text in superscript</span>
Does not work, the text appears as normal. Help?
Use the <sup></sup>tag (<sub></sub> is the equivalent for subscripts). See this gist for an example.
You have a few options for this. The answer depends on exactly what you're trying to do, how readable you want the content to be when viewed as Markdown and where your content will be rendered:
HTML Tags
As others have said, <sup> and <sub> tags work well for arbitrary text. Embedding HTML in a Markdown document like this is well supported so this approach should work with most tools that render Markdown.
Personally, I find HTML impairs the readable of Markdown somewhat, when working with it "bare" (eg. in a text editor) but small tags like this aren't too bad.
LaTeX (New!)
As of May 2022, GitHub supports embedding LaTeX expressions in Markdown docs directly. This gives us new way to render arbitrary text as superscript or subscript in GitHub flavoured Markdown, and it works quite well.
LaTeX expressions are delineated by $$ for blocks or $ for inline expressions. In LaTeX you indicate superscript with the ^ and subscript with _. Curly braces ({ and }) can be used to group characters. You also need to escape spaces with a backslash. The GitHub implementation uses MathJax so see their docs for what else is possible.
You can use super or subscript for mathematical expressions that require it, eg:
$$e^{-\frac{t}{RC}}$$
Which renders as..
Or render arbitrary text as super or subscript inline, eg:
And so it was indeed: she was now only $_{ten\ inches\ high}$, and her face brightened up at the thought that she was now the right size for going through the little door into that lovely garden.
Which renders as..
I've put a few other examples here in a Gist.
Unicode
If the superscript (or subscript) you need is of a mathematical nature, Unicode may well have you covered.
I've compiled a list of all the Unicode super and subscript characters I could identify in this gist. Some of the more common/useful ones are:
⁰ SUPERSCRIPT ZERO (U+2070)
¹ SUPERSCRIPT ONE (U+00B9)
² SUPERSCRIPT TWO (U+00B2)
³ SUPERSCRIPT THREE (U+00B3)
ⁿ SUPERSCRIPT LATIN SMALL LETTER N (U+207F)
People also often reach for <sup> and <sub> tags in an attempt to render specific symbols like these:
™ TRADE MARK SIGN (U+2122)
® REGISTERED SIGN (U+00AE)
℠ SERVICE MARK (U+2120)
Assuming your editor supports Unicode, you can copy and paste the characters above directly into your document or find them in your systems emoji and symbols picker.
On MacOS, simultaneously press the Command ⌘ + Control + Space keys to open the emoji picker. You can browse or search, or click the small icon in the top right to open the more advanced Character Viewer.
On Windows, you can a emoji and symbol picker by pressing ⊞ Windows + ..
Alternatively, if you're putting these characters in an HTML document, you could use the hex values above in an HTML character escape. Eg, ² instead of ². This works with GitHub (and should work anywhere else your Markdown is rendered to HTML) but is less readable when presented as raw text.
Images
If your requirements are especially unusual, you can always just inline an image. The GitHub supported syntax is:
![Alt text goes here, if you'd like](path/to/image.png)
You can use a full path (eg. starting with https:// or http://) but it's often easier to use a relative path, which will load the image from the repo, relative to the Markdown document.
If you happen to know LaTeX (or want to learn it) you could do just about any text manipulation imaginable and render it to an image. Sites like Quicklatex make this quite easy. Of course, if you know your document will be rendered on GitHub, you can use the new (2022) embedded LaTeX syntax discussed earlier)
Comments about previous answers
The universal solution is using the HTML tag <sup>, as suggested in the main answer.
However, the idea behind Markdown is precisely to avoid the use of such tags:
The document should look nice as plain text, not only when rendered.
Another answer proposes using Unicode characters, which makes the document look nice as a plain text document but could reduce compatibility.
Finally, I would like to remember the simplest solution for some documents: the character ^.
Some Markdown implementation (e.g. MacDown in macOS) interprets the caret as an instruction for superscript.
Ex.
Sin^2 + Cos^2 = 1
Clearly, Stack Overflow does not interpret the caret as a superscript instruction. However, the text is comprehensible, and this is what really matters when using Markdown.
If you only need superscript numbers, you can use pure Unicode. It provides all numbers plus several additional characters as superscripts:
x⁰¹²³⁴⁵⁶⁷⁸⁹⁺⁻⁼⁽⁾ⁿⁱ
However, it might be that the chosen font does not support them, so be sure to check the rendered output.
In fact, there are even quite a few superscript letters, however, their intended use might not be for superscript, and font support might be even worse. Use your own judgement.

Format painter through Word interop

I want to implement some complex formatting using Word interop. It would be easy if I could copy the formatting from one range and then use the format painter to apply it to another range.
Is such a thing possible through the Word interop libraries?
Visual Studio 2008/Word 2007
From http://social.msdn.microsoft.com/Forums/en-US/vsto/thread/3f905707-b6d8-453d-9807-c7a9f8e9edae/
The best you can do for "quick and dirty" (as opposed to working through the attributes, one-by-one) is to use the Selection.CopyFormat / Selection.PasteFormat methods of the Word object model. These emulate the Format Painter (paint brush) tool in the Word UI; for details look them up in the Help. Play around with that a bit and see if it gives you what you're after. If you need both paragraph and character-level attributes, copy the paragraph formatting first, then the character formatting.