pandoc-generated docx misses italic variables in equations - ms-word

I have the following segment of Markdown with embedded LaTeX equations:
# Fisher's linear discriminant
\newcommand{\cov}{\mathrm{cov}}
\newcommand{\A}{\mathrm{A}}
\renewcommand{\B}{\mathrm{B}}
\renewcommand{\T}{^\top}
The first method to find an optimal linear discriminant was proposed by Fisher
(1936), using the ratio of the between-class variance to the within-class variance
of the projected data, $d(\vec x)$, as a criterion. Expressed in terms of the
sample properties, the $p$-dimensional centroids $\bar {\vec x}_\A$ and
$\bar {\vec x}_\B$ and the $p \times p$ covariance matrices
$S_A = \cov_i ( \vec x_{\A i} )$ and $S_B = \cov_i ( \vec x_{\B i} )$, the
optimal direction is given by
$$
\vec w = \left ( \frac{ S_A + S_B }{2} \right ) ^{-1}
~ ( \bar {\vec x}_\B - \bar {\vec x}_\A ).
$$
When I convert it with pandoc to LaTeX and compile it with xelatex, I get the expected text with nicely rendered math. When I convert it with pandoc to MS Word using
pandoc test.text -o test.docx
and open it in MS Office Word 2007, I get the following:
Only those parts of the equations that are symbols or upright text get rendered correctly, while variable names in italics are replaced by a question mark in a box.
How can I make this work?

In Word 2007, I see a result similar to yours, except that here, I don't see the "question marks in boxes" characters, just space.
If I then take one of the expressions, and use your trick of going to linear display and back, the characters reappear for that expression.
If I save and re-open, the other expressions still do not display correctly, but if I save and look at the XML, I notice that
the Math font has been changed to Cambria Math
additional run parameter (w:rPr) XML specifying the Cambria Math
font has been inserted in many of the runs (w:r) inside the oMath
elements, even in the oMath expressions that do not display
correctly. However, in the oMath expression that now displays
correctly, this extra XML has been applied to every run. In the
others, it has only been applied to some runs (I think I can see the
pattern but I'm running out of time here right now...)
If I manually add the XML to the other runs and re-open the
document, the expressions appear correctly. Or at least, they do in
the one case I have tried.
Since Word 2010 displays the resuls correctly, I can only assume that it does not rely on these explicit font settings, whereas Word 2007 does. This doesn't really help you yet, because altering all those w:r elements would be even harder than what you are already doing. But it is possible that a default style/font needs to be set, either somewhere higher in the XML hierarchy, or perhaps elsewhere in the .zip (perhaps in fontTable.xml or styles.xml). I'm not familiar enough with Word's XML structures to guess what, if anything might be missing, but may be able to have a look tomorrow.
I suppose another possibility is that you just have to have all these extra rPr elements for this to work in Word 2007, which would suggest that pandoc may have been written for Word 2010, not 2007. (I don't know anything about the tool).
As an example, where you have
<m:r>
<m:t>(</m:t>
</m:r>
what you need is
<m:r>
<w:rPr>
<w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math" />
</w:rPr>
<m:t>(</m:t>
</m:r>

I did the following to get rid of the font issue:
Create a new empty word document.
Copy all content to the new document.
Choose Match Source Format.

As discussed above, Windows doesn't have the font Lucida Grande, so substituting the Math Font with Cambria Math should work.
Rename the test.docx to test.zip
vim test.zip and select test/word/settings.xml
find and change Lucida Grande to Cambria Math
save and rename zip to docx. This results in something like this docx.
You can then also supply that file as a sort of docx template to pandoc with the --reference-docx option.

Related

Replace '\t' with the correct amount of spaces inside a text object

I have a text box in my GUI, into which I want to write a tabbed text.
As you may or may not know, the \t modifier does not work in a tex-interpreted text strings.
What I ask is if there's an elegant solution to emulate the tab modifier with the CORRECT amount of spaces, also taking into account the fact that different characters might have different widths?
Result should be like this:
[tabText('Try\tThis') ; tabText(Tryy\tThis)]
ans =
Try This
Tryy This
Thanks.
'\t' in matlab is interpreted as it is: two characters \ and t, not the tabulation.
To obtain the tabulation character, you'll have to go through sprintf:
> 'Try\tThis'
Try\tThis
> sprintf('Try\tThis')
Try This
Or with char(9) (ASCII code):
> ['Try' char(9) 'This']
Try This
Looking at the relevant part of the MATLAB documentation for text (at the time of writing, this points to the R2016b docs) one can see the TeX "subset" that is supported by MATLAB, and it does not include any tab-like character. Thus it seems that there's no proper way to do this with the tex interpreter.
You have several options:
If using uifigures is an option, text labels there allow MathML to be used. Which is very customizable...
If you switch to the 'latex' interpreter, you could use \quad, \qquad etc.
figure();
text(.5,.5,{'$$This \quad text$$','$$is \quad properly$$','$$tabbed, \quad Right?$$'},...
'Interpreter','latex');
What O'Neil suggested.
Regarding the unequal character width - you might be able to overcome this by changing the font, using the 'FontName' argument to text(...).

Copy Microsoft Word text and equations as mathml and text together

I have text with equations in Microsoft Word 2013. I want to copy this text with equations together, but what I need is, text as plain text and equations as mathml.
When I copy mathml only Equation Options -> Copy MathML to clipboard as plain text worked perfectly. However if I copy equation with text, all comes as plain text only.
Is there any way to copy text with MathML?
I don't know of a way to do it without some sort of pre- or post-processing.
Taking the latter first, if your document contains only text and OMML equations, when you copy it, one of the clipboard formats Word provides to the Windows clipboard is HTML. In this HTML, the equations are present, but are coded with OMML, rather than MathML. You would need a script or some other means of converting the OMML into MathML (and presumably removing all the other MS-specific markup that you probably don't want in the final document).
The other way is to pre-process it. MathType will do this with its Convert Equations command on the MathType tab in Word. In the Convert Equations dialog, choose to convert OMML to MathML, and after it's finished, copy the entire text+MathML document to wherever you want to paste it. (Or save it as a .txt file.)

How can I use the DocX library to change the font globally, remove superfluous spaces, and remove or add extra line breaks?

I want to, using the DocX library [https://docx.codeplex.com/], convert a .docx document to use a different font. Does anybody know how to do that? The samples projects are very spare, and the documentation is nonexistent.
I find, too, that often there are extraneous spaces in documents, and I want to iterate over all these until there are never two contiguous spaces. I can do this in a loop, I guess, replacing " " (2 spaces) with " " (1 space) until " " (2 spaces) is no longer found.
However, I also want to remove superfluous line breaks that sometimes occur when copying-and-pasting text into a document. I can do it "manually" (in Libre Office, not sure how it's done in MS Word), as I got an answer to this question:
(select "Regular Expressions" and then replace "$" (without the quotes) with a space)
...but how programmatically, with DocX?
Additionally, in some cases I want to ADD line breaks/"paragraph returns" where there are legitimate line breaks between the end of one paragraph and the start of another, but no extra line to separate them visually. According to this:
...I can add a paragraph/line break to a legitimate line break by searching for "$" and replacing that with "\n\n"
This does work, too (manually, in Libre Office); but again...how to do this with the DocX library?
It appears that not all of this is possible with the current version of the DocX library you are using. If it is not exposed in documentation, the functions might as well not exist, and you should not be using undocumented features.
There is a much more mature library available, however, called the "Open XML SDK", that can do everything you need.
The correct way to change a font, regardless of whether you are doing it with the document editor, or you are writing a program to manipulate these files, is to change the appropriate text's style attribute, or changing the definition of style in use.
You should never, ever, ever, ever directly change the font of any text. Personally, I think that the 'font type' and 'font size' menus should be removed entirely from word/libreoffice/etc, and only be accessible inside a 'change style properties' dialog; the only reason to directly apply a font is if you are actually providing an example of particular typeface under discussion!
See How to: Replace the styles parts in a word processing document (Open XML SDK) from the MSDN documentation for a description of the way that works.
To search and replace text, the applicable MSDN page is How to: Search and replace text in a document part (Open XML SDK). For specifically replacing multiple spaces with a single space, there are numerous results on Google that should all work to at least some degree.

Word document to mathml?

I have lots of word documents which contain math equations, some tables, and some expressions written in superscript and subscript. Is there a good tool besides MathType for converting my equations to mathml?
If the expressions are entered as mathzones in Word 2007 or later's in-build math formatter then Word includes a transformation to MathML built in, you can select (by an option in the ribbon) that if you cut and paste any math expression then they MathML version will be placed on the clipboard. If you want to bulk convert all the expressions in a document rather than manual cut and paste there is an old blog of mine on the subject at
http://dpcarlisle.blogspot.co.uk/2007/04/xhtml-and-mathml-from-office-20007.html

Perl CAM::PDF splitting words improperly

I'm using the CAM::PDF Perl module to parse PDFs. The module works great except for one issue, it seems to split words randomly. Is there any way of fixing this via settings or some algorithmic way to put the words back together?
For example:
"has offices located in New Yor k and Dublin."
-Notice New York
"price competit ion"
-price competition
The section of code is below:
$pdf = CAM::PDF->new($pdf_name);
$text = $pdf->getPageText($page);
print("$text\n");
;
In general it's not always possible to reconstruct the original text from a PDF. Often the physical structure doesn't match the output.
In this case you are quite possibly being affected by manual kerning. I.e. splitting on character pairs and adjusting the spacing to produce a more pleasing result - see http://en.wikipedia.org/wiki/Kerning.
So breaking within words and outputting smaller chunks, which is being recognised by CAM::PDF as separate words.
If you have some control on your PDF production, you could experiment with fonts and kerning settings - but this might also compromise output quality.
PDF::OCR2 is likely to handle kerning more robustly and might do a better overall job of recognizing the original text.