I have text with equations in Microsoft Word 2013. I want to copy this text with equations together, but what I need is, text as plain text and equations as mathml.
When I copy mathml only Equation Options -> Copy MathML to clipboard as plain text worked perfectly. However if I copy equation with text, all comes as plain text only.
Is there any way to copy text with MathML?
I don't know of a way to do it without some sort of pre- or post-processing.
Taking the latter first, if your document contains only text and OMML equations, when you copy it, one of the clipboard formats Word provides to the Windows clipboard is HTML. In this HTML, the equations are present, but are coded with OMML, rather than MathML. You would need a script or some other means of converting the OMML into MathML (and presumably removing all the other MS-specific markup that you probably don't want in the final document).
The other way is to pre-process it. MathType will do this with its Convert Equations command on the MathType tab in Word. In the Convert Equations dialog, choose to convert OMML to MathML, and after it's finished, copy the entire text+MathML document to wherever you want to paste it. (Or save it as a .txt file.)
Related
I have pdf documents.
And it's clear to me how to extract text from it.
I need to extract not only text but also coordinates associated with this text.
It's my code:
from PyPDF2 import PdfReader
pdf_path = 'docs/doc_3.pdf'
pdf = PdfReader(pdf_path)
page_1_object = pdf.getPage(1)
page_1_object.extractText().split("\n")
The result is:
['Creating value for all stakeholders',
'Anglo\xa0American is re-imagining mining to improve people’s lives.']
I need geometries associated with extracted paragraphs.
Might be something like this for example:
['Creating value for all stakeholders', [1,2,3,4,]]
'Anglo\xa0American is re-imagining mining to improve people’s lives.', [7,8,9,10]]
How I can accomplish it?
Thanks,
Currently that ability is not a PyPDF2 feature, it has the ability for parsing the content as you show extractText() but does not hold the separate glyph xy positions nor output the lines coordinates.
There are other means in python to extract a single or multiple groups of letters that form words.
Using shell commands such as poppler from / in conjunction with a text "word" from PyPDF2 is possible, however the norm would be to run with another Py PDF Lib such as PyMuPDF and here is such an article, https://pyquestions.com/find-text-position-in-pdf-file for highlighting with PyMuPDF input.
The most common means to your goal is probably as described here How to extract text and text coordinates from a PDF file?
Whenever an MS word (or LibreOffice or other word processor) document is opened in its respective program, the words appear normally on the page, but when the document is opened in a text editor, most of it is Unicode gibberish.
I can understand why the document might have some parts that aren't legible, like bullet points or metadata, but why isn't at least some of the content stored as plaintext? Does every letter get encoded?
The last format docx of Microsoft Word is an xml with plain text compressed with zip. You can unzip the file by renaming docx to zip and then open the file with a notepad. So it is stored partially as plain text just compressed.
I find that it is probably a branding thing. If you want you can import it to a Text File.
If you go to File > Export > Change File Type > Plain Text (*.txt), you can export the document there.
I have the following segment of Markdown with embedded LaTeX equations:
# Fisher's linear discriminant
\newcommand{\cov}{\mathrm{cov}}
\newcommand{\A}{\mathrm{A}}
\renewcommand{\B}{\mathrm{B}}
\renewcommand{\T}{^\top}
The first method to find an optimal linear discriminant was proposed by Fisher
(1936), using the ratio of the between-class variance to the within-class variance
of the projected data, $d(\vec x)$, as a criterion. Expressed in terms of the
sample properties, the $p$-dimensional centroids $\bar {\vec x}_\A$ and
$\bar {\vec x}_\B$ and the $p \times p$ covariance matrices
$S_A = \cov_i ( \vec x_{\A i} )$ and $S_B = \cov_i ( \vec x_{\B i} )$, the
optimal direction is given by
$$
\vec w = \left ( \frac{ S_A + S_B }{2} \right ) ^{-1}
~ ( \bar {\vec x}_\B - \bar {\vec x}_\A ).
$$
When I convert it with pandoc to LaTeX and compile it with xelatex, I get the expected text with nicely rendered math. When I convert it with pandoc to MS Word using
pandoc test.text -o test.docx
and open it in MS Office Word 2007, I get the following:
Only those parts of the equations that are symbols or upright text get rendered correctly, while variable names in italics are replaced by a question mark in a box.
How can I make this work?
In Word 2007, I see a result similar to yours, except that here, I don't see the "question marks in boxes" characters, just space.
If I then take one of the expressions, and use your trick of going to linear display and back, the characters reappear for that expression.
If I save and re-open, the other expressions still do not display correctly, but if I save and look at the XML, I notice that
the Math font has been changed to Cambria Math
additional run parameter (w:rPr) XML specifying the Cambria Math
font has been inserted in many of the runs (w:r) inside the oMath
elements, even in the oMath expressions that do not display
correctly. However, in the oMath expression that now displays
correctly, this extra XML has been applied to every run. In the
others, it has only been applied to some runs (I think I can see the
pattern but I'm running out of time here right now...)
If I manually add the XML to the other runs and re-open the
document, the expressions appear correctly. Or at least, they do in
the one case I have tried.
Since Word 2010 displays the resuls correctly, I can only assume that it does not rely on these explicit font settings, whereas Word 2007 does. This doesn't really help you yet, because altering all those w:r elements would be even harder than what you are already doing. But it is possible that a default style/font needs to be set, either somewhere higher in the XML hierarchy, or perhaps elsewhere in the .zip (perhaps in fontTable.xml or styles.xml). I'm not familiar enough with Word's XML structures to guess what, if anything might be missing, but may be able to have a look tomorrow.
I suppose another possibility is that you just have to have all these extra rPr elements for this to work in Word 2007, which would suggest that pandoc may have been written for Word 2010, not 2007. (I don't know anything about the tool).
As an example, where you have
<m:r>
<m:t>(</m:t>
</m:r>
what you need is
<m:r>
<w:rPr>
<w:rFonts w:ascii="Cambria Math" w:hAnsi="Cambria Math" />
</w:rPr>
<m:t>(</m:t>
</m:r>
I did the following to get rid of the font issue:
Create a new empty word document.
Copy all content to the new document.
Choose Match Source Format.
As discussed above, Windows doesn't have the font Lucida Grande, so substituting the Math Font with Cambria Math should work.
Rename the test.docx to test.zip
vim test.zip and select test/word/settings.xml
find and change Lucida Grande to Cambria Math
save and rename zip to docx. This results in something like this docx.
You can then also supply that file as a sort of docx template to pandoc with the --reference-docx option.
I have lots of word documents which contain math equations, some tables, and some expressions written in superscript and subscript. Is there a good tool besides MathType for converting my equations to mathml?
If the expressions are entered as mathzones in Word 2007 or later's in-build math formatter then Word includes a transformation to MathML built in, you can select (by an option in the ribbon) that if you cut and paste any math expression then they MathML version will be placed on the clipboard. If you want to bulk convert all the expressions in a document rather than manual cut and paste there is an old blog of mine on the subject at
http://dpcarlisle.blogspot.co.uk/2007/04/xhtml-and-mathml-from-office-20007.html
I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).
I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.
pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.
There is getpdftext.pl; part of CAM::PDF.
Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].
James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.
If on windows go here and download xpdf precompiled binary:
http://www.foolabs.com/xpdf/download.html
Then, if you need to run this within perl use system, e.g.,:
system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");
where $saveName is the full path to your PDF file.
This hopefully leaves you with a text file you can open and parse in perl.
i tried this module which is working fine for special characters of pdf..
!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;
my $filename = "pdf.pdf";
my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";
Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.