Accurately Reading Document Content with Position Using Open XML (Word) - openxml

I have a need to retrieve a string which matches perfectly the content of a word document. That is to say, in position 1000 of this string, I should find exactly the text at position 1000 in the document.
We have been through various iterations of reading in the document context text and adjusting for field codes/tables/pictures/inline shapes etc by padding in the right places. This approach does work (well) but we want to move towards Open XML instead for speed.
We have Power Tools for Open Xml installed, and have been looking at ways to recreate this string using Open Xml. We can get all the text by going through the runs (as per Eric White's blogs), but we also need everything else. \r's, \t's etc. I see things like "TabChar" in runs, and wdfldChar, but I am unclear how to use this information to generically get what we want.
For example, "TabChar" in our string should be \t. We must need to interpret wdfldChar begin, separate, end in a certain way (maybe by adding spaces). The problem is that we don't want to have to find every possibility and code them
[If run = "TabChar" append."\t" etc] a) because it's inefficient, b) it is unsafe.
Can anyone help with a method to reproduce this string with complete accuracy?
Thanks

Related

Is there an Emacs read-only or view mode that allows inserting some text?

Here's the use case: I'm writing a novel in Emacs (in org-mode). One part of my writing/editing flow is to read over some large portion of what I've written, collecting notes/possible edits/etc as I go. The sort of thing you'd do, on paper, by printing it all out and then writing notes in the margin.
I want to prevent myself from, as I do this kind of review, actually doing any writing -- but that's surprisingly hard. Like, if the buffer is editable, I start to type a brief note about a fix, then find myself starting to restructure / fix a sentence, and next thing I know, I've spent five minute polishing
a single paragraph. This not only slows me down, it breaks my ability to imagine a reader's response.
I've tried just putting the buffer in view-mode, and that sort of works -- but then it's laborious to try to identify the places I want to go back and review/fix up.
My ideal would be, to have something in view-mode, which I genuinely can't edit, but which, as I move the cursor through it, I could hit some key combination, and it would allow me to enter a brief note in the minibuffer, which would then get inserted into the main buffer, at point, possibly inside brackets or a comment or some such.
Does anyone know of something like that? Or have any pointers to something similar which I could try to adapt?
You can easily set bookmarks at any locations. And bookmarks can contain annotations.
If you use library Bookmark+:
The annotations are in Org Mode by default, and they can even be separate files (by default they are part of the bookmarks themselves, so stored in your bookmarks file).
You can bookmark not just a position but also a region of text, whether a sentence, paragraph, page, or an arbitrary span of text.
You can automatically name bookmarks as you set them, if you don't care about the names.
Updated after OP's comment saying "I prefer to shove the comments/questions/notes directly into the text of the novel. Because I end up adding/deleting/moving text a ton, and I want the notes to move with the text":
Bookmarks move with the surrounding text. That is, they generally get relocated automatically, since the surrounding text is recorded as part of the bookmark, and when jumping to a bookmark that text is looked for.
Occasionally the context has changed so much that a bookmark can't be relocated automatically, and you are prompted to relocate it manually.
But yes, bookmarks are stored in a bookmark file, separately from the files they target. There are both advantages and disadvantages to this feature. Advantages include (1) removing clutter from the text (annotations, including notes about possible text changes are metadata), (2) immediate access to particular text locations from anywhere, (3) a separate, persistent record/history of work or thoughts on it, (4) you can have multiple, separate sets of bookmarks/annotations for the same target text.
One thing you might find handy, when using bookmarks especially for annotating a particular file: C-x p C-l switches to a bookmark file that has only bookmarks for the current file or buffer, creating such a file on the fly if none
exists. (This is available only with Bookmark+.)

VSCode: activeTextEditor encoding

Is there any way to get current document encoding (that is in the bottom bar) in my extension code?
Something like vscode.window.activeTextEditor.encoding
This does not appear to be possible.
Since it's nearly impossible to prove a negative, the rest of this answer documents what I explored.
The string "encoding" does not appear (in this sense) anywhere in the API docs nor in the index.d.ts file it is derived from. (With VSCode 1.37.1, current as of writing.)
I dug into the vscode sources to see if there might be a clever solution, but came up empty. The code that executes when the encoding is changed by the user is in editorStatus.ts, class ChangeEncodingAction. This makes its way to textFileEditorModel.ts, function updatePreferredEncoding, which sets preferredEncoding. That field controls what happens when the file is saved, and is used to populate the status indicator, but doesn't go anywhere else I can find.
Reading the status indicator itself does not appear possible since the API allows extensions to create new indicators with window.createStatusBarItem but not enumerate existing ones. And directly accessing the DOM is not possible.
I also came up empty searching through VSCode issues related to encoding, both open and closed, but only skimmed the most recent ~100 closed issue titles.
Alternatives
My main suggestion at this point would be to file an enhancement request on the VSCode github.
It should also be possible to do something with reflection but of course it would be fragile.
Finally, the encoding controls how the document in memory (a sequence of characters) maps to a file on disk (a sequence of bytes). Depending on what you're trying to do, it might work to speculatively encode the document in several encodings and compare each to what is on disk (so long as the file is not dirty).

iText - Manipulate existing PDF - add dashes to end of each paragraph

I need to manipulate existing PDF in iText to add dashes to the end of each paragraph. Something like this:
I would make this in Word with tab leaders.
Is this possible to do with iText on an existing document.
Any help would be greatly appreciated.
Thanks!
Edit for clarifications
iText version is 5.5.x, but I guess we can upgrade it if the task would be easier with newer version.
There could be some paragraph that do not need dashes, but I have some control of the original PDF. It is assembled from different system and I could add some kind of markers to the paragraphs that need leaders (ie. I can add text like "~tab~" at the end of such paragraphs).
At the moment the documents that need this kind of editing have headers and footer, nothing but the text and one column with justified alignment.
Edit for even more clarification
I can even (by configuration) set where the dashes has to end (ie. at 10px) for specific document. We know every document type (and its structure) that needs to be manipulated this way.
This is insanely hard.
You should think of a PDF document as a container of instructions, rather than a WYSIWYG format. So finding out where lines are (let alone paragraphs) is very hard.
High level plan:
use IEventListener to process events from the PDF being parsed
look out for TextRenderInfo events, store them
sort TextRenderInfo events to ensure your list of events is in logical reading order.
merge items in your list if they appear on the same line and are less than a certain distance apart (for instance the distance of 3 spaces in the font specified by TextRenderInfo)
Now you should have lines
Merge lines if they appear in close vertical proximity of eachother and they overlap horizontally. How close they should be, and how much they overlap is something you'll have to figure out, and might differ from page to page, and document to document.
now you should have paragraphs
Figure out the bounding box of each paragraph. Or more accurately, the convex hull. There is a good algorithm for this called the gift-wrapping algorithm.
Now you can simply insert lines by inspecting your convex hull. This is the easy step.
If you can insert markers, you can easily do this using iText7. iText7 has an implementation of IEventListener that allows you to look for regular expressions within a PDF document. It returns the locations where the regular expression was found. If you can ensure your markers always satisfy some kind of regular expression, you can easily look for them, get their coordinates, and insert a line at the calculated position.
Of course, then you need to get rid of the marker text.
For that you can use pdfSweep.

Is there a way to detect when a field mark with 'can grow' has truncated the field data?

I do not normally work with crystal, but I have spent nearly 2 days looking for a way to do this.
The problem is that I have a number of lines of text that need to show on a report, but need to cut off after 8 lines and show a 'more' prompt to inform the user that they need to go look at the rest of the details online. This was originally handled by storing the data as individual lines already wrap to size and counting the lines with a formula and conditionally showing a separate 'more' field. They have since added the ability to use html to the text, but this made the current way of doing things wrap incorrectly and show the html mark up.
I wrote a database function to combine the text into a single field and use the HTML text interpretation to display it correctly on 7 other reports that do not limit the text length, and the max line count works great for limiting the text size, I just can't figure out how to show the 'more' prompt when needed.
Any suggestions would be greatly appreciated.
GrumpyGeek,
If your database function now combines the text into a single field does this mean the original way, with the separated lines, is still stored? If so, why not add another calculated field called 'line-count' that tallies the old line-based data?
So you'd still have your new combined HTML field and this new field that you could use to show the 'more' button when 'line-count > x'?
Alternatively, another option might work, but would be a bit touchy. That is to make the formula that shows the more button trigger when the field length exceeds x. The catch is that html mark-up isn't displayed, and heavy use of it would skew the amount of text required before you should show the 'more' button. Put another way, a field with very heavy use of mark-up ( and tags) might force the 'more' button earlier than it should. Unless you could somehow make either your 'line-count' calculated field exclude the mark-up OR make the length calculation do the same.
This would be possible if MSSQL or Crystal Reports could run regex to strip the mark-up.
If NONE of the above works, the only other thing I can suggest is to look into UDFs. Crystal allows you to load an external library that you write. These will read functions you write and show them in the function list inside Crystal. If you do this, then you could easily write a routine that strips the HTML and calculates when the more button should be shown.
Good luck with it.
Ideally, there would be a property of the DB field that would return its displayed line count. Unfortunately, there is no such property.
You could try counting the # of line ending characters (e.g. carriage return, line feed). If they are > 7 then show the hyperlink. In a HTML situation, you have to count ending elements (e.g. ). You could make use of a RegEx UFL to make it easier to identify the elements.
Probably the easiest route is to the DB to calculate the # of lines and return that as another field. Use this field to hide/show the hyperlink.

OO:Doc -perl module for Openoffic

I want to automate some writer tasks. I need to create a .odt writer
document with oo:doc using methods such as create paragraph and append
paragraph. The problem is that append paragraph and create paragraph does not
allow text to start at middle of page or at a certain column, ie
Name Surname Address
When I unzip the "master" document I want to to create, when I inspect the content.xml file i see the xml equivalent is
" <text:p text:style-name="Text_20_body"><text:s text:c="115"/><text:span text:style-name="T1"><text:s/>Hallo how are you today</text:span></text:p><text:p text:style-name="P1"><text:s text:c="116"/>I hope you are well also</text:p><text:p text:style-name="P1""
How do I set the text:c and text:s element(s) from within oo::doc
Question2:
How do i set the formatting of a paragraph
to only extend from ie column 20 to column 80
thanks
Those elements are for runs of non-breaking spaces. the text:c attribute says how many spaces there are.
That doesn't strike me as a solution to what you want, which is to change the margins and position of a paragraph, yes?
Do you have a document that you want to use as a template, where the text will be inserted? Or ar you trying to create the entire page from scratch?
I think you want to use OpenOffice.org to create a Writer document that has the structure you want, then look at the XML to see what the markup is that accomplishes that. Look at paragraph-level styles or even frames if that is what is used. You might be able to create insertion points for your generated content by then adding magic-text phrases that you can scan for.
Then figure out how to get that done with the perl module.