OpenXML Word Document Split - ms-word

I need help in splitting the word document using OpenXML.
I am trying to split the large word document into multiple word documents with single page in each document. I have to split the word document using OpenXML SDK 2.5 (no third party dlls are allowed). Documents after splitting should contains all the styling and formatting present in original document.

Pages do not exist in the OpenXML format until they are rendered by a word processor.
The document may contain <w:lastRenderedPageBreak/> nodes, which hint at where page breaks occurred the last time the document was rendered though not necessarily where the page break will always occur.
If you have control over the construction of the documents, you can use inline notations to mark where you want to split. For instance, its a straightforward operation to locate and segment on heading levels to split a document of chapters into chapter documents.

Related

Joining two different word documents

I am using Microsoft Word. I have 2 different Word documents. I would like to add the first word document (coverpage) to the beginning of the other word document. Is there a way to do that easily?
Kind regards

SQL(MSSQL/MariaDB) or NoSQL(MongoDB): XML search and processing

Current project situation.
Get lot of XML from outside system whose size is less than 50KB and if it contains an attachment, the size would be around 5MB MAX. XML structure complexity is medium because of inner nested elements. There are ~70 first level element and then its child and child of child... Storing that XML in String column of MS SQL server.
While storing XML, we read Search criteria field from XML and maintain them in new columns to improve search queries.
Search functionality to display these messages data in the form of list. Search criteria fields(optional ~10 fields) are from XML elements. Parse that XML to show the elements(around 10 -15) in lists
There are chances of reporting functionality too in future.
Challenge with this design: If new functionality introduced to search the list based on new criteria, then need to add one more column in DB table and have to store that field value from XML which is not best part of this design.
Suggested Improvement: Instead of storing an XML in String format, plan is to store it as an XML column to get a rid of extra columns to keep value of search fields and use XML column query for search and fetch.
Question:
Which DB will give me optimum performance in case of search? I have to fetch only the XMLs which are fitting inside that search criteria.
SQL or NoSQL like MongoDB?
Is there any performance metrics available? Or any case study for same?
DB to manage Reporting load.
What client language are you using? Java / PHP / C# / ...? Most of them have XML libraries that do what you need. A database is a data repository, not a data manipulator.

Mongodb Text Search Processed Query

I'm using the text search feature and I couldn't find a way to get the stemmed terms in the query. Is there a way to also return the list of words in the stemmed form together with the query results and also the parts of the document that matched the result? This would be meaningful to understand and identify which part of the document matches.
Cheers!
As of MongoDB 2.6, the only meta information about the text search that can be used is a score indicating the strength of the match. You can submit a ticket on the Core Server project to request this feature (as I looked and I don't think one exists at the moment).

Index Markdown Files with MongoDB

I am looking for a Document-Oriented-Database solution - MongoDB preferred - to index a continuously growing and frequently changing number of (pandoc) markdown files.
I read that MongoDB has a clean text indexer but I have not worked with MongoDB before and the only thing related which I found was an indexing process of preprocessed HTML. The scenario I am thinking about is: An automatic indexing of the markdown files where the markdown syntax is used to create keys (for example ## FOOO -> header2: FOO) and where the hierarchical structure of the key/value pairs is preserved as they appear in the document.
Is this possible with MongoDB only or do I always need a preprocessing in which I transform the markdown into something like a BSON file and than ingest it into MongoDB?
Why do you want to use MongoDB for it? I think ElasticSearch is much better fitting for this purpose, it's basically built for indexing texts. However - the same as with MongoDB - you won't get anything automatic and will need to process the document before saving it, if you to improve the precision of finding the documents. The whole document needs to be sent to ElasticSearch as a JSON object, but you can store inside a property also the whole unprocessed markdown text.
I'm not sure about MongoDB full text indices, but ElasticSearch also combines all indexed properties of a document for the full text search. Additionally you can also define the importance of different properties in your index. For instance the title might be more important than the rest of the text, ...

Should I use Parse::RecDescent or Regexp::Grammars to extract tables from documents?

I have lots of large plain text documents I wish to parse with perl. Each document has mostly English paragraphs in it, with a couple of plain text marked up tables in each document.
I have created a grammar to describe the table structure, but am unsure whether it would be best to use Parse::RecDescent or Regexp::Grammars to extract the tables.
I initially leaned towards Parse::RecDescent, but I'm not sure in a grammar how you would deal with the 90% of the document text I want to ignore, in order to find the couple of tables I want to extract buried inside each document.
Perhaps I need Regexp::Grammars so I can "pull" my expression through the document until it finds matches?
Thanks
Regexp::Grammars is what I wanted, as it allows you to pull your grammar through the document and find matches like a regular expression. Parse::RecDescent doesn't seem suited to scanning through a document and finding only the text that matches the grammar.