Any way to tell if an arbitrary .docx file is in the Strict Office Open XML format vs. the Transitional format? (ECMA-376) - openxml

I've searched around the web, and haven't found any procedure or tool that can distinguish those .docx files that are encoded as Strict ECMA-376 and those that are not. (same drill for .xlsx files) Most discussions center on which formats are supported by a given app, e.g. LibreOffice, but not how to distinguish files.
Dovetail question: 2. Does anyone know of any documentation that lays out the differences in the four editions of ECMA-376? http://www.ecma-international.org/publications/standards/Ecma-376.htm
On that page, you'll see first edition, second edition, third edition, fourth edition. First edition was 2006, and fourth edition is Dec 2012. None of the documentation appears to describe the revisions from one edition to the next, no "What's New in this edition" or anything like that. (In some cases they note structural changes, like a topic that was housed in Part 1 last time is now in Part 2, etc.)
Wikipedia describes the structural contents of the first two editions...: https://en.wikipedia.org/wiki/Office_Open_XML#Versions
...but has nothing to say about third or fourth editions, or substantive changes between the first two. Can anyone point me to documentation that lays out the iterative changes?
(ECMA-376 is normally mirrored by ISO 29500. ISO might document revisions, but their pubs are paywalled, and not just any paywall, but a 352 Swiss Franc paywall, which at today's exchange rates comes to $394.20...)

There are a couple of things to your questions:
Difference between Strict and Transitional
The main difference between Strict documents and Transitional documents is namespaces. The namespaces in Strict all contain #purl.org", as far as I remember, whereas namespaces in Transitional contain the word "microsoft". See the exact strings in Part 1 of OpenXml Standard for Strict and Part 4 for Transitional.
I do not think there is such a document available - and neither for the ISO-version.
And finally - you say that ECMA is mirrored by ISO. It is actually the other way around. Whenever ISO publishes a new version of the standard, an (almost) exact copy is published by ECMA (with their letter-head etc) afterwards.
And finally, finally, ISO OpenXml standard is free of charge. You can find the latest edition at http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html (search for "29500" on the page).
Should you wish to look a bit into what we do in the ISO working group, this is a good start: http://jtc1sc34.org/wg4 .
Jesper Lund Stocholm
Appointed Expert to ISO SC34 committee working with OpenXml.

According to section 17.2.3 of Ecma Office Open XML Part 1, the document element of the main document story will have an attribute "conformance" that specifies the conformance class. This can either be strict, or transitional. If the attribute is omitted, the default value is transitional.
Annex M of Ecma Office Open XML Part 1 (page 5028) documents the differences between ECMA-376:2012 and ECMA-376:2006.

Related

linker: what is "NMAGIC" section of linked file and what is the aim of section alignment?

ld man here say
-n
--nmagic
Turn off page alignment of sections, and mark the output as "NMAGIC" if possible.
-N
--omagic
Set the text and data sections to be readable and writable. Also, do not page-align the data segment, and disable linking against shared
libraries. If the output format supports Unix style magic numbers,
mark the output as "OMAGIC". Note: Although a writable text section is
allowed for PE-COFF targets, it does not conform to the format
specification published by Microsoft.
--no-omagic
This option negates most of the effects of the -N option. It sets the text section to be read-only, and forces the data segment to be
page-aligned. Note - this option does not enable linking against
shared libraries. Use -Bdynamic for this.
I do understand that theses options are used to make the code (.text) section writable or not, but I don't get the point to align or not the sections, and what is a "NMAGIC" section
On historic (PDP-11) Unix, an executable file's header began with a branch instruction that would jump past the header, to the actual start of the code. When Unix was ported to other processors, that initial PDP-11 branch instruction became fossilized as the "magic number" for the a.out(5) file format. When "pure text" was introduced, initially to allow processes to share their code segments, a new magic number was introduced so that the kernel could tell the difference (there were some important Unix programs that relied on self-modifying code and thus needed to be loaded with writable code segments). The old magic number (0407) was given the name "OMAGIC" -- "old magic" -- and the new magic number (0410) was given the name "NMAGIC", "new magic". The data segment immediately follows the code segment in memory, so when the code segment is made read-only, it must be padded to a page boundary.
Various operating systems and file formats since then introduced other magic numbers; in the last FreeBSD releases to use a.out format, the normal formats were ZMAGIC and QMAGIC, which were introduced to allow page zero in the address space to be unmapped for safety (so that a null-pointer dereference would fault) while still allowing executables to be demand paged (i.e., mmap()ed into the process's address space).
So to answer your question more directly: NMAGIC and OMAGIC are different formats of executable files, not of individual sections. They indicate the desired correspondence between the in-memory and on-disk layouts of the executable. (The reason these numbers are traditionally written in octal rather than hex or decimal is that octal is a natural representation for the instruction format on the PDP-11.) GNU ld uses these names (only) as references to executable formats that have analogous features, even when you are not generating traditional a.out format -- which of course is quite rare today. One particular benefit to using OMAGIC format is that it is more compact than any other format, which may matter in cases like boot loaders where space is limited, there is no demand paging, and there is also no room for any sort of padding.

Cite a paper using github markdown syntax

I am coding a readme for a repo in github, and I want to add a reference to a paper. What is the most adequate way to code in the citation? e.g. As a blockquote, as code, as simple text, etc?
Suggestions?
I agree with Horizon_Net that it depends on personal preference.
I like to have something which looks similar to LaTeX.
An example is provided below.
Note that it demonstrates a numeric citation style.
Alphabetic or reading style are possible too.
For numeric citation style, higher numbers should appear later in the text, and this can make satisfying numeric citation style cumbersome.
To avoid this problem, I typically use an alphabetic citation style.
"...the **go to** statement should be abolished..." [[1]](#1).
## References
<a id="1">[1]</a>
Dijkstra, E. W. (1968).
Go to statement considered harmful.
Communications of the ACM, 11(3), 147-148.
"...the go to statement should be abolished..." [1].
References
[1]
Dijkstra, E. W. (1968).
Go to statement considered harmful.
Communications of the ACM, 11(3), 147-148.
On GitHub flavored Markdown and most other Markdown flavors, you can actually click on [1] to jump to the reference.
Apologies for taking Dijkstra his sentence out of context.
The full sentence would make this example more difficult to read.
EDIT:
If the references all have a stable link, it is also possible to use those:
The field of natural language processing (NLP) has become mostly dominated by deep learning approaches
(Young et al., [2018](https://doi.org/10.1109/MCI.2018.2840738)).
Some are based on transformer neural networks
(e.g., Devlin et al, [2018](https://arxiv.org/abs/1810.04805)).
The field of natural language processing (NLP) has become mostly dominated by deep learning approaches
(Young et al., 2018).
Some are based on transformer neural networks
(e.g., Devlin et al, 2018).
From what I know, there is no built-in mechanism for this. This leads to more subjective opinions, depending on personal preference. I personally like to have a separate section called references. An example would look like the following
References
some reference
another reference
Update
Another way would be to use simple HTML embedded in your Markdown.
The alternative (to pure markdown) is to use the new GitHub integration from Aug. 2021:
Enhanced support for citations on GitHub
GitHub now has built-in support for CITATION.cff files.
This new feature enables academics and researchers to let people know how to correctly cite their work, especially in academic publications/materials.
Originally proposed by the research software engineering community, CITATION.cff files are plain text files with human- and machine-readable citation information.
When we detect a CITATION.cff file in a repository, we use this information to create convenient APA or BibTeX style citation links that can be referenced by others.
How this works
Under the hood, we’re using the ruby-cff RubyGem to parse the contents of the CITATION.cff file and build a citation string that is then shown in GitHub when someone browses a repository with one of those files1.
Now, that is helping others making your Github paper easily citable.
But, as documented, to "add a reference to a paper":
Citing something other than software
If you would prefer the GitHub citation information to link to another resource such as a research article, then you can use the preferred-citation override in CFF with the following types.
Resource
Type
Research article
article
Conference paper
conference-paper
Book
book
Extract
preferred-citation:
type: article
authors:
- family-names: "Lisa"
given-names: "Mona"
orcid: "https://orcid.org/0000-0000-0000-0000"

Lotus Notes Diff Tool

Is there any diff tool for Lotus Notes which allows to compare scripts, design elements and documents?
I see this is an old question, and most of the other answers are a little outdated now, so I thought I would add some hopefully valuable information for those who should stumble upon this now.
In Domino Designer, open either the Navigator or Package Explorer (Window menu -> Show Eclipse Views). Here you can expand databases/templates to see the design elements they contain. Select two or three elements (CTRL-click). They can be in different databases or the same database. Right click on one of the elements and select Compare with -> Each other.
You can also compare two databases element by element by selecting two databases/templates, right-clicking and selecting Compare with -> Each other. You will then get the differences between the two databases listed. You will be able to see which elements differ between the two databases, and which elements exist in one database but not the other. By double-clicking on a differing element, you will open a diff tool which lets you see differences line by line, and you can easily copy changes from left to right or right to left.
There is a tool from TeamStudio called Delta: http://www.teamstudio.com/products/delta.html
If all else fails (and by "all else" I mean the often ridiculous corporate procurement system) you can always do a an export to DXL (or a Design Synopsis for code alone) and use any decent text editor with a diff function. It's not TeamStudio Delta, but it will get you where you want to go.
There is a free tool from OpenNTF which does document comparisons:
http://www.openntf.org/Projects/pmt.nsf/ProjectLookup/Compare%20Notes%20Documents
Ytria also has a product which, among other things, will compare data documents (I don't believe it compares design elements).
http://www.ytria.com/website.nsf/WebPageRequest/Solutions_scanEZ_specen
And, I believe Martin Scott (http://www.martinscott.com) has a similar product which compares documents.
DDE (Domino Designer on Eclipse) let's you compare design elements natively. Same way as the search. It's pretty efficient (faster than a DXL exportation) and it's free.
I had a discussion on my blog a little while back about this:
http://rosshawkins.net/archive/2009/12/24/notesdomino-refactoringanalysis-tools.aspx
However what I've ended up doing in the past is exporting the design to the filesystem and using standard text tools (WinMerge and SublimeText for me personally) to do what I need.
Being able to do the raw dump is something that was added with the Eclipse based designer, and isn't overly obvious, but you can read more about it here:
rosshawkins.net/archive/2010/01/20/searching-the-contents-of-notesdomino-design-elements.aspx
(link mangled as my rep is too low to post 2 links in one post yet!)
Teamstudio Delta is really nice. However it might kill you with too many details. As Ross pointed out the Domino Designer 8.5 can use the Diff tool inherited from Eclipse. You also could head over to http://www.openntf.org and look for the DXLMagic project. It can generate a report that shows differences (including code) between 2 databases (typically a template and a variation of it). It is not as complete as Delta, but shows the essentials. It's free and source is included (Disclaimer: I wrote it).
This is what I do. I run a design synopsis of the database using the Notes Designer. Dump the file to a text file. You can actually split the synopsis out to different objects like Agents, Forms, Views, etc. Then you can run UNIX/Linux/Mac Unix commands to compare the elements. By doing this operation you find out what code is active, and have a complete documented source code. You do a lot of csplit and a few sed commands.
Version 12.0.1 has such a tool as part of the server. Look for comparedbs.ntf and designsynopsis.ntf on the Domino server.

Tool to compare/diff HTML in bulk

I have a lot of HTML files (10,000's and GBs worth) scraped from a server and I want to check to make sure the server produces the same results after some modifications but ignore kinds of differences that don't matter, e.g. whitespace, missing newlines, timestamps, small changes in some kinds of number, etc.
Does anyone know of a tool for doing this? I'd really rather not do more filtering than I have to.
(Oh and it needs to run under linux)
You might consider using a clone detector such as our CloneDR. This tool parses large sets of computer program (HTML is special case) files, builds abstract syntax trees representing the essential structure of each files, and compares programs for similarity.
Because it is comparing essential program structure, it ignores inessential differences such as comments and whitespace, and deterimines that two code segments are either identical or one can be obtained from the other by substituting other blocks of code. The latter allows the recognition of code that has been modified in various ways. You can see samples of clone detection runs on a variety of computer languages at the web site.
In your case, what you would be looking for are files in system A which are essentially clones (exact or near misses) of files in system B. As a general rule, if a file a is a variant of file b (e.g., with a few changes) the CloneDr will report it as a clone and show the exact differences.
At the scale of 20,000 files, I can see why you want a tool, and I can see why you want near-miss matches rather than exact matches.
Doesn't run under Linux, but I assume your problem is hard to enough to solve so that isn't what you are optimizing.
I use winmerge alot in windows and from what i can see some people enjoy meld in linux, so perhaps that could do the trick for you
http://meld.sourceforge.net/
Other examples i saw from a quick googling was Kompare,xxdiff.sourceforge.net, and kdiff3.sourceforge.net
(could only post 1 link so wrote the adresses to xxdiff and kdiff3 as text)
Beyond Compare is purchased software that is actually worth the money (I never thought I'd hear myself typing that!). It is GUI based but handles thousands of files very well. It will allow you to specify unimportant changes with regular expressions as well as whitespace (beginning, middle and end of line). The feature set is very extensive, check out a trial download.
I do not work for this company, I just use Beyond Compare every day at work and enjoy it every time!

Reliably getting a character count for .doc files

What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word's, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.
I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?
Here's a link to some Linux word-to-text converters.
For example you could use
antiword file.doc | wc
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format
Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:
NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)
More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w.
Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.
If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm
I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.
Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.
I hope this helps.
Regards. Chris