Relationship between standards DocBook DITA OpenDocument and CMIS, MoReq2 - perl

Can anybody explain /for dummies :)/ relationship between these (mostly oasis) standards?
DocBook, DITA, OpenDocument
CMIS
MoReq2
As i understand yet:
DITA, DocBook and OpenDocument - are standards for file formats of documents
CMIS is something what I need explain
MoReq2 - is a standard for digital archives for storing metadata about the documents (record management standard)
So, for the portable solutions need
store documents in the above formats (when in the what one?)
and need describe them with MoReq2 schemas
but where to come CMIS?
Or i'm totally wrong?
Ps: Understand than it is an complex question, but nowhere find any simply explanation of their relationship.
ps2: plus question - have any of the above support in perl?

The topics I know best are the first three (DocBook, DITA, OpenDocument).
DocBook and DITA are standards for writing potentially long technical documents, in which you do not specify any style or presentation. Rather, you just write text, and then you can tag the text with information about its role (whether it is a keyword, whether it is a warning note, etc). This way, you can then use stylesheets to apply consistent style to all of your text, and you can produce multiple publication formats from it.
DocBook focuses more on providing a large set of tags that covers every common case, while DITA focuses on a bare minimum that is easy to extend. Another difference is that DocBook encourages you to think in terms of long documents, whereas DITA encourages you to think in reusable "modular" documents.
Both DocBook and DITA documents would be stored in multiple files. A single document could be from tens to thousands of files.
OpenDocument is a standard for specific office documents. As such, an OpenDocument document would often be a single file. An OpenDocument document is more specific than DocBook or DITA. It is less likely to be a book, and more likely to be a letter, a specification, a spreadsheet or a presentation. Also, unlike DocBook and DITA, OpenDocument will very likely contain style information (colours, numbering, etc), because the text is not necessarily related to any other document and is only used once.
Each of DocBook, DITA and OpenDocument are formats used to store text in files. Usually these are XML files.
CMIS. I have never heard of this before today, but I do know about content management systems. I can therefore tell you that it is a headache to try to manage the path that a certain piece of text is supposed to take from the repository, disk or database where it is stored, up to the book, webpage, help system or blog where it is supposed to be published. Content management systems help you specify data for large sets of files; this data can then be used by a tool to decide where to publish a document, or just a piece of information. A content management system can be as simple as two folders on your hard drive: any files put in one folder should be published for example as PDFs in Chinese, whereas files put into the second folder should be published in as blog entries in German and Turkish.
Now, content management systems are usually much more sophisticated than that, and there are many of them. I imagine that CMIS is an abstract layer that lets you allow different content management systems to inter-operate, if by chance you have invested in more than one of them.
Finally, MoReq2. Again, I only discovered this today, and unlike CMIS, I don't even have experience with record keeping. However, you have two answers from #Tasha and #Marc Fresko which should give you a good starter.
What I imagine about MoReq2 is that it can help you manage the lifecycle of your documents. For example, you may want to specify that a certain policy document is only valid until 2010, or that it has been deprecated already. I also imagine that MoReq2 is much, much more than that.
To sum up, all of these standards concern document management. DocBook, DITA and OpenDocument are about writing and storing documents. CMIS is about managing where the documents go. And MoReq2 seems to be about how long they live.

On CMIS, try this link. MoReq2 is not about digital archives, and it's not about “storing metadata". It's typical functional requirements for decent Electronic Records Management System. Both documents are in public domain - get them and read the introductions.

Tasha's reply is 100% accurate. I'll add that the metadata model in MoReq2 is the weakest part of MoReq2, and arguably the least important - it probably contains many errors. I say this on the basis of having been the leader of the MoReq2 project.

Related

Document similarity framework

I would like to create an application which searches for similar documents in its database; eg. the user uploads a document (text, image, etc.), and I would like to query my application for similar ones.
I have already created the neccesseary algorithms for the process (fingerprinting, feature extraction, hashing, hash compare, etc.), I'm looking for a framework, which couples all of these.
For example, if I would implement it in Lucene, I would do the following:
Create a custom "tokenizer" and "stemmer" (~ feature extraction and fingerprinting)
Than adding the created elements to the Lucene index
And finally using the MoreLikeThis class to find the similar documents
So, basically Lucene might be a good choice - but as far as I know, Lucene is not meant to be a document similarity search engine, but rather a term-based searchengine.
My question is: are the any applications/frameworks, which might fit for the above mentioned problem?
Thanks,
krisy
UPDATE: It seems like the process I described above is called Content Based Media (Sound, Image, Video.) Retrieval.
There are many projects that use Lucene for this, see: http://wiki.apache.org/lucene-java/PoweredBy (Lire, Alike, etc.), but still didn't found any dedicated framework ...
Since you're using Lucene, you might take a look at SOLR. I do realize it's not a dedicated framework for your purpose either, but it does add stuff on top of Lucene that comes in quite handy. Given the pluggability of Lucene, its track record and the fact that there are a lot of useful resources out there, SOLR might help you get your job done.
Also, the answer that #mindas pointed to, links to the blog post describing the technical details at how to accomplish your goal with SOLR (but you probably already read that in meantime).
If I am getting correctly you have your own database, and you are searching if its duplicate, or copy/similar, in database while/after user uploads.
If That is the case, the domain is very big in comparison..
1) For Image you must use pattern matching, there are few papers available for image duplicate finder, on net, search for them you will get many options for that,
2) for Document there is again characteristically division
DOC(x)
PDF
TXT
RTF, etc..
Each document carry different property, now here Lucene may help you but its search engine,
While searching for Language pattern, there are many things we need to check, as you are searching for similar(not exact same).
So, fuzzy language program will come handy.
This requirement is too large that the forum page will not be enough to explain everything anyways, I hope this much will do

Embedded nosql open source java database

I'm developing an open source product and need an embedded dbms.
Can you recommend an embedded open source database that ...
Can handle objects over 10 GB each
Has a license friendly to embedding (LGPL, not GPL).
Is pure Java
Is (preferably) nosql. Sql might work, but prefer nosql
I've looked over some of the document DBMSs, like mongodb,
but they seem to be limited to 4 or 16 mb documents.
Berkeley DB looked attractive but has a GPL like license.
Sqlite3 is attractive: good license, and you can compile
with whatever max blob size you like. But, it's not Java.
I know JDBC drivers exist, but we need a pure Java system.
Any suggestions?
Thanks
Steve
Although it's an old question, I've been looking into this recently and have come across the following (at least two of which were written after this question was asked). I'm not sure how any of these handle very large objects - and at 10GB you would probably have to do some serious testing, as I presume few database developers would have objects of that size in mind for their products (just a guess). I would definitely consider storing them to disk directly, with just a reference to the file location in your database.
(Opinions below are all pretty superficial, by the way, as I haven't used them in earnest yet).
OrientDB looks like the most mature of the three I found. It appears to be a document and/or graph database and claims to be very fast (making use of and "RB+Tree" data structure - a combination of B+ and Red Black trees). It claims to be super fast and light, with no external dependencies. There seems to be an active community developing it, with lots of commits over the last few days, for example. It's also compliant with TinkerPop graph database standard, which adds another layer of features (such as the Gremlin graph querying language). It's ACID compliant, has REST and other external APIs and even a web based management app (which presumably could be deployed with your embedded DB, but I'm not sure).
The next two fall more into the simple key-value store camp of N(ot)O(nly)SQL world.
JDBM3 is an extremely minimal data store: it has a hash map, tree map, tree set and linked list which are written to disk through memory mapped files. It claims to be very light and fast, is fully transactional and is being actively developed.
HawtDB looks similary very simple and fast - a BTree or Hash based index persisted to disk with memory mapped files. It's (optionally) fully transactional. There has been no commit in the past seven months (to end March 2012) and there's not much activity on the mailing list. That's not to say it's not a good library, but worth mentioning.
JDBM3 and HawtDB are pretty minimal, so you're not going to get any fancy GUIs. But I think they both look very attractive for their speed and simplicity.
Those are all I've found matching your requirements. In addition, Neo4J is great - a graph database, which is now a pretty mature and works very well in embedded mode. It's GPL/AGPL licensed, though, so may require a paid license, unless you can open source your code too:
http://neotechnology.com/products/price-list/
Of course, you could also use the H2 SQL database with one big table and no indices!

Office development - Word

I have a Word document [ template ] with some placeholders in it. I need to populate the placeholders with some data. I also need to generate a table at runtime. Like I can't have a table designed at design time [the number of rows and columns vary]
I see a lot of posts online. WordProcessingML, OpenXmL. Which path should I take? Do I even have to use the template or just generate the entire doc at runtime? I am confused...
As the comments mention, the question is a bit broad, but in general, there are a few alternatives.
1) If you can deal with ONLY the newer format DOCX files, then Plutex's OpenDoPE is a good possible solution.
2) if you have to deal with older format DOC files, you may find that Word COM Automation is about the only decent solution, but that has other issues, such as speed, and the much great difficulty of using it in a server environment.
3) There are some 3'rd party Word libraries out there that let you manipulate doc files for mail merge, but most only give you barely more functionality that the default word Mail merge. WindWard reports is one solution I came very close to using at one point. It's not cheap, but it is quite powerful. Aspose is another one, though it's merge is pretty basic.

Alternative to CSV?

I intend to build a RESTful service which will return a custom text format. Given my very large volumes of data, XML/JSON is too verbose. I'm looking for a row based text format.
CSV is an obvious candidate. I'm however wondering if there isn't something better out there. The only I've found through a bit of research is CTX and Fielded Text.
I'm looking for a format which offers the following:
Plain text, easy to read
very easy to parse by most software platforms
column definition can change without requiring changes in software clients
Fielded text is looking pretty good and I could definitely build a specification myself, but I'm curious to know what others have done given that this must be a very old problem. It's surprising that there isn't a better standard out there.
What suggestions do you have?
I'm sure you've already considered this, but I'm a fan of tab-delimited files (\t between fields, newline at the end of each row)
I would say that since CSV is the standard, and since everyone under the sun can parse it, use it.
If I were in your situation, I would take the bandwidth hit and use GZIP+XML, just because it's so darn easy to use.
And, on that note, you could always require that your users support GZIP and just send it as XML/JSON, since that should do a pretty good job of removing the redundancy accross the wire.
You could try YAML, its overhead is relatively small compared to formats such as XML or JSON.
Examples here: http://www.yaml.org/
Surprisingly, the website's text itself is YAML.
I have been thinking on that problem for a while. I came up with a simple format that could work very well for your use case: JTable.
{
"header": ["Column1", "Column2", "Column3"],
"rows" : [
["aaa", "xxx", 1],
["bbb", “yyy”, 2],
["ccc", “zzz”, 3]
]
}
If you wish, you can find a complete specification of the JTable format, with details and resources. But this is pretty self-explanatory and any programmer would know how to handle it. The only thing necessary is, really, to say, that this is JSON.
Looking through the existing answers, most struck me as a bit dated. Especially in terms of 'big data', noteworthy alternatives to CSV include:
ORC : 'Optimised Row Columnar' uses row storage, useful in Python/Pandas. Originated in HIVE, optimised by Hortonworks. Schema is in the footer. The Wikipedia entry is currently quite terse https://en.wikipedia.org/wiki/Apache_ORC but Apache has a lot of detail.
Parquet : Similarly column-based, with similar compression. Often used with Cloudera Impala.
Avro : from Apache Hadoop. Row-based, but uses a Json schema. Less capable support in Pandas. Often found in Apache Kafka clusters.
All are splittable, all are inscrutable to people, all describe their content with a schema, and all work with Hadoop. The column-based formats are considered best where cumulated data are read often; with multiple writes, Avro may be more suited. See e.g. https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/
Compression of the column formats can use SNAPPY (faster) or GZIP (slower but more compression).
You may also want to look into Protocol Buffers, Pickle (Python-specific) and Feather (for fast communication between Python and R).

NoSQL database and many semi-large blobs

Is there a NoSQL (or other type of) database suitable for storing a large number (i.e. >1 billion) of "medium-sized" blobs (i.e. 20 KB to 2 MB). All I need is a mapping from A (an identifier) to B (a blob), the ability to retrieve "B" given A, a consistent external API for access, and the ability to "just add another computer" to scale the system.
Something simpler than a database, e.g. a distributed key-value system, may just fine, and I'd appreciate any thoughts along that vein as well.
Thank you for reading.
Brian
If your API requirements are purely along the lines of "Get(key), Put(key,blob), Remove(key)" then a key-value store (or more accurately a "Persistent distributed hash table") is exactly what you are looking for.
There a quite a few of these available, but without additional information it is hard to make a solid recommendation - What OS are you targeting? Which language(s) are you developing with? What are the I/O characteristics of your app (cold/immutable data such as images? high write loads aka tweets?)
Some of the KV systems worth looking into:
- MemcacheDB
- Berkeley DB
- Voldemort
You may also want to look into document stores such as CouchDB or RavenDB*. Document Stores are similar to KV stores but they understand the persistence format (usually JSON) so they can provide additional services such as indexing.
If you are developing in .Net then skip directly to RavenDB (you'll thank me later)
What about Jackrabbit?
Apache Jackrabbit™ is a fully
conforming implementation of the
Content Repository for Java Technology
API (JCR, specified in JSR 170 and
283).
A content repository is a hierarchical
content store with support for
structured and unstructured content,
full text search, versioning,
transactions, observation, and more.
I knew Jackrabbit when I worked with Liferay CMS. Liferay uses Jackrabbit to implement its Document Library. It stores user files in the server's file system.
You'll also want to take a look at Riak. Riak is very focused on doing exactly what you're asking (just add node, easy to access).