Document Repository REST application in java - rest

I have a requirement to develop a Document Repository which will maintain all documents related to different listed Companies. Each document will be related to a Company. It has to be REST API. Documents can be in pdf, html, word or excel format. Along with storing documents, I need to store metadata as well like CompanyID, Doc format, timestamp, doc language etc.
As the number of document will grow in years to come, its important that the application is scalable.
Also need to translate non-English doc and store it translated English version in some parent-child relation which is easy for retrieval.
Any insights on the approach, libraries/jars to use and best practices and references are welcome.

The base 64 encoded content of the file could be included as the part of your payload along with file metadata.
Posting a File and Associated Data to a RESTful WebService preferably as JSON
Once the file reach to your end, you could save the file either local to your hard disk or save the same base64 encoded content as in your data store(user blob/clob).

Related

Creating query to get some certain parts from a file GRIDFS

In my Spring Boot application I used GridFS to store large file in my database. To find certain files, I use normal queries on the files collection:
GridFSFile file = gridFsTemplate.findOne(Query.query(Criteria.where(ID).is(id)));
but with this approach I'm getting the entire file.
My question is, how to create some queries without loading the whole file in the memory?
My stored files are books ( in pdf format ) and suppose I want to get the content from certain page without loading the entire book in the memory.
I'm guessing I'll have to use the chunk collection and perform some operations to the chunks but I cannot find how to do that.
GridFS is described here. Drivers do not provide a standardized API for retrieving parts of the file, but you can read that spec and construct your own queries that would retrieve portions of the written chunks.
Your particular driver may provide partial file retrieval functionality, consult its docs for that.

How best to store HTML alongside strings in Cloud Storage

I have a collection data of, and in each case there is chunk of HTML and a few strings, for example
html: <div>html...</div>, name string: html chunk 1, date string: 01-01-1999, location string: London, UK. I would like to store this information together as a single cloud storage object. Specifically, I am using Google Cloud Storage. There are two ways I can think of doing this. One is to store the strings as custom metadata, and the HTML as the actual file contents. The other is to store all the information as JSON file, with the HTML as a base64 encoded string.
I want to avoid a situation where after having stored a lot of data, I find there is some limitation to the approach I am using. What is the proper way to do this - is either of these approaches bad practice? Assuming there is no problem with either, I would go with the JSON approach because it is easier to pass around all the data together as a file.
There isn't a specific right way to do what you're talking about, there are potential pitfalls and performance criteria but they depend on what you're doing with the data and why. Do you ever need access to the metadata for queries? You won't be able to efficiently do that if you pack everything into one variable as a JSON object. What are you parsing the data with later? does it have built in support for JSON? Does it support something else? Is speed a consideration? Is cloud storage space a consideration? Does a user have the ability to input the html and could they potentially perform some sort of attack? How do you use the data when you retrieve it? How stable is the format of the data? You could use JSON, ProtocolBuffers, packed binary blobs in a length | value based format, base64 with a delimiter, zip files turned into binary blobs, do what suits your application and allows a clean structured design that you can test and maintain.

How to get LibreOffice's document binary?

I'm just starting to develop extensions to LibreOffice suite and I'd like to get the binary of the current active document. In fact I'd like to do something similar to an ajax request where I'd sent this document. Any ideia?
As ngulam stated, the document proper is XML.
The raw file on disk is stored in a ZIP container. You can find the URL to this disk from the document and then access this ZIP container directly. I do not believe that it is possible, however, to see the document as a binary blob (or even the XML as stored in the ZIP container) using the API and accessing what has been loaded into memory.
Can you clarify your question. For example, are you attempting to access binary portions such as an inserted graphic into a document?

Json or CSV to web service with large amount of data

I have a list of objects that I send to web service.
In csv it has 5kb and in JSon it has 15kb and this can be larger based on amount of data.
Because this is the first time that I send large amount of data to web service I need advice should I use JSon or CSV to send to ws?
What is the best practice?
I am most worried about performance.
Advantages:
JSON - easily interpreted on client side, compact notation, Hierarchical Data
CSV - Opens in Excel(?)
Disadvantages:
JSON - If used improperly can pose a security hole (don't use eval), Not all languages have libraries to interpret it.
CSV - Does not support hierarchical data, you'd be the only one doing it, it's actually much harder than most devs think to parse valid csv files (CSV values can contain new lines as long as they are between quotes, etc).
For MoreDetail See this link.
THis is the Link

Relationship between standards DocBook DITA OpenDocument and CMIS, MoReq2

Can anybody explain /for dummies :)/ relationship between these (mostly oasis) standards?
DocBook, DITA, OpenDocument
CMIS
MoReq2
As i understand yet:
DITA, DocBook and OpenDocument - are standards for file formats of documents
CMIS is something what I need explain
MoReq2 - is a standard for digital archives for storing metadata about the documents (record management standard)
So, for the portable solutions need
store documents in the above formats (when in the what one?)
and need describe them with MoReq2 schemas
but where to come CMIS?
Or i'm totally wrong?
Ps: Understand than it is an complex question, but nowhere find any simply explanation of their relationship.
ps2: plus question - have any of the above support in perl?
The topics I know best are the first three (DocBook, DITA, OpenDocument).
DocBook and DITA are standards for writing potentially long technical documents, in which you do not specify any style or presentation. Rather, you just write text, and then you can tag the text with information about its role (whether it is a keyword, whether it is a warning note, etc). This way, you can then use stylesheets to apply consistent style to all of your text, and you can produce multiple publication formats from it.
DocBook focuses more on providing a large set of tags that covers every common case, while DITA focuses on a bare minimum that is easy to extend. Another difference is that DocBook encourages you to think in terms of long documents, whereas DITA encourages you to think in reusable "modular" documents.
Both DocBook and DITA documents would be stored in multiple files. A single document could be from tens to thousands of files.
OpenDocument is a standard for specific office documents. As such, an OpenDocument document would often be a single file. An OpenDocument document is more specific than DocBook or DITA. It is less likely to be a book, and more likely to be a letter, a specification, a spreadsheet or a presentation. Also, unlike DocBook and DITA, OpenDocument will very likely contain style information (colours, numbering, etc), because the text is not necessarily related to any other document and is only used once.
Each of DocBook, DITA and OpenDocument are formats used to store text in files. Usually these are XML files.
CMIS. I have never heard of this before today, but I do know about content management systems. I can therefore tell you that it is a headache to try to manage the path that a certain piece of text is supposed to take from the repository, disk or database where it is stored, up to the book, webpage, help system or blog where it is supposed to be published. Content management systems help you specify data for large sets of files; this data can then be used by a tool to decide where to publish a document, or just a piece of information. A content management system can be as simple as two folders on your hard drive: any files put in one folder should be published for example as PDFs in Chinese, whereas files put into the second folder should be published in as blog entries in German and Turkish.
Now, content management systems are usually much more sophisticated than that, and there are many of them. I imagine that CMIS is an abstract layer that lets you allow different content management systems to inter-operate, if by chance you have invested in more than one of them.
Finally, MoReq2. Again, I only discovered this today, and unlike CMIS, I don't even have experience with record keeping. However, you have two answers from #Tasha and #Marc Fresko which should give you a good starter.
What I imagine about MoReq2 is that it can help you manage the lifecycle of your documents. For example, you may want to specify that a certain policy document is only valid until 2010, or that it has been deprecated already. I also imagine that MoReq2 is much, much more than that.
To sum up, all of these standards concern document management. DocBook, DITA and OpenDocument are about writing and storing documents. CMIS is about managing where the documents go. And MoReq2 seems to be about how long they live.
On CMIS, try this link. MoReq2 is not about digital archives, and it's not about “storing metadata". It's typical functional requirements for decent Electronic Records Management System. Both documents are in public domain - get them and read the introductions.
Tasha's reply is 100% accurate. I'll add that the metadata model in MoReq2 is the weakest part of MoReq2, and arguably the least important - it probably contains many errors. I say this on the basis of having been the leader of the MoReq2 project.