I have hundreds of thousands of PDFs that are presently stored in the filesystem. I have a custom application that, as an afterthought to its actual purpose, provides access to these PDFs. I would like to take the "storage & retrieval" part out of the custom application and use an OpenSource document storage backend.
Access to the PDF Store should be via a REST API, so that users would not need a custom client for basic document browsing and viewing. Programs that store PDFs should also be able to work via the REST API. They would provide the actual binary or ASCII data plus structured meta data, which could later be used in retrieval.
A typical query for retrieval would be "give me all documents that were created between days X and Y with document types A or B".
My research, whether such a storage backend exists, has come up empty. Do any of you know a system that provides these features? OpenSource preferred, reasonably priced systems considered.
I am not looking for advice on how to "roll my own" using available technologies. Rather, I'm trying to find out whether that can be avoided. Many thanks in advance.
What you describe sounds like a document management or asset management system of which there are many; and many work with PDF files. I have some fleeting experience with commercial offerings such as Xinet (http://www.northplains.com/xinet - now acquired apparently) or Elvis (http://www.elvisdam.com). Both might fit your requirements but they're probably too big and likely too expensive.
Have you looked at Alfresco? This is an open source alternative I came into contact with years ago while being on the board of a selection committee. As far as I remember it definitely goes in the direction of what you are looking for and it is open source so might fit that angle as well: http://www.alfresco.com.
Related
If you want to look at the differences in the content management system and the document management system in the real world, what is the best example?
Thanks for your attention. I’m looking forward to your reply.
There is a big difference between the content management and document management system. Name of both the services or tools seems to be same but there is a difference
CMS ( Content Management System)
The content management system is the tool which is used to maintain the content of a website or the application. Let me elaborate you in details
Have you ever create a website? A website is developed in Wordpress or in different things like Shopify, Magento etc. This kind of things are included in a content management system
Document Management System (DMS)
Document management system is used in the business/company or for individuals.
In DMS all the papers or documents are converted into the digital form by scanning them and saved all the documentation in the cloud server which will never be lost or theft.
All the business document are secured.
Content management systems vary from document management systems in one key area – the type of information they manage.
Document management system is designed specifically for data contained in structured documents and files like Word, PowerPoint, Excel spreadsheets, PDF, and other popular formats. Their purpose is primarily to digitize and archive files and track and manage new documents throughout their lifecycle, as they are written, revised, and updated. Many of them include advanced imaging and scanning capabilities (for digitization of hard copy files), that can’t be found in most content management systems.
Content management systems, on the other hand, are more about the logical organization and improved accessibility of various types of structured and unstructured electronic information. This includes not only the kinds of files that are managed by document management applications but a broader range of digital assets. For example, audio, video, Flash, and multimedia files, as well as raw data collected from various third-party Internet sources.
I have data that I need to organize, and the easiest way to do it would be with CoreData. I also want to sync this data to Dropbox so that it will be synced across multiple iOS devices and Macs. I looked at this post, and now I am kind of concerned:
You want to look at this pessimistic
take on cloud sync: Why Cloud Sync
Will Never Work. It covers a lot of
the issues that you are wrestling
with. Many of them are largely
intractable.
It is very, very, very difficult to
synchronize information period. Adding
in different devices, different
operating systems, different data
structures, etc snowballs the
complexity often fatally. People have
been working on variants of this
problem since the 70s and things
really haven't improve much.
I am especially concerned because I am pretty new to iOS and programming in general, and I was hoping it would be easier. I was wondering if anyone had some tips/tutorials/experience with doing this. I could use property lists (or a different method) to store the data, but that would make it harder later in case I wanted to change any of the attribues for the data I am storing. Is this really as complicated as they are making it sound, and should I just try to find some other way to sync the data (e.g. email, drag and drop in iTunes, etc.)?
I don't have any experience with cloud sync, but I do have experience with data management. Plist files are not at all bad in terms of data manipulation. The main problem with plist files is speed when handling large amounts of data, but for what you are intending to do they should work fine. It is difficult to provide more of an answer because in your question you did not say what kind of data, or how much data, or how often this data will be changed/accessed. If you are a beginner in iPhone development of programming in general, I will just say that Core Data has a very steep learning curve. When i first started programming for the iPhone all I used were plist's because they are simple and versatile.
Also, from reading the article that was linked in your question, it seems that he was condemning cloud providers for the way they handle data storage, and the services offered to the users. That article was written in 2009, since then great strides in "cloud" storage and syncing have been made. Also, you are not actually creating a cloud sync service, you are simply using one that is already in existence, so almost none of those problems apply to you.
Syncing is rather easy. You just have to keep track of file creation and deletion.
I wrote this blog post about how to sync a local data store with a remote one: Basic Syncing Algorithm
In the comments, tell me what (in general) you are using CoreData to manage. I need more information.
Now there is a product to sync your CoreData across devices with the data being stored in your's Dropbox, Box, or Google Drive account. It's called NimbusBase.
You can directly use your CoreData, import our libraries, and your data will be saved straight to your's Dropbox. We handle authentication and also moving the data back and forth.
Feel free to email me at admin#nimbusbase.com if you have questions.
Disclosure: I am a programmer at NimbusBase
I am looking for a web based way of showing users tiff, pdf, doc(x), and xls(x) files. This is being required from a business standpoint and I don't have a whole lot of weight/control into the decision being made. The web application will be used by both internal and external customers, not publically available though.
Pricing is not such a big deal right now, the active stakeholders know this is extremely valuable and important. So to a point, pricing does not matter.
I was hoping somebody else's google-fu was better than mine, or knows of a good solution/product that doesn't necessarily have good search engine ranking.
Little more info
I do believe all we will need is a way to view the images. We will not be performing any redaction or annotations. It would be nice to have a thumbnail control to facilitate flipping through many pages (upwards around 100), but this is not required. There will be other controls on the page, so I'm looking for a minimalistic viewer. Being able to customize the navigation buttons/controls would be an added bonus as well. Also this will be developed/deployed using ASP.NET MVC2 on an IIS7 x64 platform.
A silverlight/flash control/solution would also be acceptable.
Current Findings
Previewing TIF documents on the Web (.Net C#) - Only directed at TIF images
http://www.accusoft.com/prizmviewerfeatures.htm - uses a browser plugin. This is not ideal, but a possibility.
http://www.atalasoft.com/products/dotimage - Does not seem to support MSFT Office formats, no mention of MVC support.
http://www.snowbound.com/viewer_inaction/viewerdemos.html - So far this one is coming out ahead, it supports many formats (pay for the formats you need/want). But again, no mention of MVC.
Google Docs API - From what I can tell in order to use Google's conversion, you need to put the documents on their server. This will not work for us because of sensitive information the documents have.
There's a partial solution for this that involves converting the various documents into HTML for display (or any other web-capable format) in a web browser. It doesn't satisfy all your requirements but may lead to something useful eventually.
JODConverter offers a server-side java-based solution that leverages OpenOffice.org's powerful converter to convert from any supported format to any other supported format.
From the website:
JODConverter, the Java OpenDocument
Converter, converts documents between
different office formats. It leverages
OpenOffice.org, which provides
arguably the best import/export
filters for OpenDocument and Microsoft
Office formats available today
I've used it successfully to convert documents from MSWord to HTML for display in the browser. Any format that OpenOffice supports is supported by JODConverter. So PDF, MS formats, TIFF and others are supported.
It's java so it's platform independent - I've used it on a Windows, Mac and Linux server.
There are a couple of other ones I have found:
http://www.eviewer.net : An HTML5 based viewer that has both .NET and Java backend.
http://www.ms-technology.com/viewing-solutions : They have Java Applet, Silverlight, Flash viewers that would fulfill your needs as well.
I hope this helps.
I am using Scantron Cognition Enterprise at work to capture data from scanned forms. Building these forms is tedious at best, especially when it would be nice to have a library of pre-built objects to use. Unfortunately, documentation and on-line resources are scarce.
Does anyone have any pointers to find some resources for this tool?
Hey Jason, believe it or not, Scantron is STILL the standard, but this is not the Scantron you probably remember. Although OMR (bubble) forms are still used extensively in education, there are a lot more advanced technologies available to be added to them today.
Concerning Cognition, I looked through the available tags and these would fit:
"document-imaging" - Cognition is a document imaging product and can feed images and index values into most commercially available document storage applications
"OCR" - Optical Character Recognition, or reading machine print.
"ICR" - Intelligent Character Recognition - reading hand writing, usually in a constrained print format (one letter per box like a credt card application.
"datacollection" - the key purpose of Cognition is data collection.
However, there is not a tag for "OMR" - Optical Mark Recognition, or reading bubble choices, similar to the basic Scantron forms of the past. Also, I could not find one for "Key From Image", another purpose that Cognition is used for.
I am a Cognition user as well as someone who markets it and I know that there are a large number of users in North America. Many corporations that use Cognition use it for sensitive HR functions and so might not have their usage of it posted in a searchable format. Many other organizations use it for safety inspections, insurance data entry, and also for testing and surveys - basically anywhere you have a large number of paper forms and need all of the data quickly entered into a database. Many users are using Cognition for sensitive applications are so are not likely to share, but I can share a few I have, you could also contact your Scantron rep and they might have something they could share as well. I have some decent ICR fields built for name, e-mail, address, etc. The ICR fields are best when you build in your own dictionary or database look-ups. The OMR fields are the hard ones to build, but I have a few of these as well. The easiest way to share these is to send you the form that already has the field built into it. You can build your own lookups from txt, xls or db files.
We have a web application which contains a bunch of content that the system operator can change (e.g. news and events). Occasionally we publish new versions of the software. The software is being tagged and stored in subversion. However, I'm a bit torn on how to best version control the content that may be changed independently. What are some mechanisms that people use to make sure that content is stored and versioned in a way that the site can be recreated or at the very least version controlled?
When you identify two set of files which have their own life cycle (software files on one side, "news and events" on the other, you know that:
you can not versionned them together at the same time
you should not put the same label
You need to save the "news and event" files separatly (either in the VCS or in a DB like Ian Jacobs suggests, or in a CMS - Content Management system), and find a way to link the tow together (an id, a timestamp, a meta-label, ...)
Do not forget you are not only talking about two different set of files in term of life cycle, but also about different set of files in term of their very natures:
Consider the terminology introduced in this SO question "Is asset management a superset of source control" by S.Lott
software files: Infrastructure information, that is "representing the processing of the enterprise information asset". Your code is part of that asset and is managed by a VCS (Version Control System), as part of the Configuration management discipline.
"news and events": Enterprise Information, that is data (not processing); this is often split between Content Managers and Relational Databases.
So not everything should end up in Subversion.
Keep everything in the DB, and give every transaction to the DB a timestamp. that way you can keep standard DB backups and load the site content at whatever date you want if the worst happens.
I suppose part of the answer depends on what CMS you're using, and how your web app is designed, but in general, I'd regard data such as news items or events as "content". In other words, it's not part of your application - it's the data which your application processes.
Of course, there will be versioning issues between your CMS code and your application code. You could manage this by defining the interface between the two. Personally, I'd publish the data to the web app as XML, which gives you the possibility of using XML schema to define exactly what the CMS is required to produce, and what the web app should expect to process.
This ought to mean that most changes in the web app can be made without a corresponding alteration in the rendering of the data. When functionality changes require this, you can create a new version of the schema and continue to make progress. In this scenario, I'd check the schema in with the web app code, but YMMV.
It isn't easy, and it gets more complicated again if you need additional data fields in your CMS. Expect to plan for a fairly complex release process (also depending on how complex your Dev-Test-Acceptance-Production scenario is.)
If you aren't using a CMS, then you should consider it. (Of course, if the operation is very small, it may still fall into the category where doing it by hand is acceptable.) Simply putting raw data into a versioning system doesn't solve the problem - you need to be able to control the format in which your data is published to the web app. Almost certainly this format should be something intended for consumption by software, and therefore not usually suitable for hand-editing by the kind of people who write news items or events.