Searching for a document format.. flowing layout + page control - version-control

I am bouncing around the idea of creating a custom document versioning system to use on business rule manuals. These manuals are broken up into outlined sections which contain one rule per section which are outlined in various ways (1.1, 1.2, etc). There are many manuals which contain the same rule for different locations in the country (down to the state/county level), however many locations will have different versions of the rules depending on business needs or whatnot.
My thought is to create a system which will manage versions of each section/rule separately. This would make the management of this mess much easier to maintain (think hundreds of manuals times hundreds of rules), and it would make fielding query requests from management much quicker.
Ok, it's a fairly easy and straightforward design to this point. Now for the monkey wrench. These rules are regulated by government agencies, so they must be submitted to and approved by state agencies. In doing this, many states require only the exact pages which are updated for each request to be submitted for approval. Once they are approved, these pages will get a new effective date and the rest of the manual will remain the same. There are business reasons for this process.
So my choice of document format has to allow for flowing layout much like Word, however I need to be able to programatically determine the page range of these sections and if changes or additions will cause a repagination.
The most complex layout will contain only tables, headers/footers, and a table of contents. I have thought about using OOXML, but I don't see a way to determine pagination without loading Word which is something I would prefer to avoid. I could create my own pagination algorithm, but that sounds a lot like reinventing the wheel.
Can anyone offer pointers to a solution whether it is an open document format, a book, or something else? Thank you for taking the time to read this.

If you want a truly modular document, then DocBook might be worth a look. You have all the rich formatting you need but it does need a bit of work. It really depends on who's doing the authoring and what tools they're comfortable using. DocBook is a rich mark-up language and you can do anything from work in the base plain text file or look at a number of WYSIWYG editors, e.g. ArborText.
It's not Word though - which might be enough to put your authors off!
If you did go with DocBook, you would maintain each document section in a separate text file so your versioning solution would work well. DocBook can produce output in a number of formats simultaneously so you could have an HTML version, an OOXML version, and a PDF version produced from the same source. A PDF version of each changed section might be appropriate to send to government agencies for approval.
On pagination, you could make life a lot easier for yourself by not having continuous page numbers. Use section or chapter based page numbering, e.g. page I-1, I-2, ..., II-1, II-2.

Related

How far to go with Structured Data Markup?

The more I am going into the depths of structured data makeup, the more complex and detailed it seems to become. One could even markup areas of the page like footer, header, sidebar, single menu elements etc., I guess a page could easily consist of 80% schema markup and 20% content when taken too seriously. :)
Is it really doing any good to add more than a rough markup skeleton (WebPage or Article) to the potentially hundreds of actual content pages of a website, and shouldn't one only include full author information along with business opening times, contact details etc. on a dedicated contact/business information page? I'm concerned about the bloat. Which kind of markup is recommended for certain types of pages and which of it can be left out because a search engine would compile the information from other parts of the website anyway?
If you only care about user-visible search result features in big seach engine services (e.g., Bing, Google Search, Yahoo! Search and Yandex, which all happen to sponsor Schema.org), the answer is easy: Provide what search engines document to recognize.
Are these user-visible search result features the only things search engines "do" with Schema.org structured data? Probably not. They’ll likely use structured data to better understand page content, and most likely to analyze what other features they could offer in the future. See for example Dan Brickley’s (he is Google’s Schema.org representative) posting about this. But all this is typically not documented by the search engines, of course. So if you care about this, too, the answer would be: Provide what is conceivable to be useful for search engines.
Are search engines the only consumers interested in Schema.org structured data? No, there are countless other consumers (services as well as tools). Enter the world of the Semantic Web and Linked Data. If you know and care about a consumer, the answer is easy again: Provide what this consumer documents to support. But you can’t know them all, of course. So if you care about all these (known and unknown, currently existing and still to appear) consumers, the answer would be: Provide what is conceivable to be useful for all consumers. Or, because the interest of these consumers varies widely, even: Provide what you can.
That said, there are certainly Schema.org types which are rarely useful to provide. A good example are the WebPageElement types, which, as you mentioned, can get used for page areas (header, footer, navigation, sidebar etc.). In my opinion, a typical web page shouldn’t provide these types.
If you care about file sizes, you’ll want to use Microdata/RDFa (because these syntaxes allow you to annotate existing content) instead of JSON-LD (because this syntax requires you to duplicate the content). With RDFa you’ll probably even save slightly more compared to Microdata.
However, structured data typically only represents a fraction of the markup/content anyway, even if you provide as much data as possible.
Instead of repeating "background information" on every page (for example, the full data about the business), you can make use of references: you define a URI for your business (or every other thing) on the page where you fully describe it, and use this URI as property value where applicable on other pages. This is possible with #id (JSON-LD, see an example), itemid (Microdata), and resource (RDFa). The only reason not do this is possibly lacking consumer support for such references (depending on the consumer / the use case, they might not get followed). A middle way might be to provide the item (about the business or any other thing) on every page, once with the full data, and in all other cases with a limited set of data (ideally what is visible on the page, or what is needed for a specific consumer). The URI gets used as identifier for each item, conveying that all these items are about the same thing.

per-paragraph commenting system

I'm very interested in the emerging trend of comments-per-paragraph systems (also called "annotations systems"), such as the ones implemented by medium.com and qz.com and i'm looking at the idea of developing one for my own.
Question: it seems they are mainly implemented via javascript, that runs through the text's html paragraphs uniquely identified by an id attribute (or, in the case of Medium, a name attribute). Does it mean their CMS actually store each paragraph as a separate entry in the database? Seems overly complex to me, but otherwise, how do they manage the fact that a paragraph can be deleted, edited or moved around in the overall text? How would the unique id be preserved if the author changes the paragraph?
How is that unique id logically structured? (post_id + position_in_post)?
Thank you for your insights...
I can't speak to the medium side, but as one of the developers for Quartz, I can give insight into how qz.com annotations work.
The annotations code is custom php code and is independent of the CMS for publishing articles (wordpress VIP). We do indeed store a reference to each paragraph as a row in the database, in order to track any updates to the article content. We call this an annotation thread and when a user saves an annotation the threadId gets stored along with the annotation.
We do not have a unique id stored on the wordpress side for each paragraph, instead we store the paragraphs relative position in that article (nodeIndex “3" and nodeSelector “p” == the third p-tag in the content body for a given article) and the javascript determines where exactly to place the annotation block. We went this route to avoid heavier customizations on the wordpress side, though depending on your CMS it may be easier to address this directly in the CMS code and add unique ids in the html before sending to the client.
Every time an update to an article is published, each paragraph in the updated article is compared against what was previously stored with the annotation threads for that article. If the position and paragraph text do not match up, it attempts to find the paragraph that is the closest match and update the row for that thread and new threads are created and deleted where appropriate. All of this is handled server side whenever changes are published to an article.
A couple of alternate implications that are also worth looking at are Gawker's Kinja text annotations (currently in use on Jalopnik) and the word-for-word annotations of rapgenius.com.
(disclaimer: I'm a factlink dev.)
I work for a company trying to allow per-paragraph (or per-phrase) commenting on arbitrary sites. Essentially, you've got two choices to identify the anchor of a comment.
Remember the structure of the page (e.g. some path from a root to a paragraph), and place comments at the same position next time.
Identify the content of the paragraph and place comments near identical or similar content next time.
Both systems have their downsides, but you pretty much need to go with option 2 if you want a robust system. Structural identification is fragile in the face of changing structure. Especially irrelevant changes such theming or the precise html tags used can significantly impact the "path". When that happens, you really can't fix it - unless you inspect the content, i.e. option (2).
Sam describes what comes down to a server-side content-based in his answer. Purely client-side content-based matching is what factlink and (IIRC) hypothesis use. Most browsers support non-standard but fast substring search in page content using either window.find or TextRange.findText. Alternatively, you could walk the DOM, which is slower but gives you the flexibility to implement (e.g.) fuzzy matching.
It may seem like client-side matching is overkill or complex, but really, it's simpler: it's a very robust way to decouple your content-management from your commenting. Neither is really simple, so decoupling those concerns can be a win.
I had created a fiddle on the same lines to demonstrate power of JQuery during a training session.
http://fiddle.jshell.net/fotuzlab/Lwhu5/
Might help as a starting point along with Sam's detailed and useful insights. You get the value of textfield in Jquery function where you can send it across to your CMS using ajax/APIs.
PS: The function is not production ready. Its only meant as a starting point. A little tweaking will make it usable.
I've recently published a post on how to do this with WordPress building on an existing plugin.
Like qz.com, I assign paragraph ids on the client and then provide that info to WordPress to store as comment meta when a new comment is created. I used hashing of the paragraph text to create the id which means that the order of paragraphs is unimportant but does mean that if a paragraph is edited then any associated comments become orphaned.
At first I thought this was an issue but thinking about it, if a reader comments on a paragraph then editing that text subsequently seems a little sneaky.
The code is freely available on GitHub if you feel like forking it and enhancing it.
There is one other wordpress plugin called "commentpress" which exist since a long time.
I use an old version of this plugin for my blog and it's work very well.
You can choose to comment per lines or per paragraphs, and ergonomics is really thinking!
A demo here:
http://futureofthebook.org/
and all the code is on github:
https://github.com/IFBook/commentpress-core
After a quick look on the code, it seems they use the second approch like #Eamon Nerbonne explains on his answer.
They parse each paragraphs to make a signature based on the first char of each words. Here is the function to do that.
In case someone comes looking in here, I've implemented a medium like functionality as a Django app.
It is open source and can be found as package on Pypi, and on github.
I used one of my other apps, blogging to allocate unique Paragraph IDs to each content object (currently we're only looking at <p> tags) and puts uses some extra internal meta data in the backend while storing it in DB (MySQL currently, but what we've done is JSONed the Blob, this method is more natively suited for a document oriented DBs). The frontend is mainly jQuery driven with REST API plugging the backend with the frontend.
I took cues from this post, but then rejected the creation of some kind of digest value from paragraph because content can change. What I wanted was to preserve the annotations as long as the paragraph was not completely over-written. In the complete over-write case, I provided for collection of the annotations in an orphaned bucket.
More in these tutorials
A legacy version of the same is running on those tutorials pages, that was the first revision. (But you won't be able to post without logging in, but you could always login using social accounts to check it out :-) )

Content management system for graphics?

I am researching CMS systems, something I know little about. I am an animator and generate large numbers of files and have many source files that I use. There are so many its become difficult to manage them all and keep some organization. Can someone suggest an Open Source CMS solution that could aid in organizing these files.
Thanks
Apparently, these systems are called "digital asset management systems" when they're not about text but about images.
An overview about open-source ones can be found here
Razuna looks quite good, I'm looking for a similar solution - though i probably won't have trillions of textures files or something, I do have loads of .psd/.ai/.indd formats, which a number of systems offer thumbnail preview to a certain degree.
One thing to look out for is whether the system can handle/use/manipulate IPTC metadata, basically what this means is when metadata is embedded in an asset, the system can present that to you in a digestable format. An example of this is Google's Picassa which allows search indexing on this data. Also a number of stock asset sites both use and produce this data in their asset sets - so when you download an image for example, it comes pre-tagged with "woman, standing, smiling, photo, office" so you only have to add your own tags on top, for example "telecoms project, overview module".
Again, if you're generating a swathe of files from your output then it may depend on the nature of your file output as to what kind of versioning/management you need?
If, for example, you have output that is made up of a bunch of source files, some of which are program-specific and some of which are linked assets, then you might want to put the whole lot under version control (PlasticSCM or Subversion perhaps) and "exclude" graphic files by their file type. Then, use something like Razuna to upload, hold and display your graphic files.
I noticed with Razuna that you can organise things by category, and assign multiple categories - that is, you have 1 set of files but multiple views of them. That's why I liked Razuna, though to be honest the demo crapped out but it could've been because I changed email and profile data half way through the trial.
Interested to know how you go in your search and what you've found to be useful!
We're looking for something as well, preferably cloud-based, but that's not a requirement.
We're looking right now at Razuna. It has a lot of great features. The organization seems very flexilble, which is great. The
But it doesn't seem very mature in some ways. The development team I think is small. Some features don't work reliably (e.g., uploading additional versions of an asset [such as different resolutions] works intermittently and only with IE as far as I can tell.)
So if anyone has any other systems worth a look I'd be glad to hear about them.
In the end, Razuna was just too immature for us. It's a great effort. The dev team is obviously talented and sincere. In a couple of years they may well have a great product. I wish them luck.
We've settled on a commercial service, WebDAM. It some ways it's very comprehensive and does a lot of things well. The price is not too bad, and there's a nice API to program against, so we will be able to lean on it heavily for image selection and then incorporate it into our automated processes, grabbing images as needed programmatically.
In on other ways it a little maddening. The UI, in particular, could use a lot of work to make it easier to use for the average person. It was clearly designed by programmers. A lot of UI niceties that would not be that hard to add are missing. Obvious things like boxes with data being too small while a lot of screen real estate goes unused.
The keyword capability is useful but there doesn't have obvious things like synonyms and stemming when you search. This will make things harder on our users and will have to be planned carefully to make sure it's as useful as it can be.
We're still just in the planning stages, so not sure how it will fly once we go live, but we're going to give it a shot. You might want to have a look at it. But they have a much more mature development effort going on and more support for the customer, which swayed us in the end.

Dynamically populate a pdf

Could anyone help me? I need to insert information to the header of a pdf from a customer form online with php. I am not a programmer so I need a sense of direction before I speak to my developers.
The idea is to get licence information from a field, insert the information to the header and save the result as securely as is reasonably possible before the customised file is downloaded.
Any help would be greatly appreciated.
This will generally involve opening your existing PDF file with an appropriate PDF manipulation module (Zend_Pdf works for this purpose), performing whatever operations you need, such as inserting data into the document, and then outputting the document to the user with the appropriate headers (content type and disposition) set.
If you want customers to be able to download files several times, it would be wise to have the licensing information be gathered from the user account information, rather than a form.
First of all, asking for information on a particular library or tool to use is off bounds for StackOverflow (as it draws mostly opinionated answers. But even more importantly, in my experience, going to your developers and telling them which library to use while you don't know anything about it and are not a developer, is not generally a good idea. Focus on what you want to accomplish; don't worry too much about what technically is needed for them to do what you want. That's their job.
Looking at your question, there's a couple of things you might think about and discuss with them:
1) Taking information and adding it on top of an existing PDF file, is not the end of the world from a technical point of view. There are probably 100 or more different tools and libraries that can accomplish this in many different ways.
2) Much more technically challenging and worthy of a discussion with them is "save the result as securely as is reasonably possible before the customised file is downloaded". Normally, when you add information to a PDF document, it's not trivial to change it afterwards. But it's far from impossible. And if it's only a matter of removing stuff, it's even easier. A tool such as Adobe Acrobat for example will happily let you remove a bunch of text / graphics from a PDF file. If you want to prevent that, you have to at least protect your PDF document; set a number of flags in it that prevent it from being edited for example.
That's still not going to be waterproof, as these flags are supposed to be honoured by PDF processing applications. Adobe Acrobat does that and so do most other "decent" PDF applications, but it's certainly possible to circumvent this if you want it hard enough.
So, from your product management point of view, think about what reasonably secure means and have a discussion about that with your developers. That's probably going to get you much further than suggesting a particular library they're going to resent instantly because it comes from a non-developer :)

How to create a deliverable for a front-end engineer?

This is a question about the development workflow of front end engineers. I am starting a project for a rather large site with lots of pages, each page has multiple steps, and it's very difficult to lay out all the content in a spreadsheet.
The content of each page will be delivered in a spreadsheet cell, and some pages have multiple variable section that are determined by user's preferences.
I was asked my opinion about how to structure the deliverable. I am wondering if there is a best practice out there for structuring this kind of deliverable? Because when you have a poorly structured deliverable it can be almost as mindnumbing as using pen-and-pencil to write code.
Do you have any tools, formats, practices for creating deliverables that are easy to work with?
It sounds like you are just doing the UI design and then giving it to the front-end engineers.
If that is correct, I would suggest that you see if you can do the rough html/css work to get the page to look as you want, and then they can go in and give it the functionality, but that way you have an idea what is possible.
You can do much of the work, then leave comments about trying to center something a bit better, for example.
I am not a big fan of just getting the design on paper or as an image, it would be easier to just get the html/css.
There are plenty of tools now that make css and html easy to do, even if you have the css inside the html, they can separate the two, but, it would be a huge help to the designers.
Just do one page, and give it to them, and then come back in a day or two and get feedback as to what their thoughts are, and how you can improve what you give them.
As you go through this process, after a while both groups will know what to expect and you can get the rest done quickly.
This is more of an agile methodology with the front-end engineers as your customers.
My suggestion would be mockups or wireframes for the pages. Mockups would be examples of the pages in various states while the wireframe is a detailed document of the structure of the page.
HTML and CSS is way too complicated for mockup use. I usually first create a requirement backlog for UI/functionalities as well (just a list of priorized reqs in Excel).
Especially for a large site development you should also have the process and data flow definitions done (UML or other way of description) to help you define the mentioned requirements.
Based on these you will know what kind of steps does the whole site funcionality need (i.e. pages) and what the page hierarchy and structure will be like. This way it's much easier to get a grasp of the whole thing.
After that we'll create fast wireframes and visualize the end result with fast mockups done as images with Photoshop or similar. These are absolutely vital in my experience as it helps the customer (and other stakeholders) to actually understand what is beind done. For this the html and css are simply too slow to run multiple iterations with.