How to scrape and virtually combine wiki articles? - merge

So our company has a large number of internal wiki sites for different departments and I'm looking for a way to unify them. We keep trying to get everybody to use the same wiki but it never works, they keep wanting to create new ones. What I'm wanting to do as an alternative is to scrape each wiki and create a new wiki with articles that has combined information from each source.
In terms of implementation I've looked at Nutch (http://nutch.apache.org/) and (http://scrapy.org/) to do the web crawling and using MediaWiki as the frontend. Basically I'd use the crawler as the front end to scrape each wiki, write some code in the middle (I'm thinking of using Python or Perl) to make sense of it and create new articles, writing to MediaWiki using its API.
Wasn't sure if anybody had similar experience and a better way to do this, trying to do some R&D before I get too deep into the project.

I did something very similar a little while back. I wrote a little Python script that scrapes a page hierarchy in our Confluence wiki, saves the resulting html pages locally and converts them into DITA XML topics for processing by our documentation team.
Python was a good choice - I used mechanize for my browsing/scraping needs, and the lxml module for making sense of the xhtml (it has quite a nice range of xml traversing/selection methods. Worked out nicely!

Please don't do screenscraping, you make me cry.
If you just want to regularly merge all wikis into one and have them under a "single wiki", export each wiki to XML and import the XML of each wiki into its own namespace of the combined wiki.
If you want to integrate the wikis more tightly and on a live basis, you need crosswiki transclusion on the combined wiki to load the HTML from a remote wiki and show it as a local page. You can build on existing solutions:
the DoubleWiki extension,
Wikisource's interwiki transclusion gadget.

Related

Simplest CMS ever?

I’m building a super simple website with 5 pages and I want a CMS that allows me to change the text and the pictures in a couple of them.
In the past I used wordpress, but it has way too many features that i don’t need in this case.
I’ve been trying to learn gatsby.js so I would like to build it on that, but trying to see how to source from Netlify-CMS I started facing an overwhelming amount of information which I'm not sure I need.
Any tips?
Thanks!
M
Netlify has a built in CMS, and it's compatible with Gatsby! You can find examples online. It should be good for smaller sites, but for larger projects, I really like Prismic.io. Contentful is another popular one, but it's a bit pricier than prismic.
Edit: reread your comment about sourcing from Netlify. Netlify is not a "source" plug in in Gatsby. You use a local file +markdown source, and do the configuration for netlify, which adds an admin interface at an endpoint. You configure your data models in the interface, create login, etc. Then, when you submit changes, it modifies files in your connected git repo, so the local file + remark will make the data available in the graphql queries.
In the end I used Forestry.io, a good simple solution that did exactly what I needed in combination with Jekyll.

Use TYPO3 only for Backend

I am new to TYPO3. A customer wants TYPO3. It's a quite large project with WIKI, JOBS, NEWS, PROJECTS, etc.
As far as I can see, there are several ways I could go.
What would you as an experienced TYPO3 developer suggest as the easeast, fastest, most efficient approach to reach our goal?
A: Leave the object graph in TYPO3
A1: I've seen with the DOMAIN_MODEL classes there is some kind of rudimentary ORM integrated. This approach means, we would need to reevent the wheel with WIKI, NEWS, etc.
A2: It would be great to use existing extentions for WIKI, NEWS,... The customer wants them to be related (add a wiki entry to news or new to projects). That means the existing extentions have to be extented
B: Use TYPO3 only for normal content like pages (legal, contact, landig pages...) and as a backend for dynamic content as well, which is stored and handled in an external API. This approach also means, we would need to reevent the wheel with WIKI, NEWS, etc.
I would suggest to use TYPO3 for content and news (use news extension for that) and look for a suitable standalone wiki solution. Then connect both systems the way you need them to interact. No wheels have to be reinvented.

EPiServer migrate content from home grown CMS

Hopefully someone can help me, I'm new to EPiServer and have been given a data migration task. We are using the latest version 8.5. I need to migrate content from a clients home grown CMS (that luckily is in a tree like structure) to EPiServer. There doesn't seem to be a whole lot of information about this on the web - perhaps I just don't know the right thing to search for.
It looks like using the EPiServer.ServiceApi might be the route to go but again locating useful documentation is proving difficult.
I was thinking of setting up the client CMS in SQL Server and writing a simple console application to call the EPiServer.ServiceApi inserting the content. If anyone has any information on this or better still and example i would be very grateful.
Thanks,
Dan
If you are just importing content from another CMS I would write a scheduled job in EPiServer:
http://world.episerver.com/code/dannymurphy/Stoppable-Scheduled-Job-with-feedback/
That job then uses the standard IContentRepository to create content:
http://world.episerver.com/documentation/Items/Developers-Guide/EPiServer-CMS/8/Content/Persisting-IContent-instances/
That way you can run it whenever you want and have access to EPiServers complete API. Also you can see progress of the import through the job status.
In the job you can read the content as a file in any format you like or directly from the source CMS database or some xml or RSS feed perhaps.
I have moved content from PHP, Java and .NET CMS this way. In .NET you could even access the source CMS via WCF or SOAP if available.
The ServiceApi is relatively new and more focused on Commerce products and media assets rather than CMS page and block content so I wouldn't use that.
There is complete documentation below for the ServiceApi by the way, did you not find it?
http://world.episerver.com/documentation/Items/EPiServer-Service-API/
Regarding language management you can read more in the below links:
http://cjsharp.com/blog/2013/04/11/working-with-localization-and-language-branches-in-episerver-7-mvc/
http://tedgustaf.com/blog/2010/5/create-a-new-page-language-branch-programmatically-in-episerver/
Basically you have two options for multiple languages. If the content is just straight translations you should create nine different language versions (branches) of the same page. You can also have multiple sites in an EPiServer installation but that requires 9 separate licenses (and the associated costs).
I've done a lot of EpiServer content migration projects. The easiest way if it's possible is to export your current sites tree in Json and then import that into EpiServer. I've had to do it on a recent project and mixed with Json.net it's pretty easy.
If you want to go that route you can find all the code to do it here: EpiServer Content Migration With Json.Net/

What separates a content management system from just a bunch of web pages?

I have a website that has related pages. They have links that point back and forth to one another but I have no integrated system, nor do I know what that would mean.
What is the minimum code that a group of web pages must have to be considered a Content Management System (CMS). Is it that all the settings are in the database and the pages are generated somehow? Is there some small snippet that all my pages could share that makes them a CMS, database or not?
Thanks. I was also hoping not to have to study a giant CMS to see what makes it a CMS . After maybe a basic understanding I would know what I was looking for.
edit: here's why I ask about code. Whenever I have looked at a CMS, and maybe they aren't all the same, I saw that to develop a module you always had to inherit from certain classes and had some necessary code. I didn't know if there was some magic model that I just don't get that all cms makers understand.
edit: perhaps my question is more about being extendable or pluggable. What would a minimum look like? Is it possible to show that here?
edit: how about this? Is something a CMS if it is not extendable and/or pluggable?
I think this is really impossible to say. We all manage content. The "system" is just whatever mechanism you use to do so(dragging and dropping in Explorer or committing content changes via a SQL query). To say there is a minimum amount of code needed really isn't indicative. What is indicative is how often you find yourself making mistakes and how easy it is for a given user of a given skill level and knowledge to execute the functions in the designed system. That tells you the quality/degree of what you have in place being worthy of being called a "CMS."
Simply put a CMS is an application that allows the user to publish and edit existing web content.
In response to the edit:
A "good" CMS allows of extensibility. By using inheritence you can extend the functionality of a CMS outside of the core components provided. That's the magic.
About Extensibility:
Depending on the language/framework you want to build your CMS with, you can load pages or controls(ASP.NET) using command built into the framework. Typically what is being done is a parent class/interface is being defined that forces an module that is to be developed to follow some given standards:
Public MustInherit Class CMSModule
'Here you will define properties and functions that need to be global to all modules being developed to extend your CMS.
public property ModuleName as string
End Class
public class PlugInFooCMSPage
inherits CMSModule
end class
Then it's just a matter of simply loading a module dynamically in whatever construct a given language/framework provides.
Ultimately, a CMS is a system that lets you manage content, so it needs an user interface that is dedicated to letting you easily create, edit and delete pages on your website.
However, it's fairly usual to expect from a CMS to provide a browser-based WYSIWYG page editor, file uploading, image resizing, url rewriting, page categories and tags, user accounts (editor, moderator, administrator), and some kind of templae system.
Without dragging you into a theoretical explanation of what a CMS is and what it's not, perhaps some tutorials on the building methodology of a CMS will help you better understand.
http://css-tricks.com/php-for-beginners-building-your-first-simple-cms/
http://www.intranetjournal.com/php-cms/
A Content Management System is a System that Manages Content. :)
So if you got many pages that share the same layout, you can create a system that stores the content into a database and when a page is requested, it gets that content, merges it with a template that contains the page header, menu, etc.. and outputs the result.
The basis idea is that you don't want to copy HTML pages, and have to edit hundreds of them when you want to change your layout.
Such a system can be very complex, featuring wysiwyg editors, toolbars, version control, multiple user publishing and much more, but it could be as simple as a single page behind a standard loging, that contains only an input field for the title and a textarea in which you type the html content.

How to move most cms content from mediawiki to django-cms?

I'm planning to move a site from mediawiki to django-cms. It has around 150 pages and a similar number of embedded images.
The question is how to do it with minimal effort. There are several issues:
django-cms does not support mw wiki sintax, should I export html?
how can I dump all the the wiki pages as plain html
how to automate the process or at least minimize the effort.
how to deal with internal links (there are not so many internal links, probably under 2/page)
how to deal with images
First reaction - do it manually. 150 pages is not much and you can certainly do it in one working day (probably less) - which is about the same time it will take you to find and test an automated solution.
I've moved a wiki to another format and found I got a long way with just find and replace in Notepad!