Auto Categorization of Content - tags

I'm developing a script that extracts the messages from the message archive of a particular meetup.com group of which I'm a member - http://www.meetup.com/opencoffee/messages/archive/
The idea is to dynamically add these to a wordpress site and allow people to search messages, auto tag messages etc.
The issue I have is how best to auto categorize these messages. I would welcome any thoughts and ideas of how best to go about this and what would be the most efficient way of programming this.
Option 1
Find a source of tags by subject area such as finance, technology, business etc by using the delicious API and find related tags by subject:-
http://delicious.com/tag/finance
http://delicious.com/tag/technology
if a message contains these tags then the message is assigned to the respective category.
I believe this could work but not sure the most efficient method of scanning the message for these tags.
Option 2
Find sites that are representative of the categories I need such as ft.com, the economist for finance etc, techcrunch for technology etc and then determine what tags are being used by people to tag these sites and determine by default that those tags are how people relate to these sites and their content stack.
Option 3
Pass the message url to http://semanticproxy.com/ (part of Reuters Calais project) or use the Open Calais API. This I have tried but without much success as the variable depth of content is not always sufficient to return meaningful taxonomy.
Here is an example message that I parsed through the calais api:-
Original Message
http://www.meetup.com/opencoffee/messages/6045615/
Calais Result
http://www.mashinteractive.com/opencoffee/calais.php
SUMMARY
So That's about it. I would welcome any thoughts and ideas on methodology and tips on how best to approach the message scanning for options 1 and 2.
FYI there are approximately, 1,700 messages to date and I'm guessing I may have 10 categories with each category being defined by 20 or 30 tags.
If anyone would like to help develop a Wordpress plugin or class to do this I would be more than happy to have you on board. Bear in mind I'm not a programmer, I just tinker around the edges and pretend I am one.
Thanks in advance
Jonathan
CEO
Crowd People

You may want to check out Zemanta, which has tools and plugins (including Wordpress) for auto-tagging content, and also have a look at Common Tag, which is a vocabulary for expressing tags on content using RDFa, a semantic web standard currently indexed by some search engines.

Related

Does it make sense building a Knowledge Base using headless CMS?

Situation:
I work for a multinational/global organization in the HR dept as Engineering Manager.
HR needs a lot of content related to Immigration, Benefits, Leaves, Disability, Transfers, New Hire Onboarding, Covid Policies, Expense Policies.
These are rendered thru documents/Knowledge bases. As you can imagine for a global corporation that is present in multiple countries this problem can get very complex soon.
Almost all of the content is in terms of text/documents that are not really structured.
Today we are using AEM as the Content Management Platform. AEM was being used in a headful manner but AEM imposed a lot of restrictions when we had to develop Applications on top of AEM
So we are going to use AEM in a headless manner and bring in all the content in content fragments so that those content fragments can be rendered on different portals (some use cases need more than 15 portals)
Questions:
Does it make sense imposing structure on these documents?
Does continuing to use AEM make sense here?
We want to enable reuse of pages : One page is rendered on multiple platforms.
We want to enable reuse of text blocks: One block of text could be used on multiple platforms.
How do we derive information such as breadcrumbs?
How do we build an information tree: e.g. article A , B , C should be shown under US-> Leaves-> Maternal leaves while D,E,F should show under Global -> Leaves -> Bearevent Leaves. That information is not going to be present in content fragments.
How do we build a site map?
How do authors discover information? If I write a content fragment - how do I manage its taxonomy?
Just in case - AEM also delivers any content as json or xml. Or, if your sling models are set up for model.json, you even get a json representation with all the context you need.
So I am not really convinced that this statement is true:
AEM imposed a lot of restrictions when we had to develop Applications
on top of AEM
I have seen projects using experience fragments combined with content fragments in "real" AEM pages. That way you can reuse and combine content parts on several levels and even make use of the multi site manager feature.
Using tags or custom metadata fields (based on a central taxonomy) will help you add information you might need to display content parts without enclosing elements. All you need is a servlet that returns all the content or experience fragments with the right tags attached.
It's hard to tell you more here without doing the complete rwquirements engineering ;-)
in AEM,
do via writing different scripts that use different selectors
do via using reference components
perhaps render breadcrumbs using page path
use a tree structure of tags and tag the items accordingly
you may want to create a custom left side authoring pane picker
The other things are more complex (site maps!) but all can be done with AEM.
You can also use AEM to do some of those things and do other ones outside of it.
I tried my best to answer it :
Does it make sense imposing structure on these documents?
Structuring document will give you more control over content, even better to strategies it. It will help in several things such as planning, searching, filtering.
Does continuing to use AEM make sense here?
I don’t know your exact business entirely however if content have more static content unlike real time websites live update in data then it will help surly.
in addition, CMS market AEM mold good in compare with other CMS such as Sitecore. others CMS using databases whereas AEM content repository. it is debatable which is good
We want to enable reuse of pages : One page is rendered on multiple platforms.
if you are saying pages means it an HTML experience division of page. AEM has good feature of experience Fragment. Yes, there is challenges but still this will fit efficiently
We want to enable reuse of text blocks: One block of text could be used on multiple platforms.
Here also experience fragment can be a good fit, make as many variants as possible you want and reuse it
How do we derive information such as breadcrumbs?
I do not know exact wanting here but I would say implement multiple breadcrumbs for small segments or can implement custom breadcrumbs to target content small segments accordingly. Here if you content is well designed then breadcrumbs will not be real challenge
How do we build an information tree: e.g. article A , B , C should be shown under US-> Leaves-> Maternal leaves while D,E,F should show under Global -> Leaves -> Bearevent Leaves. That information is not going to be present in content fragments.
You can also manage it through Template and restrict the template in a way that page will be created under certain tree. In other way, make taxonomy in such a way it creates structure you have given for example parent page (HR work, Engineering) for each and every business unit its ok to have redundant content, use MSM feature also tag them meaningful way.
How do we build a site map?
As in breadcrumbs, you can also build sitemaps for small segments such as one for HR/US and Engineering/US render them individual or together it does not matter. It will be still well design sitemap
How do authors discover information? If I write a content fragment - how do I manage its taxonomy?
Either to make folder structure in certain way or make variants and use tagging framework
To conclude - No product will be 100% fit for any requirement, it’s just you have to use the product in such a way it will be more and more suitable for your requirement.
Good luck!

How much, what order and where to put data?

I've been updating and moving my massage business website to Wordpress. During SEO process I interested and decided to include some structured data but I'm bit confused how to do it properly. I'm going to test that stuff first on my current site.
I'm going to present information with JSON-LD and I've been reading alot of schema-org manuals and blog posts about the schemas, still they are bit vaque to me.
How much data should I provide?
I still would like to present list of services we provide and price range by currency/min/maxPrice and persons data who are working there (name, profession, phone).
Would it be wise to put that data in the <head>-section of every page?
Or just specific data to page that they relate to like staff info to "Contact Us" page and service list to "Services" page?
Is there any penalty or down sides to have all that data on every page?
How do I present personal courses that every person has taken or other studies?
How do I present those services?
Can business under that HealthAndBeautyBusiness handle 3 phone numbers with names or should I just put contact info under person's data?
Does it matter in which order I present that data?
The more data you provide, the better
Better to be specific, otherwise it could be interpreted as spam. The structured data should be closely related to the content of the page itself
You mean the employees? You could use the employee property and the alumniOf properties but that doesn't match it very well. I think such data is a bit too detailed to be described at the moment - I would omit it for the time being
List them as offers, see makesOffer property
I would limit it to 1 number
The order doesn't matter
In the future try to split your questions, would be much easier to answer them that way.
I'm going to present information with JSON-LD and I've been reading alot of schema-org manuals and blog posts about the schemas, still they are bit vaque to me.
In regards to this statement. If I were you, and I'm not, therefore I can only assume you are just learning about technologies such as json-ld and how they relate to the bigger picture that is the Semantic Web also known as Web 3.0.
It sounds like you are on the right track I would suggest additionally reading articles relating to api's as well as the http request life cycle.
-Happy Coding

Intranet site Content Management

I'm currently designing my very first Website for a small business Intranet (5 pages). Can anyone recommend the best way to manage content for the Company News section? I don't really want to get involved in day to day content updates so something that would be simple for the Marketing guy to create and upload a simple news article, perhaps created in MS Word, lets assume the author has no html skills.
I've read about Content Management systems but,
A. I won't get any funding for purchase and
B. Think it's a bit overkill for a small 5 page internal website.
It's been an unexpected hurdle in my plans, for something that I'd assumed would be a fairly common functionality I can't seem to find any definitive articles to suit my needs.
I'm open to suggestions (even if it's confirmation that a CMS is the only way to go).
Your requirements are : small site, no budget and the need for it to be easy for the marketing guy to upload a news item.
My recommendation would be to go with an all in one CMS e.g wordpress which has the kind of functionality you're talking about out of the box.
My guess is this organisation is just getting into "intranets" so something quick and simple that can be used to justify expenditure if value is returned is the key. Perhaps look at a plugin that automatically emails a summary of the blog posts to all employees once a week would be useful ?
There are many options and you can use any one of these:
Joomla
SilverStripe CMS
ModX
Cushy CMS
Frog CMS
Drupal
Additional in what Mr. Mckinnon said, you must keep in mind that if you don't want to get involved in daily updates of the people who is going to use the platform, you should consider the following:
What kind of data you want to be displayed
Who can view/modify that data
Who can create/remove data
How you will be organizing all that data
Your intranet should not be limited to display or create data, eventually all that data can turn into a beautiful Knowledge Base (KB) for your company that eventually your coworkers can share their solutions to common and rare problems that company can present eventually. This KB is amazing and time-saving, it is recommended to start it as soon as possible, so newcomers to your Company have access to it and see the most common issues and they can enter into production asap (we all know time is a luxury in every company regarding size).
Just keep in mind too, that all that knowledge and data is beyond valuable to you and your coworkers, so you should also consider some additional login credentials so your Company System Administrator can manage those credentials and also eventual audit for unauthorized access (if applicable).
I hope this helps from the administrative point of view

Is tagging organizationally superior to discrete subforums?

I am interested in choosing a good structure for an online message board-type application. I will use SO as an example, as I think it's an example that we are all familiar with, but my question is more general; it is about how to achieve the right balance between organization and flexibility in online message boards.
The questions page is a load of random stuff. It moves quickly (some might say, too quickly) and contains a huge number of questions that I'm not interested in.
The idea, I imagine, is that we can use tags to find questions that we're interested in. However, I'm not sure that this works: you can't use tags negatively. I'm not interested in PHP or perl or web development. I want to exclude such posts. But with the tags, I can't.
Although discrete subforums are in a sense less flexible, as they generally force you to pick a category even if a question might fit into two (if SO had, say, areas for "Web Development", "Games development", "Computer Science", "Systems Programming", "Databases", etc. then sure, some people might want to post about developing of web-based games, for example) is it worth sacrificing some of that flexibility in order to make it easier to find the content that you are interested in, and hide the content that you are not interested in?
Is there any way with a pure tagging system to achieve the greater ease of use that subforums provide?
The real problem with subforums comes when you guess wrong about which topics have enough interest to get their own subforums. While some topics end up with their own vibrant subcommunities others end up as empty ghettos, with little activity or feeling of community. Topics that might flourish as occasional subjects in a larger forum end up fragmented among many subforums, none of which has the critical mass of people necessary to have an active, vibrant community.
Though I think that tagging is supperior to grouping, people tend to think hierarchically.
In general it depends on the target group for the forum.
Maybe you can go with a mixture: use tagging and later use tag groups to order to posts. Delicious uses this, for example, and I find it rather helpful.
If you're worried about the divide between specific forums and open tag-based systems, like Stack Overflow, consider making a query system that allows you to do a bit more complex queries than just the AND operator, like here on Stack Overflow.
I cannot make a query here that will give me all questions in .NET, SQL or C#, combined, and that is the biggest irritation I have with the tags. With such a query system, you can create virtual forums at least.
Other than that, I don't really have a good opinion. I like both, and I haven't yet decided which one is best.
The idea, I imagine, is that we can use tags to find questions that we're interested in. However, I'm not sure that this works: you can't use tags negatively. I'm not interested in PHP or perl or web development. I want to exclude such posts. But with the tags, I can't.
While it's currently the case that you can't use tags to hide content, it shouldn't be impossible. Using SO as an example again, there's no reason that a system similar to the ignore function on a forum couldn't be made for the tag system. By adding a right-click context menu or a small "X" link somewhere in the tag display, tags could be marked as ignored. This would also allow the current tag feature to function; Seeing everything (minus your ignore list), or clicking a tag to see only questions with that tag.
Ignored tags could be managed in your profile if you should later develop an interest in PHP or INTERCAL that you lacked before.
The real question is that of performance. In my head it's as simple as replacing a SELECT [stuff] WHERE Tag = 'buffer-overflow' with SELECT [stuff] WHERE Tag NOT IN ('php','offtopic','funny-hat-friday') but I've not put together any DB backed sites that get absolutely pounded on by thousands people.

Is there any wiki engine that supports page creation by email?

I want to consolidate all the loose information of the company I work for into a knowledge base. A wiki seems to be the way to go, but most of the relevant information is buried inside PST files, and it would take ages to convince people to manually translate their emails one by one (including attachments) into wiki pages. So I'm looking for a wiki engine that supports page creation by email, that is, capable of receiving email (supporting plain text, html and attachments) and then create the corresponding page. Supporting file indexing and looking for duplicates would be a huge bonus.
I tried with WikiMatrix, but didn't find what I was looking for. I wouldn’t mind to build my own engine (borrowing a couple of snippets here and there for MIME decoding), but I don’t think is that a rare problem so there is no implementation.
Both Jotspot and MediaWiki allow you to do this. The latter has support for a lot of plugins, of which this is one. The format is essentially PageTitle#something. Jotspot is a hosted solution where you get your own email address, MediaWiki is self-hosted and you give it a mailbox to monitor for incoming.
Articles are appended to pages if they already exist, or a new page is created if it does not. This does require a degree of discipline for naming conventions, but is great for CC'ing.
We use MediaWiki here and I like it a lot. It has the same flaws as many other Wiki packages (e.g difficult to reorganize without orphaning pages) but is as good if not better than other Wiki packages I've used.
I don't know if this is exactly what you're looking for, but I know many of 37 Signals' products support adding data through email. I use Highrise to keep track of some of my business correspondence, and I'm able to CC or forward emails to Highrise and they get added to the appropriate contact.