Is there a well known classifier library? - cluster-analysis

I'm crawling data from internet,without classifying.
Is there such a library to recommend?
EDIT
I'm crawling jobs from other website,and I need to group them into different industries.

To sort unlabelled data into groups, you want clustering, not classification. The most complete machine learning library is the Java-based Weka. You'll probably want to start by extracting text from the web pages (remove script and style elements completely, strip other tags), and then running the text through the StringToWordVector filter before performing clustering.

My current employer developed a system to categorize web pages. There were not any useful libraries that we could find so we had to do our own. We do not license ours out.
I can give you some hints. Spam analyzers classify email into Junk or Not Junk. You can use the same tools such as Bayesian, CRM-114, etc to do your own classifications on any text, including web pages.
You will have to watch the results of these very carefully and give them a lot of human feedback. You can often find keyword sets that will score very well for you. Finding those keyword sets will take time and effort and it will change some over time.
You will have to write code to divide web pages into topic sections because most pages are not all one thing. There are ad frames, navigation and other things.

Related

DNN CMS training

Whats the best way to start to train an end user in a CMS like DOTNETNUKE?
The end user will want to add edit and delete there own content. They will need to install modules and understand how everything works?
Should i create a manual? is there a way to plan some training?
any ideas?
edit: the end users are VERY I.T illiterate, they struggled to even understand the rich text editor. I need to train them on how to use the form and list module and the HTML module for editting content. They want a document of some sort, this is really old school.
PD24, for what most customers do it usually only takes 5-10 minutes of training. I usually create a couple Jing Videos which is a free screen and audio recording tool. I go through and do voice over as I create a page, edit text, add photos, add modules and record it. Then I send them the links they can reference if they ever need a reminder.
Works great! (boooo to manuals, no one reads those and they take a lot of time to make!)
& DNNcreative is probably too detailed for your client, that's a good resource for DNN implementers.
We have a variety of videos in the video library on DotNetNuke.com you could point users to those for specific topics.
We (DotNetNuke Corp) also provide custom training solutions, we could develop a custom training program for your client that fits the scope of your project and delivery requirements. If you want more info feel free to email me at training#dnncorp.com.
Have a look into www.dnncreative.com, they have some awesome tutorials for developers and users.

Content management system for graphics?

I am researching CMS systems, something I know little about. I am an animator and generate large numbers of files and have many source files that I use. There are so many its become difficult to manage them all and keep some organization. Can someone suggest an Open Source CMS solution that could aid in organizing these files.
Thanks
Apparently, these systems are called "digital asset management systems" when they're not about text but about images.
An overview about open-source ones can be found here
Razuna looks quite good, I'm looking for a similar solution - though i probably won't have trillions of textures files or something, I do have loads of .psd/.ai/.indd formats, which a number of systems offer thumbnail preview to a certain degree.
One thing to look out for is whether the system can handle/use/manipulate IPTC metadata, basically what this means is when metadata is embedded in an asset, the system can present that to you in a digestable format. An example of this is Google's Picassa which allows search indexing on this data. Also a number of stock asset sites both use and produce this data in their asset sets - so when you download an image for example, it comes pre-tagged with "woman, standing, smiling, photo, office" so you only have to add your own tags on top, for example "telecoms project, overview module".
Again, if you're generating a swathe of files from your output then it may depend on the nature of your file output as to what kind of versioning/management you need?
If, for example, you have output that is made up of a bunch of source files, some of which are program-specific and some of which are linked assets, then you might want to put the whole lot under version control (PlasticSCM or Subversion perhaps) and "exclude" graphic files by their file type. Then, use something like Razuna to upload, hold and display your graphic files.
I noticed with Razuna that you can organise things by category, and assign multiple categories - that is, you have 1 set of files but multiple views of them. That's why I liked Razuna, though to be honest the demo crapped out but it could've been because I changed email and profile data half way through the trial.
Interested to know how you go in your search and what you've found to be useful!
We're looking for something as well, preferably cloud-based, but that's not a requirement.
We're looking right now at Razuna. It has a lot of great features. The organization seems very flexilble, which is great. The
But it doesn't seem very mature in some ways. The development team I think is small. Some features don't work reliably (e.g., uploading additional versions of an asset [such as different resolutions] works intermittently and only with IE as far as I can tell.)
So if anyone has any other systems worth a look I'd be glad to hear about them.
In the end, Razuna was just too immature for us. It's a great effort. The dev team is obviously talented and sincere. In a couple of years they may well have a great product. I wish them luck.
We've settled on a commercial service, WebDAM. It some ways it's very comprehensive and does a lot of things well. The price is not too bad, and there's a nice API to program against, so we will be able to lean on it heavily for image selection and then incorporate it into our automated processes, grabbing images as needed programmatically.
In on other ways it a little maddening. The UI, in particular, could use a lot of work to make it easier to use for the average person. It was clearly designed by programmers. A lot of UI niceties that would not be that hard to add are missing. Obvious things like boxes with data being too small while a lot of screen real estate goes unused.
The keyword capability is useful but there doesn't have obvious things like synonyms and stemming when you search. This will make things harder on our users and will have to be planned carefully to make sure it's as useful as it can be.
We're still just in the planning stages, so not sure how it will fly once we go live, but we're going to give it a shot. You might want to have a look at it. But they have a much more mature development effort going on and more support for the customer, which swayed us in the end.

Resources for Scantron Cognition Enterprise?

I am using Scantron Cognition Enterprise at work to capture data from scanned forms. Building these forms is tedious at best, especially when it would be nice to have a library of pre-built objects to use. Unfortunately, documentation and on-line resources are scarce.
Does anyone have any pointers to find some resources for this tool?
Hey Jason, believe it or not, Scantron is STILL the standard, but this is not the Scantron you probably remember. Although OMR (bubble) forms are still used extensively in education, there are a lot more advanced technologies available to be added to them today.
Concerning Cognition, I looked through the available tags and these would fit:
"document-imaging" - Cognition is a document imaging product and can feed images and index values into most commercially available document storage applications
"OCR" - Optical Character Recognition, or reading machine print.
"ICR" - Intelligent Character Recognition - reading hand writing, usually in a constrained print format (one letter per box like a credt card application.
"datacollection" - the key purpose of Cognition is data collection.
However, there is not a tag for "OMR" - Optical Mark Recognition, or reading bubble choices, similar to the basic Scantron forms of the past. Also, I could not find one for "Key From Image", another purpose that Cognition is used for.
I am a Cognition user as well as someone who markets it and I know that there are a large number of users in North America. Many corporations that use Cognition use it for sensitive HR functions and so might not have their usage of it posted in a searchable format. Many other organizations use it for safety inspections, insurance data entry, and also for testing and surveys - basically anywhere you have a large number of paper forms and need all of the data quickly entered into a database. Many users are using Cognition for sensitive applications are so are not likely to share, but I can share a few I have, you could also contact your Scantron rep and they might have something they could share as well. I have some decent ICR fields built for name, e-mail, address, etc. The ICR fields are best when you build in your own dictionary or database look-ups. The OMR fields are the hard ones to build, but I have a few of these as well. The easiest way to share these is to send you the form that already has the field built into it. You can build your own lookups from txt, xls or db files.

Searching for a document format.. flowing layout + page control

I am bouncing around the idea of creating a custom document versioning system to use on business rule manuals. These manuals are broken up into outlined sections which contain one rule per section which are outlined in various ways (1.1, 1.2, etc). There are many manuals which contain the same rule for different locations in the country (down to the state/county level), however many locations will have different versions of the rules depending on business needs or whatnot.
My thought is to create a system which will manage versions of each section/rule separately. This would make the management of this mess much easier to maintain (think hundreds of manuals times hundreds of rules), and it would make fielding query requests from management much quicker.
Ok, it's a fairly easy and straightforward design to this point. Now for the monkey wrench. These rules are regulated by government agencies, so they must be submitted to and approved by state agencies. In doing this, many states require only the exact pages which are updated for each request to be submitted for approval. Once they are approved, these pages will get a new effective date and the rest of the manual will remain the same. There are business reasons for this process.
So my choice of document format has to allow for flowing layout much like Word, however I need to be able to programatically determine the page range of these sections and if changes or additions will cause a repagination.
The most complex layout will contain only tables, headers/footers, and a table of contents. I have thought about using OOXML, but I don't see a way to determine pagination without loading Word which is something I would prefer to avoid. I could create my own pagination algorithm, but that sounds a lot like reinventing the wheel.
Can anyone offer pointers to a solution whether it is an open document format, a book, or something else? Thank you for taking the time to read this.
If you want a truly modular document, then DocBook might be worth a look. You have all the rich formatting you need but it does need a bit of work. It really depends on who's doing the authoring and what tools they're comfortable using. DocBook is a rich mark-up language and you can do anything from work in the base plain text file or look at a number of WYSIWYG editors, e.g. ArborText.
It's not Word though - which might be enough to put your authors off!
If you did go with DocBook, you would maintain each document section in a separate text file so your versioning solution would work well. DocBook can produce output in a number of formats simultaneously so you could have an HTML version, an OOXML version, and a PDF version produced from the same source. A PDF version of each changed section might be appropriate to send to government agencies for approval.
On pagination, you could make life a lot easier for yourself by not having continuous page numbers. Use section or chapter based page numbering, e.g. page I-1, I-2, ..., II-1, II-2.

How to create a deliverable for a front-end engineer?

This is a question about the development workflow of front end engineers. I am starting a project for a rather large site with lots of pages, each page has multiple steps, and it's very difficult to lay out all the content in a spreadsheet.
The content of each page will be delivered in a spreadsheet cell, and some pages have multiple variable section that are determined by user's preferences.
I was asked my opinion about how to structure the deliverable. I am wondering if there is a best practice out there for structuring this kind of deliverable? Because when you have a poorly structured deliverable it can be almost as mindnumbing as using pen-and-pencil to write code.
Do you have any tools, formats, practices for creating deliverables that are easy to work with?
It sounds like you are just doing the UI design and then giving it to the front-end engineers.
If that is correct, I would suggest that you see if you can do the rough html/css work to get the page to look as you want, and then they can go in and give it the functionality, but that way you have an idea what is possible.
You can do much of the work, then leave comments about trying to center something a bit better, for example.
I am not a big fan of just getting the design on paper or as an image, it would be easier to just get the html/css.
There are plenty of tools now that make css and html easy to do, even if you have the css inside the html, they can separate the two, but, it would be a huge help to the designers.
Just do one page, and give it to them, and then come back in a day or two and get feedback as to what their thoughts are, and how you can improve what you give them.
As you go through this process, after a while both groups will know what to expect and you can get the rest done quickly.
This is more of an agile methodology with the front-end engineers as your customers.
My suggestion would be mockups or wireframes for the pages. Mockups would be examples of the pages in various states while the wireframe is a detailed document of the structure of the page.
HTML and CSS is way too complicated for mockup use. I usually first create a requirement backlog for UI/functionalities as well (just a list of priorized reqs in Excel).
Especially for a large site development you should also have the process and data flow definitions done (UML or other way of description) to help you define the mentioned requirements.
Based on these you will know what kind of steps does the whole site funcionality need (i.e. pages) and what the page hierarchy and structure will be like. This way it's much easier to get a grasp of the whole thing.
After that we'll create fast wireframes and visualize the end result with fast mockups done as images with Photoshop or similar. These are absolutely vital in my experience as it helps the customer (and other stakeholders) to actually understand what is beind done. For this the html and css are simply too slow to run multiple iterations with.