I am working on a project that need to identify contact points on company's website and used for the purpose of enhancing security.
Right now, I managed to use Apache Nutch to crawl several rounds of sites. The next step will be to parse the HTML pages and locate where the contact information is. In this case, I am only interested in email addresses and phone numbers....
Here is what I am planning to do, we can write a map reduce jobs to parse HTML file and use things like regular expression in combo with Jsoup/Beautifulsoup HTML parsers to find the regular expression.
However, I am wondering is there any parser plugin that has already been implemented and maybe tested used for this purpose?
You should not need to write a custom map reduce job. Just implement a bespoke HTMLParseFilter which will give you a DOM to run XPath expressions on + the text of the document if you want regular expressions.
Having worked on something similar for a customer a few years ago, I found that there were many pages implementing schema.org. You could write a custom HTMLParse filter with Xpath to extract normalised info from the microdata. You can look at the microdata parser which is for StormCrawler as a an example of how to leverage Apache Any23 to extract microdata.
If you want a more NLP-intensive method, you could use Behemoth to process Nutch segments with tools such as Apache UIMA or GATE.
HTH
Related
I'm working on a project, within which we are using semantic web technologies and creating web application allows user to get recommendation in order to take right decision ( won't get into the details).
For me and my team its a first experience to work with ontology.
We've already created ontology (have rdf and owl formatted files)(We are using eclipse to keep them).
Separately, we've created web application. My question how to connect web page and owl, rdf formatted data, more precisely, how to ensure input through webpage to dataset and get output on page.
I've found some info( on old forums), that its easyrdf which can be used as embedded in php script. But not clear.
Based on youtube tutorials, I've downloaded jena fuseki and don't know what is the next step.
I would be glad to get any advice, suggestion :)
In my view, there is no a single way to do this.
I usually set up some back-end application in order to pre-process this kind of information (build SPARQL queries, execute them and parse the results) and then return to the front in some way understandable by that side.
So, you could have all your data in RDF format store, for example, in a TDF exposed by Fuseki and interact with that data with some back-end, aimed to consume, update and parse the results you could find there.
That's my advice, hope could be useful for you.
Good luck!
Is it Possible to convert a complete CCDA xml to a FHIR based xml? I would like to convert an complete CCDA xml to a FHIR compatible XML through Mirthconnect interface.
I like to have sample messages that shows how a complete CCDA is been transformed to FHIR based XML, I googled and ended up with no answers. It would be great if you guys help me.
Strictly speaking, C-CDA is consolidated CDA. It is an IG - Implementation Guide.
In simple terms, there are various IG for generating a CDA document HITSP/C83 for one is an example and there are several others. The main problem with all these seperate IG is that they are not uniform. C-CDA was created to bring uniformity of data. This presentation here is a good place to start. Basically, it says you got to have at least 4 mandatory section in your CCD, and rest optional sections. It entirely depends on your use case.
Secondly, You need to download a copy of a valid C-CDA file from this site. Let's take inpatient summary document.
So that would be your target document, and consider it as a template.
Third, You got to tell your engineering team or if you are the developer yourself, then you need to build logic to extract and place information into that template. This is an iterative process, and everytime you need to validate your developed document, against the validator (site given above).
Until and unless the validator says 0 errors present, your document is not ready.
So, There does not exist a ready made code or logic that you can just plug and play and start developing C-CDA documents.
I have a number PDF documents which are dynamic forms. I want to make one document that contains all pages of the fist document, then all pages of the second, and so on. How can I do it programmatically with the Java API of Adobe Livecycle Enterprise Server?
I found documentation here, but it does not work for dynamic forms. Maybe I can convert the dynamic forms to static forms first? How can I do that?
http://livedocs.adobe.com/livecycle/8.2/programLC/programmer/help/wwhelp/wwhimpl/common/html/wwhelp.htm?context=sdkHelp&file=001473.html
Thanks in advance for your answers.
Cheers,
Arne
It depends on how many of these you need to create. I assume you are going to be making large numbers of these PDF's. The correct thing to use is Adobe LC Output ES2. The process is first to render your dynamic XFA based forms to static PDF using the Output service with whatever data you have and then to assemble them with the Assembler service (requires a DDX file with the rules for the assembly).
There are APIs (inc. Java) to call these services directly or you can write an orchestration(in Workbench) that does all the steps and you can call the orchestration from various API's including Java. The short lived orchestration capability is also licensed with Output.
This sample would be a good thing to review to see how to construct the orchestration (service) XDP and DDX.
http://help.adobe.com/en_US/livecycle/9.0/samples/ServiceUsageSampleOutputLetterWithAttachment.html
Places to review invoking orchestrations in Java are:
http://help.adobe.com/en_US/livecycle/9.0/programLC/help/000495.html
Best of luck C.
This is how I did it:
Add a numeric field to the master pages
Mark it as calculated - read only
Select current page number, in the menu below. you can also have another with total number of pages.
I'm writing a small web application using Perl, HTML::Mason and Apache.
I've been using Mason's usual <%args> method for receiving 'normal' form parameters, and Apache2::Upload for receiving files.
However, I want to write a page that allows a user to upload multiple files, and I'd like to take advantage of HTML5's multiple attribute to input fields. This will look to the server as though there were multiple file inputs in the form with the same name.
The interface for Apache2::Upload doesn't seem to directly support this, allowing you instead to just get the data for a file with a particular parameter name. The documentation alludes to using APR::Request::Param::Table, but I can't find any documentation for doing that.
Please note that I'm not interested in answers that involve adding extra file input fields with different names. This is trivial to handle on the server, and my question doesn't involve front-end scripting at all.
Use the multiple attribute (in the form as you described) and then, after submission, call the Apache request object's upload method. That will give you a list of Apache2::Upload instances.
Good luck!
I'm considering Wordpress as my CMS platform for a client site I'm doing at the moment.
However, I need to create a couple of custom 'modules'. One of these modules is a form that people will be able to complete and have a quote, and once submitted, in a special place in the Wordpress panel (like a menu or something), there will be a listing of all the submitted quotes (just fetching it from a table in my database).
Another one is to manage a cafeteria menu, so the client can add a different meal to each day of the week.
I know perfectly how to do this kind of things using some kind of MVC framework and doing it 'by-hand', but I'm just wondering if this would be possible to do with WP and if yes, what kind of tools I'll have to use.
Thanks
Quite simply, yes, WordPress would be a more-than-capable asset to your criteria. But it's whether the learning curve in getting to know WP outweighs using a framework you're clearly already familiar with?
Personally, it sounds you like you're pretty solid with PHP, and considering the fact that, in my opinion, what you're planning on doing is relatively easy, I'd say WordPress is an excellent solution.
I'd recommend reading about WordPress 3.0's new custom post type API, and skimming the basics of hooks and filters in the Plugin API.
Submitted quotes would merely be a custom post type. You'd be better off writing the front-end code (like handling the form, UI etc.) yourself, either within a theme or plugin, then using wp_insert_post and let WordPress handle all the database administration. In fact, WP will go one step further and set up the entire admin for viewing, editing and deleting quotes.
Post meta (also known as custom fields) is also there for you if you need to store additional information about a quote that doesn't quite fit the post's table structure.
For the menu, this is even easier. I'd say just create a post category called 'Menu', and the client can publish 'dishes' to it as you would with a blog or any similar rolling format.
I've only licked the surface here. Get stuck in with the above, then check out some other goodies like meta boxes and custom taxonomies!
If you want to try a plugin instead of writing something yourself, Flutter might work. It's a little unpolished sometimes but it makes this sort of thing an absolute breeze.