convert natural language to OCL script - eclipse

i have some statements in natural language that i have to convert to OCL format, i am beginner and have written some by myself but failed to convert some advanced level statements , i have searched a lot like this, this and this but failed to found something exact or similar.
statements that i want to convert:
1.Disable web server directory listing and ensure file meta data (e.g GIT) and backup files are not present within web roots
2.Log access failures , alert admins on repeated failures
3.Limit API and control access to minimize the harm from automated attack
Block SQL injection through query parameterization.
please help me
thanks

Natural language is imprecise and so cannot always be converted to a precise form such as OCL without additional human insight. The converse is possible.
The statements you want to convert appear to be from disciplined fields. You therefore just need to transliterate each idiom.

Related

Deploy Knowledge Studio dictionary pre-annotator to Natural Language Understanding

I'm getting started with Knowledge Studio and Natural Language Understanding.
I'm able to deploy a machine-learning model toNatural Language Understanding and use the API to query it.
I would know if there's a way to deploy only the pre-annotator.
I read from Knowledge Studio's documentation that
You can deploy or export a machine-learning annotator. A dictionary pre-annotator can only be used to pre-annotate documents within Watson Knowledge Studio.
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
Does exist a workaround to create a model that simply does the job of the pre-annotator, i.e. use dictionaries to find entities instead of the machine-learning model?
You may need to explain this better in what you need.
WKS allows you to pre-annotate documents with dictionaries you upload. Once you have created a ML model, you can alternatively use that to annotate your training documents, and then manually correct. As you continue the amount of manual work will reduce after each model iteration.
The assumption is that you are creating a model with a reasonable amount of examples. In your model results, you will want the mention/relations to be outside or close to outside the gray area of the report.
The other interpretation of your request I took was you want to create a dictionary based model only. This is possible using the "Rule-Based Model" functionality. You would have to create the parsing rules but you just map what you want to find to the dictionary/rule.
Using this in production though is still limited. You should get a warning when you deploy these kinds of models.
It's slightly better than just a keyword search as you can map items to parts of speech.
The last point. The purpose of WKS is to create a machine learning model which will do the work in discovering new terms you haven't seen before. With the rule based engine it can only find what you explicitly tell it to find.
If all you want is just dictionary entries, then you can create a very simple string comparison solution, but you lose the linguistic features.

Creating restricted temporary Subroutines in ABAP

I would like to offer the possibility to create advanced selects, more profound than the left joins that can be created via quickviewer.
The easiest way would be to allow the user to insert some source code via "GENERATE SUBROUTINE POOL".
But it would be neccesary for me to ensure that this source code doesn't change any data, start other programms or does anything than evaluating data.
My idea would be to restict the inserted source code to some key-worlds like SELECT, LOOP, IF etc. In this case I would need to find all key words in the inserted source code and check it against some white list.
How could I do this? Are there any ways to circumvent my restrictions in order to do some real damage? Are there other ways to reach my goal?
Securing stuff like this is really hard to do, and it's not a personal limitation or an issue with the platform. You can't get the machine to understand what the programmer is doing - see Halting Problem. Doing a keyword scan for yourself is tedious work. You might want to try SCAN ABAP-SOURCE and take a look at the tables the scanner throws out. However, limiting to the most basic language elements will only get you so far, because for any substantial programming, you'll have to allow for external subroutine calls, and then you're vulnerable to about anything. In my opinion, it's better to hand the specifications to a developer and have the report developed with no dynamic stuff whatsoever. This way, you can also ensure that no incorrect reports are created just because the user hacking away at the SELECT statements didn't know about that cancellation flag...

How to tag a scientific data processing tool to ensure repeatability

we develop a data processing tool to extract some scientific results out of a given set of raw data. In data science it is very important that you can re-obtain your results and repeat the calculations, that led to a result set
Since the tool is evolving, we need a way to find out which revision/build of our tool generated a given result set and how to find the corresponding source from which the tool was build.
The tool is written in C++ and Python; gluing together the C++ parts using Boost::Python. We use CMake as a build system generating Make files for Linux. Currently the project is stored in a subversion repo, but some of us already use git resp. hg and we are planning to migrate the whole project to one of them in the very near future.
What are the best practices in a scenario like this to get a unique mapping between source code, binary and result set?
Ideas we are already discussing:
Somehow injecting the global revision number
Using a build number generator
Storing the whole sourcecode inside the executable itself
This is a problem I spend a fair amount of time working on. To what #VonC has already written let me add a few thoughts.
I think that the topic of software configuration management is well understood and often carefully practiced in commercial environments. However, this general approach is often lacking in scientific data processing environments many of which either remain in, or have grown out of, academia. However, if you are in such a working environment, there are readily available sources of information and advice and lots of tools to help. I won't expand on this further.
I don't think that your suggestion of including the whole source code in an executable is, even if feasible, necessary. Indeed, if you get SCM right then one of the essential tests that you have done so, and continue to do so, is your ability to rebuild 'old' executables on demand. You should also be able to determine which revision of sources were used in each executable and version. These ought to make including the source code in an executable unnecessary.
The topic of tying result sets in to computations is also, as you say, essential. Here are some of the components of the solution that we are building:
We are moving away from the traditional unstructured text file that is characteristic of the output of a lot of scientific programs towards structured files, in our case we're looking at HDF5 and XML, in which both the data of interest and the meta-data is stored. The meta-data includes the identification of the program (and version) which was used to produce the results, the identification of the input data sets, job parameters and a bunch of other stuff.
We looked at using a DBMS to store our results; we'd like to go this way but we don't have the resources to do it this year, probably not next either. But businesses use DBMSs for a variety of reasons, and one of the reasons is their ability to roll-back, to provide an audit trail, that sort of thing.
We're also looking closely at which result sets need to be stored. A nice approach would be only ever to store original data sets captured from our field sensors. Unfortunately some of our computations take 1000s of CPU-hours to produce so it is infeasible to reproduce them ab-initio on demand. However, we will be storing far fewer intermediate data sets in future than we have in the past.
We are also making it much harder (I'd like to think impossible but am not sure we are there yet) for users to edit result sets directly. Once someone does that all the provenance information in the world is wrong and useless.
Finally, if you want to read more about the topic, try Googling for 'scientific workflow' and 'data provenance' similar topics.
EDIT: It's not clear from what I wrote above, but we have modified our programs so that they contain their own identification (we use Subversion's keyword capabilities for this with an extension or two of our own) and write this into any output that they produce.
You need to consider git submodules of hg subrepos.
The best practice in this scenario os to have a parent repo which will reference:
the sources of the tool
the result set generated from that tool
ideally the c++ compiler (won't evolve every day)
ideally the python distribution (won't evolve every day)
Each of those are a component, that is an independent repository (Git or Mercurial).
One precise revision of each component will be reference by a parent repository.
The all process is representative of a component-based approach, and is key in using an SCM (here Software Configuration Management) at its fullest.

How can I build a generic dataset-handling Perl library?

I want to build a generic Perl module for handling and analysing biomedical character separated datasets and which can, most certain, be used on any kind of datasets that contain a mixture of categorical (A,B,C,..) and continuous (1.2,3,881..) and identifier (XXX1,XXX2...). The plan is to have people initialize the module and then use some arguments to point to the data file(s), the place were the analysis reports should be placed and the structure of the data.
By structure of data I mean which variable is in which place and its name/type. And this is where I need some enlightenment. I am baffled how to do this in a clean way. Obviously, having people create a simple schema file, be it XML or some other format would be the cleanest but maybe not all people enjoy doing something like this.
The solutions I can think of are:
Create a configuration file in XML or similar and with a prespecified format.
Pass the information during initialization of the module.
Use the first row of the data as headers and try to guess types (ouch)
Surely there must be a "canonical" way of doing this that is also usable and efficient.
This doesn't answer your question directly, but have you checked CPAN? It might have the module you need already. If not, it might have similar modules -- related either to biomedical data or simply to delimited data handling -- that you can mine for good ideas, both concerning formats for metadata and your module's API.
Any of the approaches you've listed could make sense. It all depends on how complex the data structures and their definitions are. What will make something like this useful to people is whether it saves them time and effort. So, your decision will have to be answered based on what approach will best satisfy the need to make:
use of the module easy
reuse of data definitions easy
the data definition language sufficiently expressive to describe all known use cases
the data definition language sufficiently simple that an infrequent user can spend minimal time with the docs before getting real work done.
For example, if I just need to enter the names of the columns and their types (and there are only 4 well defined types), doing this each time in a script isn't too bad. Unless I have 350 columns to deal with in every file.
However, if large, complicated structure definitions are common, then a more modular reuse oriented approach is better.
If your data description language is difficult to work with, you can mitigate the issue a bit by providing a configuration tool that allows one to create and edit data schemes.
rx might be worth looking at, as well as the Data::Rx module on the CPAN. It provides schema checking for JSON, but there is nothing inherent in the model that makes it JSON-only.

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.
Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.
Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with
Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.