Meaningful data generation for Java POJO attributes

Meaningful data generation for Java POJO attributes - rest

I have large number of APIs which I received in swagger json files. I want to stub these hundreds of APIs for which I require to autofill the returnable POJOs with some meaningful data( and not random data ). The aim is to mock these APIs with minimal effort. The closest thing which I found is Podam java library but it generates random data (as shown here - https://mtedone.github.io/podam/). It also generates attribute level meaningful data (in section "Defining an attribute-level strategy" on the previous link), but it mounts to same effort one would take to individually make POJOs. Can anybody prvide suggestion as to how should I solve this with minimal effort.

Related

Validate if graph is following ontology file

Imagine I have two RDF (turtle) files, one that contains my custom ontology (a.ttl) and one that contains values according to the ontology (b.ttl).
Is it possible to check if b.ttl is respecting all the definitions defined in a.ttl using .NET RDF?
I can load a.ttl using the OntologyGraph class, can I use this in some way to validate that the graph loaded from b.ttl is following the specification?

It depends on how your definitions are expressed.
If they are expressed in SHACL, then yes - dotNetRDF supports SHACL validation (which is sadly not yet written up in the docs, but take a look at this sample code).
If they are expressed in OWL, then no - dotNetRDF does not have an OWL inference engine so it cannot determine if your data is consistent with the ontology (in general OWL is actually for asserting new facts, and OWL "validation" is a process of determining if the facts asserted remain consistent with the ontology). You may need to look to one of the reasoners listed here to do this sort of processing.
A simple RDF-Schema based set of constraints (like just subclass, property domain, property range) could probably be converted to SHACL fairly easily, but that would be an additional step to add to your process.

EventStore basics - what's the difference between Event Meta Data/MetaData and Event Data?

I'm very much at the beginning of using / understanding EventStore or get-event-store as it may be known here.
I've consumed the documentation regarding clients, projections and subscriptions and feel ready to start using on some internal projects.
One thing I can't quite get past - is there a guide / set of recommendations to describe the difference between event metadata and data ? I'm aware of the notional differences; Event data is 'Core' to the domain, Meta data for describing, but it is becoming quite philisophical.
I wonder if there are hard rules regarding implementation (querying etc).
Any guidance at all gratefully received!

Shamelessly copying (and paraphrasing) parts from Szymon Kulec's blog post "Enriching your events with important metadata" (emphases mine):
But what information can be useful to store in the metadata, which info is worth to store despite the fact that it was not captured in
the creation of the model?
1. Audit data
who? – simply store the user id of the action invoker
when? – the timestamp of the action and the event(s)
why? – the serialized intent/action of the actor
2. Event versioning
The event sourcing deals with the effect of the actions. An action
executed on a state results in an action according to the current
implementation. Wait. The current implementation? Yes, the
implementation of your aggregate can change and it will either because
of bug fixing or introducing new features. Wouldn’t it be nice if
the version, like a commit id (SHA1 for gitters) or a semantic version
could be stored with the event as well? Imagine that you published a
broken version and your business sold 100 tickets before fixing a bug.
It’d be nice to be able which events were created on the basis of the
broken implementation. Having this knowledge you can easily compensate
transactions performed by the broken implementation.
3. Document implementation details
It’s quite common to introduce canary releases, feature toggling and
A/B tests for users. With automated deployment and small code
enhancement all of the mentioned approaches are feasible to have on a
project board. If you consider the toggles or different implementation
coexisting in the very same moment, storing the version only may be
not enough. How about adding information which features were applied
for the action? Just create a simple set of features enabled, or map
feature-status and add it to the event as well. Having this and the
command, it’s easy to repeat the process. Additionally, it’s easy to
result in your A/B experiments. Just run the scan for events with A
enabled and another for the B ones.
4. Optimized combination of 2. and 3.
If you think that this is too much, create a lookup for sets of
versions x features. It’s not that big and is repeatable across many
users, hence you can easily optimize storing the set elsewhere, under
a reference key. You can serialize this map and calculate SHA1, put
the values in a map (a table will do as well) and use identifiers to
put them in the event. There’s plenty of options to shift the load
either to the query (lookups) or to the storage (store everything as
named metadata).
Summing up
If you create an event sourced architecture, consider adding the
temporal dimension (version) and a bit of configuration to the
metadata. Once you have it, it’s much easier to reason about the
sources of your events and introduce tooling like compensation.
There’s no such thing like too much data, is there?

I will share my experiences with you which may help. I have been playing with akka-persistence, akka-persistence-eventstore and eventstore. akka-persistence stores it's event wrapper, a PersistentRepr, in binary format. I wanted this data in JSON so that I could:
use projections
make these events easily available to any other technologies
You can implement your own serialization for akka-persistence-eventstore to do this, but it still ended up just storing the wrapper which had my event embedded in a payload attribute. The other attributes were all akka-persistence specific. The author of akka-persistence-eventstore gave me some good advice, get the serializer to store the payload as the Data, and the rest as MetaData. That way my event is now just the business data, and the metadata aids the technology that put it there in the first place. My projections now don't need to parse out the metadata to get at the payload.

Strategies for embedding change tracking in a structured data document format

When designing a specialized structured-data document format (perhaps upon XML): part of the requirements for this document format are that it accommodates, in a metadata section, a history of meaningful (app-level) changes to the structured data at a field level.
At minimum, useful tracked information would be:
an author identifier
time stamp
type of change
what it was changed from
Both data items and any lists of such data items are to be tracked meaningfully, efficiently. The data schema should be separable/unaware of the metadata tracking it - although facilitating annotations such as node identifiers could be required. A trusted application could be required to enforce the tracking; however, it would be a benefit to be able to calculate the "deltas" at intervals by comparing data-sections between versions rather than requiring the editor to track each change live.
"Meaningful" tracking may involve the metadata schema treating higher-level data changes atomically - such as an update to a group of fields which is treated at the application level as one data-point.
For character-by-character or byte-by-byte data, diff/patch type algorithms work. Structured data (to be treated as structured) seems to me to require more complex solutions.
I realize that I don't have very well-defined requirements - the purpose of my question here is to find out where these problems have been considered with more elegance.
What strategies exist for embedding change tracking in a structured data document format?
Thanks!

You might be interested in XML patch formats (e.g. as described by rfc 5261).
You could for example, build a list of such patches embedded at the top of your structured XML file and annotating each patch with its author, potential feature request/bug number and so on, potentially annotating it with semantic level patch information (such as "added such object", "removed such rule"...). Using such a format could help you obtain old versions from your document rather easily as tools exist to treat it.

machine learning and code generator from strings

The problem: Given a set of hand categorized strings (or a set of ordered vectors of strings) generate a categorize function to categorize more input. In my case, that data (or most of it) is not natural language.
The question: are there any tools out there that will do that? I'm thinking of some kind of reasonably polished, download, install and go kind of things, as opposed to to some library or a brittle academic program.
(Please don't get stuck on details as the real details would restrict answers to less generally useful responses AND are under NDA.)
As an example of what I'm looking at; the input I'm wanting to filter is computer generated status strings pulled from logs. Error messages (as an example) being filtered based on who needs to be informed or what action needs to be taken.

Doing Things Manually
If the error messages are being generated automatically and the list of exceptions behind the messages is not terribly large, you might just want to have a table that directly maps each error message type to the people who need to be notified.
This should make it easy to keep track of exactly who/which-groups will be getting what types of messages and to update the routing of messages should you decide that some of the messages are being misdirected.
Typically, a small fraction of the types of errors make up a large fraction of error reports. For example, Microsoft noticed that 80% of crashes were caused by 20% of the bugs in their software. So, to get something useful, you wouldn't even need to start with a complete table covering every type of error message. Instead, you could start with just a list that maps the most common errors to the right person and routes everything else to a person for manual routing. Each time an error is routed manually, you could then add an entry to the routing table so that errors of that type are handled automatically in the future.
Document Classification
Unless the error messages are being editorialized by people who submit them and you want to use this information when routing them, I wouldn't recommend treating this as a document classification task. However, if this is what you want to do, here's a list of reasonably good packages for document document classification organized by programming language:
Python - To do this using the Python based Natural Language Toolkit (NLTK), see the Document Classification section in the freely available NLTK book.
Ruby - If Ruby is more of your thing, you can use the Classifier gem. Here's sample code that detects whether Family Guy quotes are funny or not-funny.
C# - C# programmers can use nBayes. The project's home page has sample code for a simple spam/not-spam classifier.
Java - Java folks have Classifier4J, Weka, Lucene Mahout, and as adi92 mentioned Mallet.
Learning Rules with Weka - If rules are what you want, Weka might be of particular interest, since it includes a rule set based learner. You'll find a tutorial on using Weka for text categorization here.

Mallet has a bunch of classifiers which you can train and deploy entirely from the commandline
Weka is nice too because it has a huge number of classifiers and preprocessors for you to play with

Have you tried spam or email filters? By using text files that have been marked with appropriate categories, you should be able to categorize further text input. That's what those programs do, anyway, but instead of labeling your outputs a 'spam' and 'not spam', you could do other categories.
You could also try something involving AdaBoost for a more hands-on approach to rolling your own. This library from Google looks promising, but probably doesn't meet your ready-to-deploy requirements.

REST Media type explosion

In my attempt to redesign an existing application using REST architectural style, I came across a problem which I would like to term as "Mediatype Explosion". However, I am not sure if this is really a problem or an inherent benefit of REST. To explain what I mean, take the following example
One tiny part of our application looks like:
collection-of-collections->collections-of-items->items
i.e the top level is a collection of collections and each of these collection is again a collection of items.
Also, each item has 8 attributes which can be read and written individually. Trying to expose the above hierarchy as RESTful resources leaves me with the following media types:
application/vnd.mycompany.collection-of-collections+xml
application/vnd.mycompany.collection-of-items+xml
application/vnd.mycompany.item+xml
Further more, since each item has 8 attributes which can be read and written to individually, it will result in another 8 media types. e.g. one such media type for "value" attribute of an item would be:
application/vnd.mycompany.item_value+xml
As I mentioned earlier, this is just a tiny part of our application and I expect several different collections and items that needs to be exposed in this way.
My questions are:
Am I doing something wrong by having these huge number of media types?
What is the alternative design method to avoid this explosion of media types?
I am also aware that the design above is highly granular, especially exposing individual attributes of the item and having separate media types for each them. However, making it coarse means I will end up transferring unnecessary data over the wire when in reality the client only needs to read or write a single attribute of an item. How would you approach such a design issue?

One approach that would reduce the number of media types required is to use a media type defined to hold lists of other media-types. This could be used for all of your collections. Generally lists tend to have a consistent set of behavior.
You could roll your own vnd.mycompany.resourcelist or you could reuse something like an Atom collection.
With regards to the specific resource representations like vnd.mycompany.item, what you can do depends a whole lot on the characteristics of your client. Is it in a browser? can you do code-download? Is your client a rich UI, or is it a data processing client?
If the client is going to do specific data processing then you pretty much need to stick with the precise media types and you may end up with a large number of them. But look on the bright side, you will have less media-types than you would have namespaces if you were using SOAP!
Remember, the media-type is your contract, if your application needs to define lots of contracts with the client, then so be it.
However, I would not go as far as defining contracts to exchange single attribute values. If you feel the need to do that, then you are doing something else wrong in your design. Distributed interface design needs to have chunky conversations, not chatty ones.

I think I finally got the clarification I sought for the above question from Ian Robinson's presentation and thought I should share it here.
Recently, I came across the statement "media type for helping tune the hypermedia engine, schema for structure" in a blog entry by Jim Webber. I then found this presentation by Ian Robinson of Thoughtworks. This presentation is one of the best that I have come across that provides a very clear understanding of the roles and responsibilities of media types and schema languages (the entire presentation is a treat and I highly recommend for all). Especially lookout for the slides titled "You've Chosen application/xml, you bstrd." and "Custom media types". Ian clearly explains the different roles of the schemas and the media types. In short, this is my take away from Ian's presentation:
A media type description includes the processing model that identifies hypermedia controls and defines what methods are applicable for the resources of that type. Identifying hypermedia controls means "How do we identify links?" in XHTML, links are identified based on tag and RDF has different semantics for the same. The next thing that media types help identify is what methods are applicable for resources of a given media type? A good example is ATOM (application/atom+xml) specification which gives a very rich description of hyper media controls; they tell us how the link element is defined? and what we can expect to be able to do when we dereference a URI so it actually tells something about the methods we can expect to be able to apply to the resource. The structural information of a resource represenation is NOT part of or NOT contained within the media type description but is provided as part of appropriate schema of the actual representation i.e the media type specification won’t necessarily dictate anything about the structure of the representation.
So what does this mean to us? simply that we dont need a separate media type for describing each resource as described above in my original question. We just need one media type for the entire application. This could be a totally new custom media type or a custom media type which reuses existing standard media types or better still, simply a standard media type that can be reused without change in our application.
Hope this helps.

In my opinion, this is the weak link of the REST concept. As an architectural and interface style, REST is outstanding and the work done by Roy F. and others has advanced the state of the art considerably. But there is an upper limit to what can be communicated (not just represented) by standard media types.
For people to understand and use your REST-ish API, they need to understand the meaning of the data. There are APIs where the media types tell most of the story; e.g. if you have a text-to-speech API, the input media type is text/plain and the output media type is audio / mp4, then someone familiar with the subject matter could probably make do. Text in, audio out, probably enough to go on in this case.
But many APIs can't communicate much of their meaning with just media type. Let's say you have an API that handles airline ticketing. The inputs and outputs will mostly be data. The media types on input and output of every API could be application/json or application/xml, so the media type doesn't transmit a lot of information. So then you would look at the individual fields in the inputs & outputs. Maybe there's a field called "price". Is that in dollars or pennies? USD or some other currency? I don't know how a user would answer those questions without either (a) very descriptive names, like "price_pennies_in_usd", or (b) documentation. Not to mention format conventions. Is an account number provided with or without dashes, must letters be all-caps and so on. There is no standard media type that defines these issues.
It's one thing when we're in situations where the client doesn't need a semantic understanding of the data. That works well. The fact that browsers can visually render any compliant document, and interact with any compliant resource, is really great. That's basically the "media" use case.
But it's entirely different when the client (or actually, the developer/user behind the client) needs to understand the semantics of the data. DATA IS NOT MEDIA. There is no way to explain data in all its real-world meaning and subtlety other than documenting it. This is the "data" use case.
The overly-academic definition of REST works in the media use case. It doesn't work, and needs to be supplemented with non-pure but useful things like documentation, for other use cases.

You're using the media type to convey details of your data that should be stored in the representation itself. So you could have just one media type, say "application/xml", and then your XML representations would look like:
<collection-of-collections>
<collection-of-items>
<item>
</item>
<item>
</item>
</collection-of-items>
<collection-of-items>
<item>
</item>
<item>
</item>
</collection-of-items>
</collection-of-collections>
If you're concerned about sending too much data, substitute JSON for XML. Another way to save on bytes written and read is to use gzip encoding, which cuts things down about 60-70%. Unless you have ultra-high performance needs, one of these approaches ought to work well for you. (For better performance, you could use very terse hand-crafted strings, or even drop down to a custom binary TCP/IP protocol.)
Edit One of your concerns is that:
making [the representation] coarse means I will end up transferring unnecessary data over the wire when in reality the client only needs to read or write a single attribute of an item
In any web service there is quite a lot of overhead in sending messages (each HTTP request might cost several hundred bytes for the start line and request headers and ditto for each HTTP response as in this example). So in general you want to have less granular representations. So you would write your client to ask for these bigger representations and then cache them in some convenient in-memory data structure where your program could read data from them many times (but be sure to honor the HTTP expiration date your server sets). When writing data to the server, you would normally combine a set of changes to your in-memory data structure, and then send the updates as a single HTTP PUT request to the server.
You should grab a copy of Richardson and Ruby's RESTful Web Services, which is a truly excellent book on how to design REST web services and explains things much more clearly than I could. If you're working in Java I highly recommend the RESTlet framework, which very faithfully models the REST concepts. Roy Fielding's USC dissertation defining the REST principles may also be helpful.

A media type should be seldomly created and time should be invested in making sure the format can survive change.
As you're relying on xml, there is no particular reason why you couldn't create one media type, provided that media type is described in one source.
Choosing ATOM over having one host media type that supports multiple root elements doesn't necessarily bring you anything: you'll still need to start reading the message within the context of a specific operation before deciding if enough information is present to process the request.
So i would suggest that you could happily have one media type, represented by one root element, and use a schema language to specify which of the elements can be contained.
In other words, a language like xsd can let you type your media type to support one of multiple root elements. There is nothing inherently wrong with application/vnd.acme.humanresources+xml describing an xml document that can take either or as a root element.
So to answer your question, create as few media types as you can possibly afford, by questioning if what you put in the documentation of the media type will be understandable and implementeable by a developer.

Unless you intend on registering these media types you should pick one of the existing mime types instead of trying to make up your own formats. As Jim mentions application/xml or text/xml or application/json works for most of what gets transmitted in a REST design.
In reply to Darrel here is Roy's full post. Aren't you trying to define typed resources by creating your own mime types?
Suresh, why isn't HTTP+POX Restful?

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse