Good open source or good community edition ETL tool - mongodb

I was working previously on No-Sql like MongoDB. Now I want to switch to ETL process, for that I was searching ETL tools which integrate to MongoDB, Hadoop and that tool should be having good community edition or open source because of initially I want to learn ETL so currently not possible to buy the Enterprise Edition, so any one knows which one is good ETL tool which satisfied my requirement.

One of the best is definitely Talend Open Studio for Big Data. But it requires a lot of hacking for effective work (you may need to implement your own components or inject custom Java-code to get the desired result).

I think Kettle is one of the most popular (at least from what I have noticed).
Kettle
It has a lot of features and is "fairly" user friendly.

Check out StreamSets Data Collector. It has a modern web based interface and a growing community. It's Apache 2.0 licensed. Supports much of the Hadoop ecosystem and mongodb out of the box.
Full disclosure: I'm a committer on this project.

Related

What are some of the big differences in Java Client versus Go Client when implementing Uber Cadence workflow?

I am working on designing a workflow with the intention of using cadence workflow engine and Java client. Seems like uber is actively using Go, and thus Go has better documentation and Activity and other classes than Java Client. Is this true?
No, it is not really true. The majority of the open source users of Cadence and Temporal are using Java SDK.
If you go the the java-client channel in Cadence slack, the community has more discussion than go-client. Even in Uber, Java-client is heavily used by core services like payments.
Go client happens to have more docs/samples because it started a little earlier. In fact, the docs that are missing in Java, could be derived from Go. It should be noted that there are more documents in Java library. For example, the documents of how to write unit tests, instead of putting in to cadenceworkflow.io, we put in
javadocs directly. Because this is the convention for Java developer to lookup documentation.
IMO they are the same important for Cadence. All the new features are implemented/rolling out at the same time hence they don't have real difference.

Apache Stanbol scalability and real-world applications

I'm starting a project with requirements such as NLP, storage of semantic data, content managment etc. and Apache Stanbol seems like a nice fit, but I'm not exactly sure it's ready so I'm trying to make an appropriate assessment before starting to work with it, as there are few things that worry me:
Stanbol seems a bit young and immature (newest version 0.12). Has anybody used it in a commercial project/application/setup (I failed to find this information online)? What is the scale of those projects?
How horizontally scalable is Stanbol? What are its cloud/clustering capabilities? As far as I know it relies on Apache Jena for storage, and Jena storage isn't horizontally scalable which would make Stanbol unable to scale horizontally as well. I might be wrong about this, but this is my current understanding, please correct me if I'm wrong. Maybe Jena can be swapped with something else to be used as RDF storage provider and I'm not aware of it?
Learning resources for Stanbol seem a little scarce. Does anyone know of a place/book/whatever where I can get more understanding about Stanbol under the hood (other than the official Stanbol website and the IKS website)?
Are there any good alternatives? I know there are nice alternatives regarding NLP (e.g. GATE, UIMA), but they lack CMS capabilities.
Thanks.
To your question:
1) I've been working on a project involving Stanbol(version 0.10). Its
still in the pre production stage. For CMS, we evaluated JackRabbit
and Alfresco. Alfresco (CMIS) was found to be a better choice in our case. What I
like about stanbol is the enhancement chains and the set of
Enhancement
Engines
that come by default. This is a small to mid size project.
3) I found this book (Instant Apache Stanbol, Packt Publishing)
very practical and useful while going about with my work especially the sections on Entity hubs and Enhancement engines.
A viable option is to use Redlink that offers content analysis and linked data services in the cloud using Apache Stanbol and Apache Marmotta in the back-end.
The Readlink team has worked on IKS and Apache Stanbol; for these reasons getting in contact with them can be a good starting point when deciding to use these technologies in production environments.

Stored procedure in Neo4j

I wanted to know if there is any Neo4j equivalent of a stored procedure?
When I researched this, I came across events, but I found them more like triggers and not stored procedures.
There are basically two techniques to extend a Neo4j server:
Server plugins enrich the existing REST endpoints and
unmanaged extensions allow to you create new REST endpoints
Both techniques require to write code in JVM (or other JVM language), package a jar file and deploy it to Neo4j server.
Stored procedures are available as capabilities CALLABLE from the Cypher language since version 3.0
A first reference can be found here
https://dzone.com/articles/neo4j-30-stored-procedures
A remarkable example, showing how graph can be processed in the large
through procedure to achieve network clustering and community
decetion, here
http://www.markhneedham.com/blog/2016/02/28/neo4j-a-procedure-for-the-slm-clustering-algorithm/
EDIT
As Neo4J 3.0 has been released in April'16, the stored procedure became an official, Apache 2.0 licensed, repository.
https://neo4j.com/labs/apoc/
Available procedures range from data manipulation/import to Spatial and complex graph algorithms (es. Page Rank, Dijkstra, Community detection, betweenness centrality , closeness centrality, etc)
My answer here does not answer the question directly (Stefan's answer does just fine for that). With that said, if any of you are considering writing server plugins (to get Stored Proc behavior) before your project is actually being used in production (which at the time of this writing is the vast majority of the Neo4j userbase), I strongly recommend not doing so.
Server plugins add architectural complexity to your project. You will require JVM developers to maintain them. Deploying or updating them can be tricky, and the associated source control methodologies are not intuitive. Neo4j doesn't require schema migrations, which makes your job as a developer easier. Adding server plugins will no longer give you that benefit, and since it's not a mainstream use case of Neo4j, you'll be getting little help from the developer community, and improvements and bug fixes related around that function will be given lesser priority from the Neo4j team.
And all that would be for possibly a slight performance boost, or none at all.
"Stored Procedures" (or using server plugins as such) are an important feature to have in the context of performance tuning, but if your team is still two guys in a garage, don't even think about going down this path.

Does PostgreSQL support PMML

I couldn't find any reference that PostgreSQL db supports PMML using a search engine. I was wondering if anyone had any luck with this. I would like to deploy a Random Forest model that is built in R in PostgreSQL (I'm aware of other work arounds - but want to get an answer for this question before I go down the other route).
From my own reading, PostgreSQL doesn't directly support PMML, however if you use JPMML it integrates seamlessly with PostgreSQL. Its library is opensource and extensive.
https://github.com/jpmml/jpmml-postgresql
There is no built-in support. However with the XML support, the extensible stored procedure language handlers, and such it shouldn't be too hard to implement as an add-on (or perhaps an extension).
I don't foresee PMML support coming built into PostgreSQL in the near to moderate future so you would do best to either implement it yourself or go another route.

Process Engines for BPMN 2.0

I'm doing a comparison among all existing BPMN 2.0 Process Engines e.g. Activiti, jBPM etc.
I've prepared a list of 4 process engines which executes BPMN 2.0 given below;
Popular BPMN 2.0 compliant open-source engines:
Activiti: http://www.activiti.org/
jBPM: http://www.jboss.org/jbpm
Bonita: http://www.bonitasoft.com/
A commercial engine:
ActiveVOS: http://www.activevos.com/products
I would appreciate your help if you enhance my research by adding any existing Process Engines (for BPMN 2.0) in the above list along with the quick comparison among all.
I would prefer a very short comparison listing only important features (distinguishing features like what is possible for one and not for others, licensing, dependencies with other products like tomcat & JBoss and operating systems etc)
P.S: I've found much on Activiti vs jBPM but still your answers will be a favor.
I cannot offer you a full-fledged comparison but I can give you some pointers that might help you in your evaluation:
An "Activiti in Action" book has just been published (July
2012) and in it you will have a section reviewing other BPMN process
engines (Section 1.2.3 - Knowing the competitors).
For Activiti, there also exists since recently, a commercially-supported version called camunda fox BPM Platform. They also provide a comparison with the added-value they provide here.
I am disappointed with Activiti. It should be called Spring BPM because it doesn't work well without it. If you don't mind using Spring, then Activiti might be a better fit. If you are using JEE/CDI, then JBPM is a better fit.
I did such a research, too. Here are the key-points which were relevant for our concrete use case:
Bonita:
Bonita has a zero-coding approach which means that they provide an easy to use IDE to build your processes without the need for coding. To achieve that, Bonita has the concept of connectors. For example, if you want to consume a web service, they provide you with a graphical wizzard. The downside is that you have to write the plain XML SOAP-envelope manually and copy it in a graphical textbox. The problem with this approach is that you only can realize use cases which are intended by Bonita. If you want to integrate a system which Bonita did not developed a connector for, you have to code such a connector on your own which is very painful. For example, Bonita offers a SOAP connector for consuming SOAP web services. This connector only works with SOAP 1.2, but not for SOAP 1.1 (http://community.bonitasoft.com/answers/consume-soap-11-webservices-bonita-secure-web-service-connector). If you have a legacy application with SOAP 1.1, you cannot integrate this system easily in your process. The same is true for databases. There are only a few database connectors for dedicated database versions. If you have a version not matching to a connector, you have to code this on your own.
In addition, Bonita has no support for LDAP or Active Directory Sync in the free community edition which is quite a showstopper for a production environment. Another thing to consider is that Bonita is licensed under the GPL / LGPL license which could cause problems when you want to integrate Bonita in another enterprise application. In addition, the community support is very weak. There are several posts which are more than 2 years old and those posts are still not answered.
Another important thing is Business-IT-Alignment. Modelling processes is a collaborative discipline in which IT AND the business analysts are involed. That is why you need adequate tools for both user groups (e.g. an Eclipse Plugin for the developers and an easy to use web modeler for the business people). Bonita only offers Bonita Studio, which needs to be installed on your machine. This IDE is quite technical and not suitable for business users. Therefore, it is very hard to realize Business-IT-Alignment with Bonita.
Bonita is a BPM tool for very trivial and easy processes. Because of the zero-coding approach, the lerning curve is very low and you can start modelling very fast. You need less programming skills and you are able to realize your processes without the need of coding. But as soon as your processes become very complex, Bonita might not be the best solution because of the lack of flexibility. You only can realize use cases which are intended by Bonita.
jBPM:
jBPM is a very powerful Open Source BPM Engine which has a lot of features. The web modeler even supports prefabricated models of some van der Aalst workflow patterns (workflowpatterns.com). Business-IT-Alignment is realizable because jBPM offers an Eclipse integration as well as a web-based modeler. A bit tricky is that you only can define forms in the web modeler, but not in the Eclipse Plugin, as far as I know. To sum up, jBPM is a good candidate for using in a company. Our showstopper was the scalability. jBPM is based on the Rules-Engine Drools. This leads to the fact that whole process instances are persisted as BLOBS in the database. This is a critial showstopper when you consider searching and scalability.
In addition, the learning curve is very high because of the complexity. jBPM does not offer a Service Task like the BPMN-Standard suggests In contrast, you have to define your own Java Service tasks and you have to register them manually in the engine, which results in quite low level programming.
Activiti:
In the end, we went with Activiti because this is a very easy to use framework-based engine. It offers an Eclipse Plugin as well as a modern AngularJS Web-Modeler. In this way, you can realize Business-IT-Alignment. The REST-API is secured by Spring Security which means that you can extend the Engine very easily with Single Sign-on features. Because of the Apache License 2.0, there is no copyleft which means you are completely free in terms of usage and extensibility which is very important in a productive environment.
In addition, the BPMN-coverage is very good. Not all BPMN-elements are realized, but I do not know any engine which does that.
The Activiti Explorer is a demo frontend which demonstrates the usage of the Activiti APIs. Since this frontend is based on VAADIN, it can be extended very easily. The community is very active which means that you can get help very fast if you have any problems.
Activiti offers good integration points for external form-technologies which is very important for a productive usage. The form-technologies of all candidates are very restrictive. Therefore, it makes sense to use a standard form-technology like XForms in combination with the Engine. Even such more complex things are realizable via the formKey-Attribute.
Activiti does not follow the zero-coding approach which means that you will need a bit of coding if you want to orchestrate services. But even the communication with SOAP services can be achieved by using a Java Service Task and Apache CXF. The coding effort is low.
I hope that my key points can help by taking a decision. To be clear, this is no advertisment for Activiti. The right product choice depends on the concrete use cases. I only want to point out the most important points in our project.
Best regards Ben
Nommy, you should take a look at Roubroo - a process engine built to natively support BPMN 2.0. It does not have the legacy of an older process engine being retrofitted to support the new standard. It support BPMN 2.0 execution semantics including the IOR gateway, which I think is the key to way business processes are defined in a networked graph. jBPM and Activiti are based on the underlying PVM, which has great support for some workflow patterns but not for others. Take a look at this research paper : http://eprints.qut.edu.au/14320/1/14320.pdf
and http://www.workflowpatterns.com/evaluations/opensource/
In my opinion currently Camunda BPM Platform the leader in the open source field.
And you mentioned Open Source?
So try camunda if you like:
- Clean BPMN focused engine (Shared, Embedable or "remote")
- Clean and working REST API
- Out of the box Platform with basic administration tools, and development ready API's
- Biggest open-source community (my persnoal oppinion)
- Best of Breed approach in the java eco-system.
- If you like Java.
- If you want to that your Processes get accepted by your IT crowd.
http://www.camunda.com/fox/product/details/
jBPM5 is agnostic to the environment, it doesn't depend on JBoss, you can run it in every Application Server, Servlet Container or a SE environment. jBPM5 is licensed with the Apache Software License V2 which I believe that is a really good idea.
You can of course find more information in the official page.
Cheers
Regarding jBPM:
jBPM is an open-source workflow engine written in Java that can execute business processes described in BPMN 2.0 (or its own process definition language jPDL in earlier versions). It is released under the ASL (or LGPL in earlier versions) by the JBoss company
It includes,
Strong and powerful integration with business rules and event processing.
Process collaboration, monitoring and management through the Guvnor repository and the management consoles.
Human interaction using an independent WS-HT human task service.
In essence jBPM takes graphical process descriptions as input. A process is composed of tasks that are connected with sequence flows. Processes represent an execution flow. The graphical diagram (flow chart) of a process is used as the basis for the communication between non-technical users and developers.
Take also a look at Imixs-Workflow which is a human-centric workflow engine. Unlike the usual engines, Imixs workflow is characterised by strong support for user-centric.
Human-centric business process management means supporting human skills, activities and collaboration in a task-oriented manner. With such a Workflow engine you can protect and securely distribute business data within an event-driven BPM architecture based on the BPMN 2.0 standard.
The Imixs-Workflow engine is open source and can be integrated in Jakarta EE oder deployed out of the box with a Microservice running in a Docker container
Take a look at Zeebe.io - a modern, cloud-native workflow engine with first-class Node.js support.