standard process for developing datastage steps

standard process for developing datastage steps - datastage

I have experience using Pentaho Kettle and Talend Data Integration for ETL jobs and typically the high-level process for developing transformations is:
define source connections
define target connections
define transformation of data between source and target
What is the 'standard' high-level process for developing datastage jobs? Is it similar to the process identified above?

Exactly the same. If you know the basic concepts of an ETL tool, they apply to all of the tools.
The three steps you listed are, however, very high level. Depending on what you're trying to do, that list can be increased quite dramatically.

Related

Custom workflows in Cadence or Temporal

I am planning to use Cadence or Temporal Workflow for the architecture, however we plan to give the users a lot of power in deciding the workflow. Both Cadence and Temporal mention in their Use Cases that Custom DSL are supported by their SDK, but I couldn't see that feature. Will you please help me out?

Yes, custom DSLs can be implemented on top of Cadence/Temporal with a relative ease. The basic idea is that Temporal workflow definition is code and you have to write a simple interpreter of the DSL using that code. As code is fault tolerant and durable the DSL will get all the benefits of Cadence / Temporal.
Here is a DSL sample using Go SDK. We are going to add the Java one later.

Play Framework with Spark MLib vs PredictionIO

Good morning,
currently I'm exploring my options for building an internal platform for the company I work for. Our team is responsible for the company's data warehouse and reporting.
As we evolve, we'll be developing an intranet to answer some of the company's necessities and, for some time now, I'm considering scala (and PlayFramework) as the way to go.
This will also envolve a lot of machine learning to cluster clients, predict sales evolution, and so on. This is when I've started to think in Spark ML and came across PredictionIO.
As we are shifting our skills towards data science, what will benefit and teach us/company most:
build everything on top of Play and Spark and have both the plataform and machine learning on the same project
using Play and PredictionIO where most of the stuff is already prepared
I'm not trying to open a question opinion based, rather then, learn from your experience / architectures / solutions.
Thank you

Both are good options: 1. use PredictionIO if you are new to ML, easy to start but it will limit you in a long run, 2. use spark if you have confidence in your data science and data engineering team, spark has excellent and easy to use api along with extensive ML library, saying that in order to put things into production, you will require some distributed spark knowledge - experience and it is tricky at times to make it efficient and reliable.
Here are options:
spark databricks cloud expensive but easy to use spark, no data engineering
PredictionIO if you certain that their ML can solve all your business cases
spark in google dataproc, easy managed cluster for 60% less than aws, still some engineering required
In summary: PredictionIO for a quick fix, and spark for long term data - science / engineering development. You can start with databricks to minimise expertise overheads and move to dataproc as you go along to minimise costs

PredictionIO uses Spark's MLLib for the majority of their engine templates.
I'm not sure why you're separating the two?
PredictionIO is as flexible as Spark is, and can alternatively use other libraries such as deeplearning4j & H2O to name a few.

Best suited NoSQL database for Content Recommender

I am currently working in a project which includes migrating a content recommender from MySQL to a NoSQL database for performarce reasons. Our team has been evaluating some alternatives like MongoDB, CouchDB, HBase and Cassandra. The idea is to choose a database that is capable of running in a single server or in a cluster.
So far we have discarded the use of Hbase due to its dependency on a distributed environment. Even having the idea of scaling horizontally, we need to run the DB in a single server for a little while in production. MongoDB was also discarded because it does not support map/reduce features.
We have still 2 alternatives and we have no solid background to decide. Any guidance or help is appreciated
NOTE: I do not pretend to create a religion-like discussion with non-founded arguments. It is a strictly technical question to be discussed in the problem's context

Graph databases are usually considered as best suited for recommendation engines, since a lot of the recommendation algorithms are actually graph based. I recommend looking into Neo4J - it can handle billions of nodes/edges on a single machine and it supports a so-called high availability mode which is a master-slave setup with automatic master selection.

What would be a good application for an enhanced version of MapReduce that shares information between Mappers?

I am building an enhancement to the Spark framework (http://www.spark-project.org/). Spark is a project out of UC Berkeley that does MapReduce quickly in RAM. Spark is built in Scala.
The enhancement I'm building allows some data to be shared between the mappers while they are computing. This can be useful, for example, if each of the mappers is looking for an optimal solution, and they all want to share the current best solution (to prune out bad solutions early). The solution may be slightly out of date as it propagates, but this should still speed up the solution. In general, this is called the branch-and-bound approach.
We can share monotonically increasing numbers, but also we can share arrays, and dictionaries.
We are also looking at machine learning applications where the mappers describe local natural gradient information, and then a new best current optimal solution is shared among all nodes.
What are some other good real-world applications of this kind of enhancement? What kinds of real, useful applications might benefit from a Map Reduce computation with just a little bit of information-sharing between mappers. What applications use MapReduce or Hadoop right now but are just a little too slow because of the independence restriction of the Map phase?
The benefit can be to either speed up the map phase, or improve the solution.

The enhancement I'm building allows some data to be shared between the mappers while they are computing.
Apache Giraph is based on Google Pregel which is based on BSP and is used for graph processing. In BSP, there is data sharing between the processes in the communication phase.
Giraph depends on Hadoop for implementation. In general there is no communication between the mappers in MapReduce, but in Giraph the mappers communicate with each other during the communication phase of BSP.
You might be also interested in Apache Hama which implements BSP and can be used for more than graph processing.
There might be some reason why mappers don't communicate in the MR. Have you considered these factors in your enhancement?
What are some other good real-world applications of this kind of enhancement?
Graph processing is one thing I can think of, similar to Giraph. Checkout the different use cases for BSP, some might be applicable for this kind of enhancement. I am also very interested what other have to say on this.

How would you develop a workflow application in Java?

I want to develop an application that allows its users to define workflows and then executes them.
My environment is JBoss so naturally I'm considering jBPM.
I can NOT use the jBPM graphic workflow design tools since my workflows are very specific and I don't want to expose my users to all jBPM features.
Questions:
Is jBPM robust and scalable?
Is jBPM standard (i.e., used by enough people)?
How do I tie my own workflow GUI to the jBPM engine?
Is jBPM suitable for the job, should I consider a different platform or maybe do it (the workflow logic) myself?

Is jBPM robust and scalable?
Ans: Yes, Jbpm is robust and scalable. Need to configure/develop properly..
Is jBPM standard (i.e., used by enough people)?
Ans : You need to ask with jbpm forum.
How do I tie my own workflow GUI to the jBPM engine?
Ans : You need to develop processConfiguration file for each workflow, and deploy
these config file(xml file), this updates jbpm related tables and your
workflow related tables.
Is jBPM suitable for the job, should I consider a different platform or maybe do it
(the workflow logic) myself?
Ans : Its suitable for big workflows( where the stages/Nodes and logic are more).
And easy to Integrate with rule engine.

Not a direct answer to your question, but I think you should also take into consideration:
As you want your users to define workflows, are you sure you're not just referring to a finite state machine, rather than a workflow?
Can the user change existing workflows, and if so: if the workflow is changed, do you want running processes to continue using the old definition, or do you need to be able to migrate the running processes to use the new definition?

How do I tie my own workflow GUI to the jBPM engine?
Readed at jBPM main page:
JBoss jBPM provides a process-oriented programming model (jPDL) that blends the best of both Java and declarative programming techniques.
jBPM jPDL API docs overview

Is jBPM robust and scalable?
Yes, you have a wide range of options to scale your engine to a large number of process definitions, process instances and/or requests/second.
Is jBPM standard (i.e., used by enough people)?
Difficult to define standard ;) But it had for example several thousand downloads last week, and it uses standards as much as possible, like the BPMN 2.0 specification for process definitions, a standard that is currently being introduced by almost all BPM vendors.
How do I tie my own workflow GUI to the jBPM engine?
Depends what the GUI is for. Assuming you are referring to a GUI for defining the process definitions and you don't want to use the Eclipse-based or web-based editors that are offered by default, you can:
- use any GUI you like as long as it generates the BPMN2 XML, that can then be read in by the process engine
- your GUI uses the process fluent Java API to create processes using Java, which can then be loaded into the engine as well
Is jBPM suitable for the job, should I consider a different platform or maybe do it (the workflow logic) myself?
Trying to create a simple workflow engine yourself probably takes more effort than you might think, as you might start out simple, but usually end up adding features like persistence, monitoring, integration, dynamically loading new process definitions and process instance migration, etc. and end up with a home-grown workflow engine you have to maintain ;) You get these features out-of-the-box with jBPM.
Kris