Any pointers how to build a data lineage solution across multiple application? - metadata

We have a requirement to capture data-lineage across multiple applications. These applications span multi-tech stack ranging from PL/SQL to Java to Spark.
Any hints how to proceed will be of great help.
Thanks
Anuj Mehra

Related

How can I compare serverless with monolithic projects?

I need to develop a final project in college and I chose this topic as a goal. I would like to compare the impact of using serverless on the development of an application.
To do this, I thought I'd compare repositories that use monolithic and those that use serverless. At first, the idea would be to use only Python or JavaScript languages.
I would like to know if any of you have any suggestions for software to calculate software metrics or to make it easier to find these types of repositories on GitHub. Currently, to find serverless projects, I'm looking for repositories that contain some serverless.yml file.
The idea would be to make a comparative study between these two types of architecture, calculating the differences and the benefits of using each one of them. For example, how to split code into atomic parts can impact code complexity as well as maintainability over time.
I'm still a little lost on how to proceed, any ideas or suggestions would be most welcome!

Play Framework with Spark MLib vs PredictionIO

Good morning,
currently I'm exploring my options for building an internal platform for the company I work for. Our team is responsible for the company's data warehouse and reporting.
As we evolve, we'll be developing an intranet to answer some of the company's necessities and, for some time now, I'm considering scala (and PlayFramework) as the way to go.
This will also envolve a lot of machine learning to cluster clients, predict sales evolution, and so on. This is when I've started to think in Spark ML and came across PredictionIO.
As we are shifting our skills towards data science, what will benefit and teach us/company most:
build everything on top of Play and Spark and have both the plataform and machine learning on the same project
using Play and PredictionIO where most of the stuff is already prepared
I'm not trying to open a question opinion based, rather then, learn from your experience / architectures / solutions.
Thank you
Both are good options: 1. use PredictionIO if you are new to ML, easy to start but it will limit you in a long run, 2. use spark if you have confidence in your data science and data engineering team, spark has excellent and easy to use api along with extensive ML library, saying that in order to put things into production, you will require some distributed spark knowledge - experience and it is tricky at times to make it efficient and reliable.
Here are options:
spark databricks cloud expensive but easy to use spark, no data engineering
PredictionIO if you certain that their ML can solve all your business cases
spark in google dataproc, easy managed cluster for 60% less than aws, still some engineering required
In summary: PredictionIO for a quick fix, and spark for long term data - science / engineering development. You can start with databricks to minimise expertise overheads and move to dataproc as you go along to minimise costs
PredictionIO uses Spark's MLLib for the majority of their engine templates.
I'm not sure why you're separating the two?
PredictionIO is as flexible as Spark is, and can alternatively use other libraries such as deeplearning4j & H2O to name a few.

Can I build a license or time limited demo version of a stand-alone application using MATLAB Compiler?

I have developed a stand-alone application using the MATLAB Compiler and I would like to be able to distribute it, as this is a common practice to let the potential user tests the application before buying.
Is there any ability to do so using MATLAB?
Your help in this regard is highly appreciated. Thank you in advance.

standard process for developing datastage steps

I have experience using Pentaho Kettle and Talend Data Integration for ETL jobs and typically the high-level process for developing transformations is:
define source connections
define target connections
define transformation of data between source and target
What is the 'standard' high-level process for developing datastage jobs? Is it similar to the process identified above?
Exactly the same. If you know the basic concepts of an ETL tool, they apply to all of the tools.
The three steps you listed are, however, very high level. Depending on what you're trying to do, that list can be increased quite dramatically.

Which are Scala alternatives for MDB or JMS?

I am now in the process of convincing my boss to start using Scala for our web application. I shown him some nice features and frameworks (Play 2.x) where Scala power is greatly noticed and now he comes up with a question where I need some Scala expert's advices:
what are the frameworks used in Scala for building queue based message processing (same as MDB and JMS) and what transaction management systems are used and how they compare with EJB based container managed txns ?
Because I don't have to much experience with Scala, I am pretty lost in finding documentation about this question. Can you please give me some suggestions?