Red Hat Process Automation Manager Database Sizing? - jboss

How would one go about sizing a data for Red Hat Process Automation Manager (jBPM)? All to documents I have read state you can hook it up to a database (mySQL mostly) but don't give any inclination to default sizing anywhere.

The size of the database will be dependent on the your business requirement or the usages. It will vary as per your implementation.
If the database tables are growing infinitely then you can remove the historical data/clean history tables. Red Hat Process Automation Manager (RHPAM) provide feature to schedule a job and clear the historical data periodically.
Configure a "LogCleanupCommand" job to clear the historical data.

Related

Can we generate a data lineage from our DataStage Jobs?

We're using IBM DataStage 11.7.1
The metadata asset manager was not used in the Project.
Can we generate a data lineage out of the existing and used jobs (knowing that not 100% can be covered)? If yes: how?
You can only generate lineage within a job, using DataStage. That is, you can answer questions "show where data flows to" and "show where data comes from" within the context of the one job. You can access this functionality by right-click on the stage about which you're interested in asking the question.
Beyond that, you can generate data lineage more formally using the Information Governance Catalog tool. If you are not using shared metadata resources, and not generating operational metadata when running jobs, then the lineage report will be based on design data only.
If you share the table definitions you use in your jobs into the common metadata repository (from the Repository menu in DataStage Designer), then you will get better lineage results in IGC. If you generate operational metadata when running your jobs then these operational metadata will also be available in lineage reports.
Don't forget that DataStage jobs are not included in lineage by default. You need to mark at least the jobs of interest as "include for lineage" in the Administration page of IGC.

Talend open studio run only created or modified records among 15k

I have a job in talend open studio which is working fine, it conects a tMSSqlinput to a tMap then tMysqlOutput, very straight forward. My problem is that i need this job running on daily basis, but only run when a new record is created or modified...any help is highly aprecciated!
It seems that you are searching for a Change Data Capture Tool for Talend.
Unfortunately it is only available on the licenced product.
To implement your need, you do have several ways. I want to show the most popular ones.
CDC from Talend
As Corentin said correctly, you could choose to use CDC (Change Data Capture) from Talend if you use the subscription version.
CDC of MSSQL
Alternatively you can check if you can activate or use CDC in your MSSQL server. This depends on your license. If it is possible, you can use the function to identify new elements and proceed them.
Triggers
Also you can create triggers on your database (if you have access to it). For example, creating a trigger for the cases INSERT, UPDATE, DELETE would help you getting the deltas. Then you could store those records separately or their IDs.
Software driven / API
If your database is connected to a software and you have developers around, you could ask for a service which identifies records on insert / update / delete and shows them to you. This could be done e.g. in a REST interface.
Delta via ID
If the primary key is an ID and it is set to autoincrement, you could also check your MySQL table for the biggest number and only SELECT those from the source which have a bigger ID than you have already got. This depends of course from the database layout.

Data mining with postgres in production environment - is there a better way?

There is a web application which is running for a years and during its life time the application has gathered a lot of user data. Data is stored in relational DB (postgres). Not all of this data is needed to run application (to do the business). However form time to time business people ask me to provide reports of this data data. And this causes some problems:
sometimes these SQL queries are long running
quires are executed against production DB (not cool)
not so easy to deliver reports on weekly or monthly base
some parts of data is stored in way which is not suitable for such
querying (queries are inefficient)
My idea (note that I am a developer not the data mining specialist) how to improve this whole process of delivering reports is:
create separate DB which regularly is update with production data
optimize how data is stored
create a dashboard to present reports
Question: But is there a better way? Is there another DB which better fits for such data analysis? Or should I look into modern data mining tools?
Thanks!
Do you really do data mining (as in: classification, clustering, anomaly detection), or is "data mining" for you any reporting on the data? In the latter case, all the "modern data mining tools" will disappoint you, because they serve a different purpose.
Have you used the indexing functionality of Postgres well? Your scenario sounds as if selection and aggregation are most of the work, and SQL databases are excellent for this - if well designed.
For example, materialized views and triggers can be used to process data into a scheme more usable for your reporting.
There are a thousand ways to approach this issue but I think that the path of least resistance for you would be postgres replication. Check out this Postgres replication tutorial for a quick, proof-of-concept. (There are many hits when you Google for postgres replication and that link is just one of them.) Here is a link documenting streaming replication from the PostgreSQL site's wiki.
I am suggesting this because it meets all of your criteria and also stays withing the bounds of the technology you're familiar with. The only learning curve would be the replication part.
Replication solves your issue because it would create a second database which would effectively become your "read-only" db which would be updated via the replication process. You would keep the schema the same but your indexing could be altered and reports/dashboards customized. This is the database you would query. Your main database would be your transactional database which serves the users and the replicated database would serve the stakeholders.
This is a wide topic, so please do your diligence and research it. But it's also something that can work for you and can be quickly turned around.
If you really want try Data Mining with PostgreSQL there are some tools which can be used.
The very simple way is KNIME. It is easy to install. It has full featured Data Mining tools. You can access your data directly from database, process and save it back to database.
Hardcore way is MADLib. It installs Data Mining functions in Python and C directly in Postgres so you can mine with SQL queries.
Both projects are stable enough to try it.
For reporting, we use non-transactional (read only) database. We don't care about normalization. If I were you, I would use another database for reporting. I will desing the tables following OLAP principals, (star schema, snow flake), and use an ETL tool to dump the data periodically (may be weekly) to the read only database to start creating reports.
Reports are used for decision support, so they don't have to be in realtime, and usually don't have to be current. In other words it is acceptable to create report up to last week or last month.

Synchronize between an MS Access (Jet / MADB) database and PostgreSQL DB, is this possible?

Is it possible to have a MS access backend database (Microsoft JET or Access Database Engine) set up so that whenever entries are inserted/updated those changes are replicated* to a PostgreSQL database?
Two-way synchronization would be nice, but one way would be acceptable.
I know it's popular to link the two and use one as a frontend, but it's essential that both be backend.
Any suggestions?
* ie reflected, synchronized, mirrored
Can you use Microsoft SQL Server Express Edition? Or do you have to use Microsoft Access Database Engine? It's possible you'll have more options using MS SQL express, like more complete triggers and logging.
Either way, you're going to need a way to accumulate a log of changed rows from the source database engine, and a program to sync them to PostgreSQL by reading the log and converting it into suitable PostgreSQL INSERT, UPDATE and DELETE statements.
You could do this by having audit triggers in MADB/Express insert a row into an audit shadow table for every "real" table whenever it changed, including inserting special "row deleted" audit entries. Then your sync program could connect to both MADB/Express, read the audit tables, apply the changes to PostgreSQL, and empty the audit tables.
I'll be surprised if you find anything to do this out of the box. It's one area where Microsoft SQL Server has a big advantage because of all the deep Access and MADB engine integation to support the synchronisation and integration features.
There are some ETL ("Extract, Transform, Load") tools that might be helpful, like Pentaho and Talend. I don't know if you can achieve the desired degree of automation with them though.

what happens to my dataset in case of unexpected failure

i know this has been asked here. But my question is slightly different. When the dataset was designed keeping the disconnected principle in mind, what was provided as a feature which would handle unexpected termination of the application, say a power failure or a windows hang or system exception leading to restart. Say the user has entered some 100 rows and it is modified at the dataset alone. Usually the dataset is updated at the application close or at a timely period.
In old times which programming using vb 6.0 all interaction used to take place directly with the database, thus each successful transaction was committing itself automatically. How can that be done using datasets?
DataSets are never for direct access to database, they are a disconnected model only. There is no intent that they be able to recover from machine failures.
If you want to work live against the database you need to use DataReaders and issue DbCommands against the database live for changes. This of course will increase your load on the database server though.
You have to balance the two for most applications. If you know a user just entered vital data as a new row, execute an insert command to the database, and put a copy in your local cached DataSet. Then your local queries can run against the disconnected data, and inserts are stored immediately.
A DataSet can be serialized very easily, so you could implement your own regular backup to disk by using serialization of the DataSet to the filesystem. This will give you some protection, but you will have to write your own code to check for any data that your application may have saved to disk previously and so on...
You could also ignore DataSets and use SqlDataReaders and SqlCommands for the same sort of 'direct access to the database' you are describing.