Talend: Mapreduce code integration - talend

i have some questions related to Talend:
is it possible to include mapreduce code into Talend and execute from it?
when we execute after setting up componenets, will Talend convert it to mapreduce job?

1) You need to have Talend Enterprise Edition for BigData for your first answer.Their talend includes custom code components where you can write your own mapper and reducer code.
2) Yes talend convert it to MapReduce code but only for following components Hive,Pig,Sqoop,HBase.

The answers to both questions are: yes!
This video shows you how they design the job with the tool.
You have to use the software called: Talend Open Studio for Big Data
Last release is: v5.3.1
The documentation is here:
Getting started
Component Reference

Related

Apache beam Python SDK

So I have a python file which has code related to rest api to extract from a url and load it in a sql database. The code contains python packages such as graphql to extract the data and sqlalchemy to inject the data into the database. I’m trying integrate this code into beam api, but I have no clue how to do so. Do I have to generate the data first and then use the csv output for my pipeline or can I just insert all of this into a beam pipeline and extract the csv by executing the apache beam code? Any help is extremely appreciated thank you for reading.
I am not going to share any code, im just here to understand how to tackle this problem so that I can look for solutions myself!

How to get geospatial POINT using SparkSQL

I'm converting a process from postgreSQL over to DataBrick ApacheSpark,
The postgresql process uses the following sql function to get the point on a map from a X and Y value. ST_Transform(ST_SetSrid(ST_MakePoint(x, y),4326),3857)
Does anyone know how I can achieve this same logic in SparkSQL o databricks?
To achieve this you need to use some library, like, Apache Sedona, GeoMesa, or something else. Sedona, for example, has the ST_TRANSFORM function, maybe it has the rest as well.
The only thing that you need to take care, is that if you're using pure SQL, then on Databricks you will need:
install Sedona libraries using the init script, so libraries should be there before Spark starts
set Spark configuration parameters, as described in the following pull request
Update June 2022nd: people at Databricks developed the Mosaic library that is heavily optimized for geospatial analysis on Databricks, and it's compatible with standard ST_ functions.

MongoDB data pipeline to Redshift using Apache-Spark

As my employer makes the big jump to MongoDB, Redshift and Spark. I am trying to be proactive and get hands on with each of these technologies. Could you please refer me to any resources that will be helpful in performing this task -
"Creating a data pipeline using Apache Spark to move data from MongoDB to RedShift"
So, far I have been able to download a dev version of MongoDB and create a test Redshift instance. How do I go about setting the rest of the process and get my feet wet.
I understand to create the data pipeline using Apache Spark, one has to either code in Scala or Python or Java. I have a solid understanding of SQL, so feel free to suggest which language out of Scala, Python or Java would be easy for me to learn.
My background is in data warehousing, traditional ETL (Informatica, Datastage etc.).
Thank you in advance :)
A really good approach may be to use AWS Data Migration Services
http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html
you can specify mongodb as a source endpoint and redshift as the target endpoint

Integrate PMML to MongoDB

I have build a supervised learning model in R, and exported the model/decision rules in PMML format. I was hoping I could link the PMML straightforwardly to MongoDB using something like the JPMML library (as JPMML integrates well with PostgreSQL).
However, it seems the only way to link MongoDB to my PMML xml file is to use Cascading Pattern through Hadoop. Since my dataset isn't large (<50GB), I don't really need Hadoop.
Has anyone used PMML with MongoDB before that doesn't involve having to go down the hadoop route? Many thanks
Basically, you have two options here:
Convert the PMML file to something that you can execute inside MongoDB.
Deploy the PMML file "natively" to some outside service and connect MongoDB to it.
50 GB is still quite a lot of data, so option #1 is clearly preferable in terms of the ease of setup and the speed of execution. Is it possible to write a Java user-defined function (UDF) for MongoDB? If so, then it would be possible to run the JPMML library inside MongoDB. Otherwise, you might see if it would be possible to convert your PMML model to SQL script. For example, the latest versions of KNIME software (2.11.1 and newer) contain a "PMML to SQL" conversion node.
If you fall back to option #2, then the following technical article might provide some inspiration to you: Applying predictive models to database data: the REST web service approach.

How can I get a sample of my input data for testing purposes using Talend?

Is there a component in the Talend Open Studio edition (or else, the Enterprise edition) that allows one to do a test run of a job? I would like to do a test load of a small batch of data.
Obviously you could just provide less data as the input. So this could be to manually provide a small test input file or change your query to use LIMIT (or your RDBMS' equivalent) or even add some filtering criteria in the query to a database or in a tFilterRow component.
Alternatively you can use the tSampleRow component to take a sample of your input data.