I am new to Talend ETL tool. I want to use Talend to generate Spark batch jobs.
Can Talend generate Scala code instead of Java or is there a way to plug in Scala based batch job in to Talend?
No, it generates Java only.
That should not matter if using Talend as a graphic ETL-tool at a higher abstraction level though.
Related
I've started developing my first DataFlow job using Scio, the Scala SDK. The dataflow job will run in streaming mode.
Can anyone advise the best way to deploy this? I have read in the Scio docs they use sbt-pack and then deploy this within a Docker container. I have also read about using DataFlow templates (but not in great detail).
What's best?
Like for Java and Python version, you can run directly your code on Dataflow by using the dataflow runner and by launching it from your computer (or a VM/function).
If you want to package it for a reutilisation, you can create a template.
You can't run custom container on Dataflow.
I have a code written entirely in scala which uses spark streaming to get json data from kafka topic and then dump it to cassandra and another kafka topic after some processing. Now I need to write an unit test for this code. I need help on how to write such a test and how we can mock data when I am using spark cassandra connector.
You can use spark-cassandra-connector-embedded that is developed together with connector itself. Just add the Maven or SBT dependency to your project, like, this for SBT:
"com.datastax.spark" %% "spark-cassandra-connector-embedded" % {latest.version}
Which talend tool should i download . I need to use enterprise technologies like REST/SOAP/JSON/XML/JDBC/SFTP/XSD (Oracle).
My use cases are :
Exposing services(REST/SOAP)
Reading from file like MT940,CSV,flat files and storing in database (Oracle)
Using SFTP ,File Movements frequently.
What is the difference between TALEND Data integration and Talend ESB.
Currently i have download Talend Big Data Open studio .
Will this suffice?
It depends on what you are wanting to do with it. I'm thinking Talend Data Integrator, the free version should suffice, if all you want to do is pull data.
I`m currently working on recommender system using pyspark and ipython-notebook. I want to get recommendations from data stored in BigQuery. There are two options:Spark BQ connector and Python BQ library.
What are the pros and cons of these two tools?
The Python BQ library is a standard way to interact with BQ from Python, and so it will include the full API capabilities of BigQuery. The Spark BQ connector you mention is the Hadoop Connector - a Java Hadoop library that will allow you to read/write from BigQuery using abstracted Hadoop classes. This will more closely resemble how you interact with native Hadoop inputs and outputs.
You can find example usage of the Hadoop Connector here.
We have datastage jobs and want to use one java class which reads the file and gives some data back. Can someone explain the steps needed to perform this function?
There are java transformer and java client stages in Real Time section in Palette.
You will need to study the API that DataStage uses to work with Java.
Simply write a java code that reads the file and you can call its class in DataStage.
The Java Integration Stage is a DataStage Connector through which you can call a custom Java application from InfoSphere Data Stage and Quality Stage parallel jobs. The Java Integration Stage is available from IBM InfoSphere Information Server version 9.1 and higher. The Java Integration Stage can be used in the following topologies: as a source, as a target, as a transformer, and as a lookup stage. For more information on the Java Integration Stage, see Related topics.
The DataStage Java Pack is a collection of two plug-in stages, Java Transformer and Java Client, through which you can call Java applications from DataStage. The Java Pack is available from DataStage version 7.5.x and higher.
The Java Transformer stage is an active stage that can be used to call a Java application that reads incoming data, transforms it, and writes it to an output link defined in a DataStage job. The Java Client stage is a passive stage that can be used as a source, as a target, and as a lookup stage. When used as a Source, the Java Client will be producing data. When used as a target, the Java Client Stage will be consuming data. When used as a lookup, the Java Client Stage will perform lookup functions.
For more information on the Java Pack Stages, see Related topics.
https://www.ibm.com/developerworks/data/library/techarticle/dm-1305handling/index.html