using multiple integration tools on hdfs - frameworks

I am working on a small project. The aim of the project is to use framework ingestion tools to ingest data in to a data lake.
-I will be ingesting data in batches.
-The data formats will be RDBMS, csv files and flat files.
I've done my research on the ingestion tools to use and I have found plenty like: Sqoop, Flume, Gobblin, Kafka etc.
My question is: What ingestion tools or approaches do you recommend for this small project? (keep in mind I'll be using HDFS as my lake)

Related

External Kafka Stream Source in Spring Cloud Data Flow

I am migrating from Streamset to Spring Cloud Data Flow. When I am looking for module list I realized that some of the sources are not listed in Spring Cloud Flow - One of them is KAFKA source.
My question is why external KAFKA source is removed from standard sources list in spring cloud data flow ?
It is not that it is removed, but rather does not exist yet. See https://github.com/spring-cloud/stream-applications/issues/265

HDFS file system, get latest folders using scala API

Our application reads data from several HDFS data folders, folders get updated weekly/daily/monthly so based on the updated period we need to find the latest path and then read the data.
We would like to do this using programmatic way using scala, so is there libraries available?
We could only see but just wondering any better libraries available?
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/package-summary.html
The linked library is the recommended way to use HDFS API programmatically without going through hadoop fs CLI scripts. Any other library you may find would be built using the same package.

What are the main differences between HDF schema registry and the Confluent one?

I was wondering about the differences of the kafka embebed in the HDF suite and the Confluent one, specifically the schema registry tool.
https://registry-project.readthedocs.io/en/latest/competition.html
The Hortonworks schema registry depends on a Mysql or Postgres database (supposedly this is pluggable, so you could write your own storage layer) to store its schemas while the Confluent one stores schemas directly in Kafka. Therefore there's more infrastructure to manage with Hortonworks implementation.
The Hortonworks one supposedly has some plugin mechanism so that it'll support the Confluent serialization format, but I've not seen it used in practice. It also has pluggable schema storage, but I've not seen anything except Avro used in it.
The Hortonworks one has its own web UI and rich editor, compared to the Confluent one, where you're limited to third-party tools or purchasing a license for Confluent Control Center.
Hortonworks aims to provide integrations with Spark, Nifi, SMM, Storm, Atlas, possibly Ranger, and other components of their HDF stack. Confluent Schema Registry support in those tools is all community driven.

Which talend product to download?

Which talend tool should i download . I need to use enterprise technologies like REST/SOAP/JSON/XML/JDBC/SFTP/XSD (Oracle).
My use cases are :
Exposing services(REST/SOAP)
Reading from file like MT940,CSV,flat files and storing in database (Oracle)
Using SFTP ,File Movements frequently.
What is the difference between TALEND Data integration and Talend ESB.
Currently i have download Talend Big Data Open studio .
Will this suffice?
It depends on what you are wanting to do with it. I'm thinking Talend Data Integrator, the free version should suffice, if all you want to do is pull data.

Spark BigQuery Connector vs Python BigQuery Library

I`m currently working on recommender system using pyspark and ipython-notebook. I want to get recommendations from data stored in BigQuery. There are two options:Spark BQ connector and Python BQ library.
What are the pros and cons of these two tools?
The Python BQ library is a standard way to interact with BQ from Python, and so it will include the full API capabilities of BigQuery. The Spark BQ connector you mention is the Hadoop Connector - a Java Hadoop library that will allow you to read/write from BigQuery using abstracted Hadoop classes. This will more closely resemble how you interact with native Hadoop inputs and outputs.
You can find example usage of the Hadoop Connector here.