Read many files from Kafka Connect FileStreamSourceTask

Read many files from Kafka Connect FileStreamSourceTask - apache-kafka

I am reading 1 log file in Kafka, and creating a topic. This is succesful. To read this file, I am editing the file config/connect-file-source.properties to this purpose, and according to Step 7 of Kafka Quickstart (http://kafka.apache.org/quickstart#quickstart_kafkaconnect).
But, now, I would like to read a lot of files. In the file config/connect-file-source.properties I have edited the variable file with a pattern, for instance:
file=/etc/logs/archive.log*
Because I want to read all the files of the directory logs, with the pattern archive*.log. But, this line doesn't work.
What is the best form to implement the reading of files with a pattern, using the file config/connect-file-source.properties ?

In config/connect-file-source.properties,
source class is FileStreamSource and it uses task class as FileStreamSourceTask.
It reads a file using FileInputStream, so it cannot open multiple files at once. (by passing a directory name or regex pattern..)
You should implement your own Source & SourceTask class, or use an existing one that supports this feature such as kafka-connect-spooldir

Related

Why does Kafka Connect Sftp source directory need to be writeable?

using the connector: https://docs.confluent.io/current/connect/kafka-connect-sftp/source-connector/index.html
When I config the connector and check the status I get the bellow exception...
org.apache.kafka.connect.errors.ConnectException: Directory for 'input.path' '/FOO' it not writable.\n\tat io.confluent.connect.sftp.source.SftpDirectoryPermission.directoryWritable
This makes no sense from a source stand point especially if you are connecting to a 3rd party source you do NOT control.

You need write permissions because the Connector will move the files that it read to the configurable finished.path. This movement into finished.path is explained in the link you have provided:
Once a file has been read, it is placed into the configured finished.path directory.
The documentation on the configuration input.path states that you need write access to it:
input.path - The directory where Kafka Connect reads files that are processed. This directory must exist and be writable by the user running Connect.

Creating and using a custom kafka connect configuration provider

I have installed and tested kafka connect in distributed mode, it works now and it connects to the configured sink and reads from the configured source.
That being the case, I moved to enhance my installation. The one area I think needs immediate attention is the fact that to create a connector, the only available mean is through REST calls, this means I need to send my information through the wire, unprotected.
In order to secure this, kafka introduced the new ConfigProvider seen here.
This is helpful as it allows to set properties in the server and then reference them in the rest call, like so:
{
.
.
"property":"${file:/path/to/file:nameOfThePropertyInFile}"
.
.
}
This works really well, just by adding the property file on the server and adding the following config on the distributed.properties file:
config.providers=file # multiple comma-separated provider types can be specified here
config.providers.file.class=org.apache.kafka.common.config.provider.FileConfigProvider
While this solution works, it really does not help to easy my concerns regarding security, as the information now passed from being sent over the wire, to now be seating on a repository, with text on plain sight for everyone to see.
The kafka team foresaw this issue and allowed clients to produce their own configuration providers implementing the interface ConfigProvider.
I have created my own implementation and packaged in a jar, givin it the sugested final name:
META-INF/services/org.apache.kafka.common.config.ConfigProvider
and added the following entry in the distributed file:
config.providers=cust
config.providers.cust.class=com.somename.configproviders.CustConfigProvider
However I am getting an error from connect, stating that a class implementing ConfigProvider, with the name:
com.somename.configproviders.CustConfigProvider
could not be found.
I am at a loss now, because the documentation on their site is not explicit about how to configure custom config providers very well.
Has someone worked on a similar issue and could provide some insight into this? Any help would be appreciated.

I just went through these to setup a custom ConfigProvider recently. The official doc is ambiguous and confusing.
I have created my own implementation and packaged in a jar, givin it the sugested final name:
META-INF/services/org.apache.kafka.common.config.ConfigProvider
You could name the final name of jar whatever you like, but needs to pack to jar format which has .jar suffix.
Here is the complete step by step. Suppose your custom ConfigProvider fully-qualified name is com.my.CustomConfigProvider.MyClass.
1. create a file under directory: META-INF/services/org.apache.kafka.common.config.ConfigProvider. File content is full qualified class name:
com.my.CustomConfigProvider.MyClass
Include your source code, and above META-INF folder to generate a Jar package. If you are using Maven, file structure looks like this
put your final Jar file, say custom-config-provider-1.0.jar, under the Kafka worker plugin folder. Default is /usr/share/java. PLUGIN_PATH in Kafka worker config file.
Upload all the dependency jars to PLUGIN_PATH as well. Use the META-INFO/MANIFEST.MF file inside your Jar file to configure the 'ClassPath' of dependent jars that your code will use.
In kafka worker config file, create two additional properties:
CONNECT_CONFIG_PROVIDERS: 'mycustom', // Alias name of your ConfigProvider
CONNECT_CONFIG_PROVIDERS_MYCUSTOM_CLASS:'com.my.CustomConfigProvider.MyClass',
Restart workers
Update your connector config file by curling POST to Kafka Restful API. In Connector config file, you could reference the value inside ConfigData returned from ConfigProvider:get(path, keys) by using the syntax like:
database.password=${mycustom:/path/pass/to/get/method:password}
ConfigData is a HashMap which contains {password: 123}
If you still seeing ClassNotFound exception, probably your ClassPath is not setup correctly.
Note:
• If you are using AWS ECS/EC2, you need to set the worker config file by setting the environment variable.
• worker config and connector config file are different.

Kafka Connect - Missing Text

Kafka Version : 2.12-2.1.1
I created a very simple example to create a source and sink connector by using following commands :
bin\windows\connect-standalone.bat config\connect-standalone.properties config\connect-file-source.properties config\connect-file-sink.properties
Source file name : text_2.txt
Sink file name : test.sink_2.txt
A topic named "connect-test-2" is used and I created a consumer in PowerShell to show the result.
It works perfect at the first time. However, after i reboot my machine and start everything again. I found that some text are missing.
For example, when I type the characters below into test_2.txt file and save as following:
HAHAHA..
missing again
some text are missing
I am able to enter text
first letter is missing
testing testing.
The result windows (Consumer) and the sink file shows the following:
As you can see, some text are missing and i cannot find out why this happen. Any advice?
[Added information below]
connect-file-source.properties
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test_2.txt
topic=connect-test-2
connect-file-sink.properties
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=test.sink_2.txt
topics=connect-test-2

I think the strange behaviour is caused be the way you are modifying the sink file (text_2.txt).
How you applied changes after stoping the connector:
Using some editor <-- I think you use that method
Append only new characters to the end of file
FileStreamSource track changes based on the position in the file. You are using Kafka Connect in standalone mode so current position is written in /tmp/connect.offsets file.
If you modify source file using the editor, the whole content of the file has been changed. However FileStreamSource checks only if the size has change and poll characters, which offsets in the file is bigger then last processed by the Connector.
You should modify source file only by appending new characters to the end of the file

Spark Streaming textFileStream COPYING

I'm trying to monitor a repository in HDFS to read and process data in files copied to it (to copy files from local system to HDFS I use hdfs dfs -put ), sometimes it generates the problem : Spark Streaming: java.io.FileNotFoundException: File does not exist: .COPYING so I read the problems in forums and the question here Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_
According to what I read the problem is linked to Spark streaming reading the file before it finishes being copied in HDFS and on Github :
https://github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala , they say that they corrected the problem but only for FileInputDStream as I could see but I'm using textFileStream
When I tried to use FileInputDStream the IDE throws an error the Symbol is not accessible from this place.
Does anyone know how to filter out the files that are still COPYING because I tried :
var lines = ssc.textFileStream(arg(0)).filter(!_.contains("_COPYING_")
but that didn't work and it's expected because the filter should be applied on the name of the file process I guess which I can't access
As you can see I did plenty of research before asking the question but didn't get lucky ,
Any help please ?

So I had a look: -put is the wrong method. Look at the final comment: you have to use -rename in your shell script to have an atomical transaction on the HDFS.

Loading Amazon Redshift with a manifest, with an error in one file

When using the COPY command to load Amazon Redshift with a manifest, suppose one of the files contains an error.
Is there a way to just log the error for that file, but continue loading the other files?

The manifest file indicates whether a file is mandatory and whether an error should be generated if a file is not found. (Using a Manifest to Specify Data Files)
The COPY command will retry if it cannot read a file. (Errors When Reading Multiple Files)
The COPY command can specify a MAXERRORS parameter that permits a certain number of errors before the COPY command fails. (MAXERROR)
When loading data from files, Amazon Redshift will report any errors in the STL_LOAD_ERRORS table. (STL_LOAD_ERRORS)

As said above, the maxerror property should satisfy the above requirement.
In addition, copy-noload property checks the validity of the data without loading. Running with NOLOAD parameter is much faster as it only parses the file