Kafka Connect - Missing Text - apache-kafka

Kafka Version : 2.12-2.1.1
I created a very simple example to create a source and sink connector by using following commands :
bin\windows\connect-standalone.bat config\connect-standalone.properties config\connect-file-source.properties config\connect-file-sink.properties
Source file name : text_2.txt
Sink file name : test.sink_2.txt
A topic named "connect-test-2" is used and I created a consumer in PowerShell to show the result.
It works perfect at the first time. However, after i reboot my machine and start everything again. I found that some text are missing.
For example, when I type the characters below into test_2.txt file and save as following:
HAHAHA..
missing again
some text are missing
I am able to enter text
first letter is missing
testing testing.
The result windows (Consumer) and the sink file shows the following:
As you can see, some text are missing and i cannot find out why this happen. Any advice?
[Added information below]
connect-file-source.properties
name=local-file-source
connector.class=FileStreamSource
tasks.max=1
file=test_2.txt
topic=connect-test-2
connect-file-sink.properties
name=local-file-sink
connector.class=FileStreamSink
tasks.max=1
file=test.sink_2.txt
topics=connect-test-2

I think the strange behaviour is caused be the way you are modifying the sink file (text_2.txt).
How you applied changes after stoping the connector:
Using some editor <-- I think you use that method
Append only new characters to the end of file
FileStreamSource track changes based on the position in the file. You are using Kafka Connect in standalone mode so current position is written in /tmp/connect.offsets file.
If you modify source file using the editor, the whole content of the file has been changed. However FileStreamSource checks only if the size has change and poll characters, which offsets in the file is bigger then last processed by the Connector.
You should modify source file only by appending new characters to the end of the file

Related

Kafka connect - File Source

It seems connect file source is reading from the beginning of the files when the connector is restarted .
I couldn't find the equivalent configuration.
How to specify to read only the appended data ( Please note this only happens if connect is restarted , in that case, it is only reading the data that got appended) .

Read many files from Kafka Connect FileStreamSourceTask

I am reading 1 log file in Kafka, and creating a topic. This is succesful. To read this file, I am editing the file config/connect-file-source.properties to this purpose, and according to Step 7 of Kafka Quickstart (http://kafka.apache.org/quickstart#quickstart_kafkaconnect).
But, now, I would like to read a lot of files. In the file config/connect-file-source.properties I have edited the variable file with a pattern, for instance:
file=/etc/logs/archive.log*
Because I want to read all the files of the directory logs, with the pattern archive*.log. But, this line doesn't work.
What is the best form to implement the reading of files with a pattern, using the file config/connect-file-source.properties ?
In config/connect-file-source.properties,
source class is FileStreamSource and it uses task class as FileStreamSourceTask.
It reads a file using FileInputStream, so it cannot open multiple files at once. (by passing a directory name or regex pattern..)
You should implement your own Source & SourceTask class, or use an existing one that supports this feature such as kafka-connect-spooldir

Spark Streaming textFileStream COPYING

I'm trying to monitor a repository in HDFS to read and process data in files copied to it (to copy files from local system to HDFS I use hdfs dfs -put ), sometimes it generates the problem : Spark Streaming: java.io.FileNotFoundException: File does not exist: .COPYING so I read the problems in forums and the question here Spark Streaming: java.io.FileNotFoundException: File does not exist: <input_filename>._COPYING_
According to what I read the problem is linked to Spark streaming reading the file before it finishes being copied in HDFS and on Github :
https://github.com/maji2014/spark/blob/b5af1bdc3e35c53564926dcbc5c06217884598bb/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala , they say that they corrected the problem but only for FileInputDStream as I could see but I'm using textFileStream
When I tried to use FileInputDStream the IDE throws an error the Symbol is not accessible from this place.
Does anyone know how to filter out the files that are still COPYING because I tried :
var lines = ssc.textFileStream(arg(0)).filter(!_.contains("_COPYING_")
but that didn't work and it's expected because the filter should be applied on the name of the file process I guess which I can't access
As you can see I did plenty of research before asking the question but didn't get lucky ,
Any help please ?
So I had a look: -put is the wrong method. Look at the final comment: you have to use -rename in your shell script to have an atomical transaction on the HDFS.

Loading Amazon Redshift with a manifest, with an error in one file

When using the COPY command to load Amazon Redshift with a manifest, suppose one of the files contains an error.
Is there a way to just log the error for that file, but continue loading the other files?
The manifest file indicates whether a file is mandatory and whether an error should be generated if a file is not found. (Using a Manifest to Specify Data Files)
The COPY command will retry if it cannot read a file. (Errors When Reading Multiple Files)
The COPY command can specify a MAXERRORS parameter that permits a certain number of errors before the COPY command fails. (MAXERROR)
When loading data from files, Amazon Redshift will report any errors in the STL_LOAD_ERRORS table. (STL_LOAD_ERRORS)
As said above, the maxerror property should satisfy the above requirement.
In addition, copy-noload property checks the validity of the data without loading. Running with NOLOAD parameter is much faster as it only parses the file

Getting null.txt file when using Cygnus HDFS sink

I'm using Cygnus 0.5 with default configuration for HDFS sink. In order make it to run, I have deactivated the "ds" interceptor (otherwise I get an error at start time that precludes Cygnus to start, related with not finding the matching table file).
Cygnus seems to work, but the file in which entity information is stored in HDFS gets a weird name: "null.txt". How can I fix this?
First of all do no deactivate the DestinationExtractor interceptor. This is the piece of code infering the destination the context data notified by Orion is going to be persisted. Please observe the destination may refer to a HDFS file name, a MySQL table name or a CKAN resource name, it depends on the sinks you have configured. Once infered, the destination is added to the internal Flume event as a header called destination in order the sinks know where to persist. Thus, if deactivated, such a header is not found by the sinks and a null name is used as the destination name.
Regarding the "matching table file not found" problem you experienced (and which leaded you yo deactivate the Interceptor), it was due to the Cygnus configuration template had a bad default value for the cygnusagent.sources.http-source.interceptors.de.matching_table parameter. This has been solved in Cygnus 0.5.1.
A workaround while Cygnus 0.5.1 gets released is:
Do no deactivate the DestinatonExtractor (as #frb says in his answer)
Create an empty matching table file and use it for the matching_table configuration, i.e.: touch /tmp/dummy_table.conf then set in the cygnus configuration file: cygnusagent.sources.http-source.interceptors.de.matching_table = /tmp/dummy_table.conf