I have a pCollection that contains Avro records and I used AvroIO to write to output files. All working fine till recently I found when the collection was super small like 20 records or so, the output does not show up. I think what happened would be some buffer not been flushed. I think the FileIO or AvroIO should have a flush function. How can I force the flush?
Here is the sink part of the pipeline
myPCollection.apply(FileIO.<String, GenericRecord>writeDynamic()
.by(new OutputGroupFun())
.via(AvroIO.sink(eventSchema))
.withNaming(new FileNameFn(myOption.getOutputGcsDir()))
.withDestinationCoder(StringUtf8Coder.of())
.withTempDirectory(myOption.getAvroTempDir())
.withNumShards(1));
Related
Currently developing a Beam pipeline using Flinkrunner and Apache Beam 2.29.
As per suggestion in Beam File processing patterns, I have an unbound pipeline listening to a Kafka topic for a CSV filename and once received processes it through TextIO readFile().
We end up with two PCollections, one is from the file being processed and the other is from a lookup from an external datastore. The PCollections are joined using the Join extension which forces us to setup some triggering on these two PCollections. So I have defined something like the below for each PCollection in hopes that the end result following the join would produce some new output every time a new filename arrives from the Kafka topic we are monitoring.
PCollection<KV<String, Map<String, AttributeValue>>> lookupTable = LookupTable.getPspLookupData(p, lookupTableName, lookupTableRegionFilter)
.apply("WindowB", Window.<KV<String, Map<String, AttributeValue>>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(15))))
.withAllowedLateness(Duration.standardSeconds(5))
.discardingFiredPanes()
);
But it simply does not work more than once. It seems that if I send one or more kafka messages before the 15 seconds defined in plusDelayOf() the data gets processed but anything sent past those 15 seconds (from pipeline startup) is never processed and the pipeline is simply "stuck" despited having defined a trigger of Repeatedly.forever...
I have tried numerous combinations and I simply cant get it to work. Would welcome any ideas or suggestions to get this to work. Feels like I am missing something basic but I have been at this for hours.
Thanks,
Serge
I am using Flink streaming with Kafka consumer connector (FlinkKafkaConsumer) and file Sink (StreamingFileSink) in a cluster mode with exactly once policy.
The file sink writes the files to the local disk.
I’ve noticed that if a job fails and automatic restart is on, the task managers look for the leftovers files from the last failing job (hidden files).
Obviously, since the tasks can be assigned to different task managers, this sums up to more failures over and over again.
The only solution I found so far is to delete the hidden files and resubmit the job.
If I get it right (and please correct me If I wrong), the events in the hidden files were not committed to the bootstrap-server, so there is no data loss.
Is there a way, forcing Flink to ignore the files that were written already? Or maybe there is a better way to implement the solution (maybe somehow with savepoints)?
I got a very detailed answer in Flink mailing list. TLDR, in order to implement exactly once, I have to use some kind of distributed FS.
The full answer:
A local filesystem is not the right choice for what you are trying to achieve. I don't think you can achieve a true exactly once policy in this setup. Let me elaborate on why.
The interesting bit is how it behaves on checkpoints. The behavior is controlled by a RollingPolicy. As you have not said what format you use let's assume you use row format first. For a row format the default rolling policy (when to change the file from in-progress to pending) is it will be rolled if the file reaches 128MB, the file is older than 60 sec or it has not been written to for 60 sec. It does not roll on a checkpoint. Moreover StreamingFileSink considers the filesystem as a durable sink that can be accessed after a restore. That implies that it will try to append to this file when restoring from checkpoint/savepoint.
Even if you rolled the files on every checkpoint you still might face the problem that you can have some leftovers because the StreamingFileSink moves the files from pending to complete after the checkpoint is completed. If a failure happens between finishing the checkpoint and moving the files it will not be able to move them after a restore (it would do it if had an access).
Lastly a completed checkpoint will contain offsets of records that were processed successfully end-to-end, which means records that are assumed committed by the StreamingFileSink. This can be records written to an in-progress file with a pointer in a StreamingFileSink checkpointed metadata, records in a "pending" file with an entry in a StreamingFileSink checkpointed metadata that this file has been completed or records in "finished" files.[1]
Therefore as you can see there are multiple scenarios when the StreamingFileSink has to access the files after a restart.
The last thing, you mentioned "committing to the "bootstrap-server". Bear in mind that Flink does not use offsets committed back to Kafka for guaranteeing consistency. It can write those offsets back but only for monitoring/debugging purposes. Flink stores/restores the processed offsets from its checkpoints.[3]
Let me know if it helped. I tried my best ;) BTW I highly encourage reading the linked sources as they try to describe all that in a more structured way.
I am also cc'ing Kostas who knows more about the StreamingFileSink than I do., so he can maybe correct me somewhere.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/streamfile_sink.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html
[3]https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html#kafka-consumers-offset-committing-behaviour-configuration
I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.
In DNA informatics the files are massive (300GB each and biobanks have hundreds of thousands of files) and they need to go through 6 or so lengthy downstream pipelines (hours to weeks). Because I do not work at the company that manufactures the sequencing machines, I do not have access to the data as it is being generated...nor do I write assembly lang.
What I would like to do is transform the lines of text from that 300GB file into stream events. Then pass these messages through the 6 pipelines with Kafka brokers handing off to SparkStreaming between each pipeline.
Is this possible? Is this the wrong use case? It would be nice to rerun single events as opposed to entire failed batches.
Desired Workflow:
------pipe1------
_------pipe2------
__------pipe3------
___------pipe4------
Current Workflow:
------pipe1------
_________________------pipe2------
__________________________________------pipe3------
___________________________________________________------pipe4------
Kafka is not meant for sending files, only relatively small events. Even if you did send a file line-by-line, you would have need to know how to put the file back together to process it, and thus you're effectivly doing the same thing as streaming the files though a raw TCP socket.
Kafka has a maxiumum default message suze of 1MB, and while you can increase it, I wouldn't recommend pushing it much over the double digit MB sizes.
How can I send large messages with Kafka (over 15MB)?
If you really needed to get data like that though Kafka, the recommended pattern is to put your large files on an external storage (HDFS, S3, whatever), then put the URI to the file within the Kafka event, and let consumers deal with reading that datasource.
If the files have any structure at all to them (like pages, for example), then you could use Spark and a custom Hadoop InputFormat to serialize those, and process data in parallel that way. Doesn't necessarily have to be through Kafka, though. You could try Apache NiFi, which I hear processes larger files better (maybe not GB, though).
I am trying to use file as my kafka producer. The source file grows continuously (say 20 records/lines per second). Below is a post similar to my problem:
How to write a file to Kafka Producer
But in this case, the whole file is read and added to the Kafka topic every time a new line is inserted into the file. I want only the newly appended lines to be sent to the topic (ie. if the file holds 10 lines already and 4 more lines are appended to it, only those 4 lines need to be sent to the topic).
Is there a way to achieve this ??
Other solutions tried:
Apache flume by using source type as 'spooldir'. But it was of no use since it reads data from new files that are added to the directory and not when data is appended to an already-read file.
Also we tried with flume source type as 'exec' and command as 'tail –F /path/file-name'. This too doesn't seem to work.
Suggestions for using any other tool is also welcomed as my objective is to read the data from the file in real time (ie. I need the data as soon as it is inserted into the file).
There are a couple of options that you could look at, depending on your exact needs.
Kafka Connect
As stated by Chin Huang above the FileSource Connector from Kafka Connect should be able to do what you want without installing additional software. Check out the Connect Quickstart for guidance on how to get this up and running, they actually have an example for reading a file into Kafka.
Logstash
Logstash is the classic option for something like this, with its Kafka output it will do just what you want it to do, for one or muliple files. The following configuration should give you roughly what you want.
input {
file {
path => "/path/to/your/file"
}
output {
kafka {
bootstrap_servers => "127.0.0.1:9092"
topic_id => "topicname"
}
}
Filebeat
Filebeat is pretty similar to Logstash, it just offers less functionality if you want to perform additional processing on data read from the file. Also, it is written in go instead of java, so the footprint on the machine its running on should be smaller.
The following should be a minimal config to get you started (from memory, you might need to add a parameter or two if they are mandatory):
filebeat.prospectors:
- type: log
paths:
- /path/to/your/file
output.kafka:
hosts: ["127.0.0.1:9092"]
topic: 'topicname'
Flume
If you want to revisit your Flume option, have a look at the TaildirSource, I have not used it, but it sounds like it should fit your use case quite nicely.