Beam Pipeline with CheckStopReadingFn throws IllegalStateException upon returning true from CheckStopReadingFn - apache-beam

I am using Apache Beam 2.29.0 to consume from Kafka. I added the function CheckStopReadingFn to stop reading from Kafka once the test returns true. My pipeline creation is like this:
return p.apply("Read from Kafka", KafkaIO.<byte[], GenericRecord>read()
.withBootstrapServers(options.getSourceBrokers().get())
.withTopics(srcTopics)
.withConsumerConfigUpdates(consumerExtraOptions)
.withConsumerFactoryFn(new ConsumerFactoryFn(offsetInfo))
.withConsumerConfigUpdates(ImmutableMap.of("group.id", "my_group"))
.withCheckStopReadingFn(new CheckStopReadingFn(options.getProjectId().get()))
.withKeyDeserializer(ByteArrayDeserializer.class)
.withStartReadTime(Instant.ofEpochMilli(options.getStartTimestamp().get()))
.withValueDeserializer(ConfluentSchemaRegistryDeserializerProvider.of(options.getSourceSchemaRegistryUrl().get(), options.getSourceSubject().get()))
.withoutMetadata()).apply("Drop keys", Values.<GenericRecord>create())
.apply("Windowing of " + windowDuration + " seconds", Window.<GenericRecord>into(FixedWindows.of(Duration.standardSeconds(windowDuration))));
The place in Beam that throws the exception is in the OffsetTracker method below because the lastAttempedOffset is null:
#Override
public void checkDone() throws IllegalStateException {
if (range.getFrom() == range.getTo()) {
return;
}
checkState(
lastAttemptedOffset != null,
"Last attempted offset should not be null. No work was claimed in non-empty range %s.",
range);
checkState(
lastAttemptedOffset >= range.getTo() - 1,
"Last attempted offset was %s in range %s, claiming work in [%s, %s) was not attempted",
lastAttemptedOffset,
range,
lastAttemptedOffset + 1,
range.getTo());
}
Any ideas what the problem might be?
Here is my CheckStopReadingFn implementation:
private static class CheckStopReadingFn implements SerializableFunction<TopicPartition, Boolean> {
final private String projectId;
final private String bucketName;
final private String subdirectory;
CheckStopReadingFn(String projectId, String bucketName, String subdirectory ) {
this.projectId = projectId;
this.bucketName = bucketName;
this.viewName = subdirectory;
}
#Override
public Boolean apply(TopicPartition topicPartition) {
return = GCSUtility.filesExist(projectId, bucketName, subdirectory, Set.of(topicPartition.toString()));
}
}

Related

Select immediate predecessor by date

I am new to Drools and I'm using Drools 7.12.0 to try and validate a set of meter readings, which look like
public class MeterReading() {
private long id;
private LocalDate readDate;
private int value;
private String meterId
private boolean valid;
/* Getters & Setters omitted */
}
As part of the validation I need to compare the values of each MeterReading with its immediate predecessor by readDate.
I first tried using 'accumulate'
when $mr: MeterReading()
$previousDate: LocalDate() from accumulate(MeterReading($pdate: readDate < $mr.readDate ), max($pdate))
then
System.out.println($mr.getId() + ":" + $previousDate);
end
but then discovered that this only returns the date of the previous meter read, not the object that contains it. I then tried a custom accumulate with
when
$mr: MeterReading()
$previous: MeterReading() from accumulate(
$p: MeterReading(id != $mr.id),
init( MeterReading prev = null; ),
action( if( prev == null || $p.readDate < prev.readDate) {
prev = $p;
}),
result(prev))
then
System.out.println($mr.getId() + ":" + $previous.getId() + ":" + $previous.getReadDate());
end
but this selects the earliest read in the set of meter readings, not the immediate predecessor. Can someone point me in the right direction as to what I should be doing or reading to be able to select the immediate predecessor to each individual meter read.
Regards
After further research I found this article http://planet.jboss.org/post/how_to_implement_accumulate_functions which I used to write my own accumulate function;\
public class PreviousReadFinder implements AccumulateFunction {
#Override
public Serializable createContext() {
return new PreviousReadFinderContext();
}
#Override
public void init(Serializable context) throws Exception {
PreviousReadFinderContext prfc = (PreviousReadFinderContext) context;
prfc.list.clear();
}
#Override
public void accumulate(Serializable context, Object value) {
PreviousReadFinderContext prfc = (PreviousReadFinderContext) context;
prfc.list.add((MeterReading) value);
}
#Override
public void reverse(Serializable context, Object value) throws Exception {
PreviousReadFinderContext prfc = (PreviousReadFinderContext) context;
prfc.list.remove((MeterReading) value);
}
#Override
public Object getResult(Serializable context) throws Exception {
PreviousReadFinderContext prfc = (PreviousReadFinderContext) context;
return prfc.findLatestReadDate();
}
#Override
public boolean supportsReverse() {
return true;
}
#Override
public Class<?> getResultType() {
return MeterReading.class;
}
#Override
public void writeExternal(ObjectOutput out) throws IOException {
}
#Override
public void readExternal(ObjectInput in) throws IOException, ClassNotFoundException {
}
private static class PreviousReadFinderContext implements Serializable {
List<MeterReading> list = new ArrayList<>();
public Object findLatestReadDate() {
Optional<MeterReading> optional = list.stream().max(Comparator.comparing(MeterReading::getReadDate));
if (optional.isPresent()) {
MeterReading to = optional.get();
return to;
}
return null;
}
}
}
and my rule is now
rule "Opening Read With Previous"
dialect "mvel"
when $mr: MeterReading()
$pmr: MeterReading() from accumulate($p: MeterReading(readDate < $mr.readDate ), previousReading($p))
then
System.out.println($mr.getId() + ":" + $pmr.getMeterReadDate());
end
How do I write a rule to select the eatliest meter reading in the set which does not have a previous read?

How to throttle flink output to kafka?

I want to send 100 messages/second from my stream to a kafka topic. I have more than enough data in stream to do so.
So far, I have found windowing concept, but I am unable to modify it to my use case.
You could do this easily with a ProcessFunction. You would keep a counter in Flink state, and only emit elements when the counter is less than 100. Meanwhile, use a timer to reset the counter to zero once a second.
Flink v1.15, I created function.
Refer to checkpointing_under_backpressure
and process_function.
public class RateLimitFunction extends KeyedProcessFunction<String, String, String> {
private transient ValueState<Long> counter;
private transient ValueState<Long> lastTimestamp;
private final Long count;
private final Long millisecond;
public RateLimitFunction(Long count, Long millisecond) {
this.count = count;
this.millisecond = millisecond;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
counter = getRuntimeContext()
.getState(new ValueStateDescriptor<>("counter", TypeInformation.of(Long.class)));
lastTimestamp = getRuntimeContext()
.getState(new ValueStateDescriptor<>("last-timestamp", TypeInformation.of(Long.class)));
}
#Override
public void processElement(String value, KeyedProcessFunction<String, String, String>.Context ctx,
Collector<String> out) throws Exception {
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime());
long current = counter.value() == null ? 0L : counter.value();
if (current < count) {
counter.update(current + 1L);
out.collect(value);
} else {
if (lastTimestamp.value() == null) {
lastTimestamp.update(ctx.timerService().currentProcessingTime());
}
Thread.sleep(millisecond);
out.collect(value);
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
if (lastTimestamp.value() != null && lastTimestamp.value() + millisecond <= timestamp) {
counter.update(0L);
lastTimestamp.update(null);
}
}
}

Spring batch : FlatFileItemWriter header never called

I have a weird issue with my FlatFileItemWriter callbacks.
I have a custom ItemWriter implementing both FlatFileFooterCallback and FlatFileHeaderCallback. Consequently, I set header and footer callbacks in my FlatFileItemWriter like this :
ItemWriter Bean
#Bean
#StepScope
public ItemWriter<CityItem> writer(FlatFileItemWriter<CityProcessed> flatWriter, #Value("#{jobExecutionContext[inputFile]}") String inputFile) {
CityItemWriter itemWriter = new CityItemWriter();
flatWriter.setHeaderCallback(itemWriter);
flatWriter.setFooterCallback(itemWriter);
itemWriter.setDelegate(flatWriter);
itemWriter.setInputFileName(inputFile);
return itemWriter;
}
FlatFileItemWriter Bean
#Bean
#StepScope
public FlatFileItemWriter<CityProcessed> flatFileWriterArchive(#Value("#{jobExecutionContext[outputFileArchive]}") String outputFile) {
FlatFileItemWriter<CityProcessed> flatWriter = new FlatFileItemWriter<CityProcessed>();
FileSystemResource isr;
isr = new FileSystemResource(new File(outputFile));
flatWriter.setResource(isr);
DelimitedLineAggregator<CityProcessed> aggregator = new DelimitedLineAggregator<CityProcessed>();
aggregator.setDelimiter(";");
BeanWrapperFieldExtractor<CityProcessed> beanWrapper = new BeanWrapperFieldExtractor<CityProcessed>();
beanWrapper.setNames(new String[]{
"country", "name", "population", "popUnder25", "pop25To50", "pop50to75", "popMoreThan75"
});
aggregator.setFieldExtractor(beanWrapper);
flatWriter.setLineAggregator(aggregator);
flatWriter.setEncoding("ISO-8859-1");
return flatWriter;
}
Step Bean
#Bean
public Step stepImport(StepBuilderFactory stepBuilderFactory, ItemReader<CityFile> reader, ItemWriter<CityItem> writer, ItemProcessor<CityFile, CityItem> processor,
#Qualifier("flatFileWriterArchive") FlatFileItemWriter<CityProcessed> flatFileWriterArchive, ExecutionContextPromotionListener executionContextListener) {
return stepBuilderFactory.get("stepImport").<CityFile, CityItem> chunk(10).reader(reader(null)).processor(processor).writer(writer).stream(flatFileWriterArchive)
.listener(executionContextListener).build();
}
I have the classic content in my writeFooter, writeHeader and write methods.
ItemWriter code
public class CityItemWriter implements ItemWriter<CityItem>, FlatFileFooterCallback, FlatFileHeaderCallback, ItemStream {
private FlatFileItemWriter<CityProcessed> writer;
private static int totalUnknown = 0;
private static int totalSup10000 = 0;
private static int totalInf10000 = 0;
private String inputFileName = "-";
public void setDelegate(FlatFileItemWriter<CityProcessed> delegate) {
writer = delegate;
}
public void setInputFileName(String name) {
inputFileName = name;
}
private Predicate<String> isNullValue() {
return p -> p == null;
}
#Override
public void write(List<? extends CityItem> cities) throws Exception {
List<CityProcessed> citiesCSV = new ArrayList<>();
for (CityItem item : cities) {
String populationAsString = "";
String less25AsString = "";
String more25AsString = "";
/*
* Some processing to get total Unknown/Sup 10000/Inf 10000
* and other data
*/
// Write in CSV file
CityProcessed cre = new CityProcessed();
cre.setCountry(item.getCountry());
cre.setName(item.getName());
cre.setPopulation(populationAsString);
cre.setLess25(less25AsString);
cre.setMore25(more25AsString);
citiesCSV.add(cre);
}
writer.write(citiesCSV);
}
#Override
public void writeFooter(Writer fileWriter) throws IOException {
String newLine = "\r\n";
String totalUnknown= "Subtotal:;Unknown;" + String.valueOf(nbUnknown) + newLine;
String totalSup10000 = ";Sum Sup 10000;" + String.valueOf(nbSup10000) + newLine;
String totalInf10000 = ";Sum Inf 10000;" + String.valueOf(nbInf10000) + newLine;
String total = "Total:;;" + String.valueOf(nbSup10000 + nbInf10000 + nbUnknown) + newLine;
fileWriter.write(newLine);
fileWriter.write(totalUnknown);
fileWriter.write(totalSup10000);
fileWriter.write(totalInf10000);
fileWriter.write(total );
}
#Override
public void writeHeader(Writer fileWriter) throws IOException {
String newLine = "\r\n";
String firstLine= "FILE PROCESSED ON: ;" + new SimpleDateFormat("MM/dd/yyyy").format(new Date()) + newLine;
String secondLine= "Filename: ;" + inputFileName + newLine;
String colNames= "Country;Name;Population...;...having less than 25;...having more than 25";
fileWriter.write(firstLine);
fileWriter.write(secondLine);
fileWriter.write(newLine);
fileWriter.write(colNames);
}
#Override
public void close() throws ItemStreamException {
writer.close();
}
#Override
public void open(ExecutionContext context) throws ItemStreamException {
writer.open(context);
}
#Override
public void update(ExecutionContext context) throws ItemStreamException {
writer.update(context);
}
}
When I run my batch, I only have the data for each city (write method part) and the footer lines. If I comment the whole content of write method and footer callback, I still don't have the header lines. I tried to add a System.out.println() text in my header callback, it looks like it's never called.
Here is an example of the CSV file produced by my batch :
France;Paris;2240621;Unknown;Unknown
France;Toulouse;439553;Unknown;Unknown
Spain;Barcelona;1620943;Unknown;Unknown
Spain;Madrid;3207247;Unknown;Unknown
[...]
Subtotal:;Unknown;2
;Sum Sup 10000;81
;Sum Inf 10000;17
Total:;;100
What is weird is that my header used to work before, when I added both footer and header callbacks. I didn't change them, and I don't see what I've done in my code to "broke" my header callback... And of course, I have no save of my first code. Because I see only now that my header has disappeared (I checked my few last files, and it looks like my header is missing for some time but I didn't see it), I can't just remove my modifications to see when/why it happens.
Do you have any idea to solve this problem ?
Thanks
When using Java config as you are, it's best to return the most specific type possible (the opposite of what you're normally told to do in java programming). In this case, your writer is returning ItemWriter, but is step scoped. Because of this a proxy is created that can only see the type that your java config returns which in this case is ItemWriter and does not expose the methods on the ItemStream interface. If you return CityItemWriter, I'd expect things to work.

How to process multi line input records in Spark

I have each record spread across multiple lines in the input file(Very huge file).
Ex:
Id: 2
ASIN: 0738700123
title: Test tile for this product
group: Book
salesrank: 168501
similar: 5 0738700811 1567184912 1567182813 0738700514 0738700915
categories: 2
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Wicca[12484]
|Books[283155]|Subjects[1000]|Religion & Spirituality[22]|Earth-Based Religions[12472]|Witchcraft[12486]
reviews: total: 12 downloaded: 12 avg rating: 4.5
2001-12-16 cutomer: A11NCO6YTE4BTJ rating: 5 votes: 5 helpful: 4
2002-1-7 cutomer: A9CQ3PLRNIR83 rating: 4 votes: 5 helpful: 5
How to identify and process each multi line record in spark?
If the multi-line data has a defined record separator, you could use the hadoop support for multi-line records, providing the separator through a hadoop.Configuration object:
Something like this should do:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", "id:")
val dataset = sc.newAPIHadoopFile("/path/to/data", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val data = dataset.map(x=>x._2.toString)
This will provide you with an RDD[String] where each element corresponds to a record. Afterwards you need to parse each record following your application requirements.
I have done this by implementing custom input format and record reader.
public class ParagraphInputFormat extends TextInputFormat {
#Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) {
return new ParagraphRecordReader();
}
}
public class ParagraphRecordReader extends RecordReader<LongWritable, Text> {
private long end;
private boolean stillInChunk = true;
private LongWritable key = new LongWritable();
private Text value = new Text();
private FSDataInputStream fsin;
private DataOutputBuffer buffer = new DataOutputBuffer();
private byte[] endTag = "\n\r\n".getBytes();
public void initialize(InputSplit inputSplit, TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException {
FileSplit split = (FileSplit) inputSplit;
Configuration conf = taskAttemptContext.getConfiguration();
Path path = split.getPath();
FileSystem fs = path.getFileSystem(conf);
fsin = fs.open(path);
long start = split.getStart();
end = split.getStart() + split.getLength();
fsin.seek(start);
if (start != 0) {
readUntilMatch(endTag, false);
}
}
public boolean nextKeyValue() throws IOException {
if (!stillInChunk) return false;
boolean status = readUntilMatch(endTag, true);
value = new Text();
value.set(buffer.getData(), 0, buffer.getLength());
key = new LongWritable(fsin.getPos());
buffer.reset();
if (!status) {
stillInChunk = false;
}
return true;
}
public LongWritable getCurrentKey() throws IOException, InterruptedException {
return key;
}
public Text getCurrentValue() throws IOException, InterruptedException {
return value;
}
public float getProgress() throws IOException, InterruptedException {
return 0;
}
public void close() throws IOException {
fsin.close();
}
private boolean readUntilMatch(byte[] match, boolean withinBlock) throws IOException {
int i = 0;
while (true) {
int b = fsin.read();
if (b == -1) return false;
if (withinBlock) buffer.write(b);
if (b == match[i]) {
i++;
if (i >= match.length) {
return fsin.getPos() < end;
}
} else i = 0;
}
}
}
endTag identifies the end of each record.

Extracting CD tagged tokens using UIMA

I wrote an annotator that extracts all CD tagged tokens and the code looks like this
public class WeightAnnotator extends JCasAnnotator_ImplBase {
private Logger logger = Logger.getLogger(getClass().getName());
public static List<Token> weightTokens = new ArrayList<Token>();
public static final String PARAM_STRING = "stringParam";
#ConfigurationParameter(name = PARAM_STRING)
private String stringParam;
#Override
public void initialize(UimaContext context) throws ResourceInitializationException {
super.initialize(context);
}
#Override
public void process(JCas jCas) throws AnalysisEngineProcessException {
logger.info("Starting processing.");
for (Sentence sentence : JCasUtil.select(jCas, Sentence.class)) {
List<Token> tokens = JCasUtil.selectCovered(jCas, Token.class, sentence);
for (Token token : tokens) {
int begin = token.getBegin();
int end = token.getEnd();
if (token.getPos().equals( PARAM_STRING)) {
WeightAnnotation ann = new WeightAnnotation(jCas, begin, end);
ann.addToIndexes();
System.out.println("Token: " + token.getCoveredText());
}
}
}
}
}
but when I am trying to create an iterator on it in a pipeline, the iterator is returning null. Here is how my pipeline looks.
AnalysisEngineDescription weightDescriptor = AnalysisEngineFactory.createEngineDescription(WeightAnnotator.class,
WeightAnnotator.PARAM_STRING, "CD"
);
AggregateBuilder builder = new AggregateBuilder();
builder.add(sentenceDetectorDescription);
builder.add(tokenDescriptor);
builder.add(posDescriptor);
builder.add(weightDescriptor);
builder.add(writer);
for (JCas jcas : SimplePipeline.iteratePipeline(reader, builder.createAggregateDescription())) {
Iterator iterator1 = JCasUtil.iterator(jcas, WeightAnnotation.class);
while (iterator1.hasNext()) {
WeightAnnotation weights = (WeightAnnotation) iterator1.next();
logger.info("Token: " + weights.getCoveredText());
}
}
I generated WeightAnnotator and WeightAnnotator_Type using JCasGen. I debugged the entire code but I don't understand where I am getting this wrong. Any ideas on how to improve this are appreciated.