WordCount on KV<String, String> counting words on value and preserving link to the key? - apache-beam

When reading some of the Beam examples I was wondering if there's a way to WordCount on values and preserve the key? Make those counts associated to the key so that bound is not lost.
Trying to modify the WordCount example I tried something as follows:
public static class CountWords
extends PTransform<PCollection<KV<String, String>>, PCollection<KV<String, KV<String, Long>>>> {
#Override
public PCollection<KV<String, Long>> expand(PCollection<KV<String, String>> items) {
// Convert lines of text into individual words.
PCollection<String> words = lines.apply(ParDo.of(new ExtractWordsFn()));
// Count the number of times each word occurs.
PCollection<KV<String, KV<String, Long>>> wordCounts = words.apply(Count.perElement());
return wordCounts;
}
}
Still can't wrap my head around this and see how I could perform the count while keeping the link to the key.
The output I would like to have is, having the input KV(string_key_1, some random text), get a KV output of KV(string_key_1, KV({some: 1, random: 1, text: 1}).
Is there a way to link a PCollection to a certain key and have it separately processed so it's still processed as in the example (PCollection>String)?

I am not sure if I understand what you want to do, I guess you want to do a Word Count on each key separately. If that's the case, this would work:
final List<KV<String, String>> elements = Arrays.asList(
KV.of("key1", "some random text"),
KV.of("key1", "some different text"),
KV.of("key1", "another random line"),
KV.of("key2", "some random text different"),
KV.of("key2", "bla bla bla")
);
p
.apply("Create Elements", Create.of(elements))
.apply("New combined KVs", ParDo.of(new DoFn<KV<String, String>, KV<String, Integer>>() {
#ProcessElement
public void processElement(ProcessContext c) {
String[] values = c.element().getValue().split(" ");
for (String s: values) {
String key = String.format("%s %s", c.element().getKey(), s);
c.output(KV.of(key, 1));
}
}
}))
.apply(Count.perKey())
.apply("Separate KVs", ParDo.of(new DoFn<KV<String, Long>, KV<String, KV<String, Long>>>() {
#ProcessElement
public void processElement(ProcessContext c) {
String[] keys = c.element().getKey().split(" ");
KV<String, Long> value = KV.of(keys[1], c.element().getValue());
c.output(KV.of(keys[0], value));
}
}))
.apply(GroupByKey.create())
.apply("LOG", ParDo.of(new DoFn<KV<String, Iterable<KV<String, Long>>>, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
LOG.info(c.element().toString());
}
}));
The output is
KV{key1, [KV{line, 1}, KV{text, 2}, KV{another, 1}, KV{different, 1}, KV{some, 2}, KV{random, 2}]}
KV{key2, [KV{different, 1}, KV{some, 1}, KV{bla, 3}, KV{text, 1}, KV{random, 1}]}
The general idea behind is that you combine the keys into "double keys" (i.e., key1, some, key, random`) and aggregate first. Then separate and group.

Related

Writable Classes in mapreduce

How can i use the values from hashset (the docid and offset) to the reduce writable so as to connect map writable with reduce writable?
The mapper (LineIndexMapper) works fine but in the reducer (LineIndexReducer) i get the error that it can't get string as argument when i type this:
context.write(key, new IndexRecordWritable("some string");
although i have the public String toString() in the ReduceWritable too.
I believe the hashset in reducer's writable (IndexRecordWritable.java) maybe isn't taking the values correctly?
I have the below code.
IndexMapRecordWritable.java
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.io.Writable;
public class IndexMapRecordWritable implements Writable {
private LongWritable offset;
private Text docid;
public LongWritable getOffsetWritable() {
return offset;
}
public Text getDocidWritable() {
return docid;
}
public long getOffset() {
return offset.get();
}
public String getDocid() {
return docid.toString();
}
public IndexMapRecordWritable() {
this.offset = new LongWritable();
this.docid = new Text();
}
public IndexMapRecordWritable(long offset, String docid) {
this.offset = new LongWritable(offset);
this.docid = new Text(docid);
}
public IndexMapRecordWritable(IndexMapRecordWritable indexMapRecordWritable) {
this.offset = indexMapRecordWritable.getOffsetWritable();
this.docid = indexMapRecordWritable.getDocidWritable();
}
#Override
public String toString() {
StringBuilder output = new StringBuilder()
output.append(docid);
output.append(offset);
return output.toString();
}
#Override
public void write(DataOutput out) throws IOException {
}
#Override
public void readFields(DataInput in) throws IOException {
}
}
IndexRecordWritable.java
import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.util.HashSet;
import org.apache.hadoop.io.Writable;
public class IndexRecordWritable implements Writable {
// Save each index record from maps
private HashSet<IndexMapRecordWritable> tokens = new HashSet<IndexMapRecordWritable>();
public IndexRecordWritable() {
}
public IndexRecordWritable(
Iterable<IndexMapRecordWritable> indexMapRecordWritables) {
}
#Override
public String toString() {
StringBuilder output = new StringBuilder();
return output.toString();
}
#Override
public void write(DataOutput out) throws IOException {
}
#Override
public void readFields(DataInput in) throws IOException {
}
}
Alright, here is my answer based on a few assumptions. The final output is a text file containing the key and the file names separated by a comma based on the information in the reducer class's comments on the pre-condition and post-condition.
In this case, you really don't need IndexRecordWritable class. You can simply write to your context using
context.write(key, new Text(valueBuilder.substring(0, valueBuilder.length() - 1)));
with the class declaration line as
public class LineIndexReducer extends Reducer<Text, IndexMapRecordWritable, Text, Text>
Don't forget to set the correct output class in the driver.
That must serve the purpose according to the post-condition in your reducer class. But, if you really want to write a Text-IndexRecordWritable pair to your context, there are two ways approach it -
with string as an argument (based on your attempt passing a string when you IndexRecordWritable class constructor is not designed to accept strings) and
with HashSet as an argument (based on the HashSet initialised in IndexRecordWritable class).
Since your constructor of IndexRecordWritable class is not designed to accept String as an input, you cannot pass a string. Hence the error you are getting that you can't use string as an argument. Ps: if you want your constructor to accept Strings, you must have another constructor in your IndexRecordWritable class as below:
// Save each index record from maps
private HashSet<IndexMapRecordWritable> tokens = new HashSet<IndexMapRecordWritable>();
// to save the string
private String value;
public IndexRecordWritable() {
}
public IndexRecordWritable(
HashSet<IndexMapRecordWritable> indexMapRecordWritables) {
/***/
}
// to accpet string
public IndexRecordWritable (String value) {
this.value = value;
}
but that won't be valid if you want to use the HashSet. So, approach #1 can't be used. You can't pass a string.
That leaves us with approach #2. Passing a HashSet as an argument since you want to make use of the HashSet. In this case, you must create a HashSet in your reducer before passing it as an argument to IndexRecordWritable in context.write.
To do this, your reducer must look like this.
#Override
protected void reduce(Text key, Iterable<IndexMapRecordWritable> values, Context context) throws IOException, InterruptedException {
//StringBuilder valueBuilder = new StringBuilder();
HashSet<IndexMapRecordWritable> set = new HashSet<>();
for (IndexMapRecordWritable val : values) {
set.add(val);
//valueBuilder.append(val);
//valueBuilder.append(",");
}
//write the key and the adjusted value (removing the last comma)
//context.write(key, new IndexRecordWritable(valueBuilder.substring(0, valueBuilder.length() - 1)));
context.write(key, new IndexRecordWritable(set));
//valueBuilder.setLength(0);
}
and your IndexRecordWritable.java must have this.
// Save each index record from maps
private HashSet<IndexMapRecordWritable> tokens = new HashSet<IndexMapRecordWritable>();
// to save the string
//private String value;
public IndexRecordWritable() {
}
public IndexRecordWritable(
HashSet<IndexMapRecordWritable> indexMapRecordWritables) {
/***/
tokens.addAll(indexMapRecordWritables);
}
Remember, this is not the requirement according to the description of your reducer where it says.
POST-CONDITION: emit the output a single key-value where all the file names are separated by a comma ",". <"marcello", "a.txt#3345,b.txt#344,c.txt#785">
If you still choose to emit (Text, IndexRecordWritable), remember to process the HashSet in IndexRecordWritable to get it in the desired format.

How to wait for a finite stream bulk result

I have a stream processing application built with spring cloud streams & kafka streams,
this system takes logs from an application and compares them to observations made by another stream processor
and produces a score, the log stream is then split by the score (above & below some threshold).
The topology:
The issue:
So my problem is how to properly implement the "Log best observation selector processor",
There are a finite amount of observations at the moment the log is processed but there may be a lot of them.
So I came up with 2 solutions...
Group & Window log-scored-observations topic by log id and then reduce to get the highest score. (Problem: scoring all observations may take longer then the window)
Emit a scoring completed message after every scoring, join with log-relevant-observations, use log-scored-observations global table & interactive query to check that every observation id is in the global table store, when all ids are in the store map to the observation with the highest score. (Problem: global table does not appear to work when only used for interactive query)
What would be the best way to achieve what I'm trying?
I'm hoping not to create any partition, disk or memory bottleneck.
Everything has unique ids and tuples of relevant ids when the value is joined from log & observation.
(Edit: Switched text description of topology with a diagram & change title)
Solution #2 seems to work, but it emitted warnings because interactive queries takes some time to be ready - so I implemented the same solution with a Transformer:
#Slf4j
#Configuration
#RequiredArgsConstructor
#SuppressWarnings("unchecked")
public class LogBestObservationsSelectorProcessorConfig {
private String logScoredObservationsStore = "log-scored-observations-store";
private final Serde<LogEntryRelevantObservationIdTuple> logEntryRelevantObservationIdTupleSerde;
private final Serde<LogRelevantObservationIdsTuple> logRelevantObservationIdsTupleSerde;
private final Serde<LogEntryObservationMatchTuple> logEntryObservationMatchTupleSerde;
private final Serde<LogEntryObservationMatchIdsRelevantObservationsTuple> logEntryObservationMatchIdsRelevantObservationsTupleSerde;
#Bean
public Function<
GlobalKTable<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple>,
Function<
KStream<LogEntryRelevantObservationIdTuple, LogEntryRelevantObservationIdTuple>,
Function<
KTable<String, LogRelevantObservationIds>,
KStream<String, LogEntryObservationMatchTuple>
>
>
>
logBestObservationSelectorProcessor() {
return (GlobalKTable<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple> logScoredObservationsTable) ->
(KStream<LogEntryRelevantObservationIdTuple, LogEntryRelevantObservationIdTuple> logScoredObservationProcessedStream) ->
(KTable<String, LogRelevantObservationIdsTuple> logRelevantObservationIdsTable) -> {
return logScoredObservationProcessedStream
.selectKey((k, v) -> k.getLogId())
.leftJoin(
logRelevantObservationIdsTable,
LogEntryObservationMatchIdsRelevantObservationsTuple::new,
Joined.with(
Serdes.String(),
logEntryRelevantObservationIdTupleSerde,
logRelevantObservationIdsTupleSerde
)
)
.transform(() -> new LogEntryObservationMatchTransformer(logScoredObservationsStore))
.groupByKey(
Grouped.with(
Serdes.String(),
logEntryObservationMatchTupleSerde
)
)
.reduce(
(match1, match2) -> Double.compare(match1.getScore(), match2.getScore()) != -1 ? match1 : match2,
Materialized.with(
Serdes.String(),
logEntryObservationMatchTupleSerde
)
)
.toStream()
;
};
}
#RequiredArgsConstructor
private static class LogEntryObservationMatchTransformer implements Transformer<String, LogEntryObservationMatchIdsRelevantObservationsTuple, KeyValue<String, LogEntryObservationMatchTuple>> {
private final String stateStoreName;
private ProcessorContext context;
private TimestampedKeyValueStore<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple> kvStore;
#Override
public void init(ProcessorContext context) {
this.context = context;
this.kvStore = (TimestampedKeyValueStore<LogEntryRelevantObservationIdTuple, LogEntryObservationMatchTuple>) context.getStateStore(stateStoreName);
}
#Override
public KeyValue<String, LogEntryObservationMatchTuple> transform(String logId, LogEntryObservationMatchIdsRelevantObservationsTuple value) {
val observationIds = value.getLogEntryRelevantObservationsTuple().getRelevantObservations().getObservationIds();
val allObservationsProcessed = observationIds.stream()
.allMatch((observationId) -> {
val key = LogEntryRelevantObservationIdTuple.newBuilder()
.setLogId(logId)
.setRelevantObservationId(observationId)
.build();
return kvStore.get(key) != null;
});
if (!allObservationsProcessed) {
return null;
}
val observationId = value.getLogEntryRelevantObservationIdTuple().getObservationId();
val key = LogEntryRelevantObservationIdTuple.newBuilder()
.setLogId(logId)
.setRelevantObservationId(observationId)
.build();
ValueAndTimestamp<LogEntryObservationMatchTuple> observationMatchValueAndTimestamp = kvStore.get(key);
return new KeyValue<>(logId, observationMatchValueAndTimestamp.value());
}
#Override
public void close() {
}
}
}

How to preserve informations about original observable on RxJava2

I have a REST call returning a collection (the original), this collection is filtered but on the subscribe onSuccess I what to obtain both the original list and the filtered one.
I don't know how to 'pass' this second element, which operator should I use to obtain this result?
I show a simplified version of my code below
Observable.fromCallable(new Callable<List<Integer>>() {
#Override public List<Integer> call() throws Exception {
// dynamic list obtained from REST call
// for simplicity here I return a list
return Arrays.asList(1, 2, 3, 4);
}
})
.flatMap(new Function<List<Integer>, ObservableSource<Integer>>() {
#Override public ObservableSource<Integer> apply(List<Integer> integers) throws Exception {
return Observable.fromIterable(integers);
}
})
.filter(new Predicate<Integer>() {
#Override public boolean test(Integer integer) throws Exception {
return integer > 2;
}
})
.toList()
.subscribe(new SingleObserver<List<Integer>>() {
#Override public void onSubscribe(Disposable d) {}
#Override public void onSuccess(List<Integer> value) {
///////////////////
// here I want both original and filtered list
///////////////////
}
#Override public void onError(Throwable e) {}
});
One way is with ConnectableObservable. You need to share the emissions of your initial stream. Something like this
ConnectableObservable<List<Integer>> connectableObservable
= Observable.fromCallable(() -> {
// dynamic list obtained from REST call
// for simplicity here I return a list
return Arrays.asList(1, 2, 3, 4);
}).publish();
Single.zip(connectableObservable.flatMapIterable(integers -> integers)
.filter(integer -> integer > 2)
.toList(),
connectableObservable.elementAtOrError(0),
(integers, lists) -> combine(integers, lists))
.subscribe(o -> {
///////////////////
// here you ll have a new object containing both the initial list and the filtered list
///////////////////
});
connectableObservable.connect();

IObservable output emitted before input?

The program below attempts to print out words with their respective lengths. It erroneously reports that cat has 6 letters. As I examine the log, it looks like the length of a specific word is emitted BEFORE the word it is based upon is emitted. How is this possible? The length observable is defined as word.select(i=>i.Length) so I don't see how it could produce a result before the word arrives. At first I thought this might be a bug in my logging code, but the behavior of Observable.WithLatestFrom reinforces my belief that something weird is going on here.
Log results:
0001report.Subscribe()
0002first.Subscribe()
0003second.Subscribe()
0002first.OnNext(3)
0003second.OnNext(cat)
0002first.OnNext(6)
0001report.OnNext({ Word = cat, Length = 6 })
0003second.OnNext(donkey)
The program:
static void Main(string[] args) {
ILogger logger = new DelegateLogger(Console.WriteLine);
Subject<string> word = new Subject<string>();
IObservable<int> length = word.Select(i => i.Length);
var report = Observable
.WithLatestFrom(
length.Log(logger, "first"),
word.Log(logger, "second"),
(l, w) => new { Word = w, Length = l })
.Log(logger,"report");
report.Subscribe();
word.OnNext("cat");
word.OnNext("donkey");
Console.ReadLine();
}
public interface ILogger
{
void Log(string input);
}
public class DelegateLogger : ILogger
{
Action<string> _printer;
public DelegateLogger(Action<string> printer) {
_printer = printer;
}
public void Log(string input) => _printer(input);
}
public static class ObservableLoggingExtensions
{
private static int _index = 0;
public static IObservable<T> Log<T>(this IObservable<T> source, ILogger logger, string name) {
return Observable.Create<T>(o => {
var index = Interlocked.Increment(ref _index);
var label = $"{index:0000}{name}";
logger.Log($"{label}.Subscribe()");
var disposed = Disposable.Create(() => logger.Log($"{label}.Dispose()"));
var subscription = source
.Do(
x => logger.Log($"{label}.OnNext({x?.ToString() ?? "null"})"),
ex => logger.Log($"{label}.OnError({ex})"),
() => logger.Log($"{label}.OnCompleted()")
)
.Subscribe(o);
return new CompositeDisposable(subscription, disposed);
});
}
}
I think I know what is going on. There are two subscriptions on word (1. length 2. WithLatestFrom), and one subscription on length (1. WithLatestFrom)
When a word is emitted, a synchronous callback process starts that passes it to the first subscriber (length), which calculates a value, that is passed to its subscriber, WithLatestFrom. Next, WithLatestFrom receives the word that generated the calculated length. So WithLatestFrom receives the length BEFORE the word, not the other way around. That's why the report isn't giving me the results I expected.

Adding a row-number column to GWT CellTable

I need to insert a new first-column into a CellTable, and display the RowNumber of the current row in it. What is the best way to do this in GWT?
Get the index of the element from the list wrapped by your ListDataProvider. Like this:
final CellTable<Row> table = new CellTable<Row>();
final ListDataProvider<Row> dataProvider = new ListDataProvider<Starter.Row>(getList());
dataProvider.addDataDisplay(table);
TextColumn<Row> numColumn = new TextColumn<Starter.Row>() {
#Override
public String getValue(Row object) {
return Integer.toString(dataProvider.getList().indexOf(object) + 1);
}
};
See here for the rest of the example.
Solution from z00bs is wrong, because row number calculating from object's index in data List. For example, for List of Strings with elements: ["Str1", "Str2", "Str2"], the row numbers will be [1, 2, 2]. It is wrong.
This solution uses the index of row in celltable for row number.
public class RowNumberColumn extends Column {
public RowNumberColumn() {
super(new AbstractCell() {
#Override
public void render(Context context, Object o, SafeHtmlBuilder safeHtmlBuilder) {
safeHtmlBuilder.append(context.getIndex() + 1);
}
});
}
#Override
public String getValue(Object s) {
return null;
}
}
and
cellTable.addColumn(new RowNumberColumn());