Side Inputs for dynamic BigQuery schemas/tables - apache-beam

I want to write a PCollection to multiple BigQuery tables, with different tables based on the contents of the PCollection and different schemas, the contents of which arrive via a side input.
I noted this in the docs for DynamicDestinations:
An instance of DynamicDestinations can also use side inputs using sideInput(PCollectionView). The side inputs must be present in getSideInputs(). Side inputs are accessed in the global window, so they must be globally windowed.
How is this be practically implemented with v2.0.0 of the Apache Beam BigQueryIO API?

Supposing you have a side input prepared like
// Must be globally windowed to work with BigQueryIO
PCollectionView<MyAuxData> myView = ...
then you would access it in your DynamicDestinations like this:
new DynamicDestinations<MyElement, MyDestination>() {
#Override
protected List<PCollectionView<?>> getSideInputs() {
return ImmutableLIst.of(myView);
}
#Override
public TableSchema getSchema(MyDestination dest) {
MyAuxData = sideInput(myView);
...
}
...
}
and so on.

Related

Can the state stores in Kafka be shared across streams?

We have a scenario where a statestore having some values from one kstream needs to be accessed in another kstream, is there any way to achieve this?
They can be accessed with Interactive Queries.
Between applications or instances of the same application, you need to use RPC calls such as adding an HTTP or gRPC server.
https://docs.confluent.io/platform/current/streams/developer-guide/interactive-queries.html
You can attach the same state store to multiple processors if you use the Processor API, but also if you use the Processor API Integration in the DSL.
There are two ways to do that (see javadocs). You can either manually add the store to the processors, like:
// create store
StoreBuilder<KeyValueStore<String,String>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("myProcessorState"),
Serdes.String(),
Serdes.String());
// add store
builder.addStateStore(keyValueStoreBuilder);
KStream outputStream = inputStream.processor(new ProcessorSupplier() {
public Processor get() {
return new MyProcessor();
}
}, "myProcessorState");
or you can implement stores() on the passed in ProcessorSupplier:
class MyProcessorSupplier implements ProcessorSupplier {
// supply processor
Processor get() {
return new MyProcessor();
}
// provide store(s) that will be added and connected to the associated processor
// the store name from the builder ("myProcessorState") is used to access the store later via the ProcessorContext
Set<StoreBuilder> stores() {
StoreBuilder<KeyValueStore<String, String>> keyValueStoreBuilder =
Stores.keyValueStoreBuilder(Stores.persistentKeyValueStore("myProcessorState"),
Serdes.String(),
Serdes.String());
return Collections.singleton(keyValueStoreBuilder);
}
}
These are examples for KStream#process(), but it works similarly for the family of KStream#*transform*() methods.

How to run BigQueryIO.read().fromQuery with parameters

I need to run multiple queries from a single .SQL file but with different params
I've tried something like this but it does not work as BigQueryIO.Read consumes only PBegin.
public PCollection<KV<String, TestDitoDto>> expand(PCollection<QueryParamsBatch> input) {
PCollection<KV<String, Section1Dto>> section1 = input.apply("Read Section1 from BQ",
BigQueryIO
.readTableRows()
.fromQuery(ResourceRetriever.getResourceFile("query/test/section1.sql"))
.usingStandardSql()
.withoutValidation())
.apply("Convert section1 to Dto", ParDo.of(new TableRowToSection1DtoFunction()));
}
Are there any other ways to put params from existing PCollection inside my BigQueryIO.read() invocation?
Are different queries/parameters available in the pipeline construction time ? If so you could just create multiple read transforms and combine results, for example, using a Flatten transform.
Beam Java BigQuery source does not support reading a PCollection of queries currently. Python BQ source does though.
I've come up with the following solution: not to use BigQueryIO but regular GCP library for accessing BigQuery, marking it as transient and initializing it each time in method with #Setup annotation, as it is not Serializable
public class DenormalizedCase1Fn extends DoFn<*> {
private transient BigQuery bigQuery;
#Setup
public void initialize() {
this.bigQuery = BigQueryOptions.newBuilder()
.setProjectId(bqProjectId.get())
.setLocation(LOCATION)
.setRetrySettings(RetrySettings.newBuilder()
.setRpcTimeoutMultiplier(1.5)
.setInitialRpcTimeout(Duration.ofSeconds(5))
.setMaxRpcTimeout(Duration.ofSeconds(30))
.setMaxAttempts(3).build())
.build().getService();
}
#ProcessElement
...

Preventing KStream from emiting old unchanged aggregate value

I have a KStream pipeline which groups by key and then windows on some interval and then applies a custom aggregation on that:
KStream<String, Integer> input = /* define input stream */
/* group by key and then apply windowing */
KTable<Windowed<String>, MyAggregate> aggregateTable =
input.groupByKey()
.windowedBy(/* window defintion here */)
.aggregate(MyAggregate::new, (key, value, agg) -> agg.addAndReturn(value))
// I need to get a change log of aggregateTable so:
aggregateTable.toStream().to("output-topic");
The problem is that majority of the input records will not change the internal state of MyAggregate object. The structure is similar to:
class MyAggregate {
private Set<Integer> checkBeforeInsert = /* some predefined values */
private List<Integer> actualState = new ArrayList<>();
public MyAggregate addAndReturn(Integer value) {
/* for 99% of records the if check passes */
if (checkBeforeInsert.contains(value)) {
/* do nothing and return. Note that the state hasn't been changed */
return this;
} else {
actualState.add(value);
return this;
}
}
}
However, KStream doesn't have any clue that the aggregate object hasn't been changed, it still stores the aggregate (which is same as old). It also propagates to same old value to changelog topic and also triggers aggregateTable.toStream() with the same old value.
Although the semantic of my application works fine (the rest of application is aware of this fact that unchanged state might arrive), but this creates a huge noise traffic on intermediate topics. I need a way to notify KStream whether an aggregate has been really changed and should be stored or it's the same as previous record (just ignore it).

Accessing Pipeline within DoFn

I'm writing a pipeline to replicate data from one source to another. Info about data sources is stored in db (BQ). How I can use this data it to build read/write endpoints dynamically?
I tried to pass Pipeline object to my custom DoFn but it can't be serialized. Later I tried to call method getPipeline() on a passed view but it doesn't work as well. -- which is actually expected
I can't know all tables I need to serialize in advance so I have to read all data from db (or any other source).
// builds some random view
PCollectionView<IdWrapper> idView = ...;
// reads tables meta and replicates data per each table
pipeline.apply(getTableMetaEndpont().read())
.apply(ParDo.of(new MyCustomReplicator(idView)).withSideInputs(idView))
private static class MyCustomReplicator extends DoFn<TableMeta, TableMeta> {
private final PCollectionView<IdWrapper> idView;
private DataReplicator(PCollectionView<IdWrapper> idView) {
this.idView = idView;
}
// TableMeta {string: sourceTable, string: destTable}
#ProcessElement
public void processElement(#Element TableMeta tableMeta, ProcessContext ctx) {
long id = ctx.sideInput(idView).getValue();
// builds read endpoint which depends on table meta
// updates entities
// stores entities using another endpoint
idView
.getPipeline()
.apply(createReadEndpoint(tableMeta).read())
.apply(ParDo.of(new SomeFunction(tableMeta, id)))
.apply(createWriteEndpoint(tableMeta).insert());
ctx.output(tableMetadata);
}
}
I expect it to replicate data specified by TableMeta but I can't use pipeline within DoFn object because it can't be serialized/deserialized.
Is there any way to implement the intended behavior?

Understanding RxJava: Differences between Runnable callback

I'm trying to understand RxJava and I'm sure this question is a nonsense... I have this code using RxJava:
public Observable<T> getData(int id) {
if (dataAlreadyLoaded()) {
return Observable.create(new Observable.OnSubscribe<T>(){
T data = getDataFromMemory(id);
subscriber.onNext(data);
});
}
return Observable.create(new Observable.OnSubscribe<T>(){
#Override
public void call(Subscriber<? super String> subscriber) {
T data = getDataFromRemoteService(id);
subscriber.onNext(data);
}
});
}
And, for instance, I could use it this way:
Action1<String> action = new Action<String>() {
#Override
public void call(String s) {
//Do something with s
}
};
getData(3).subscribe(action);
and this another with callback that implements Runnable:
public void getData(int id, MyClassRunnable callback) {
if (dataAlreadyLoaded()) {
T data = getDataFromMemory(id);
callback.setData(data);
callback.run();
} else {
T data = getDataFromRemoteService(id);
callback.setData(data);
callback.run();
}
}
And I would use it this way:
getData(3, new MyClassRunnable()); //Do something in run method
Which are the differences? Why is the first one better?
The question is not about the framework itself but the paradigm. I'm trying to understand the use cases of reactive.
I appreciate any help. Thanks.
First of all, your RxJava version is much more complex than it needs to be. Here's a much simpler version:
public Observable<T> getData(int id) {
return Observable.fromCallable(() ->
dataAlreadyLoaded() ? getDataFromMemory(id) : getDataFromRemoteService(id)
);
}
Regardless, the problem you present is so trivial that there is no discernible difference between the two solutions. It's like asking which one is better for assigning integer values - var = var + 1 or var++. In this particular case they are identical, but when using assignment there are many more possibilities (adding values other than one, subtracting, multiplying, dividing, taking into account other variables, etc).
So what is it you can do with reactive? I like the summary on reactivex's website:
Easily create event streams or data streams. For a single piece of data this isn't so important, but when you have a stream of data the paradigm makes a lot more sense.
Compose and transform streams with query-like operators. In your above example there are no operators and a single stream. Operators let you transform data in handy ways, and combining multiple callbacks is much harder than combining multiple Observables.
Subscribe to any observable stream to perform side effects. You're only listening to a single event. Reactive is well-suited for listening to multiple events. It's also great for things like error handling - you can create a long sequence of events, but any errors are forwarded to the eventual subscriber.
Let's look at a more concrete with an example that has more intrigue: validating an email and password. You've got two text fields and a button. You want the button to become enabled once there is a email (let's say .*#.*) and password (of at least 8 characters) entered.
I've got two Observables that represent whatever the user has currently entered into the text fields:
Observable<String> email = /* you figure this out */;
Observable<String> password = /* and this, too */;
For validating each input, I can map the input String to true or false.
Observable<Boolean> validEmail = email.map(str -> str.matches(".*#.*"));
Observable<Boolean> validPw = password.map(str -> str.length() >= 8);
Then I can combine them to determine if I should enable the button or not:
Observable.combineLatest(validEmail, validPw, (b1, b2) -> b1 && b2)
.subscribe(enableButton -> /* enable button based on bool */);
Now, every time the user types something new into either text field, the button's state gets updated. I've setup the logic so that the button just reacts to the state of the text fields.
This simple example doesn't show it all, but it shows how things get a lot more interesting after you get past a simple subscription. Obviously, you can do this without the reactive paradigm, but it's simpler with reactive operators.