How to run BigQueryIO.read().fromQuery with parameters - apache-beam

I need to run multiple queries from a single .SQL file but with different params
I've tried something like this but it does not work as BigQueryIO.Read consumes only PBegin.
public PCollection<KV<String, TestDitoDto>> expand(PCollection<QueryParamsBatch> input) {
PCollection<KV<String, Section1Dto>> section1 = input.apply("Read Section1 from BQ",
BigQueryIO
.readTableRows()
.fromQuery(ResourceRetriever.getResourceFile("query/test/section1.sql"))
.usingStandardSql()
.withoutValidation())
.apply("Convert section1 to Dto", ParDo.of(new TableRowToSection1DtoFunction()));
}
Are there any other ways to put params from existing PCollection inside my BigQueryIO.read() invocation?

Are different queries/parameters available in the pipeline construction time ? If so you could just create multiple read transforms and combine results, for example, using a Flatten transform.
Beam Java BigQuery source does not support reading a PCollection of queries currently. Python BQ source does though.

I've come up with the following solution: not to use BigQueryIO but regular GCP library for accessing BigQuery, marking it as transient and initializing it each time in method with #Setup annotation, as it is not Serializable
public class DenormalizedCase1Fn extends DoFn<*> {
private transient BigQuery bigQuery;
#Setup
public void initialize() {
this.bigQuery = BigQueryOptions.newBuilder()
.setProjectId(bqProjectId.get())
.setLocation(LOCATION)
.setRetrySettings(RetrySettings.newBuilder()
.setRpcTimeoutMultiplier(1.5)
.setInitialRpcTimeout(Duration.ofSeconds(5))
.setMaxRpcTimeout(Duration.ofSeconds(30))
.setMaxAttempts(3).build())
.build().getService();
}
#ProcessElement
...

Related

Accessing Pipeline within DoFn

I'm writing a pipeline to replicate data from one source to another. Info about data sources is stored in db (BQ). How I can use this data it to build read/write endpoints dynamically?
I tried to pass Pipeline object to my custom DoFn but it can't be serialized. Later I tried to call method getPipeline() on a passed view but it doesn't work as well. -- which is actually expected
I can't know all tables I need to serialize in advance so I have to read all data from db (or any other source).
// builds some random view
PCollectionView<IdWrapper> idView = ...;
// reads tables meta and replicates data per each table
pipeline.apply(getTableMetaEndpont().read())
.apply(ParDo.of(new MyCustomReplicator(idView)).withSideInputs(idView))
private static class MyCustomReplicator extends DoFn<TableMeta, TableMeta> {
private final PCollectionView<IdWrapper> idView;
private DataReplicator(PCollectionView<IdWrapper> idView) {
this.idView = idView;
}
// TableMeta {string: sourceTable, string: destTable}
#ProcessElement
public void processElement(#Element TableMeta tableMeta, ProcessContext ctx) {
long id = ctx.sideInput(idView).getValue();
// builds read endpoint which depends on table meta
// updates entities
// stores entities using another endpoint
idView
.getPipeline()
.apply(createReadEndpoint(tableMeta).read())
.apply(ParDo.of(new SomeFunction(tableMeta, id)))
.apply(createWriteEndpoint(tableMeta).insert());
ctx.output(tableMetadata);
}
}
I expect it to replicate data specified by TableMeta but I can't use pipeline within DoFn object because it can't be serialized/deserialized.
Is there any way to implement the intended behavior?

GroupIntoBatches for non-KV elements

According to the Apache Beam 2.0.0 SDK Documentation GroupIntoBatches works only with KV collections.
My dataset contains only values and there's no need for introducing keys. However, to make use of GroupIntoBatches I had to implement “fake” keys with an empty string as a key:
static class FakeKVFn extends DoFn<String, KV<String, String>> {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(KV.of("", c.element()));
}
}
So the overall pipeline looks like the following:
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
long batchSize = 100L;
p.apply("ReadLines", TextIO.read().from("./input.txt"))
.apply("FakeKV", ParDo.of(new FakeKVFn()))
.apply(GroupIntoBatches.<String, String>ofSize(batchSize))
.setCoder(KvCoder.of(StringUtf8Coder.of(), IterableCoder.of(StringUtf8Coder.of())))
.apply(ParDo.of(new DoFn<KV<String, Iterable<String>>, String>() {
#ProcessElement
public void processElement(ProcessContext c) {
c.output(callWebService(c.element().getValue()));
}
}))
.apply("WriteResults", TextIO.write().to("./output/"));
p.run().waitUntilFinish();
}
Is there any way to group into batches without introducing “fake” keys?
It is required to provide KV inputs to GroupIntoBatches because the transform is implemented using state and timers, which are per key-and-window.
For each key+window pair, state and timers necessarily execute serially (or observably so). You have to manually express the available parallelism by providing keys (and windows, though no runner that I know of parallelizes over windows today). The two most common approaches are:
Use some natural key like a user ID
Choose some fixed number of shards and key randomly. This can be harder to tune. You have to have enough shards to get enough parallelism, but each shard needs to include enough data that GroupIntoBatches is actually useful.
Adding one dummy key to all elements as in your snippet will cause the transform to not execute in parallel at all. This is similar to the discussion at Stateful indexing causes ParDo to be run single-threaded on Dataflow Runner.

Side Inputs for dynamic BigQuery schemas/tables

I want to write a PCollection to multiple BigQuery tables, with different tables based on the contents of the PCollection and different schemas, the contents of which arrive via a side input.
I noted this in the docs for DynamicDestinations:
An instance of DynamicDestinations can also use side inputs using sideInput(PCollectionView). The side inputs must be present in getSideInputs(). Side inputs are accessed in the global window, so they must be globally windowed.
How is this be practically implemented with v2.0.0 of the Apache Beam BigQueryIO API?
Supposing you have a side input prepared like
// Must be globally windowed to work with BigQueryIO
PCollectionView<MyAuxData> myView = ...
then you would access it in your DynamicDestinations like this:
new DynamicDestinations<MyElement, MyDestination>() {
#Override
protected List<PCollectionView<?>> getSideInputs() {
return ImmutableLIst.of(myView);
}
#Override
public TableSchema getSchema(MyDestination dest) {
MyAuxData = sideInput(myView);
...
}
...
}
and so on.

Insert & Update from single Spring Batch ItemReader

My process transforms the data into SCD2 pattern. Thus, any update in the source data culminates into updating the end_date & active_ind in the dimension table and inserting a new record.
I have configured the SQL in an ItemReader implementation which identifies the records which got changed in the source data.
I need help/suggestion on how to route the data to 2 writers, 1 each for update & insert?
There is a general pattern in Spring for this type of use case and not necessarily for Spring Batch using Classifier Interface.
You can use BackToBackPatternClassifier implementation of this interface.
Additionally, you need to use Spring Batch provided ClassifierCompositeItemWriter.
Here is a summary of steps:
The POJO/Java Bean that is passed on to writer should have some kind of String field that can identify the target ItemWriter for that POJO.
Then you write a Classifier that returns that String type for each POJO like this:
public class UpdateOrInsertClassifier {
#Classifier
public String classify(WrittenMasterBean writtenBean){
return writtenBean.getType();
}
}
and
#Bean
public UpdateOrInsertClassifier router() {
return new UpdateOrInsertClassifier();
}
I assume that WrittenMasterBean is POJO that you sent to either of writers and it has a private String type; field This Classifier is your router.
Then you implement BackToBackPatternClassifier like -
#Bean
public Classifier classifier() {
BackToBackPatternClassifier classifier = new BackToBackPatternClassifier();
classifier.setRouterDelegate(router());
Map<String,ItemWriter<WrittenMasterBean>> writerMap = new HashMap();
writerMap.put("Writer1", writer1());
writerMap.put("Writer2", writer2());
classifier.setMatcherMap(writerMap);
return classifier;
}
i.e. I assume that keys Writer1 and Writer2 will identify your writers for that particular bean.
writer1() and writer2() return actual ItemWriter beans.
BackToBackPatternClassifier needs two fields - one router classifier and another matcher map.
Restriction is that keys are Strings in this classifier. You can't use any other type of keys.
Pass on BackToBackPatternClassifier to ClassifierCompositeItemWriter - You need to use Spring Batch provided ClassifierCompositeItemWriter
#Bean
public ItemWriter<WrittenMasterBean> classifierWriter(){
ClassifierCompositeItemWriter<WrittenMasterBean> writer = new ClassifierCompositeItemWriter();
writer.setClassifier(classifier());
return writer;
}
You configure this - classifierWriter() into your Step .
Then you are good to go.

What is the difference between factory and pipeline design patterns?

What is the difference between factory and pipeline design patterns?
I am asking because I need making classes, each of which has a method that will transform textual data in a certain way.
I have other classes whose data needs to be transformed. However, the order and selection of the transformations depends on (and only on) which base class from which these classes inherit.
Is this somehow related pipeline and/or a factory pattern?
Factory creates objects without exposing the instantiation logic to the client and refers to the newly created object through a common interface. So, goal is to make client completely unaware of what concrete type of product it uses and how that instance created.
public interface IFactory // used by clients
{
IProduct CreateProduct();
}
public class FooFactory : IFactory
{
public IProduct CreateProduct()
{
// create new instance of FooProduct
// setup something
// setup something else
// return it
}
}
All creation details are encapsulated. You can create instance via new() call. Or you can clone some existing sample FooProduct. You can skip setup. Or you can read some data from database before. Anything.
Here we go to Pipeline. Pipeline purpose is to divide a larger processing task into a sequence of smaller, independent processing steps (Filters). If creation of your objects is a large task AND setup steps are independent, you can use pipeline for setup inside factory. But instantiation step definitely not independent in this case. It mast occur prior to other steps.
So, you can provide Filters (i.e. Pipeline) to setup your product:
public class BarFilter : IFilter
{
private IFilter _next;
public IProduct Setup(IProduct product)
{
// do Bar setup
if (_next == null)
return product;
return _next.Setup(product);
}
}
public abstract class ProductFactory : IProductFactory
{
protected IFilter _filter;
public IProduct CreateProduct()
{
IProduct product = InstantiateProduct();
if (_filter == null)
return product;
return _filter.Setup(product);
}
protected abstract IProduct InstantiateProduct();
}
And in concrete factories you can setup custom set of filters for your setup pipeline.
Factory is responsible for creating objects:
ICar volvo = CarFactory.BuildVolvo();
ICar bmw = CarFactory.BuildBMW();
IBook pdfBook = BookFactory.CreatePDFBook();
IBook htmlBook = BookFactory.CreateHTMLBook();
Pipeline will help you to separate processing into smaller tasks:
var searchQuery = new SearchQuery();
searchQuery.FilterByCategories(categoryCriteria);
searchQuery.FilterByDate(dateCriteria);
searchQuery.FilterByAuthor(authorCriteria);
There is also a linear pipeline and non-linear pipeline. Linear pipeline would require us to filter by category, then by date and then by author. Non-linear pipeline would allow us to run these simultaneously or in any order.
This article explains it quite well:
http://www.cise.ufl.edu/research/ParallelPatterns/PatternLanguage/AlgorithmStructure/Pipeline.htm