Accessing Pipeline within DoFn - apache-beam

I'm writing a pipeline to replicate data from one source to another. Info about data sources is stored in db (BQ). How I can use this data it to build read/write endpoints dynamically?
I tried to pass Pipeline object to my custom DoFn but it can't be serialized. Later I tried to call method getPipeline() on a passed view but it doesn't work as well. -- which is actually expected
I can't know all tables I need to serialize in advance so I have to read all data from db (or any other source).
// builds some random view
PCollectionView<IdWrapper> idView = ...;
// reads tables meta and replicates data per each table
pipeline.apply(getTableMetaEndpont().read())
.apply(ParDo.of(new MyCustomReplicator(idView)).withSideInputs(idView))
private static class MyCustomReplicator extends DoFn<TableMeta, TableMeta> {
private final PCollectionView<IdWrapper> idView;
private DataReplicator(PCollectionView<IdWrapper> idView) {
this.idView = idView;
}
// TableMeta {string: sourceTable, string: destTable}
#ProcessElement
public void processElement(#Element TableMeta tableMeta, ProcessContext ctx) {
long id = ctx.sideInput(idView).getValue();
// builds read endpoint which depends on table meta
// updates entities
// stores entities using another endpoint
idView
.getPipeline()
.apply(createReadEndpoint(tableMeta).read())
.apply(ParDo.of(new SomeFunction(tableMeta, id)))
.apply(createWriteEndpoint(tableMeta).insert());
ctx.output(tableMetadata);
}
}
I expect it to replicate data specified by TableMeta but I can't use pipeline within DoFn object because it can't be serialized/deserialized.
Is there any way to implement the intended behavior?

Related

How to run BigQueryIO.read().fromQuery with parameters

I need to run multiple queries from a single .SQL file but with different params
I've tried something like this but it does not work as BigQueryIO.Read consumes only PBegin.
public PCollection<KV<String, TestDitoDto>> expand(PCollection<QueryParamsBatch> input) {
PCollection<KV<String, Section1Dto>> section1 = input.apply("Read Section1 from BQ",
BigQueryIO
.readTableRows()
.fromQuery(ResourceRetriever.getResourceFile("query/test/section1.sql"))
.usingStandardSql()
.withoutValidation())
.apply("Convert section1 to Dto", ParDo.of(new TableRowToSection1DtoFunction()));
}
Are there any other ways to put params from existing PCollection inside my BigQueryIO.read() invocation?
Are different queries/parameters available in the pipeline construction time ? If so you could just create multiple read transforms and combine results, for example, using a Flatten transform.
Beam Java BigQuery source does not support reading a PCollection of queries currently. Python BQ source does though.
I've come up with the following solution: not to use BigQueryIO but regular GCP library for accessing BigQuery, marking it as transient and initializing it each time in method with #Setup annotation, as it is not Serializable
public class DenormalizedCase1Fn extends DoFn<*> {
private transient BigQuery bigQuery;
#Setup
public void initialize() {
this.bigQuery = BigQueryOptions.newBuilder()
.setProjectId(bqProjectId.get())
.setLocation(LOCATION)
.setRetrySettings(RetrySettings.newBuilder()
.setRpcTimeoutMultiplier(1.5)
.setInitialRpcTimeout(Duration.ofSeconds(5))
.setMaxRpcTimeout(Duration.ofSeconds(30))
.setMaxAttempts(3).build())
.build().getService();
}
#ProcessElement
...

Any way I can change in runtime mongo document name

In the project we need to change collection name suffix everyday based on date.
So one day collection is named:
samples_22032019
and in the next day it is
samples_23032019
Everyday I need to change suffix and recompile spring-boot application because of this. Is there any way I can change this so the collection/table can be calculated dynamically based on current date? Any advice for MongoRepository?
Considering the below is your bean. you can use #Document annotation with spring expression language to resolve suffix at runtime. Like show below,
#Document(collection = "samples_#{T(com.yourpackage.Utility).getDateSuffix()}")
public class Samples {
private String id;
private String name;
}
Now have your date change function in a Utility method which spring can resolve at runtime. SpEL is handy in such scenarios.
package com.yourpackage;
public class Utility {
public static final String getDateSuffix() {
//Add your real logic here, below is for representational purpose only.
return DateTime.now().toDate().toString();;
}
}
HTH!
Make a cron job to run daily and generateNewName for your collection and execute the below code. Here I am getting collection using MongoDatabse than by using MongoNamespace we can rename the collection.
To get old/new collection name you can write a separate method.
#Component
public class RenameCollectionTask {
#Scheduled(cron = "${cron}")
public void renameCollection() {
// creating mongo client object
final MongoClient client = new MongoClient(HOST_NAME, PORT);
// selecting the mongo database
final MongoDatabase database = client.getDatabase("databaseName");
// selecting the mongo collection
final MongoCollection<Document> collection = database.getCollection("oldCollectionName");
// creating namespace
final MongoNamespace newName = new MongoNamespace("databaseName", "newCollectionName");
// renaming the collection
collection.renameCollection(newName);
System.out.println("Collection has been renamed");
// closing the client
client.close();
}
}
To assign the name of the collection you can refer this so that every time restart will not be required.
The renameCollection() method has the following limitations:
1) It cannot move a collection between databases.
2) It is not supported on sharded collections.
3) You cannot rename the views.
Refer this for detail.

getting statestore data from called function in kafka streams

In Kafka Streams' Processor API, can I pass processor context from init() as follows to other function and get the context back with state store in process()?
public void init(ProcessorContext context) {
this.context = context;
String resourceName = "config.properties";
ClassLoader loader = Thread.currentThread().getContextClassLoader();
Properties props = new Properties();
try(InputStream resourceStream = loader.getResourceAsStream(resourceName)) {
props.load(resourceStream);
}
catch(IOException e){
e.printStackTrace();
}
dataSplitter.timerMessageSource(props, context);//can I pass context like this?
this.context.schedule(1000);
// retrieve the key-value store named "patient"
kvStore = (KeyValueStore<String, PatientDataSummary>) this.context.getStateStore("patient");
//want to get the value of statestore filled by the called function timerMessageSource(), as the data to be put in statestore is getting generated in timerMessageSource()
//is there any way I can get that by using context or so
}
The usage of ProcessorContext is somewhat limited and you cannot call each method is provides at arbitrary times. Thus, it depend how you use it -- in general, you can pass it around as you wish (it will always be the same object throughout the live time of the processor).
If I understand your question correctly, you register a punctuation and use your dataSplitter within the punctuation callback and want to modify the store. That is absolutely possible -- you can either put the store into a class member similar to what you do with the context or use the context object to get the store within the punctuate callback.

How to get outArgument WorkflowApplication when wf wait for response(bookmark OR idle) and not complete

Accessing Out Arguments with WorkflowApplication when wf wait for response(bookmark OR idle) and not complete
I also used Tracking to retrieve the values, but instead of saving it to a database I come up with the following solution.
Make a Trackingparticipant and collect the data from an activity.
You can fine tune the tracking participant profile with a spefic tracking query.
I have added a public property Output to set the value of the data from the record.
public class CustomTrackingParticipant : TrackingParticipant
{
//TODO: Fine tune the profile with the correct query.
public IDictionary<String, object> Outputs { get; set; }
protected override void Track(TrackingRecord record, TimeSpan timeout)
{
if (record != null)
{
if (record is CustomTrackingRecord)
{
var customTrackingRecord = record as CustomTrackingRecord;
Outputs = customTrackingRecord.Data;
}
}
}
}
In your custom activity you can set the values you want to expose for tracking with a CustomTrackingRecord.
Here is a sample to give you an idea.
protected override void Execute(NativeActivityContext context)
{
var customRecord = new CustomTrackingRecord("QuestionActivityRecord");
customRecord.Data.Add("Question", Question.Get(context));
customRecord.Data.Add("Answers", Answers.Get(context).ToList());
context.Track(customRecord);
//This will create a bookmark with the display name and the workflow will go idle.
context.CreateBookmark(DisplayName, Callback, BookmarkOptions.None);
}
On the WorklfowApplication instance you can add the Tracking participant to the extensions.
workflowApplication.Extensions.Add(new CustomTrackingParticipant());
On the persistable idle event from the workflowApplication instance I subscribed with the following method.
In the method I get the tracking participant from the extensions.
Because we have set the outputs in the public property we can access them and set them in a member outside the workflow. See the following example.
private PersistableIdleAction PersistableIdle(WorkflowApplicationIdleEventArgs
workflowApplicationIdleEventArgs)
{
var ex = workflowApplicationIdleEventArgs.GetInstanceExtensions<CustomTrackingParticipant>();
Outputs = ex.First().Outputs;
return PersistableIdleAction.Unload;
}
I hope this example helped.
Even simpler: Use another workflow activity to store the value you are looking for somewhere (database, file, ...) before starting to wait for a response!
You could use Tracking.
required steps would be:
define a tracking profile which queries ActivityStates with the state closed
Implement an TrackingParticipant to save the OutArgument in process memory, a database or a file on disk
hook everything together
The link cotains all the information you will need to do this.

What is the difference between factory and pipeline design patterns?

What is the difference between factory and pipeline design patterns?
I am asking because I need making classes, each of which has a method that will transform textual data in a certain way.
I have other classes whose data needs to be transformed. However, the order and selection of the transformations depends on (and only on) which base class from which these classes inherit.
Is this somehow related pipeline and/or a factory pattern?
Factory creates objects without exposing the instantiation logic to the client and refers to the newly created object through a common interface. So, goal is to make client completely unaware of what concrete type of product it uses and how that instance created.
public interface IFactory // used by clients
{
IProduct CreateProduct();
}
public class FooFactory : IFactory
{
public IProduct CreateProduct()
{
// create new instance of FooProduct
// setup something
// setup something else
// return it
}
}
All creation details are encapsulated. You can create instance via new() call. Or you can clone some existing sample FooProduct. You can skip setup. Or you can read some data from database before. Anything.
Here we go to Pipeline. Pipeline purpose is to divide a larger processing task into a sequence of smaller, independent processing steps (Filters). If creation of your objects is a large task AND setup steps are independent, you can use pipeline for setup inside factory. But instantiation step definitely not independent in this case. It mast occur prior to other steps.
So, you can provide Filters (i.e. Pipeline) to setup your product:
public class BarFilter : IFilter
{
private IFilter _next;
public IProduct Setup(IProduct product)
{
// do Bar setup
if (_next == null)
return product;
return _next.Setup(product);
}
}
public abstract class ProductFactory : IProductFactory
{
protected IFilter _filter;
public IProduct CreateProduct()
{
IProduct product = InstantiateProduct();
if (_filter == null)
return product;
return _filter.Setup(product);
}
protected abstract IProduct InstantiateProduct();
}
And in concrete factories you can setup custom set of filters for your setup pipeline.
Factory is responsible for creating objects:
ICar volvo = CarFactory.BuildVolvo();
ICar bmw = CarFactory.BuildBMW();
IBook pdfBook = BookFactory.CreatePDFBook();
IBook htmlBook = BookFactory.CreateHTMLBook();
Pipeline will help you to separate processing into smaller tasks:
var searchQuery = new SearchQuery();
searchQuery.FilterByCategories(categoryCriteria);
searchQuery.FilterByDate(dateCriteria);
searchQuery.FilterByAuthor(authorCriteria);
There is also a linear pipeline and non-linear pipeline. Linear pipeline would require us to filter by category, then by date and then by author. Non-linear pipeline would allow us to run these simultaneously or in any order.
This article explains it quite well:
http://www.cise.ufl.edu/research/ParallelPatterns/PatternLanguage/AlgorithmStructure/Pipeline.htm