My team has a Beam pipeline where we're writing an unbounded PCollection of domain objects to BigQuery using the BigQueryIO.write() function. We're transforming the domain objects into TableRow objects inside of the BigQueryIO.write().withFormatFunction(). WriteResult.getSuccessfulInserts() gives us a PCollection of tableRows that were successfully written to BigQuery but, we would rather have access to the original domain objects again, specifically only the domain objects that were successfully written.
We've come up with one solution where we add a groupingKey field to the domain objects and put the groupingKey field into the TableRows when we do the withFormatFunction call. This allows us to take the original input (PCollection<DomainObj>), transform it into a PCollection<KV<String, DomainObj>> where the String key is the groupingKey, transform the output from writeResult.getSuccessfulTableRows into a PCollection<KV<String,TableRow>> where the String key is the groupingKey, and then do a CoGroupByKey operation on the KV keys to get a PCollection<KV<DomainObj, TableRow>>, then we can just drop the TableRows and end up with the PCollection of successfully written DomainObjects. There are a couple reasons why this solution is undesirable:
We thought of using the BiqQueryIO.write().ignoreUnknownValues() option to ensure the new groupingKey field we added to the TableRows doesn't end up in our BQ tables. This is a problem because our bigQuery schema is altered from time to time by an upstream applications and there are some occasional instances where we want unknown fields to be written to the table (we just don't want this groupingKey in the table).
The CoGroupByKey operation requires equal length windowing on its inputs and its possible that the BigQueryIO.write operation could exceed that window length. This would lead to us having to come up with complex solutions to handle items arriving past their window deadline.
Are there any more elegant solutions to write an unbounded PCollection of domain objects to BigQuery and end up with a PCollection of just the successfully written domain objects? Solutions that don't involve storing extra information in the TableRows are preferred. Thank you.
Related
I'm attempting to get hands on Kedro, but don't understand how to build my Data Fetcher (that I used before).
My Data is stored in a MongoDB instance over multiple “Tables”. One table are my usernames. First, I want to fetch them.
Thereafter, based on the usernames I get, I would like to fetch Data from three “Tables” and merge them.
How should I do this best in Kedro?
Shall I put everything in a Custom Dataset? Fetch only the Usernames and do the rest in a Part of the pipeline?
So this is an interesting one - Kedro has been designed in a way that the tasks have no knowledge of the IO that is required to provide/save the data. This (for good reasons) requires you to cross this boundary.
My recommendation is to go down the custom dataset, but potentially go a little further and make it return the 3 tables you need directly. I.e. do the username filter logic in this stage as well.
It also perfectly fine to raise a NotImplementedError on save() if you're not going do that.
I've been struggling to find a good solution for this manner for the past day and would like to hear your thoughts.
I have a pipeline which receives a large & dynamic JSON array (containing only stringified objects),
I need to be able to create a ContainerOp for each entry in that array (using dsl.ParallelFor).
This works fine for small inputs.
Right now the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes (or that is what I understood from the current open issues), but - when I try to read the file from one Op to use as input for the ParallelFor I encounter the output size limitation.
What would be a good & reusable solution for such a scenario?
Thanks!
the array comes in as a file http url due to pipeline input arguements size limitations of argo and Kubernetes
Usually the external data is first imported into the pipeline (downloaded and output). Then the components use inputPath and outputPath to pass big data pieces as files.
The size limitation only applies for the data that you consume as value instead of file using inputValue.
The loops consume the data by value, so the size limit applies to them.
What you can do is make this data smaller. For example if your data is a JSON list of big objects [{obj1}, {obj2}, ... , {objN}], you can transform it to list of indexes [1, 2, ... , N], pass that list to the loop and then inside the loop you can have a component that uses the index and the data to select a single piece to work on N ->{objN}.
In java or any other Programming we can save state of a variable and refer the variable value later as needed. This seems to be not possible with Apache beam, can someone confirm? If it is possible please point me to some samples or documentation.
I am trying to solve below which needs context of my previous transform output.
I am new to Apache Beam so finding it hard to understand how to solve the above.
Approach#1:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn());
PCollection<Users> users = config.apply(FetchUsersFn());
// Now Process using both 'records' and 'users', How can this be done with beam?
Approach#2:
PCollection config = p.apply(ReadDBConfigFn(options.getPath()));
PCollection<Record> records = config.apply(FetchRecordsFn()).apply(FetchUsersAndProcessRecordsFn());
// Above line 'FetchUsersAndProcessRecordsFn' needs 'config' so it can fetch Users but there is seems to be no possible way?
If I understand correctly, you want to use elements from the two collections records and users in a processing step? There are two commonly used patterns in Beam to accomplish this:
If you are looking to join the two collections, you probably want to use a CoGroupByKey to group related records and users together for processing.
If one of the collections (records or users) is "small" and the entire set needs to be available during processing, you might want to send it as a side input to your processing step.
It isn't clear what might be in the PCollection config in your example, so I may have misinterpreted... Does this meet your use case?
We are doing POC on Druid to check whether it fits our use cases. Though we are able to ingest data but not sure on following:
How druid supports schemaless input: Let's say input dimension are on end user discretion. Then there is no defined schema here. Thus onus lies on application to identify new dimension, identify data type and ingest. Any way to achieve this?
How druid support data type changes: Lets say in course (say after ingesting 100GBs of data), there is need to change data type of dimension from string to long or long to string (or other). What are receommended way to do it without hampering ongoing ingestion?
I looked over docs but could not get a substantial overview for both use cases.
For question 1 I'd ingest everything as string and figure it out later. It should be possible to query string columns in druid as numbers
Getting the possible behaviours explained in: https://github.com/apache/incubator-druid/issues/4888
Consider values are zeros, do not try to parse string values. Seems this is the current behaviour.
Try to parse string values, and consider values are zero if they are not parseable, or null, or multiple-value
One current inconsistency is that with expression-based column selectors (anything that goes through Parser/Expr) the behavior is (2). See IdentifierExpr + how it handles strings that are treated as numbers. But with direct column selectors the behavior is (1). In particular this means that e.g. a longSum aggregator behaves differently if it's "fieldName" : "x" vs. "expression" : "x" even though you might think they should behave the same.
You can follow the entire discussion here: https://github.com/apache/incubator-druid/issues/4888
For question 2 it think it is necessary a reindex of the data
- http://druid.io/docs/latest/ingestion/update-existing-data.html
- http://druid.io/docs/latest/ingestion/schema-changes.html
I hope this helps
1) In such cases, you don't need to specify any dimension columns in druid ingestion spec and druid will treat all columns which are not timestamp as a dimension.
More detail about such approach can be found here:
Druid Schema less Ingestion
2) For 2nd question, you can make changes to schema and druid will create new segments with new data type while your old segments will still use old data type.
In cases if you want to keep all your segments with new data type then you can reindex all the segments. Please checkout this link for further description about reindexing all segments. http://druid.io/docs/latest/ingestion/update-existing-data.html
Additional info on schema changes can be found here:
http://druid.io/docs/latest/ingestion/schema-changes.html
We came across a case, where we want to retrieve data from a time series. Let say we have time based data : [“t1-t2” : {data1}, “t2-t3” : {data2}, “t3-t4”:{dat3}]
With the above kind of data, we would want to look up exact data w.r.t time. For example, for a given time t1.5, the data has to come as data1, and for t2.6 it should come as data2.
To solve the above problem, we are planning to store the data in a sorted map in aerospike as mentioned below {“t1”:{data1}, “t2”:{dat2}, “t3”: {data3}}
When a client asks for t1.5, we must return data1. To achieve this, we implemented a UDF at the server level to do a binary search for the nearest and lowest value for the given input (i.e t1.5), which will return t1 's value ,i.e data1.
Is there a better way of achieving the same, as it incurs cost at server level for every request. Even UDF to do a binary search requires loading all the data in memory, can we avoid it?
We are planning to use a Aerospike for this. Is there a better data store to handle such queries..?
Thinking aloud… Storing t1-t2, t2-t3 is redundant on t2. Just store t1, t2 is inferred from next key:value. { t1:data, t2:data, …} - store key sorted (map policy) You must know max difference between any ‘t1’ and ‘t2’ Build secondary index on MAPKEY and type numeric (this essentially does the bulk of the sort work for you upfront in the RAM) Search for records where t between t-maxdiff and t+maxdiff ==> a set of few records and pass these to your UDF. Invoke UDF on these few records subset to return the data. This will be a very simple UDF. Note: UDFs are limited to 128 concurrent executions at any given time.
I'm not sure I understand the problem. First, you should be inserting into a K-ordered map, where the key is the timestamp (in millisecond or second or another resolution). The value would be a map of the attributes.
To get back any range of time you'd use a get_by_key_interval (for example the Python client's Client.map_get_by_key_range). You can figure out how to build the range but it's simply all between two timestamps.
Don't use a UDF for this, it is not going to perform as well or scale as the native map/list operations would.