Adding a timestamp to a pcollection - apache-beam

I'm pretty new to Beam and working with a simple batch load process for a text file. I would like to add a timestamp for the insertion of the record in BigQuery. Is there a preferred pattern for adding a "insert date" for the PCollection? I've seen a couple of different approaches but curious if there's a preferred pattern or best practice for this? Thank you!

There is a nice section on this within the Apache Beam documentation:
"An unbounded source provides a timestamp for each element. Depending on your unbounded source, you may need to configure how the timestamp is extracted from the raw data stream.
However, bounded sources (such as a file from TextIO) do not provide timestamps. If you need timestamps, you must add them to your PCollection’s elements."
There is also a nice java / python code example at:
https://beam.apache.org/documentation/programming-guide/#adding-timestamps-to-a-
pcollections-elements

Related

What happens to a Spark DataFrame used in Structured Streaming when its underlying data is updated at the source?

I have a use case where I am joining a streaming DataFrame with a static DataFrame. The static DataFrame is read from a parquet table (a directory containing parquet files).
This parquet data is updated by another process once a day.
My question is what would happen to my static DataFrame?
Would it update itself because of the lazy execution or is there some weird caching behavior that can prevent this?
Can the updation process make my code crash?
Would it be possible to force the DataFrame to update itself once a day in any way?
I don't have any code to share for this because I haven't written any yet, I am just exploring what the possibilities are. I am working with Spark 2.3.2
A big (set of) question(s).
I have not implemented all aspects myself (yet), but this is my understanding and one set of info from colleagues who performed an aspect that I found compelling and also logical. I note that there is not enough info out there on this topic.
So, if you have a JOIN (streaming --> static), then:
If standard coding practices as per Databricks applied and .cache is applied, the SparkStructuredStreamingProgram will read in static source only once, and no changes seen on subsequent processing cycles and no program failure.
If standard coding practices as per Databricks applied and caching NOT used, the SparkStructuredStreamingProgram will read in static source every loop, and all changes will be seen on subsequent processing cycles hencewith.
But, JOINing for LARGE static sources not a good idea. If large dataset evident, use Hbase, or some other other key value store, with mapPartitions if volitatile or non-volatile. This is more difficult though. It was done by an airline company I worked at and was no easy task the data engineer, designer told me. Indeed, it is not that easy.
So, we can say that updates to static source will not cause any crash.
"...Would it be possible to force the DataFrame to update itself once a day in any way..." I have not seen any approach like this in the docs or here on SO. You could make the static source a dataframe using var, and use a counter on the driver. As the micro batch physical plan is evaluated and genned every time, no issue with broadcast join aspects or optimization is my take. Whether this is the most elegant, is debatable - and is not my preference.
If your data is small enough, the alternative is to read using a JOIN and thus perform the look up, via the use of the primary key augmented with some max value in a
technical column that is added to the key to make the primary key a
compound primary key - and that the data is updated in the background with a new set of data, thus not overwritten. Easiest
in my view if you know the data is volatile and the data is small. Versioning means others may still read older data. That is why I state this, it may be a shared resource.
The final say for me is that I would NOT want to JOIN with the latest info if the static source is large - e.g. some Chinese
companies have 100M customers! In this case I would use a KV store as
LKP using mapPartitions as opposed to JOIN. See
https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
that provides some insights. Also, this is old but still applicable
source of information:
https://blog.codecentric.de/en/2017/07/lookup-additional-data-in-spark-streaming/.
Both are good reads. But requires some experience and to see the
forest for the trees.

Sorting file contents idiomatically with spring-batch

I have a CSV file with a number of fields. What is an idiomatic way to read the file, sort the file using a subset of fields, and then write another CSV as output.
Should I even attempt to do this in spring-batch? I understand that *nix-based OSes have the sort utility to do this, but I'd like to contain all my work within spring batch if possible.
The Batch Processing Strategies section of the documentation seems to suggest that might be standard utility steps to accomplish this:
In addition to the main building blocks, each application may use one or more of standard utility steps, such as:
Sort: A program that reads an input file and produces an output file where records have been re-sequenced according to a sort key field in the records. Sorts are usually performed by standard system utilities.
But I am not able to locate this. Any pointers most welcome!
Thanks very much!
Unless you really should do it inside Spring Batch I would suggest you do it with OS based commands.
But your point is correct, adding intermediary Steps to your Jobs to Sort/Filter or even clean DATA is a mainstream pattern used in Batch Processing or ETL Jobs.
Hope this helps.
I found out that there is a SystemCommandTasklet that is meant to run OS commands. This can be used to do things like sorting, finding unique items, etc.

Writing Custom Extensions in Druid

I am new to Druid.
Problem Statement
We do currently push raw event data to Druid. I have a requirement to apply certain calculations on the data (say like certain stat techniques) which are not supported by Druid or the extensions it provides out of the box.
There are two questions I have -
What would be a better way to achieve this? (Have some external script that reads data from Druid, computes the calculations and puts it back to Druid)?
Can I take a route of writing Custom Extensions on Druid? I could not find any good documentation on how do we go about writing/ testing Druid Extensions.
These link does not provide any in-depth information -
http://druid.io/docs/latest/development/modules.html
https://github.com/apache/incubator-druid (Druid repo that has some core and community contrib extensions)
Appreciate any help on this. Thank you.
You can achieve this both ways now it's up to you how much comfortable you are writing an extension by yourself and then maintain it. This is certainly time-consuming compared to another way.
If you read data from druid and then perform your calculation and write data back to the druid, you will end up writing to the separate table. If you are not storge bound on druid cluster then you can certainly take this path and its less time-consuming.
Yes, this is the recommended way to perform any custom computation on data. You can certainly write a simple extension easily. Here's the example git hub repo link which helps to write a custom druid extension: https://github.com/implydata/druid-example-extension

Large XML with selective parsing

We are building kind of staging application where we receive large XML files (ISO 20022 messages) with tons of elements defined in it. We just store these XML's in database as XMLtype and send them to downstream system for further processing.
There is GUI where we need to display some of those XML elements to Users and allow to update some of fields and store it again in database as new XML message
Trying to find best efficient implementation stack with respect to performance and memory.
One idea is to identify XML elements which are required to be displayed in UI and have such elements defined as meta fields with XPath. Trying to avoid parsing entire XML.
Appreciate any ideas to process large XML when only certain elements are required to be viewed and updated.
My experience is that using XML Data Types in common RDBMS is OK, but not great. I found that native XML DBMS work much better for ISO 20022, such as Marklogic or eXistdb.
If you want to continue with an RDBMS' XML Type, then use XQuery to pull the items you want. Oracle call this XMLQuery. Microsoft have a query function for XQuery.
As XQuery is based on XPath, then yes, using XPath is a good way to achieve what you want.

DATASTAGE capabilities

I'm a Linux programmer.
I used to write code in order to get things done: java perl php c.
I need to start working with DATA STAGE.
All I see is that DATA STAGE is working on table/csv style data and doing it line by line.
I want to know if DATA STAGE can work on file that are not table/csv like. can it load
data into data structures and run function on them, or is it limited to working
only on one line at a time.
thank you for any information that you can give on the capabilities of DATA SATGE
IBM (formerly Ascential) DataStage is an ETL platform that, indeed, works on data sets by applying various transformations.
This does not necessarily mean that you are constrained on applying only single line transformations (you can also aggregate, join, split etc). Also, DataStage has it's own programming language - BASIC - that allows you to modify the design of your jobs as needed.
Lastly, you are still free to call external scripts from within DataStage (either using the DSExecute function, Before Job property, After Job property or the Command stage).
Please check the IBM Information Center for a comprehensive documentation on BASIC Programming.
You could also check the DSXchange forums for DataStage specific topics.
Yes it can, as Razvan said you can join, aggregate, split. It can uses loops and external scripts, it can also handles XML.
My advice for you is that if you have large quantities of data you're gonna have to work on then datastage is your friend, else if the data that you're going to have to load is not very big then it's going to be easier to use JAVA, c, or any programming language that you know.
You can all times of functions , conversions , manipulate the data. mainly Datastage is used for ease of use when you handling humongous data from datamart /datawarehouse.
The main process of datastage would be ETL - Extraction Transformation Loading.
If a programmer uses 100 lines of code to connect to some database here we can do it with one click.
Anything can be done here even c , c++ coding in a rountine activity.
If you are talking about hierarchical files, like XML or JSON, the answer is yes.
If you are talking about complex files, such as are produced by COBOL, the answer is yes.
All using in-built functionality (e.g. Hierarchical Data stage, Complex Flat File stage). Review the DataStage palette to find other examples.