Write Kafka messages to Parquet every minute using apache beam - apache-beam

I want to make a simple Beam pipeline which will store the events from kafka to S3 in parquet format every minute.
Here's a simplified version of my pipeline:
def add_timestamp(event: Any) -> Any:
from datetime import datetime
from apache_beam import window
return window.TimestampedValue(event, datetime.timestamp(event[1].timestamp))
# Actual Pipeline
(
pipeline
| "Read from Kafka" >> ReadFromKafka(consumer_config, topics, with_metadata=False)
| "Transformed" >> beam.Map(my_transform)
| "Add timestamp" >> beam.Map(add_timestamp)
| "window" >> beam.WindowInto(window.FixedWindows(60)) # 1 mins
| "writing to parquet" >> beam.io.WriteToParquet('s3://test-bucket/', pyarrow_schema)
)
However, the pipeline failed with
GroupByKey cannot be applied to an unbounded PCollection with global windowing and a default trigger
This seems to be coming from https://github.com/apache/beam/blob/v2.41.0/sdks/python/apache_beam/io/iobase.py#L1145-L1146
which always add a GlobalWindows and thus causing this error. Wondering what should I do to correctly backup the event from Kafka (Unbounded) to S3. Thanks!
btw, I am running with portableRunner with Flink. Beam Version is 2.41.0 (the latest version seems to have the same code) and Flink version is 1.14.5

Related

spark read stream from parquet source

job1 -> parquet files [inputPath] -> job2 -> delta table [outputPath]
job1 writes stream to parquet files which are up to 3 days old.
I want to start job2 which will start from latest streamed files and will process latest only so will not try to process data that is older than last hour if i start now.
Right now I do:
val outDF = spark
.readStream
.option("latestFirst", "true")
.option("maxFilesPerTrigger", "10000")
.schema(Encoders.product[InputData].schema)
.parquet(inputPath)
.transform(Processor().process)
StreamDeltaWriter
.write(outDF, outputPath, partitioningCols, checkpointLocation)
.awaitTermination()
The issue i have is that it starts with latest files then it processes backwards so it does not processes latest arriving file as it tries to process all historical files first.
How to solve it?
EDIT:
according to documentation if latestFirst true if set together with maxFilesPerTrigger then maxFileAge is ignored.
Would maxFileAge work if I do latestFirst only without maxFilesPerTrigger ?

Pipeline Dependencies in Data Fusion

I have three pipelines in Data Fusion say A,B and C. I want to the Pipeline C to get triggered after execution of Pipeline A and B both Completes. Pipeline triggers are putting the dependency on one pipeline only.
Can this be implemented in Data Fusion ?
You can do it using Google Cloud Composer [1]. In order to perform this action first of all you need to create a new Environment in Google Cloud Composer [2], once done, you need to install a new Python Package in your environment [3], and the package that you will need to install is [4] "apache-airflow-backport-providers-google".
With this package installed you will be able to use these operations [5], the one you will need is [6] "Start a DataFusion pipeline", this way you will be able to start a new pipeline from Airflow.
An example of the python code would be as follows:
import airflow
import datetime
from airflow import DAG
from airflow import models
from airflow.operators.bash_operator import BashOperator
from datetime import timedelta
from airflow.providers.google.cloud.operators.datafusion import (
CloudDataFusionStartPipelineOperator
)
default_args = {
'start_date': airflow.utils.dates.days_ago(0),
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
with models.DAG(
'composer_DF',
schedule_interval=datetime.timedelta(days=1),
default_args=default_args) as dag:
# the operations.
A = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="A",
instance_name="instance_name", task_id="start_pipelineA",
)
B = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="B",
instance_name="instance_name", task_id="start_pipelineB",
)
C = CloudDataFusionStartPipelineOperator(
location="us-west1", pipeline_name="C",
instance_name="instance_name", task_id="start_pipelineC",
)
# First A then B and then C
A >> B >> C
You can set the time intervals by checking the Airflow documentation.
Once you have this code saved as a .py file, save it to ther Google Cloud Storage DAG folder of your environment.
When the DAG starts, it will execute task A and when it finishes it will execute task B and so on.
[1] https://cloud.google.com/composer
[2] https://cloud.google.com/composer/docs/how-to/managing/creating#:~:text=In%20the%20Cloud%20Console%2C%20open%20the%20Create%20Environment%20page.&text=Under%20Node%20configuration%2C%20click%20Add%20environment%20variable.&text=The%20From%3A%20email%20address%2C%20such,%40%20.&text=Your%20SendGrid%20API%20key.
[3] https://cloud.google.com/composer/docs/how-to/using/installing-python-dependencies
[4] https://pypi.org/project/apache-airflow-backport-providers-google/
[5] https://airflow.readthedocs.io/en/latest/_api/airflow/providers/google/cloud/operators/datafusion/index.html
[6] https://airflow.readthedocs.io/en/latest/howto/operator/google/cloud/datafusion.html#start-a-datafusion-pipeline
There is no direct way i could think of but two workarounds
Work around 1. Merging the pipeline A and B into pipeline AB then trigger pipeline C (AB > C).
Pipeline A - (GCS Copy > Decompress),
Pipeline B - (GCS2 > thrashsad)
BigQueryExecute to mitigate error : Invalid DAG. There is an island made up of stages..
In BigQueryExecute, valid and dummy query.
Merging the two pipeline in one, may unease the pipeline testing. To overcome this you can add a dummy condition to run a pipeline one time.
In BigQueryExecute,change query to 'Select ${flag}' and pass the value of flag in runtime argument or Select 1 as flag and tick "Row As Arguments" to true.
Add condition plugin after BigQueryExecute and put condition runtime['flag'] = 1
Condition plugin has two outlet, connect them to pipeline A and pipeline B.
Workaround 2 : Store the flag of both pipelines(A & B) in BiqQuery table,create two flow A>C and B >C to trigger the pipeline C. This would trigger pipeline C twice but using BigQueryExecute and condition plugin will run only when both flags are available in BigQuery table.
How?
In Pipeline A & B to write output (a row) to BigQuery table 'Pipeline_Run'
In Pipeline C, add BigQueryExecute and query 'select count(*) as Cnt from ds.Pipeline_Run' and tick "Row As Arguments" to true.
In Pipeline C, add Condition plugin and check if value of cnt is 2 (runtime['cnt'] = 2) and connect your rest of the pipeline's plugins to its "Yes" outlet.
You can explore "schedules" set through CDAP REST APIs. That allows parallel execution of pipelines and there is no dependency on cloud composer (except for file based trigger of first pipeline in workflow. For that you would need cloud function or may be cloud composer file sensor)

GCP Dataflow runner error when deploying pipeline using beam-nuggets library - "Failed to read inputs in the data_plane."

I have been testing an Apache beam pipeline within Apache beam notebooks provided by GCP using a Kafka instance as a input and Bigquery as output. I have been able to successfully use the pipeline via Interactive runner, but when I deploy the same pipeline to Dataflow runner it seems to never actually read from the Kafka topic that has been defined. Looking into the logs gives me the error:
Failed to read inputs in the data plane. Traceback (most recent call
last): File
/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py,
Implementation based on this post here
Any ideas? Code provided below:
from __future__ import print_function
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
from beam_nuggets.io import kafkaio
kafka_config = {"topic": kafka_topic, "bootstrap_servers": ip_addr}
# p = beam.Pipeline(interactive_runner.InteractiveRunner(), options=options) # <- use for test
p = beam.Pipeline(DataflowRunner(), options=options) # <- use for dataflow implementation
notifications = p | "Reading messages from Kafka" >> kafkaio.KafkaConsume(kafka_config)
preprocess = notifications | "Pre-process for model" >> beam.ParDo(preprocess())
model = preprocess | "format & predict" >> beam.ParDo(model())
newWrite = model | beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER)
Error message from logs:
Failed to read inputs in the data plane. Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/apache_beam/runners/worker/data_plane.py", line 528, in _read_inputs for elements in elements_iterator: File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 416, in __next__ return self._next() File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 689, in _next raise self grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "DNS resolution failed" debug_error_string = "{"created":"#1595595923.509682344","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1595595923.509650517","description":"Resolver transient failure","file":"src/core/ext/filters/client_channel/resolving_lb_policy.cc","file_line":216,"referenced_errors":[{"created":"#1595595923.509649070","description":"DNS resolution failed","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":375,"grpc_status":14,"referenced_errors":[{"created":"#1595595923.509645878","description":"unparseable host:port","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":417,"target_address":""}]}]}]}" >
and also
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "DNS resolution failed" debug_error_string = "{"created":"#1594205651.745381243","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3948,"referenced_errors":[{"created":"#1594205651.745371624","description":"Resolver transient failure","file":"src/core/ext/filters/client_channel/resolving_lb_policy.cc","file_line":216,"referenced_errors":[{"created":"#1594205651.745370349","description":"DNS resolution failed","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/dns_resolver_ares.cc","file_line":375,"grpc_status":14,"referenced_errors":[{"created":"#1594205651.745367499","description":"unparseable host:port","file":"src/core/ext/filters/client_channel/resolver/dns/c_ares/grpc_ares_wrapper.cc","file_line":417,"target_address":""}]}]}]}" >
Pipeline settings:
Python sdk harness started with pipeline_options: {'streaming': True, 'project': 'example-project', 'job_name': 'beamapp-root-0727105627-001796', 'staging_location': 'example-staging-location', 'temp_location': 'example-staging-location', 'region': 'europe-west1', 'labels': ['goog-dataflow-notebook=2_23_0_dev'], 'subnetwork': 'example-subnetwork', 'experiments': ['use_fastavro', 'use_multiple_sdk_containers'], 'setup_file': '/root/notebook/workspace/setup.py', 'sdk_location': '/root/apache-beam-custom/packages/beam/sdks/python/dist/apache-beam-2.23.0.dev0.tar.gz', 'sdk_worker_parallelism': '1', 'environment_cache_millis': '0', 'job_port': '0', 'artifact_port': '0', 'expansion_port': '0'}
As far as I know Failed to read inputs in the data plane ... status = StatusCode.UNAVAILABLE details = "DNS resolution failed" could be an issue in Python Beam SDK, it is recommended to update to Python Beam SDK 2.23.0.
It seems this isn't possible in my implementation plan, but with multi language pipelines it appears to be more viable. I opened a ticket with google support on this matter and got the following reply after some time investigating:
“… at this moment Python doesn't have any KafkaIO that works with
DataflowRunner. You can use Java as a workaround. In case you need
Python for something in particular (TensorFlow or similar), a
possibility is to send the message from Kafka to a PubSub topic (via
another pipeline that only reads from Kafka and publish to PS or an
external application).”
So feel free to take their advice, or you might be able to hack something together. I just revised my architecture to use pubsub instead of kafka.

spark structured streaming file source read from a certain partition onwards

I have a folder on HDFS like below containing ORC files:
/path/to/my_folder
It contains partitions:
/path/to/my_folder/dt=20190101
/path/to/my_folder/dt=20190103
/path/to/my_folder/dt=20190103
...
Now I need to process the data here using streaming.
A spark.readStream.format("orc").load("/path/to/my_folder") nicely works.
However, I do not want to process the whole table, but rather only start from a certain partition onwards similar to a certain kafka offset.
How can this be implemented? I.e. how can I specify the initial state where to read from.
Spark Structured Streaming File Source Starting Offset claims that there is no such feature.
Their suggestion to use: latestFirst is not desirable for my use-case, as I do not aim to build an always-on streaming application, but rather use Trigger.Once like a batch job with the nice streaming semantics of duplicate reduction and handling of late-arriving data
If this is not available, what would be a suitable workaround?
edit
Run warn-up stream with option("latestFirst", true) and
option("maxFilesPerTrigger", "1") with checkpoint, dummy sink and huge
processing time. This way, warm-up stream will save latest file
timestamp to checkpoint.
Run real stream with option("maxFileAge", "0"), real sink using the
same checkpoint location. In this case stream will process only newly
available files.
https://stackoverflow.com/a/51399134/2587904
building from this idea, let's look at an example:
# in bash
rm -rf data
mkdir -p data/dt=20190101
echo "1,1,1" >> data/dt=20190101/1.csv
echo "1,1,2" >> data/dt=20190101/2.csv
mkdir data/dt=20190102
echo "1,2,1" >> data/dt=20190102/1.csv
echo "1,2,2" >> data/dt=20190102/2.csv
mkdir data/dt=20190103
echo "1,3,1" >> data/dt=20190103/1.csv
echo "1,3,2" >> data/dt=20190103/2.csv
mkdir data/dt=20190104
echo "1,4,1" >> data/dt=20190104/1.csv
echo "1,4,2" >> data/dt=20190104/2.csv
spark-shell --conf spark.sql.streaming.schemaInference=true
// from now on in scala
val df = spark.readStream.csv("data")
df.printSchema
val query = df.writeStream.format("console").start
query.stop
// cleanup the data and start from scratch.
// this time instead of outputting to the console, write to file
val query = df.writeStream.format("csv")
.option("path", "output")
.option("checkpointLocation", "checkpoint")
val started = query.start
// in bash
# generate new data
mkdir data/dt=20190105
echo "1,5,1" >> data/dt=20190105/1.csv
echo "1,5,2" >> data/dt=20190105/2.csv
echo "1,4,3" >> data/dt=20190104/3.csv
// in scala
started.stop
// cleanup the output, start later on with custom checkpoint
//bash: rm -rf output/*
val started = query.start
// bash
echo "1,4,3" >> data/dt=20190104/4.csv
started.stop
// *****************
//bash: rm -rf output/*
Everything works as intended. The operation picks up where the checkpoint marks the last processed file.
How can a checkpoint definition be generated by hands such as all files in dt=20190101 and dt=20190102 have been processed and no late-arriving data is tolerated there anymore and the processing shall continue with all the files from dt=20190103 onwards?
I see that spark generates:
commits
metadata
offsets
sources
_spark-metadata
files and folders.
So far I only know that _spark-metadata can be ignored to set an initial state / checkpoint.
But have not yet figured out (from the other files) which minimal values need to be present so processing picks up from dt=20190103 onwards.
edit 2
By now I know that:
commits/0 needs to be present
metadata needs to be present
offsets needs to be present
but can be very generic:
v1
{"batchWatermarkMs":0,"batchTimestampMs":0,"conf":{"spark.sql.shuffle.partitions":"200"}}
{"logOffset":0}
When I tried to remove one of the already processed and committed files from sources/0/0, the query still runs but: not only the new data is processed larger than the existing committed data, but any data, in particular, the files I just removed from the log.
How can I change this behavior to only process data more current than the initial state?
edit 3
The docs (https://jaceklaskowski.gitbooks.io/spark-structured-streaming/spark-sql-streaming-FileStreamSource.html) but also javadocs ;) list getOffset
The maximum offset (getOffset) is calculated by fetching all the files
in path excluding files that start with _ (underscore).
That sounds interesting, but so far I have not figured out how to use it to solve my problem.
Is there a simpler way to achieve the desired functionality besides creating a custom (copy) of the FileSource?
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L237
maxFileAge also sounds interesting.
I have started to work on a custom file stream source. However, fail to properly instanciate it. https://gist.github.com/geoHeil/6c0c51e43469ace71550b426cfcce1c1
When calling:
val df = spark.readStream.format("org.apache.spark.sql.execution.streaming.StatefulFileStreamSource")
.option("partitionState", "/path/to/data/dt=20190101")
.load("data")
The operation fails with:
InstantiationException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource
at java.lang.Class.newInstance(Class.java:427)
at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:196)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:88)
at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:88)
at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:150)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
... 53 elided
Caused by: java.lang.NoSuchMethodException: org.apache.spark.sql.execution.streaming.StatefulFileStreamSource.<init>()
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.newInstance(Class.java:412)
... 59 more
Even though it is basically a copy of the original source. What is different? Why is the constructor not found from https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L196? But it works just fine for https://github.com/apache/spark/blob/v2.2.3/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala#L42
Even:
touch -t 201801181205.09 data/dt=20190101/1.csv
touch -t 201801181205.09 data/dt=20190101/2.csv
val df = spark.readStream
.option("maxFileAge", "2d")
.csv("data")
returns the whole dataset and fails to filter to the most k current days.

Flink Read AvroInputFormat PROCESS_CONTINUOUSLY

I am new to Flink (v1.3.2) and trying to read avro record continuously in scala on EMR. I know for file you can use something like following and it will keep running and scanning directory.
val stream = env.readFile(
inputFormat = textInputFormat,
filePath = path,
watchType = FileProcessingMode.PROCESS_CONTINUOUSLY,
interval = Time.seconds(10).toMilliseconds
)
Is there a similar way in Flink for avro record? So I have the following code
val textInputFormat = new AvroInputFormat(new Path(path), classOf[User])
textInputFormat.setNestedFileEnumeration(true)
val avroInputStream = env.createInput(textInputFormat)
val output = avroInputStream.map(line => (line.getUserID, 1))
.keyBy(0)
.timeWindow(Time.seconds(10))
.sum(1)
output.print()
I am able to see the output there then Flink switched to FINISHED, but still want to get the code running/waiting for any new files arrive in the future. Is there something like FileProcessingMode.PROCESS_CONTINUOUSLY? Please suggest!
I figure out this by setting up a flink-yarn-session on EMR and make it run PROCESS_CONTINUOUSLY.
env.readFile(textInputFormat, path, FileProcessingMode.PROCESS_CONTINUOUSLY, 100)
Create a new flink yarn session using flink-yarn-session -n 2 -d
Get application_id using yarn application -list, for example, it is application_0000000000_0002
Attached flink run job with the application_id,flink run -m yarn-cluster -yid application_0000000000_0002 xxx.jar
More detail can be found on EMR documentation now: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html