Integration testing of a Beam Pipeline - apache-beam

We have a Beam pipeline that is written using the Python SDK. The pipeline is reading from an ordered Pub/Sub topic, doing some small transformations and then inserting into ElasticSearch.
We want to write some integration tests that will cover the entire flow, reading from Pub/Sub and using the ES client, to make some assertions over what was inserted as a result. Since the job is streaming, it's a bit tricky to test it inside tests. Here is an example of how I do it now:
#pytest.fixture
def test_options():
setup_options = SetupOptions()
standard_options = setup_options.view_as(StandardOptions)
standard_options.streaming = True
standard_options.runner = "DirectRunner"
return standard_options
def test_reading_from_pubsub_to_elastic(test_options, customer_es_client, pubsub_publisher, user_events):
expected_ids = ["1_3", "1_9", "1_4", "1_6", "1_10", "1_8", "1_5", "1_7", "1_2", "1_1"]
p = HelpderClassToInitPipeline(pipeline=TestPipeline(blocking=False, options=test_options))
pipe = p.create_pipeline(project=PROJECT, local_config=LocalConfig(run_local=False))
pipe_result = pipe.run()
futures = []
for e in user_events:
futures.append(
pubsub_publisher.publish(
topic=f"topic_",
data=json.dumps(e).encode(),
)
)
for f in futures:
f.result()
time.sleep(10) # wait for the pipeline to consume events
result = customer_es_client.search(index="customer", body={"_source": ["_id"], "query": {"match_all": {}}})
ids = [d["_id"] for d in result["hits"]["hits"]]
assert ids.sort() == expected_ids.sort()
pipe_result.cancel()
We are using pytest as a test runner, and we run the inside the CI. Basically we provision Pub/Sub topics in CI, then run the tests, and then we clean up everything.
You can see that time.sleep call that is needed for the pipeline to consume and store so that we could query ES. Not sure if this is the best practice.
Any advices how can I improve this?

Related

pyspark failing to run query in parallel using foreach

I have a function that generates different query and executes them and writes data into different tables. I want to parallelize this.
Here is an example:
def build_and_execute_sql(item):
gen_sql = 'insert overwrite table schema.table_d_{} select * from ...'.format(item)
spark.sql(gen_sql)
sc = spark.sparkContext
lst = ['products', 'orders', 'deliveries']
rdd = sc.parallelize(lst)
rdd.foreach(build_and_execute_sql)
when I execute this, this fails with no specific error. My goal is to execute this in parallel.
I have about 12 such queries that are generated and are executed.
I tried to play around with rdd.formach(build_execute_sql).collect(), but nothing really works.
Any pointers?? wondering why would foreach fail?
I'm familiar with multiprocessing, but wondering if there is a clean way to do it in pyspark itself.
You can try python multiprocessing instead:
from multiprocessing.pool import ThreadPool
lst = ['products', 'orders', 'deliveries']
with ThreadPool(3) as p:
p.map(build_and_execute_sql, lst)

Cloud Dataflow GlobalWindow trigger ignored

Using the AfterPane.elementCountAtLeast trigger does not work when run using the Dataflow runner, but works correctly when run locally. When run on Dataflow, it produces only a single pane.
The goals is to extract data from Cloud SQL, transform and write into Cloud Storage. However, there is too much data to keep in memory, so it needs to be split up and written to Cloud Storage in chunks. That's what I hoped this would do.
The complete code is:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
// produce one global window with one pane per ~500 records
.withGlobalWindow(WindowOptions(
trigger = Repeatedly.forever(AfterPane.elementCountAtLeast(500)),
accumulationMode = AccumulationMode.DISCARDING_FIRED_PANES
))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
.withShardNameTemplate("-P-S")
.withWindowedWrites() // gets us one file per window & pane
pipe.saveAsCustomOutput("writer",out)
I think the root of the problem may be that the JdbcIO class is implemented as a PTransform<PBegin,PCollection> and a single call to processElement outputs the entire SQL query result:
public void processElement(ProcessContext context) throws Exception {
try (PreparedStatement statement =
connection.prepareStatement(
query.get(), ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)) {
statement.setFetchSize(fetchSize);
parameterSetter.setParameters(context.element(), statement);
try (ResultSet resultSet = statement.executeQuery()) {
while (resultSet.next()) {
context.output(rowMapper.mapRow(resultSet));
}
}
}
}
In the end, I had two problems to resolve:
1. The process would run out of memory, and 2. the data was written to a single file.
There is no way to work around problem 1 with Beam's JdbcIO and Cloud SQL because of the way it uses the MySQL driver. The driver itself loads the entire result within a single call to executeStatement. There is a way to get the driver to stream results, but I had to implement my own code to do that. Specifically, I implemented a BoundedSource for JDBC.
For the second problem, I used the row number to set the timestamp of each element. That allows me to explicitly control how many rows are in each window using FixedWindows.
elementCountAtLeast is a lower bound so making only one pane is a valid option for a runner to do.
You have a couple of options when doing this for a batch pipeline:
Allow the runner to decide how big the files are and how many shards are written:
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
pipe.saveAsCustomOutput("writer",out)
This is typically the fastest option when the TextIO has a GroupByKey or a source that supports splitting that precedes it. To my knowledge JDBC doesn't support splitting so your best option is to add a Reshuffle after the jdbcSelect which will enable parallelization of processing after reading the data from the database.
Manually group into batches using the GroupIntoBatches transform.
val pipe = sc.jdbcSelect(getReadOptions(connOptions, stmt))
.applyTransform(ParDo.of(new Translator()))
.map(row => row.mkString("|"))
.apply(GroupIntoBatches.ofSize(500))
val out = TextIO
.write()
.to("gs://test-bucket/staging")
.withSuffix(".txt")
.withNumShards(1)
pipe.saveAsCustomOutput("writer",out)
In general, this will be slower then option #1 but it does allow you to choose how many records are written per file.
There are a few other ways to do this with their pros and cons but the above two are likely the closest to what you want. If you add more details to your question, I may revise this question further.

Flink: How to write DataSet to a variable instead of to a file

I have a flink batch program written in scala using the DataSet API which results in a final dataset I am interested in. I would like to get that dataset as a variable or value (e.g. a list or sequence of String) within my program, without having to write it to any file. Is it possible?
I have seen that flink allows for collection data sinks in order to debug (the only example in their doc is in Java). However, this is only allowed in local execution, and anyway I don't know its equivalent in Scala. What I would like is to write the final resulting dataset after the whole flink parallel execution is done to a program value or variable.
First, try this for the scala version of collection data sink:
import org.apache.flink.api.scala._
import org.apache.flink.api.java.io.LocalCollectionOutputFormat;
.
.
val env = ExecutionEnvironment.getExecutionEnvironment
// Create a DataSet from a list of elements
val words = env.fromElements("w1","w2", "w3")
var outData:java.util.List[String]= new java.util.ArrayList[String]()
words.output(new LocalCollectionOutputFormat(outData))
// execute program
env.execute("Flink Batch Scala")
println(outData)
Second, if your dataset fits in memory of single machine why do you need to use a distributed processing framework? I think you should think more about your use case! and try to use the right transformations on your dataset.
I used flink 1.72 with scala 2.12. And this is a streaming prediction using SVM that i wrapped up in Model class. I think the most correct answer is using collect(). It'll return Seq. i got this answer after searching for hours. i got the idea from Flink Git - Line 95
var temp_jaringan : DataSet[(Vector,Double)] = model.predict_jaringan(value)
temp_jaringan.print()
var temp_produk : DataSet[(Vector,Double)] = model.predict_produk(value)
temp_produk.print()
var result_jaringan : Seq[(Vector,Double)] = temp_jaringan.collect()
var result_produk : Seq[(Vector,Double)] = temp_produk.collect()
if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == 1.0 ){
println("Keduanya")
}else if(result_jaringan(0)._2 == 1.0 && result_produk(0)._2 == -1.0){
println("Jaringan")
}else if(result_jaringan(0)._2 == -1.0 && result_produk(0)._2 == 1.0){
println("Produk")
}else{
println("Bukan Keduanya")
}
It may vary based on other version. cause after using and searching flink material like a mad dog for weeks even months for my final project as graduation requirement, i know that this flink develepment projects need more documentation and tutorial, especially for beginners like me.
anyway, correct me if im wrong. Thanks!

Using Custom Hadoop input format for processing binary file in Spark

I have developed a hadoop based solution that process a binary file. This uses classic hadoop MR technique. The binary file is about 10GB and divided into 73 HDFS blocks, and the business logic written as map process operates on each of these 73 blocks. We have developed a customInputFormat and CustomRecordReader in Hadoop that returns key (intWritable) and value (BytesWritable) to the map function. The value is nothing but the contents of a HDFS block(bianry data). The business logic knows how to read this data.
Now, I would like to port this code in spark. I am a starter in spark and could run simple examples (wordcount, pi example) in spark. However, could not straightforward example to process binaryFiles in spark. I see there are two solutions for this use case. In the first, avoid using custom input format and record reader. Find a method (approach) in spark the creates a RDD for those HDFS blocks, use a map like method that feeds HDFS block content to the business logic. If this is not possible, I would like to re-use the custom input format and custom reader using some methods such as HadoopAPI, HadoopRDD etc. My problem:- I do not know whether the first approach is possible or not. If possible, can anyone please provide some pointers that contains examples? I was trying second approach but highly unsuccessful. Here is the code snippet I used
package org {
object Driver {
def myFunc(key : IntWritable, content : BytesWritable):Int = {
println(key.get())
println(content.getSize())
return 1
}
def main(args: Array[String]) {
// create a spark context
val conf = new SparkConf().setAppName("Dummy").setMaster("spark://<host>:7077")
val sc = new SparkContext(conf)
println(sc)
val rd = sc.newAPIHadoopFile("hdfs:///user/hadoop/myBin.dat", classOf[RandomAccessInputFormat], classOf[IntWritable], classOf[BytesWritable])
val count = rd.map (x => myFunc(x._1, x._2)).reduce(_+_)
println("The count is *****************************"+count)
}
}
}
Please note that the print statement in the main method prints 73 which is the number of blocks whereas the print statements inside the map function prints 0.
Can someone tell where I am doing wrong here? I think I am not using API the right way but failed to find some documentation/usage examples.
A couple of problems at a glance. You define myFunc but call func. Your myFunc has no return type, so you can't call collect(). If your myFunc truly doesn't have a return value, you can do foreach instead of map.
collect() pulls the data in an RDD to the driver to allow you to do stuff with it locally (on the driver).
I have made some progress in this issue. I am now using the below function which does the job
var hRDD = new NewHadoopRDD(sc, classOf[RandomAccessInputFormat],
classOf[IntWritable],
classOf[BytesWritable],
job.getConfiguration()
)
val count = hRDD.mapPartitionsWithInputSplit{ (split, iter) => myfuncPart(split, iter)}.collect()
However, landed up with another error the details of which i have posted here
Issue in accessing HDFS file inside spark map function
15/10/30 11:11:39 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 40.221.94.235): java.io.IOException: No FileSystem for scheme: spark
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2584)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2591)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)

Job executed with no data in Spark Streaming

My code:
// messages is JavaPairDStream<K, V>
Fun01(messages)
Fun02(messages)
Fun03(messages)
Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) .
Fun01, Fun03 both executed as expected, which prove "messages" is not null or empty.
On Spark application UI, I found Fun02's output stage in "Spark stages", which prove "executed".
The first line of Fun02 is a map function, I add log in it. I also add log for every step in Fun02, they all prove "with no data".
Does somebody know possible reasons? Thanks very much.
#maasg Fun02's logic is:
msg_02 = messages.mapToPair(...)
msg_03 = msg_02.reduceByKeyAndWindow(...)
msg_04 = msg_03.mapValues(...)
msg_05 = msg_04.reduceByKeyAndWindow(...)
msg_06 = msg_05.filter(...)
msg_07 = msg_06.filter(...)
msg_07.cache()
msg_07.foreachRDD(...)
I have done test on Spark-1.1 and Spark-1.2, which is supported by my company's Spark cluster.
It seems that this is a bug in Spark-1.1 and Spark-1.2, fixed in Spark-1.3 .
I post my test result here: http://secfree.github.io/blog/2015/05/08/spark-streaming-reducebykeyandwindow-data-lost.html .
When continuously use two reduceByKeyAndWindow, depending of the window, slide value, "data lost" may appear.
I can not find the bug in Spark's issue list, so I can not get the patch.