Issue with Combine function in Apache Beam Go sdk - apache-beam

We came across an issue with the Combine operation with Apache Beam Go SDK (v2.28.0), when running a pipeline on Google Cloud Dataflow. I understand that the Go SDK is experimental but it would be great if someone can help us understand if there’s anything wrong with our code, or if there's a bug in the Go SDK or Dataflow. The issue only happens when running the pipeline with Google Dataflow, with some large data set. We are trying to combine a PCollection<pairedVec>, with
type pairedVec struct {
Vec1 [1048576]uint64
Vec2 [1048576]uint64
}
There are 10,000,000 items in the PCollection.
Main func:
func main() {
flag.Parse()
beam.Init()
ctx := context.Background()
pipeline := beam.NewPipeline()
scope := pipeline.Root()
records := textio.ReadSdf(scope, *inputFile)
rRecords := beam.Reshuffle(scope, records)
vecs := beam.ParDo(scope, &genVecFn{LogN: *logN}, rRecords)
histogram := beam.Combine(scope, &combineVecFn{LogN: *logN}, vecs)
lines := beam.ParDo(scope, &flattenVecFn{}, histogram)
textio.Write(scope, *outputFile, lines)
if err := beamx.Run(ctx, pipeline); err != nil {
log.Exitf(ctx, "Failed to execute job: %s", err)
}
}
After reading the input file, Dataflow scheduled 1000 workers to generate the PCollection, and started to do the combination. Then the worker number reduced to almost 1 and lasted for a very long time. Eventually the job failed with the following error log:
2021-03-02T06:13:40.438112597ZWorkflow failed. Causes: S09:CombinePerKey/CoGBK'1/Read+CombinePerKey/main.combineVecFn+CombinePerKey/main.combineVecFn/Extract+beam.dropKeyFn+main.flattenVecFn+textio.Write/beam.addFixedKeyFn+textio.Write/CoGBK/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: go-job-1-1614659244459204-03012027-u5s6-harness-q8tx Root cause: The worker lost contact with the service., go-job-1-1614659244459204-03012027-u5s6-harness-44hk Root cause: The worker lost contact with the service., go-job-1-1614659244459204-03012027-u5s6-harness-05nm Root cause: The worker lost contact with the service., go-job-1-1614659244459204-03012027-u5s6-harness-l22w Root cause: The worker lost contact with the service.
The change of worker number
Edit
Tried to add a step to "pre-combine" the records to 100,000 keys (combineDomain=100000) before combining all of them together:
Main function:
func main() {
flag.Parse()
beam.Init()
ctx := context.Background()
pipeline := beam.NewPipeline()
scope := pipeline.Root()
records := textio.ReadSdf(scope, *inputFile)
rRecords := beam.Reshuffle(scope, records)
vecs := beam.ParDo(scope, &genVecFn{LogN: *logN}, rRecords)
keyVecs := beam.ParDo(scope, &addRandomKeyFn{Domain: *combineDomain}, vecs)
combinedKeyVecs := beam.CombinePerKey(scope, &combineVecFn{LogN: *logN}, keyVecs)
combinedVecs := beam.DropKey(scope, combinedKeyVecs)
histogram := beam.Combine(scope, &combineVecFn{LogN: *logN}, combinedVecs)
lines := beam.ParDo(scope, &flattenVecFn{}, histogram)
textio.Write(scope, *outputFile, lines)
if err := beamx.Run(ctx, pipeline); err != nil {
log.Exitf(ctx, "Failed to execute job: %s", err)
}
}
But the job scheduled only one worker for it, and failed after a long time:
Workflow failed. Causes: S06:Reshuffle/e6_gbk/Read+Reshuffle/e6_gbk/GroupByWindow+Reshuffle/e6_unreify+main.genVecFn+main.addRandomKeyFn+CombinePerKey/CoGBK'2+CombinePerKey/main.combineVecFn/Partial+CombinePerKey/CoGBK'2/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers:
go-job-1-1615178257414007-03072037-mrlo-harness-ppjj
Root cause: The worker lost contact with the service.,
go-job-1-1615178257414007-03072037-mrlo-harness-czng
Root cause: The worker lost contact with the service.,
go-job-1-1615178257414007-03072037-mrlo-harness-79n8
Root cause: The worker lost contact with the service.,
go-job-1-1615178257414007-03072037-mrlo-harness-mj6c
Root cause: The worker lost contact with the service.
After adding another reshuffle before CombinePerKey(), the pipeline scheduled 1000 workers to process it. But the job was extremely slow, and uses a large amount of shuffle data. 1 hour later, genVecFn finished less than 10 percent, and had 8.08TB shuffle data. This is basically consistent with our production code, which eventually failed because it used up the 40TB shuffle data quota.
We tried another method to reduce workload on a single worker: segment the vector [1048576]uint64 into 32 pieces of [32768]uint64, and combine each of the pieces. Something like:
totalLength := uint64(1 << *logN)
segLength := uint64(1 << *segmentBits)
for i := uint64(0); i < totalLength/segLength; i++ {
fileName := strings.ReplaceAll(*outputFile, path.Ext(*outputFile), fmt.Sprintf("-%d-%d%s", i+1, totalLength/segLength, path.Ext(*outputFile)))
pHistogram := beam.Combine(scope, &combineVecRangeFn{StartIndex: i * segLength, Length: segLength}, vecs)
flattened := beam.ParDo(scope, &flattenVecRangeFn{StartIndex: i * segLength}, pHistogram)
textio.Write(scope, fileName, flattened)
}
The job succeeded eventually.

Given your pipeline code, the job downsizing to 1 worker is behaving as expected for the Go SDK since it lacks some of the optimizations of the Java and Python SDKs. The reason it happens is because you use beam.Combine which is a global combine, meaning that every element in the PCollection is combined down to one value. On the Go SDK this means that all elements need to be localized to a single worker to be combined, which for 10 million items each of which is about 16 megabytes, takes too long and the job most likely times out (you can probably confirm this by looking for a timeout message in the Dataflow logs).
Other SDKs have optimizations in place which split the input elements among workers to combine down, before consolidating to a single worker. For example in the Java SDK: "Combining can happen in parallel, with different subsets of the input PCollection being combined separately, and their intermediate results combined further, in an arbitrary tree reduction pattern, until a single result value is produced."
Fortunately, this solution is easily to implement manually for the Go SDK. Simply place your elements into N buckets (where N is greater than the number of workers you'd ideally want) by assigning random keys in the range of [0, N). Then perform a CombinePerKey and only elements with matching keys need to be localized on a worker, allowing this Combine to be split in multiple workers. Then follow that up with DropKey and then the global Combine, and you should get the intended result.

Related

go func for db insert operation inside a for loop

I am new to golang. I have read about go routines. But I am wondering whether it can be used in db insert operations. I have the following scenario
Need to insert rows for different types of products in each row.
Eg: If I have 5 products I need to insert its id, name, created_at as rows.So total 5 rows for 5 products.Is the following approach good to use
for _, j := range items {
go func(j product_object){
obj := prepare_dto(j)
save_in_db(obj)
}(j)
}
I made a trial with and without using go func
Without using go func avg time complexity is 22ms
With using go func avg time complexity is 427ns
Is the above approach a good practise for db operation?
Yes, you can do it. However, you are making len(items) calls to the database, which could potentially wear down your database due to too many connections. It's almost always a bad idea to insert / update to the database within a for loop. I suggest you to do a batch insert with only one call to the database.

pq: [parent] Data too large

I'm using Grafana to visualize some data stored in CrateDB in different panes.
Some of my boards work correctly, but there are 3 specific boards (created by someone from my work team), in which at certain times of the day they stop showing data (No Data) and as a warning it shows the following error:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
Honestly, I would like to understand what it means, and how I can fix it.
I remain attentive to any help you can give me, and I deeply appreciate the help.
what I tried
17 SQL Statements of the form:
SELECT
time_index AS "time",
entity_id AS metric,
v1_ps
FROM etsm
WHERE
entity_id = 'SM_B3_RECT'
ORDER BY 1,2
for 17 different entities.
what I hope
I hope to receive the data corresponding to each of the SQL statements for their respective graphing.
The result
As a result, there is no data received on some of the statements made and the warning message I shared:
db query error: pq: [parent] Data too large, data for [fetch-1] would be [512323840/488.5mb], which is larger than the limit of [510027366/486.3mb], usages [request=0/0b, in_flight_requests=0/0b, query=150023700/143mb, jobs_log=19146608/18.2mb, operations_log=10503056/10mb]
As an additional fact, the graph is configured to update every 15 min, but no matter how many times you manually update the graph, the statements that receive data are different.
Example: I refresh the panel and the SQL statements A, B and C get data, while the others don't. I refresh the panel and the SQL statements D, H and J receive data, and the others don't (with a random pattern).
Other additional information:
I have access to the database being consulted with Grafana, and the data is there
You don't have time condition, so query select/process all records all the time and you are hitting limits (e. g. size of processed data) of your DB. Add time condition, so only fraction of all records will be returned.

What is an updated index configuration of Google Firestore in Datastore mode?

Since Nov 08 2022, 16h UTC, we sometimes get the following DatastoreException with code: UNAVAILABLE, and message:
Query timed out. Please try either limiting the entities scanned, or run with an updated index configuration.
I want to get all Keys of a certain kind of entities. These are returned in batches together with a new cursor. When using the cursor to get the next batch, then the above stated error happens. I am expecting that the query does not time out so fast. (It might be that it takes up to a few seconds until I am requesting the next batch of Keys using the returned cursor, but this never used to be a problem in the past.)
There no problem before the automatic upgrade to Firestore. Also counting entities of a kind often results in the error DatastoreException: "The datastore operation timed out, or the data was temporarily unavailable."
I am wondering whether I have to make any changes on my side. Does anybody else encounter these problems with Firestore in Datastore mode?
What is meant by "an updated index configuration"?
Thanks
Stefan
I just wanted to follow up here since we were able to do detailed analysis and come up with a workaround. I wanted to record our findings here for posterity's sake.
The root of the problem is queries over large ranges of deleted keys. Given schema like:
Kind: ExampleKind
Data:
Key
lastUpdatedMillis
ExampleKind/1040
5
ExampleKind/1052
0
ExampleKind/1064
12
ExampleKind/1065
100
ExampleKind/1070
42
Datastore will automatically generate both ASC and DESC index on the lastUpdatedMillis field.
The the lastUpdatedMillis ASC index table would have the following logical entries:
Index Key
Entity Key
0
ExampleKind/1052
5
ExampleKind/1040
12
ExampleKind/1064
42
ExampleKind/1070
100
ExampleKind/1065
In the workload you've described, there was an operation that did the following:
SELECT * FROM ExampleKind WHERE lastUpdatedMillis <= nowMillis()
For every ExampleKind Entity returned by the query, perform some operation which updates lastUpdatedMillis
Some of the updates may fail, so we repeat the query from step 1 again to catch any remaining entities.
When the operation completes, there are large key ranges in the index tables that are deleted, but in the storage system these rows still exist with special deletion markers. They are visible internally to queries, but are filtered in the results:
Index Key
Entity Key
x
xxxx
x
xxxx
x
xxxx
42
ExampleKind/1070
...
Und so weiter ...
x
xxxx
When we repeat the query over this data, if the number of deleted rows is very large (100_000 ... 1_000_000), the storage system may spend the entire operation looking for non-deleted data in this range. Eventually the Garbage Collection and Compaction mechanisms will remove the deleted rows and querying this key range becomes fast again.
A reasonable is workaround is to reduce the amount of work the query has to do by restricting the time range of the lastUpdateMillis field.
For example, instead of scanning the entire range of lastUpdateMillis < now, we could break up the query into:
(now - 60 minutes) <= lastUpdateMillis < now
(now - 120 minutes) <= lastUpdateMillis < (now - 60 minutes)
(now - 180 minutes) <= lastUpdateMillis < (now - 120 minutes)
This example uses 60 minute ranges, however the specific "chunk size" can be tuned to the shape of your data. These smaller queries will either succeed and find some results, or scan the entire key range and return 0 results, however in both scenarios they will complete within the RPC deadline.
Thank you again for reaching out about this!
A couple notes:
This deadlining query problem could occur with any kind of query over the index (projection, keys only, full entity, etc)
Despite what the error message says, no extra index here is need or would speed up the operation. Datastore's built-in ASC/DESC index over each field already exists for you and is serving this query.

Delphi - Duplicating Data between data sources

Delphi Seattle, Win10. I need to write a generic routine to refresh a set of tables from a single source to a single destination. The tables on both ends already exist, and the process is a complete refresh.... i.e. empty the destination table, and then copy all rows. The tables are identical, same columns, datatypes etc. The challenge is the restrictions of HOW to access the data. The only access I have to the source is via REST. I can pull the data via REST, using RESTClient, RESTRequest, RESTResponse connected to RESTAdapter, TDataSource and TClientDataSet. The destination is an Oracle database, which I have direct access to.
I have approx 15 tables, with the largest being approx 200,000 rows, 40 columns.
Right now, I am looping through each row, for each column on source, finding matching column in destination.... and the performance is killing me. Is there a more elegant (and particularly faster) way to do this?
Here is a code snippet of what I am doing now...
// for each row, loop
...
// Copy Each Field
for i := 0 to dm1.ds_Generic.DataSet.FieldCount - 1 do
begin
FieldFrom := dm1.ds_Generic.DataSet.Fields[i];
FieldTo := dm1.tGeneric.FindField(FieldFrom.FieldName);
if Assigned(FieldTo) then
begin
FieldTo.Value := FieldFrom.Value;
end;
end;

Occasional PostgreSQL "Duplicate key value violates unique constraint" error from Go insert

I have a table with the unique constraint
CREATE UNIQUE INDEX "bd_hash_index" ON "public"."bodies" USING btree ("hash");
I also have a Go program that takes "body" values on a channel, filters out the duplicates by hashing, and inserts only the non-duplicates into the database.
Like this:
import (
"crypto/md5"
"database/sql"
"encoding/hex"
"log"
"strings"
"time"
)
type Process struct {
DB *sql.DB
BodiesHash map[string]bool
Channel chan BodyIterface
Logger *log.Logger
}
func (pr *Process) Run() {
bodyInsert, err := pr.DB.Prepare("INSERT INTO bodies (hash, type, source, body, created_timestamp) VALUES ($1, $2, $3, $4, $5)")
if err != nil {
pr.Logger.Println(err)
return
}
defer bodyInsert.Close()
hash := md5.New()
for p := range pr.Channel {
nowUnix := time.Now().Unix()
bodyString := strings.Join([]string{
p.GetType(),
p.GetSource(),
p.GetBodyString(),
}, ":")
hash.Write([]byte(bodyString))
bodyHash := hex.EncodeToString(hash.Sum(nil))
hash.Reset()
if _, ok := pr.BodiesHash[bodyHash]; !ok {
pr.BodiesHash[bodyHash] = true
_, err = bodyInsert.Exec(
bodyHash,
p.GetType(),
p.GetSource(),
p.GetBodyString(),
nowUnix,
)
if err != nil {
pr.Logger.Println(err, bodyString, bodyHash)
}
}
}
}
But periodically I get the error
"pq: duplicate key value violates unique constraint "bd_hash_index""
in my log file. I can't image how it can be, because I check the hash for uniqueness before I do an insert.
I am sure that when I call go processDebugBody.Run() the bodies table is empty.
The channel was created as a buffered channel with:
processDebugBody.Channel = make(chan BodyIterface, 1000)
When you execute a query outside of transaction with sql.DB, it automatically retries when there's a problem with connection. In the current implementation, up to 10 times. For example, notice maxBadConnRetries in sql.Exec.
Now, it really happens only when underlying driver returns driver.ErrBadConn and specification states the following:
ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection.
To prevent duplicate operations, ErrBadConn should NOT be returned if there's a possibility that the database server might have performed the operation.
I think driver implementations are a little bit careless in implementing this rule, but maybe there is some logic behind it. I've been studying implementation of lib/pq the other day and noticed this scenario would be possible.
As you pointed out in the comments you have some SSL errors issued just before seeing duplicates, so this seems like a reasonable guess.
One thing to consider is to use transactions. If you lose the connection before committing the transaction, you can be sure it will be rolled back. Also the statements of the transactions are not retransmitted automatically on bad connections, so this problem might be solved – you will most probably will se SSL errors being propagated directly to you application though, so you'll need to retry on your own.
I must tell you I've been also seeing SSL renegotiation errors on postgres using Go 1.3 and that's why I've disabled SSL for my internal DB for time being (sslmode=disable in the connection string). I was wondering whether version 1.4 has solved the issue, as one thing on changelog was The crypto/tls package now supports ALPN as defined in RFC 7301 (ALPN states for Application-Layer Protocol Negotiation Extension).