In order to add millions of records to a Postgres database with constant memory consumption, I am using a thread pool with several workers as well as a gorp.Transaction.
Per million records, the following code is called from different threads about a hundred times, each time handling a batch of 10000 records or so:
func batchCopy(p importParams) error {
copy := pq.CopyIn("entity", "startdate", "value", "expirydate", "accountid")
stmt, err := p.txn.Prepare(copy)
if err != nil {
return err
}
for _, r := range p.records {
_, err := stmt.Exec(
r.startDate,
r.value,
r.expiryDate,
p.accountId)
if err != nil {
return err
}
}
if err := stmt.Close(); err != nil {
return err
}
return nil
}
For some reason, I've noticed that the process tends to be very slow with the Prepare and Close calls taking very long.
I then tried to reuse the same statement for all of my calls to batchCopy and close it after all of them are done. In this case, the batchCopy finishes very fast, but when I call stmt.Close() - it takes forever.
Questions:
What would be the right way to go about statements? Should I create one per batch or reuse them?
What is happening in stmt.Close() and why does it take so long after many calls to stmt.Exec?
Related
Following is the code for fetching the results from the db providing collection, filter query, sorting query and number of limit.
func DBFetch(collection *mongo.Collection, filter interface{}, sort interface{}, limit int64) ([]bson.M, error) {
findOptions := options.Find()
findOptions.SetLimit(limit)
findOptions.SetSort(sort)
cursor, err := collection.Find(context.Background(), filter, findOptions)
var result []bson.M
if err != nil {
logger.Client().Error(err.Error())
sentry.CaptureException(err)
cursor.Close(context.Background())
return nil, err
}
if err = cursor.All(context.Background(), &result); err != nil {
logger.Client().Error(err.Error())
sentry.CaptureMessage(err.Error())
return nil, err
}
return result, nil
}
I am using mongo-go driver version 1.8.2
mongodb community version 4.4.7 sharded mongo with 2 shards
Each shard is with 30 CPU in k8 with 245Gb memory having 1 replica
200 rpm for the api
Api fetches the data from mongo and format it and the serves it
We are reading and writing both on primary.
Heavy writes occur every hour approximately.
Getting timeouts in milliseconds ( 10ms-20ms approx. )
As pointed out by #R2D2 in the comment, no cursor timeout error occurs when the default timeout (10 minutes) exceeds and there was no request from go for next set of data.
There are couple of workarounds you can do to mitigate getting this error.
First option is to set batch size for your find query by using the below option. By doing do, you are instructing MongoDB to send data in specified chunks rather than sending more data. Note that this will usually increase the roundtrip time between MongoDB and Go server.
findOptions := options.Find()
findOptions.SetBatchSize(10) // <- Batch size is set to `10`
cursor, err := collection.Find(context.Background(), filter, findOptions)
Furthermore, you can set the NoCursorTimeout option which will keep your MongoDB find query result cursor pointer to stay alive unless you manually close it. This option is a double edge sword since you have to manually close the cursor once you no longer need that cursor, else that cursor will stay in memory for a prolonged time.
findOptions := options.Find()
findOptions.SetNoCursorTimeout(true) // <- Applies no cursor timeout option
cursor, err := collection.Find(context.Background(), filter, findOptions)
// VERY IMPORTANT
_ = cursor.Close(context.Background()) // <- Don't forget to close the cursor
Combine the above two options, below will be your complete code.
func DBFetch(collection *mongo.Collection, filter interface{}, sort interface{}, limit int64) ([]bson.M, error) {
findOptions := options.Find()
findOptions.SetLimit(limit)
findOptions.SetSort(sort)
findOptions.SetBatchSize(10) // <- Batch size is set to `10`
findOptions.SetNoCursorTimeout(true) // <- Applies no cursor timeout option
cursor, err := collection.Find(context.Background(), filter, findOptions)
var result []bson.M
if err != nil {
//logger.Client().Error(err.Error())
//sentry.CaptureException(err)
_ = cursor.Close(context.Background())
return nil, err
}
if err = cursor.All(context.Background(), &result); err != nil {
//logger.Client().Error(err.Error())
//sentry.CaptureMessage(err.Error())
return nil, err
}
// VERY IMPORTANT
_ = cursor.Close(context.Background()) // <- Don't forget to close the cursor
return result, nil
}
In my golang project that use gorm as ORM and posgress as database, in some sitution when I begin transaction to
change three tables and commiting, just one of tables changes. two other tables data does not change.
any idea how it might happen?
you can see example below
o := *gorm.DB
tx := o.Begin()
invoice.Number = 1
err := tx.Save(&invoice)
if err != nil {
err2 := tx.RollBack().Error()
return err
}
receipt.Ref = "1331"
err = tx.Save(&receipt)
if err != nil {
err2 := tx.RollBack().Error()
return err
}
payment.status = "succeed"
err = tx.Save(&payment)
if err != nil {
err2 := tx.RollBack().Error()
return err
}
err = tx.Commit()
if err != nil {
err2 := tx.Rollback()
return err
}
Just payment data changed and I'm not getting any error.
Apparently you are mistakenly using save points! In PostgreSQL, we can have nested transactions, that is, defining save points make the transaction split into parts. I am not a Golang programmer and my primary language is not Go, but as I guess the problem is "tx.save" which makes a SavePoint, and does not save the data into database. SavePoints makes a new transaction save point, and thus, the last table commits.
If you are familiar with the Node.js, then any async function callback returns an error as the first argument. In Go, we follow the same norm.
https://medium.com/rungo/error-handling-in-go-f0125de052f0
I'm trying to rollback a transaction on my unit tests, between scenarios, to keep the database empty and do not make my tests dirty. So, I'm trying:
for _, test := range tests {
db := connect()
_ = db.RunInTransaction(func() error {
t.Run(test.name, func(t *testing.T) {
for _, r := range test.objToAdd {
err := db.PutObj(&r)
require.NoError(t, err)
}
objReturned, err := db.GetObjsWithFieldEqualsXPTO()
require.NoError(t, err)
require.Equal(t, test.queryResultSize, len(objReturned))
})
return fmt.Errorf("returning error to clean up the database rolling back the transaction")
})
}
I was expecting to rollback the transaction on the end of the scenario, so the next for step will have an empty database, but when I run, the data is never been rolling back.
I believe I'm trying to do what the doc suggested: https://pg.uptrace.dev/faq/#how-to-test-mock-database, am I right?
More info: I notice that my interface is implementing a layer over RunInTransaction as:
func (gs *DB) RunInTransaction(fn func() error) error {
f := func(*pg.Tx) error { return fn() }
return gs.pgDB.RunInTransaction(f)
}
IDK what is the problem yet, but I really guess that is something related to that (because the TX is encapsulated just inside the RunInTransaction implementation.
go-pg uses connection pooling (in common with most go database packages). This means that when you call a database function (e.g. db.Exec) it will grab a connection from the pool (establishing a new one if needed), run the command and return the connection to the pool.
When running a transaction you need to run BEGIN, whatever updates etc you require, followed by COMMIT/ROLLBACK, on a single connection dedicated to the transaction (any commands sent on other connections are not part of the transaction). This is why Begin() (and effectively RunInTransaction) provide you with a pg.Tx; use this to run commands within the transaction.
example_test.go provides an example covering the usage of RunInTransaction:
incrInTx := func(db *pg.DB) error {
// Transaction is automatically rollbacked on error.
return db.RunInTransaction(func(tx *pg.Tx) error {
var counter int
_, err := tx.QueryOne(
pg.Scan(&counter), `SELECT counter FROM tx_test FOR UPDATE`)
if err != nil {
return err
}
counter++
_, err = tx.Exec(`UPDATE tx_test SET counter = ?`, counter)
return err
})
}
You will note that this only uses the pg.DB when calling RunInTransaction; all database operations use the transaction tx (a pg.Tx). tx.QueryOne will be run within the transaction; if you ran db.QueryOne then that would be run outside of the transaction.
So RunInTransaction begins a transaction and passes the relevant Tx in as a parameter to the function you provide. You wrap this with:
func (gs *DB) RunInTransaction(fn func() error) error {
f := func(*pg.Tx) error { return fn() }
return gs.pgDB.RunInTransaction(f)
}
This effectively ignores the pg.Tx and you then run commands using other connections (e.g. err := db.PutObj(&r)) (i.e. outside of the transaction). To fix this you need to use the transaction (e.g. err := tx.PutObj(&r)).
Let's consider the following goroutine:
func main(){
...
go dbGoRoutine()
...
}
And the func:
func dbGoRoutine() {
db, err := sqlx.Connect("postgres", GetPSQLInfo())
if err != nil {
panic(err)
}
defer db.Close()
ticker := time.NewTicker(10 * time.Second)
for _ = range ticker.C {
_, err := db.Queryx("SELECT * FROM table")
if err != nil {
// handle
}
}
}
Each time the function iterates on the ticker it opens a cloudSQL connection
[service... cloudsql-proxy] 2019/11/08 17:05:05 New connection for "location:exemple-db"
I can't figure out why it opens a new connection each time, since the sqlx.Connect is not in the for loop.
This issue is due to how Query function in sql package, it returns Row which are:
Rows is the result of a query. Its cursor starts before the first row of the result set.
Those cursor are stored using cache.
try using Exec().
I'm quite new to both PostgreSQL and golang. Mainly, I am trying to understand the following:
Why did I need the Commit statement to close the connection and the other two Close calls didn't do the trick?
Would also appreciate pointers regarding the right/wrong way in which I'm going about working with cursors.
In the following function, I'm using gorp to make a CURSOR, query my Postgres DB row by row and write each row to a writer function:
func(txn *gorp.Transaction,
q string,
params []interface{},
myWriter func([]byte, error)) {
cursor := "DECLARE GRABDATA NO SCROLL CURSOR FOR " + q
_, err := txn.Exec(cursor, params...)
if err != nil {
myWriter(nil, err)
return
}
rows, err := txn.Query("FETCH ALL in GRABDATA")
if err != nil {
myWriter(nil, err)
return
}
defer func() {
if _, err := txn.Exec("CLOSE GRABDATA"); err != nil {
fmt.Println("Error while closing cursor:", err)
}
if err = rows.Close(); err != nil {
fmt.Println("Error while closing rows:", err)
} else {
fmt.Println("\n\n\n Closed rows without error", "\n\n\n")
}
if err = txn.Commit(); err != nil {
fmt.Println("Error on commit:", err)
}
}()
pointers := make([]interface{}, len(cols))
container := make([]sql.NullString, len(cols))
values := make([]string, len(cols))
for i := range pointers {
pointers[i] = &container[i]
}
for rows.Next() {
if err = rows.Scan(pointers...); err != nil {
myWriter(nil, err)
return
}
stringLine := strings.Join(values, ",") + "\n"
myWriter([]byte(stringLine), nil)
}
}
In the defer section, I would initially, only Close the rows, but then I saw that pg_stat_activity stay open in idle in transaction state, with the FETCH ALL in GRABDATA query.
Calling txn.Exec("CLOSE <cursor_name>") didn't help. After that, I had a CLOSE GRABDATA query in idle in transaction state...
Only when I started calling Commit() did the connection actually close. I thought that maybe I need to call Commit to execute anything on the transation, but if that's the case - how come I got the result of my queries without calling it?
you want to end transaction, not close a declared cursor. commit does it.
you can run multiple queries in one transaction - this is why you see the result without committing.
the pg_stat_activity.state values are: active when you run the statement (eg, begin transaction; or fetch cursos), idle in transaction when you don't currently run statements, but the transaction remains begun and lastly idle, after you run end or commit, so the transaction is over. After you disconnect the session ends and there's no row in pg_stat_activity at all...