I have to post thousands of records to an endpoint that receives them and inserts them into a table. The back-end made in Elixir (with PostgreSQL) knows beforehand how many records will arrive in total. On the other hand the front-end sends the records simultaneously and in parts. For example, I have to send 100 records in chunks of 10. So 10 post requests are made to this endpoint. This works fine, but the problem is that when the last chunk is inserted I want to spawn a process that makes some calculations on the records, but I want this process to spawn only once. How can achieve this?
def bulk_create(conn, %{"records" => records_params}) do
with {rows_affected, nil} <- Api.create_records(records_params) do
group = Api.get_group!(records_params |> first() |> Map.get("group_id"))
total_records_saved = Api.count_records_by_group_id(group.id)
finished = group.total_records == total_records_saved
pid =
if finished do
{:ok, pid} =
Processor.start_link(%{group_id: group.id, name: String.to_atom("PID.#{group.id}")})
:erlang.pid_to_list(pid) |> List.delete_at(0) |> List.delete_at(-1) |> List.to_string()
else
nil
end
conn
|> put_status(:created)
|> render("bulk_create.json",
rows_affected: rows_affected,
finished: finished,
pid: pid
)
end
end
This seems to work fine but if for any reason two requests insert at the same time they could obtain both true in this line
finished = group.total_records == total_records_saved
And both spawn the process.
You need a state, and once you need a state in OTP, you are to launch a process.
It might be a simple Agent, or you might use built-in :counters or :persistent_term.
The idea is you amend the value in the counter upon DB insert, which is an atomic operation, instead of relying on querying DB which is vulnerable to race conditions.
Another way around it would be to spawn a “throttler” which would queue DB requests, execute them, and ensure the calculation and cleanup starts before the N+1th record arrived.
Related
I have the following table:
case class Project(id: Int, name: String, locked: Boolean)
Users can request some processing to be done on the project - but I'd like to make sure only one processing job is being run on the project at a time.
My way right now is to set locked = true on the project whenever a job begins, and if a user (malicious or otherwise) tries to start a second job while locked = true, it should check if locked is already true, and if so, it should respond with an error message saying 'please wait' or such.
I think I need to do this using transactions, so race conditions / concurrent requests wouldn't work, and a malicious user wouldn't be able to send concurrent requests and have multiple jobs start because all saw locked = false (as they started simultaneously)
How can I do this with Slick? My current attempt looks like this:
def lock(id: Long): Future[Int] = {
val select = (for {p <- projects if p.id === id && p.locked === false} yield l.locked)
val q = select.update(true).transactionally //attempting to use transactions
db.run(q)
}
I believe db.run will return the number of rows which were updated, and if p.locked === false condition fails, then the number of rows updated will be 0, and I could use that to determine if project was successfully locked. And the .transactionally should perhaps make this run in transactions so concurrent requests won't be an issue.
Are my assumptions / reasoning here correct? Is there a better way to do this?
The meaning of .transactionally here depends on which database you are using.
Without specifying anything else, in this way you are using the default isolation level for a transaction offered by you db, that for example, if you use Postgres, the level will be READ COMMITTED, that means that given two concurrent transactions, one can see the data committed from the other before it ends.
I suggest to you to specify always the isolation level with .transactionally.withTransactionIsolation(transactionLevel) to avoid concurrency problems
I have a MongoDB collection with Tasks. Each task has an interval in seconds, Task identifier and payload that should be sent via HTTP POST to gather results and store them into another collection.
It may be thousands tasks with different intervals and I cannot figure out how to schedule them.
Currently I'm using simple polling by last execution time every 10ms but it produces heavy load on DB.
and it looks like this
mongo.MongoClient.connect(MONGO_URL, (err, db) ->
handle_error(err)
schedule = (collection) ->
collection.find({isEnabled:true, '$where': '((new Date()).getTime() - this.timestamp) > (this.checkInterval * 60 * 1000)'}).toArray((err, docs) ->
handle_error(err)
for i, doc of docs
collection.update({_id: doc._id}, {'$set': {timestamp: (new Date()).getTime()}}, {w: 1})
task = prepare(doc)
request.post({url: url, formData: {task: JSON.stringify(prepare(doc))}}, (err,httpResponse,body) ->
result = JSON.parse(body)
console.log(result)
db.collection(MONGO_COLLECTION_RESULTS).save({
task: result.id,
type: result.type,
data: result
})
)
setTimeout((() -> schedule(collection)), 10)
)
setTimeout((() -> schedule(db.collection(MONGO_COLLECTION_TASKS))), 10)
)
tasks can be added, updated, deleted and I have to handle it.
what about using redis? but I have no clue how to sync the data from mongo to redis when some tasks are waiting for result, interval was changed, etc
please advice best strategy for that
I don't think this is the right way to solve your use case.
I would suggest to not store the tasks in whatever database but schedule them directly when they come in and save the result, with or without the original task information.
Why not use Quartz to schedule the task?
If you know the tasks to be run, you can schedule with the unix crontab, which runs a script that connects to DB or send HTTP requests.
If each task is unique, and you cannot pre-schedule them that way, perhaps you can use your current db collections, but not poll the db that often.
If it is not critical that the tasks are executed exactly on the right time, I would do a db lookup maybe once every 10 sec to see what tasks should have been executed since the last lookup.
One way of solving the db load would be to make a query that fetches tasks ordered on when they should be executed, with all tasks that should be executed within the next minute or so. Then you have (hopefully) a low amount of tasks in memory, and can set a javascript timeout for when they should be run. If too many tasks should be run at the same time, this could be problematic to fetch from the db at once.
The essence is to batch several tasks from the db into memory, and handle some of the scheduling there.
To make things short, I have to make a script in Second Life communicating with an AppEngine app updating records in an ndb database. Records extracted from the database are sent as a batch (a page) to the LSL script, which updates customers, then asks the web app to mark these customers as updated in the database.
To create the batch I use a query on a (integer) property update_ver==0 and use fetch_page() to produce a cursor to the next batch. This cursor is also sent as urlsafe()-encoded parameter to the LSL script.
To mark the customer as updated, the update_ver is set to some other value like 2, and the entity is updated via put_async(). Then the LSL script fetches the next batch thanks to the cursor sent earlier.
My rather simple question is: in the web app, since the query property update_ver no longer satisfies the filter, is my cursor still valid ? Or do I have to use another strategy ?
Stripping out irrelevant parts (including authentication), my code currently looks like this (Customer is the entity in my database).
class GetCustomers(webapp2.RequestHandler): # handler that sends batches to the update script in SL
def get(self):
cursor=self.request.get("next",default_value=None)
query=Customer.query(Customer.update_ver==0,ancestor=customerset_key(),projection=[Customer.customer_name,Customer.customer_key]).order(Customer._key)
if cursor:
results,cursor,more=query.fetch_page(batchsize,start_cursor=ndb.Cursor(urlsafe=cursor))
else:
results,cursor,more=query.fetch_page(batchsize)
if more:
self.response.write("more=1\n")
self.response.write("next={}\n".format(cursor.urlsafe()))
else:
self.response.write("more=0\n")
self.response.write("n={}\n".format(len(results)))
for c in results:
self.response.write("c={},{},{}\n".format(c.customer_key,c.customer_name,c.key.urlsafe()))
self.response.set_status(200)
The handler that updates Customer entities in the database is the following. The c= parameters are urlsafe()-encoded entity keys of the records to update and the nv= parameter is the new version number for their update_ver property.
class UpdateCustomer(webapp2.RequestHandler):
#ndb.toplevel # don't exit until all async operations are finished
def post(self):
updatever=self.request.get("nv")
customers=self.request.get_all("c")
for ckey in customers:
cust=ndb.Key(urlsafe=ckey).get()
cust.update_ver=nv # filter in the query used to produce the cursor was using this property!
cust.update_date=datetime.datetime.utcnow()
cust.put_async()
else:
self.response.set_status(403)
Will this work as expected ? Thanks for any help !
Your strategy will work and that's the whole point for using these cursors, because they are efficient and you can get the next batch as it was intended regardless of what happened with the previous one.
On a side note you could also optimise your UpdateCustomer and instead of retrieving/saving one by one you can do things in batches using for example the ndb.put_multi_async.
I have a scenario where 2 db connections might both run Model.find_or_initialize_by(params) and raise an error: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint
I'd like to update my code so it could gracefully recover from it. Something like:
record = nil
begin
record = Model.find_or_initialize_by(params)
rescue ActiveRecord::RecordNotUnique
record = Model.where(params).first
end
return record
The trouble is that there's not a nice/easy way to reproduce this on my local machine, so I'm not confident that my fix actually works.
So I thought I'd get a bit creative and try calling create 2 times (locally) in a row which should raise then PG::UniqueViolation: ERROR, then I could rescue from it and make sure everything is handled gracefully.
But I get this error: PG::InFailedSqlTransaction: ERROR: current transaction is aborted, commands ignored until end of transaction block
I get this error even when I wrap everything in individual transaction blocks
record = nil
Model.transaction do
record = Model.create(params)
end
begin
Model.transaction do
record = Model.create(params)
end
rescue ActiveRecord::RecordNotUnique
end
Model.transaction do
record = Model.where(params).first
end
return record
My questions:
What's the right way to gracefully handle the race condition I mentioned at the very beginning of this post?
How do I test this locally?
I imagine there's probably something simple that I'm missing here, but it's late and perhaps I'm not thinking too clearly.
I'm running postgres 9.3 and rails 4.
EDIT Turns out that find_or_initialize_by should have been find_or_create_by and the errors I was getting was from the actual save call that happened later on in execution. #VeryTiredWhenIWroteThis
Has this actually happenend?
Model.find_or_initialize_by(params)
should never raise an ´ActiveRecord::RecordNotUnique´ error as it is not saving anything to db. It just creates a new ActiveRecord.
However in the second snippet you are creating records.
create (without bang) does not throw exceptions caused by validations, but
ActiveRecord::RecordNotUnique is always thrown in case of a duplicate by both create and create!
If you're creating records you don't need transactions at all. As Postgres being ACID compliant guarantees that only one of the both operations succeeds and if it responds so it's changes will be durable. (a single statement query against postgres is also a transaction). So your above code is almost fine if you replace through find_or_create_by
begin
record = Model.find_or_create_by(params)
rescue ActiveRecord::RecordNotUnique
record = Model.where(params).first
end
You can test if the code behaves correctly by simply trying to create the same record twice in row. However this will not test ActiveRecord::RecordNotUnique is actually thrown correctly on race conditions.
It's also no the responsibility of your app to test and testing it is not easy. You would have to start rails in multithread mode on your machine, or test against a multi process staging rails instance. Webrick for example handles only one request at a time. You can use puma application server, however on MRI there is no true concurrency (GIL). Threads only share the GIL only on IO blocking. Because talking to Postgres is IO, i'd expect some concurrent requests, but to be 100% sure, the best testing scenario would be to deploy on passenger with multiple workers and then use jmeter to run concurrent request agains the server.
I have an asp.net MVC4 application that I am using Unity as my IoC. The constructor for my controller takes in a Repository and that repository takes in a UnitOfWork (DBContext). Everything seems to work fine until multiple ajax requests from the same session happen too fast. I get the Store update, insert, or delete statement affected an unexpected number of rows (0) error due to a concurrency issue. This is what the method looks like called from the ajax request:
public void CaptureData(string apiKey, Guid sessionKey, FormElement formElement)
{
var trackingData = _trackingService.FindById(sessionKey);
if(trackingData != null)
{
formItem = trackingData.FormElements
.Where(f => f.Name == formElement.Name)
.FirstOrDefault();
if(formItem != null)
{
formItem.Value = formElement.Value;
_formElementRepository.Update(formItem);
}
}
}
This only happens when the ajax requests happens rapidly, meaning fast. When the requests happen at a normal speed everything seems fine. It is like the app needs time to catch up. Not sure how I need to handle the concurrency check in my repository so I don't miss an update. Also, I have tried setting the "MultipleActiveResultSets" to true and that didn't help.
As you mentioned in the comment you are using a row version column. The point of this column is to prevent concurrent overwrites of the same row. You have two operations:
Read record - reads record and current row version
Update record - update record with specified key and row version. The row version is updated automatically
Now if those operations are executed by concurrent request you may receive this:
Request A: Read record
Request B: Read record
Request A: Write record - changes row version!
Request B: Write record - fires exception because record with row version retrieved during Read record doesn't exist
The exception is fired to tell you that you are trying to update obsolete data because there is already a new version of the updated record. Normally you need to refresh data (by reloading current record from the database) and try to save them again. In highly concurrent scenario this handling may repeat many times because simply your database is designed to prevent this. Your options are:
Remove row version and let requests overwrite the value as they wish. If you really need concurrent request processing and you are happy to have "some" value, this may be the way to go.
Not allow concurrent requests. If you need to process all updates you most probably also need their real order. In such case your application should not allow concurrent requests.
Use SQL / stored procedure instead. By using table hints you will be able to lock record during Read operation and no other request will be able to read that record before the first one save changes and commits or rollbacks transaction.