Google Indexing API rateLimitExceeded - spring-batch

I have a Spring Batch Process which submits something around 5M urls to Google Indexing API. In the past, the process was segmented e parallelized int two threads by an attribute, one for the small segments and one for the bigger. From some days ago up to now, it was refactored to submit request as it come from a query response (sorted by its priority, ignoring the previous segmenting attribute, using a single thread to execute). After that refactoring, I started getting a "rateLimitExceed" error from Google API. I have (by contract) 5M request a day and I'm submitting batches of 500 urls a time. The average sending time is around 1.2 seconds for each 500 urls batch.
Does anybody know what may be causing this error?

I did not do the math, but if you are getting this exception, it means you are exceeding the limit. Depending on where you are doing the API call (ie in the item writer or in an item processor), you can do the math and delay the call as needed with a listener to not exceed the limit.
You can find a similar question/answer here: Spring batch writer throttling

Related

Stream kinesis Analytics ETL Flink - skip records before and after a delay

EDITED:
I have a requirement to skip records that are created before 10s and 20s after if a gap in incoming data occurs.
(A gap is said to occur when the event-time1 - event-time2 > 3 seconds)
the resulting data is used to calculate average or median in a timewindow,
Is this possible to be done with Kinesis analytics, Dataflow, flink API, or some solution that works?
If I understand correctly, you want to find the median and average of records that are created between 10 and 20 seconds after a gap of at least 3 seconds.
Using Flink (or Kinesis Analytics, which is a managed Flink service), you could do that with session windows, or with a ProcessFunction. Process functions are more flexible, and are capable of handling pretty much anything you might need. However, in this case, session windows are probably simpler, especially if you are willing to wait until a session ends (i.e., until the next gap) to get the results. You could avoid this delay by implementing a custom window Trigger.
window tutorial
process function tutorial

Asynchronous or synchronous pull for counting stream data in pub sub pub/sub?

I would like to count the number of messages in the last hour (last hour referring to a timestamp field in the message data).
I currently have a code that will count the messages synchronously (I am using Google Cloud Pub/Sub Synchronous pull), but I noticed it will take quite long.
My code will repeatedly poll the subscription for a predefined (I set it to 100+) number of times so that I am sure there are no more messages in the last hour that are coming in out of order.
This is not an acceptable design because it means the user has to wait for 5-10 mins for the service to count the messages when they want the metric!
Are there best practices in Pub Sub design for solving this kind of problem?
This seems like a simple problem to solve (count the number of events in the last X timeframe) so I thought there might be.
Will asynchronous design help? How would an async design work? I am not too sure about the async and Python future concept (I am using GCP Pub/Sub's Python client library).
I will try to catch the message differently. My solution is based on logging and BigQuery. The idea is to write a log, for example message received with timestamp xxxxx, to filter this log pattern and to sink the result in BigQuery.
Then, when a user ask, you simply have to request BigQuery and to count the message in the desired lap of time. You also have the advantage to change the time frame, to have an history,...
For writing this log, 2 solutions
Cheaper but not really recommended, the process which consume the message log it with it process it. However, you are dependent of an external service. And this service has 2 responsibilities: its work, and this log (for metrics). Not SOLID. Maybe it's can be the role of the publisher with a loge like this: message published at XXXX. However this imply that all the publisher or all the subscribers are on GCP.
Better is to plug a function, the cheaper (128Mb of memory) to simply handle the message and write the log.

Smartsheet API rate limit exceed

Last week encountered for the first time a rate limit exceed error (4003) in our nightly batch-process. This batch proces is synchronising Smartsheet objects with our TimeTracking application 4TT.
Since 2016 this proces works fine, but somehow now this rate limit error occurs and therefore stops synchronising. With the help of the API (and blog about rate limit) I managed to change the code, putting in pauses when this error occurs. This has taken me quite a lot of time, as every time the error occured in a different part of the synchronisation proces.
Is there or will there be a way to let the API automatically pauses, when the rate limit is about to exceed in stead of changing the code every time. And for those who don't want this feature, for example adding an optional boolean argument 'AutomaticallyPauseWhenRateLimitExceeds' (default false) when making the connection to the Smartsheet API?
You'll need to include logic in your code to effectively handle the rate limiting error -- there's no mechanism by which the Smartsheet API can automatically handle this situation for you.
A simple approach would be for you to include logic in your code such that when a rate limiting error is thrown, your code pauses execution for 60 seconds before continuing. Alternatively, a more sophisticated approach would be to implement exponential backoff logic in your code (an error handling strategy whereby you periodically retry a failed request with progressively longer wait times between retries, until either the request succeeds or the certain number of retry attempts is reached).
Implementing this type of error handling logic should not be difficult or tedious, provided that your code is structured in an efficient manner (i.e., error handling logic is encapsulated in a single location).
Additional note: The Smartsheet API Best Practices blog post (specifically the Be practical: Adhere to rate limiting guidelines section) contains info about this topic.
All our SDKs include error retry. So that's the easiest way to handle this situation. http://smartsheet-platform.github.io/api-docs/#sdks-and-sample-code
I found this and other interesting problems (in my lab) while updating the sheet including Poor Internet connection/bandwidth issues.
If unable to accommodate your code to process chunks of data, my suggestion is to use a simple Try/Catch logic to pause the thread/task for 60 secs and then try again.
using System.Threading
...
... //all your code goes here
...
try
{
// your code to Save/update the Sheet goes here
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Thread.Sleep(60000);
}
The next step is to work notifications when those errors happen

rate limit policy on queries to Azure Insights REST API for Events (Audit Logs)

I have some questions regarding Azure Insights REST Api for Events.
When I make HTTP request to Inisghts API for events, I receive the header "
x-ms-ratelimit-remaining-subscription-reads", with value "14999".
But next query in 1s returns me the same value of remaining reads.
I see there is some throttling policy there, but I would like to understand how it works and what is the correct way to deal with that.
In particular,
1) how many reads I am able to do per second?
2) if I exceed the whole remaining reads parameter, how much time should I wait before it will again be maximum?
3) is it decreased on every query attempt, despite of the $top parameter setted and how many results has been returned?
Thank you!
This article seems to have the responses you need.
To answer the questions based on it:
There is no limit to the number of requests per second, but you have 15k
requests/hour/subscription/region/instance of ARM region. Worst case scenario you will get throttled after 15k requests but you'd have to be extremely unlucky for that.
If you exceed the limit, you are
told how much you have to wait and you can integrate that logic by
looking at the Retry-After header. Happily, it's a matter of
seconds.
I believe the $top parameter doesn't affect the query since
no matter how many results are brought back, a paging request is
still just one request.
As for the fact that you get 14999 requests
remaining multiple times, as they say in their documentation it is
expected since an ARM region has multiple instances and each instance has
15k requests limit/subscription/hour. If you hit simultaneously and
you get the same number remaining, it just means that you were lucky
enough to hit different instances within the same ARM region.
1) how many reads I am able to do per second?
Based on the rate limits published here - https://azure.microsoft.com/en-in/documentation/articles/azure-subscription-service-limits/#subscription-limits, you can perform 15000 reads / hour (not sure it would translate to 4 reads / second).
2) if I exceed the whole remaining reads parameter, how much time
should I wait before it will again be maximum?
Given the rates are defined per hour, my guess would be to wait till next hour if you exhaust 15000 read request limit.
3) is it decreased on every query attempt, despite of the $top
parameter setted and how many results has been returned?
This is based on the number of API calls and not the amount of data returned. So I would say defining $top parameter should not have any impact on this.
When I make HTTP request to Inisghts API for events, I receive the
header " x-ms-ratelimit-remaining-subscription-reads", with value
"14999". But next query in 1s returns me the same value of remaining
reads.
I would assume there's some caching in play here. Is it the same request you're repeating or a different request all together?

azure mobile service custom api script http request timeout

I implemented exports.get = function(request,response) of a custom api on a mobile service of azure. I download 5 thousands records from the rest service and then i prepare the json for the output. The problem is that the time of downloading of all records is too long, for that script exceeds the default timeout of 30 secs. I was thinking if there is a way to increase the timeout of the response.
I don't believe you can have a timeout greater than 30 seconds, as I have encountered this problem myself with azure custom APIs. According to this link https://msdn.microsoft.com/en-us/library/azure/dd894042.aspx, Table operations are limited to 30 seconds, but it's not clear if that applies to custom apis, but it certainly appears to be.
What I would recommend is to implement pagination and return a limited number of records at a time. Your parameters should include the the start index and amount of records to return, and your response should include how many records in total so you can determine how many records to fetch with each request.