Why is my GitHub code search hitting secondary rate limits? - rest

I am searching for GitHub files containing the string "torch." Since, the search API limits searches to the first 100 results, I am searching based on file sizes as suggested here. However, I keep hitting the secondary rate limit. Could someone suggest if I am doing something wrong or if there is a way to optimize my code to prevent these rate limits? I have already looked at best practices to deal with rate limits. Here is my code -
import os
import requests
import httplink
import time
# This for loop searches for code based on files sizes from 0 to 500000 containing the string "torch"
for i in range(0,500000,250):
print("i = ",i," i + 250 = ", i+250)
url = "https://api.github.com/search/code?q=torch +in:file + language:python+size:"+str(i)+".."+str(i+250)+"&page=1&per_page=10"
headers = {"Authorization": f'Token xxxxxxxxxxxxxxx'} ## Please put your token over here
# Backoff when secondary rate limit is reached
backoff = 256
total = 0
cond = True
# This while loop goes over all pages of results => Pagination
while cond==True:
try:
time.sleep(2)
res = requests.request("GET", url, headers=headers)
res.raise_for_status()
link = httplink.parse_link_header(res.headers["link"])
data = res.json()
for i, item in enumerate(data["items"], start=total):
print(f'[{i}] {item["html_url"]}')
if "next" not in link:
break
total += len(data["items"])
url = link["next"].target
# Except case to catch when secondary rate limit has been reached and prevent the computation from stopping
except requests.exceptions.HTTPError as err:
print("err = ", err)
print("err.response.text = ", err.response.text)
# backoff **= 2
print("backoff = ", backoff)
time.sleep(backoff)
# Except case to catch when the given file size provides no results
except KeyError as error:
print("err = ", error)
# Set cond to False to stop the while loop
cond = False
continue
Based on this answer, it seems like it is a common occurrence. However, I was hoping someone could suggest a workaround.
I have added the tag Octokit, although I am not using that, to increase visibility and since this seems like a common problem.
A big chunk of the above logic/code was obtained through SO answers, I highly appreciate all support from the community.

Note that search has its primary and secondary rate limiting that is lower than others. For JavaScript, we have a throttle plugin that implements all the recommended best practices. For search we limit requests to 1 per 2 seconds. Hope that helps!

Related

Graphite showing rolling gap in data

I recently upgraded one of our Graphite instances from 0.9.2 to 1.1.1, and have since run into an issue where, for the lack of a better word, there is a rolling gap of data.
It shows the last few minutes correctly (I'm guessing what's in carbon cache), and after about 10-15 minutes past, it shows all of the data correctly as well.
However, inside that 10-15 minute gap, it's completely blank. I can see the gap both in Graphite, and in Grafana. It disappears after restarting carbon cache, and then comes back about a day later.
Example screenshot:
This happens for most graphs/dashboards I have.
I've spent a lot of effort optimizing disk IO, so I doubt it to be the case -> Cloudwatch shows 100% burst credit for disk. It's an m3.xlarge instance with 4 cores and 16 GB RAM. Swap file is on ephemeral storage and looks barely utilized.
Using 1 Carbon Cache instance with Whisper backend.
storage_schemas.conf:
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[dumbo]
pattern = ^collectd\.dumbo # load test containers, we don't care about their data
retentions = 300:1
[collectd]
pattern = ^collectd
retentions = 10s:8h,30s:1d,1m:3d,5m:30d,15m:90d
[statsite]
pattern = ^statsite
retentions = 10s:8h,30s:1d,1m:3d,5m:30d,15m:90d
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d
Non-default (or potentially relevant) carbon.conf settings:
[cache]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 100 # was slagging disk write IO until I dropped it down from 500
MAX_CREATES_PER_MINUTE = 50
CACHE_WRITE_STRATEGY = sorted
RELAY_METHOD = rules
DESTINATIONS = 127.0.0.1:2004
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_QUEUE_SIZE = 10000
Graphite local_settings.py
CARBONLINK_TIMEOUT = 10.0
CARBONLINK_QUERY_BULK = True
USE_WORKER_POOL = False
We've seen this with some workloads on 1.1.1, can you try updating carbon to current master? If not 1.1.2 will be released shortly which should solve the problem.

Program runs out of memory reading a large TYPO3 Extbase repository

I'm writing an extension function in TYPO3 CMS 6.2 Extbase that must process every object in a large repository. My function works fine if I have about 10,000 objects, but runs out of memory if I have over about 20,000 objects. How can I handle the larger repository?
$importRecordsCount = $this->importRecordRepository->countAll();
for ($id = 1; $id <= $importRecordsCount; $id++) {
$importRecord = $this->importRecordRepository->findByUid($id);
/* Do things to other models based on properties of $importRecord */
}
The program exceeds memory near ..\GeneralUtility.php:4427 in TYPO3\CMS\Core\Utility\GeneralUtility::instantiateClass( ) after passing through the findByUid() line, above. It took 117 seconds to reach this error during my latest test. The error message is:
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 4194304 bytes) in ... \typo3\sysext\core\Classes\Utility\GeneralUtility.php on line 4448
If it matters, I do not use #lazy because of some of the processing I do later in the function.
According to official TYPO3 website, it is recommended 256M memory limit instead of 128M:
Source
So my first suggestion would be trying to do that first and it might solve your problem now. Also you should use importRecordRepository->findAll(); instead of fetching each record by iterating uid, since someone might have deleted some records.
In general, Extbase is not really suitable for processing such a large amount of data. An alternative would be to use the DataHandler if a correct history etc. is required. It also has quite some overhead compared to using the TYPO3 Database API (DatabaseConnection, $GLOBALS['TYPO3_DB']) which would be the best-performance approach. See my comments and tutorial in this answer.
If you decide to stay with the Extbase API, the only way that could work would be to persist every X item (try what works in your setup) to free some memory. From your code I cannot really see at which point your manipulation works, but take this as an example:
$importRecords = $this->importRecordRepository->findAll();
$i = 0;
foreach ($importRecords as $importRecord) {
/** #var \My\Package\Domain\Model\ImportRecord $importRecord */
// Manipulate record
$importRecord->setFoo('Test');
$this->importRecordRepository->update($importRecord);
if ($i % 100 === 0) {
// Persist after each 100 items
$this->persistenceManager->persistAll();
}
$i++;
}
// Persist the rest
$this->persistenceManager->persistAll();
There's also the clearState function of Extbase to free some memory:
$this->objectManager->get(\TYPO3\CMS\Extbase\Persistence\Generic\PersistenceManager::class)->clearState();

Controlling requests per second and timeout threshold in Gatling

I am working on a Gatling simulation. For the life of me, I cannot get my code to reach 10000 requests per second. I have read the documentation and I keep messing with different methods and whatnot but my requests per second seems capped at 5000 requests per second. I have attached my current iteration of my code. The URL and path information is blurred out. Assume that I have no issue with the HTTP part of my simulation.
package computerdatabase
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
//import assertions._
class userSimulation extends Simulation {
object Query {
val feeder = csv("firstfileSHUF.txt").random
val query = repeat(2000) {
feed(feeder).
exec(http("user")
.get("/path/path/" + "${userID}" + "?fullData=true"))
}
}
val baseUrl = "http:URL:7777"
val httpConf = http
.baseURL(baseUrl) // Here is the root for all relative URLs
val scn = scenario("user") // A scenario is a chain of requests and pauses
.exec(Query.query)
setUp(scn.inject(rampUsers(1500) over (60 seconds)))
.throttle(reachRps(10000) in (2 minute),
holdFor(3 minutes))
.protocols(httpConf)
}
Additionally, I would like to set the maximum threshold for a timeout to be 100ms. I have tried to do this with assertions and also editing the configuration files but it never seems to show up during the tests or in my reports. How can I set a request to KO if the request took longer than 100ms? Thank you for your help with this matter!
I ended up figuring this out. My code above is correct and I know understand what Stephane, one of the main contributors to Gatling was explaining. The server at the time simply could not handle my RPS threshold. It was an upper bound that was unreachable. After making changes to the server, we could handle this sort of latency. Additionally, I found a way to timeout at 100ms in the configuration file. Specifically, requestTimeout = 100 will cause the timeout behavior I was looking for.

Python-Multithreading Time Sensitive Task

from random import randrange
from time import sleep
#import thread
from threading import Thread
from Queue import Queue
'''The idea is that there is a Seeker method that would search a location
for task, I have no idea how many task there will be, could be 1 could be 100.
Each task needs to be put into a thread, does its thing and finishes. I have
stripped down a lot of what this is really suppose to do just to focus on the
correct queuing and threading aspect of the program. The locking was just
me experimenting with locking'''
class Runner(Thread):
current_queue_size = 0
def __init__(self, queue):
self.queue = queue
data = queue.get()
self.ID = data[0]
self.timer = data[1]
#self.lock = data[2]
Runner.current_queue_size += 1
Thread.__init__(self)
def run(self):
#self.lock.acquire()
print "running {ID}, will run for: {t} seconds.".format(ID = self.ID,
t = self.timer)
print "Queue size: {s}".format(s = Runner.current_queue_size)
sleep(self.timer)
Runner.current_queue_size -= 1
print "{ID} done, terminating, ran for {t}".format(ID = self.ID,
t = self.timer)
print "Queue size: {s}".format(s = Runner.current_queue_size)
#self.lock.release()
sleep(1)
self.queue.task_done()
def seeker():
'''Gathers data that would need to enter its own thread.
For now it just uses a count and random numbers to assign
both a task ID and a time for each task'''
queue = Queue()
queue_item = {}
count = 1
#lock = thread.allocate_lock()
while (count <= 40):
random_number = randrange(1,350)
queue_item[count] = random_number
print "{count} dict ID {key}: value {val}".format(count = count, key = random_number,
val = random_number)
count += 1
for n in queue_item:
#queue.put((n,queue_item[n],lock))
queue.put((n,queue_item[n]))
'''I assume it is OK to put a tulip in and pull it out later'''
worker = Runner(queue)
worker.setDaemon(True)
worker.start()
worker.join()
'''Which one of these is necessary and why? The queue object
joining or the thread object'''
#queue.join()
if __name__ == '__main__':
seeker()
I have put most of my questions in the code itself, but to go over the main points (Python2.7):
I want to make sure I am not creating some massive memory leak for myself later.
I have noticed that when I run it at a count of 40 in putty or VNC on
my linuxbox that I don't always get all of the output, but when
I use IDLE and Aptana on windows, I do.
Yes I understand that the point of Queue is to stagger out your
Threads so you are not flooding your system's memory, but the task at
hand are time sensitive so they need to be processed as soon as they
are detected regardless of how many or how little there are; I have
found that when I have Queue I can clearly dictate when a task has
finished as oppose to letting the garbage collector guess.
I still don't know why I am able to get away with using either the
.join() on the thread or queue object.
Tips, tricks, general help.
Thanks for reading.
If I understand you correctly you need a thread to monitor something to see if there are tasks that need to be done. If a task is found you want that to run in parallel with the seeker and other currently running tasks.
If this is the case then I think you might be going about this wrong. Take a look at how the GIL works in Python. I think what you might really want here is multiprocessing.
Take a look at this from the pydocs:
CPython implementation detail: In CPython, due to the Global Interpreter Lock, only one thread can execute Python code at once (even though certain performance-oriented libraries might overcome this limitation). If you want your application to make better use of the computational resources of multi-core machines, you are advised to use multiprocessing. However, threading is still an appropriate model if you want to run multiple I/O-bound tasks simultaneously.

socket receive loop never returns

I have a loop that reads from a socket in Lua:
socket = nmap.new_socket()
socket:connect(host, port)
socket:set_timeout(15000)
socket:send(command)
repeat
response,data = socket:receive_buf("\n", true)
output = output..data
until data == nil
Basically, the last line of the data does not contain a "\n" character, so is never read from the socket. But this loop just hangs and never completes. I basically need it to return whenever the "\n" delimeter is not recognised. Does anyone know a way to do this?
Cheers
Updated
to include socket code
Update2
OK I have got around the initial problem of waiting for a "\n" character by using the "receive_bytes" method.
New code:
--socket set as above
repeat
data = nil
response,data = socket:receive_bytes(5000)
output = output..data
until data == nil
return output
This works and I get the large complete block of data back. But I need to reduce the buffer size from 5000 bytes, as this is used in a recursive function and memory usage could get very high. I'm still having problems with my "until" condition however, and if I reduce the buffer size to a size that will require the method to loop, it just hangs after one iteration.
Update3
I have gotten around this problem using string.match and receive_bytes. I take in at least 80 bytes at a time. Then string.match checks to see if the data variable conatins a certain pattern. If so it exits. Its not the cleanest solution, but it works for what I need it to do. Here is the code:
repeat
response,data = socket:receive_bytes(80)
output = output..data
until string.match(data, "pattern")
return output
I believe the only way to deal with this situation in a socket is to set a timeout.
The following link has a little bit of info, but it's on http socket: lua http socket timeout
There is also this one (9.4 - Non-Preemptive Multithreading): http://www.lua.org/pil/9.4.html
And this question: http://lua-list.2524044.n2.nabble.com/luasocket-howto-read-write-Non-blocking-TPC-socket-td5792021.html
A good discussion on Socket can be found on this link:
http://nitoprograms.blogspot.com/2009/04/tcpip-net-sockets-faq.html
It's .NET but the concepts are general.
See update 3. Because the last part of the data is always the same pattern, I can read in a block of bytes and each time check if that block has the pattern. If it has the pattern it will mean that it is the end of the data, append to the output variable and exit.