Graphite showing rolling gap in data - grafana

I recently upgraded one of our Graphite instances from 0.9.2 to 1.1.1, and have since run into an issue where, for the lack of a better word, there is a rolling gap of data.
It shows the last few minutes correctly (I'm guessing what's in carbon cache), and after about 10-15 minutes past, it shows all of the data correctly as well.
However, inside that 10-15 minute gap, it's completely blank. I can see the gap both in Graphite, and in Grafana. It disappears after restarting carbon cache, and then comes back about a day later.
Example screenshot:
This happens for most graphs/dashboards I have.
I've spent a lot of effort optimizing disk IO, so I doubt it to be the case -> Cloudwatch shows 100% burst credit for disk. It's an m3.xlarge instance with 4 cores and 16 GB RAM. Swap file is on ephemeral storage and looks barely utilized.
Using 1 Carbon Cache instance with Whisper backend.
storage_schemas.conf:
[carbon]
pattern = ^carbon\.
retentions = 60:90d
[dumbo]
pattern = ^collectd\.dumbo # load test containers, we don't care about their data
retentions = 300:1
[collectd]
pattern = ^collectd
retentions = 10s:8h,30s:1d,1m:3d,5m:30d,15m:90d
[statsite]
pattern = ^statsite
retentions = 10s:8h,30s:1d,1m:3d,5m:30d,15m:90d
[default_1min_for_1day]
pattern = .*
retentions = 60s:1d
Non-default (or potentially relevant) carbon.conf settings:
[cache]
MAX_CACHE_SIZE = inf
MAX_UPDATES_PER_SECOND = 100 # was slagging disk write IO until I dropped it down from 500
MAX_CREATES_PER_MINUTE = 50
CACHE_WRITE_STRATEGY = sorted
RELAY_METHOD = rules
DESTINATIONS = 127.0.0.1:2004
MAX_DATAPOINTS_PER_MESSAGE = 500
MAX_QUEUE_SIZE = 10000
Graphite local_settings.py
CARBONLINK_TIMEOUT = 10.0
CARBONLINK_QUERY_BULK = True
USE_WORKER_POOL = False

We've seen this with some workloads on 1.1.1, can you try updating carbon to current master? If not 1.1.2 will be released shortly which should solve the problem.

Related

Why is my GitHub code search hitting secondary rate limits?

I am searching for GitHub files containing the string "torch." Since, the search API limits searches to the first 100 results, I am searching based on file sizes as suggested here. However, I keep hitting the secondary rate limit. Could someone suggest if I am doing something wrong or if there is a way to optimize my code to prevent these rate limits? I have already looked at best practices to deal with rate limits. Here is my code -
import os
import requests
import httplink
import time
# This for loop searches for code based on files sizes from 0 to 500000 containing the string "torch"
for i in range(0,500000,250):
print("i = ",i," i + 250 = ", i+250)
url = "https://api.github.com/search/code?q=torch +in:file + language:python+size:"+str(i)+".."+str(i+250)+"&page=1&per_page=10"
headers = {"Authorization": f'Token xxxxxxxxxxxxxxx'} ## Please put your token over here
# Backoff when secondary rate limit is reached
backoff = 256
total = 0
cond = True
# This while loop goes over all pages of results => Pagination
while cond==True:
try:
time.sleep(2)
res = requests.request("GET", url, headers=headers)
res.raise_for_status()
link = httplink.parse_link_header(res.headers["link"])
data = res.json()
for i, item in enumerate(data["items"], start=total):
print(f'[{i}] {item["html_url"]}')
if "next" not in link:
break
total += len(data["items"])
url = link["next"].target
# Except case to catch when secondary rate limit has been reached and prevent the computation from stopping
except requests.exceptions.HTTPError as err:
print("err = ", err)
print("err.response.text = ", err.response.text)
# backoff **= 2
print("backoff = ", backoff)
time.sleep(backoff)
# Except case to catch when the given file size provides no results
except KeyError as error:
print("err = ", error)
# Set cond to False to stop the while loop
cond = False
continue
Based on this answer, it seems like it is a common occurrence. However, I was hoping someone could suggest a workaround.
I have added the tag Octokit, although I am not using that, to increase visibility and since this seems like a common problem.
A big chunk of the above logic/code was obtained through SO answers, I highly appreciate all support from the community.
Note that search has its primary and secondary rate limiting that is lower than others. For JavaScript, we have a throttle plugin that implements all the recommended best practices. For search we limit requests to 1 per 2 seconds. Hope that helps!

How to fix a python "pyverbs" application with RDMA write getting a not acknowledge (NACK)

Im currently working on a RoCE (RDMA over Converged Ethernet) python application with the library pyverbs. First, i want to do a simple loopback test with an RDMA Write. I tested the setup with ib_write_bw from perftest, which worked like a charm.
This is my setup:
OS: Ubuntu 20.04.05 LTS
Kernel: 5.15.0-56-generic
NIC: Mellanox ConnectX-5 MC516A-GCA_Ax 50Gbe dual-port QSFP28
I'm developing the application with a jupyter notebook. The ports are connected together with a QSFP28 cable. I set up a "client" and "server" on the same system. Both use one port of the NIC. The client performs the "RDMA Write" action. In the future, metadata will be exchanged over tcp, but for ease of debugging i exchange metadata locally in the same notebook.
Now i was able to perform an "RDMA Write" action and capture the packets.
Captured RDMA packets
I keep getting not acknowledeges (NACK) from the "server". The RDMA Write packet look correct to me. It got the right payload and the headers are the same as i configured (i can post it if it would help).
I got three ideas in my head, why it would work.
wrong server memory adress/rkey used for rdma write
missing flags at server memory allocation
missing/wrong flags at server queue pair modification
I tried all different combinations of flags and values at the queue pair modification and memory allocation.
I start the NIC in the shi
ell with
sudo mst start
Then i run my python application. I posted some codesnipped below, which i am not sure i implemented those right.
Server and client qp modification init to RTR
gid_index = 0
port_num = 1
server_attr.qp_state = e.IBV_QPS_RTR
server_attr.path_mtu = e.IBV_MTU_4096
server_attr.rq_psn = 0
server_attr.min_rnr_timer = 12
server_attr.max_dest_rd_atomic = 10
server_attr.dest_qp_num = client_qp.qp_num
server_attr.qp_access_flags = e.IBV_ACCESS_LOCAL_WRITE | e.IBV_ACCESS_REMOTE_READ | e.IBV_ACCESS_REMOTE_WRITE
server_gr = GlobalRoute(dgid=client_ctx.query_gid(port_num=port_num,index=gid_index), sgid_index=gid_index)
server_ah_attr = AHAttr(gr=server_gr, is_global=1, port_num=1)
server_attr.ah_attr = server_ah_attr
server_qp.to_rtr(server_attr)
client_attr.qp_state = e.IBV_QPS_RTR
client_attr.path_mtu = e.IBV_MTU_4096
client_attr.rq_psn = 0
client_attr.min_rnr_timer = 12
client_attr.max_dest_rd_atomic = 10
client_attr.dest_qp_num = server_qp.qp_num
client_attr.qp_access_flags = e.IBV_ACCESS_LOCAL_WRITE | e.IBV_ACCESS_REMOTE_READ | e.IBV_ACCESS_REMOTE_WRITE
client_gr = GlobalRoute(dgid=server_ctx.query_gid(port_num=port_num,index=gid_index), sgid_index=gid_index)
client_ah_attr = AHAttr(gr=client_gr, is_global=1, port_num=port_num)
client_attr.ah_attr = client_ah_attr
client_qp.to_rtr(client_attr)
Client qp modification RTR to RTS
client_attr.qp_state = e.IBV_QPS_RTS
client_attr.timeout = 14
client_attr.retry_cnt = 7
client_attr.rnr_retry = 7
client_attr.sq_psn = 0
client_attr.max_rd_atomic = 10
RDMA Write instruction to client
client_sge = SGE(client_mr.buf,len(SEND_STRING),client_mr.lkey)
send_wr = pwr.SendWR(num_sge = 1, sg = [client_sge],opcode=e.IBV_WR_RDMA_WRITE)
send_wr.set_wr_rdma(rkey = server_mr.rkey, addr = server_mr.buf)
client_qp.post_send(send_wr)
sleep(1)
print(server_mr.read(len(SEND_STRING),0))
If there is someone with knowledge in RMDA/RoCE/Pyverbs, i would be glad for some help. I don't have any prior knowledge in those topics. This is why i have choosen to write an application in python. I have knowledge in C, but python is for me much more convient for prototyping :)
Thank for your help!

How can you find the size of a delta table quickly and accurately?

The microsoft documentation here:
https://learn.microsoft.com/en-us/azure/databricks/kb/sql/find-size-of-table#size-of-a-delta-table
suggests two methods:
Method 1:
import com.databricks.sql.transaction.tahoe._
val deltaLog = DeltaLog.forTable(spark, "dbfs:/<path-to-delta-table>")
val snapshot = deltaLog.snapshot // the current delta table snapshot
println(s"Total file size (bytes): ${deltaLog.snapshot.sizeInBytes}")`
Method 2:
spark.read.table("<non-delta-table-name>").queryExecution.analyzed.stats
For my table, they both return ~300 MB.
But then in storage explorer Folder statistics or in a recursive dbutils.fs.ls walk, I get ~900MB.
So those two methods that are much quicker than literally looking at every file underreport by 67%. This would be fine to use the slower methods except when I try to scale up to the entire container, it takes 55 hours to scan all 1 billion files and 2.6 PB.
So what is the best way to get the size of a table in ADLS Gen 2? Bonus points if it works for folders that are not tables as that's really the number I need. dbutils.fs.ls is single threaded and only works on the driver, so it's not even very parallelizable. It can be threaded but only within the driver.
deltaLog.snapshot returns just the current snapshot. You can have more files present in table's directory, those belong to historical versions that have been deleted/replaced from the current snapshot.
Also it returns 0 without complaints for non-delta paths. So I'm using this piece of code to get a database-level summary:
import com.databricks.sql.transaction.tahoe._
val databasePath = "dbfs:/<path-to-database>"
def size(path: String): Long =
dbutils.fs.ls(path).map { fi => if (fi.isDir) size(fi.path) else fi.size }.sum
val tables = dbutils.fs.ls(databasePath).par.map { fi =>
val totalSize = size(fi.path)
val snapshotSize = DeltaLog.forTable(spark, fi.path).snapshot.sizeInBytes
(fi.name, totalSize / 1024 / 1024 / 1024, snapshotSize / 1024 / 1024 / 1024)
}
display(tables.seq.sorted.toDF("name", "total_size_gb", "snapshot_size_gb"))
This does parallelize on driver only, still it's only file listing, so it's pretty fast. I admit I don't have a billion files, but well, if it's slow for you just use a bigger driver and tune the number of threads.

Searchkick memory leak

Running rake searchkick:reindex CLASS=Product for an application causes the Rake process to leak memory; after about 15-20 minutes it's bad enough to freeze a Debian system with 16GB of RAM. There are ~3800 "Product" records.
I managed to work around this problem with the following code in a Rake task:
connection = ActiveRecord::Base.connection
res = connection.execute('select max(id) from products')
id = res.getvalue(0,0)
1.upto(id) do |i|
p = Product.find_by_id(i)
next unless p
p.reindex
end
This is also a little quicker.
Can anyone suggest a means to investigate this memory leak? It would be useful to do so in more detail before considering opening a ticket.
This causes a problem with generating indexes: Text fields are not optimised for operations that require per-document field data
That problem can be fixed by adding the following to the code above:
Product.reindex(import: false)
# Rest of code goes here...

update Typo3 7.6 to 8.7, can't get frontend to work on a local test envirement with XAMPP

I working on updating a Typo3 7.6 to 8.7. I do this on my local machine with XAMPP on windows with PHP 7.2.
I got the backend working. It needed some manual work in the DB, like changing the CType in tt_content for my own content elements as well as filling the colPos.
However when I call the page on the frontend all I get is a timeout:
Fatal error: Maximum execution time of 60 seconds exceeded in
C:\xampp\htdocs\typo3_src-8.7.19\vendor\doctrine\dbal\lib\Doctrine\DBAL\Driver\Mysqli\MysqliStatement.php on line 92
(this does not change if I set max_execution_time to 300)
Edit: I added an echo just before line 92 in the above file, this is the function:
public function __construct(\mysqli $conn, $prepareString)
{
$this->_conn = $conn;
echo $prepareString."<br />";
$this->_stmt = $conn->prepare($prepareString);
if (false === $this->_stmt) {
throw new MysqliException($this->_conn->error, $this->_conn->sqlstate, $this->_conn->errno);
}
$paramCount = $this->_stmt->param_count;
if (0 < $paramCount) {
$this->types = str_repeat('s', $paramCount);
$this->_bindedValues = array_fill(1, $paramCount, null);
}
}
What I get is the following statement 1000 of times, always exactly the same:
`SELECT `tx_fed_page_controller_action_sub`, `t3ver_oid`, `pid`, `uid` FROM `pages` WHERE (uid = 0) AND ((`pages`.`deleted` = 0) AND (`pages`.`hidden` = 0) AND (`pages`.`starttime` <= 1540305000) AND ((`pages`.`endtime` = 0) OR (`pages`.`endtime` > 1540305000)))`
Note: I don't have any entry in pages with uid=0. So I am really not sure what this is good for. Does there need to be a page with uid=0?
I enabled logging slow query in mysql, but don't get anything logged with it. I don't get any aditional PHP error nor do I get a log entry in typo3.
So right now I am a bit stuck and don't know how to proceed.
I enabled general logging for mysql and when I call a page on frontent I get this SQL query executed over and over again:
SELECT `tx_fed_page_controller_action_sub`, `t3ver_oid`, `pid`, `uid` FROM `pages` WHERE (uid = 0) AND ((`pages`.`deleted` = 0) AND (`pages`.`hidden` = 0) AND (`pages`.`starttime` <= 1540302600) AND ((`pages`.`endtime` = 0) OR (`pages`.`endtime` > 1540302600)))
executing this query manually gives back an empty result (I don't have any entry in pages with uid=0). I don't know if that means anything..
What options do I have? How can I find whats missing / where the error is?
First: give your PHP more time to run.
in the php.ini configuration increase the max execution time to 240 seconds.
be aware that for TYPO3 in production mode 240 seconds are recommended. If you start the install-tool you can do a system check and get information about configuration which might need optimization.
Second: avoid development mode and use production mode.
the execution is faster, but you will loose the option to debug.
debugging always costs more time and more memory to prepare all that information. maybe 240 seconds are not enough and you even need more memory.
The field tx_fed_page_controller_action_sub comes from an extension it is not part of the core. Most likely you have flux and fluidpages installed in your system.
Try to deactivate those extensions and proceed without them. Reintegrate them later if you still need them. A timeout often means that there is some kind of recursion going on. From my experience with flux it is possible that a content element has itself set as its own flux_parent and therefore creates an infinite rendering loop that will cause a fatal after the max_execution_time.
So, in your case I'd try to find the record that is causing this (seems to be a page record) and/or the code that initiates the Query. You do not need to debug in Doctrine itself :)