I am having a challanging time extracting timeseries data from logs in grafana.
I am using grafana 9.1.3 with Loki&promtail in local server, to collect data from logs.
The log looks like this(this is just an example), and I want to visualize loss in dashboard.
2022-09-08 22:46:31 Train Epoch: 14 [59520/60000 (99%)] Loss: 0.001811
2022-09-08 22:46:31 Train Epoch: 14 [59520/60000 (99%)] Loss: 0.001813
2022-09-08 22:46:31 Train Epoch: 14 [59520/60000 (99%)] Loss: 0.001811
But when I use query below in dashboard, grafana indicated that there is no data matching with query(without pattern, grafana shows log)
{filename=“/root/train/training/mnist/mnist_1.log”} | pattern <_> Loss: <loss> | label=loss
which can be the problem? thanks in advance
Related
I am new to caffe and presently I am trying to use it with Alexnet. When I say use, I mean I don't want to train the network, therefore, I got the '.caffemodel' for Alexnet as mentioned here.
Now, I want to use caffe's time feature to look at the time it takes for each layers execution during the TEST phase(I am doing for getting the execution time per layer during inference).
As per the caffe's options
usage: caffe <command> <args>
commands:
train train or finetune a model
test score a model
------------
time benchmark model execution time
collect collects layer data on specified device
compare collects layer data using inputs from other device
Flags from tools/caffe.cpp:
---------------------
-phase (Optional; network phase (TRAIN or TEST). Only used for 'time'.)
type: string default: ""
-sampling (Optional; Caffe test with sampling mode) type: bool
default: false
-------------------------
I can run the following command to benchmark Alexnet during TEST phase:
build/tools/caffe time -model models/bvlc_alexnet/train_val.prototxt -iterations 1000 -engine MKLDNN -phase TEST
But when I do that, I get the following error:
I0304 17:37:26.183619 29987 net.cpp:409] label_data_1_split does not need backward computation.
I0304 17:37:26.183625 29987 net.cpp:409] data does not need backward computation.
I0304 17:37:26.183629 29987 net.cpp:451] This network produces output accuracy
I0304 17:37:26.183635 29987 net.cpp:451] This network produces output loss
I0304 17:37:26.183647 29987 net.cpp:491] Network initialization done.
I0304 17:37:26.183732 29987 caffe.cpp:556] Performing Forward
I0304 17:37:26.287747 29987 caffe.cpp:561] Initial loss: 6.92452
I0304 17:37:26.287784 29987 caffe.cpp:563] Performing Backward
F0304 17:37:26.385227 29987 mkldnn_pooling_layer.cpp:464] Check failed: poolingBwd_pd
*** Check failure stack trace: ***
# 0x7fe03e3980cd google::LogMessage::Fail()
# 0x7fe03e399f33 google::LogMessage::SendToLog()
# 0x7fe03e397c28 google::LogMessage::Flush()
# 0x7fe03e39a999 google::LogMessageFatal::~LogMessageFatal()
# 0x7fe03ead741c caffe::MKLDNNPoolingLayer<>::InitPoolingBwd()
# 0x7fe03eac4ec2 caffe::MKLDNNPoolingLayer<>::Backward_cpu()
# 0x7fe03e8f9b19 caffe::Net<>::Backward()
# 0x5622d81a2530 (unknown)
# 0x5622d8199353 (unknown)
# 0x7fe03ab09b97 __libc_start_main
# 0x5622d8198e1a (unknown)
I am guessing there is some problem with the way I am using the command and I may have to change the .prototxt file for this.
I would appreciate if somebody can point me in the right direction as to how to get the benchmark numbers for Alexnet in Testing phase.
P.S: I could not find out what happens if you just run caffe time without specifying the Phase. Does it benchmark both the TEST and TRAIN phase?
I have a problem with Fail2Ban
2018-02-23 18:23:48,727 fail2ban.datedetector [4859]: DEBUG Matched time template (?:DAY )?MON Day 24hour:Minute:Second(?:\.Microseconds)?(?: Year)?
2018-02-23 18:23:48,727 fail2ban.datedetector [4859]: DEBUG Got time 1519352628.000000 for "'Feb 23 10:23:48'" using template (?:DAY )?MON Day 24hour:Minute:Second(?:\.Microseconds)?(?: Year)?
2018-02-23 18:23:48,727 fail2ban.filter [4859]: DEBUG Processing line with time:1519352628.0 and ip:158.140.140.217
2018-02-23 18:23:48,727 fail2ban.filter [4859]: DEBUG Ignore line since time 1519352628.0 < 1519381428.727771 - 600
It says "ignoring Line" because the time skew is greater than the inspection period. However, this is not the case.
If indeed 1519352628.0 is derived from Feb 23, 10:23:48, then the other date: 1519381428.727771 must be wrong.
I have run tests for 'invalid user' hitting this repeatedly. But Fail2ban is always ignoring the line.
I am positive I am getting Filter Matches within 1 second.
This is Ubuntu 16.04 and Fail2ban 0.9.3
Thanks for any help you might have!
Looks like there is a time zone issue on your machine that might cause the confusion. Try to set the correct time zone and restart both rsyslogd and fail2ban.
Regarding your debug log:
1519352628.0 = Feb 23 02:23:48
-> timestamp parsed from line in log file with time Feb 23 10:23:48 - 08:00 time zone offset!
1519381428.727771 = Feb 23 10:23:48
-> timestamp of current time when fail2ban processed the log.
Coincidently this is the same time as the time in the log file. That's what makes it so confusing in this case.
1519381428.727771 - 600 = Feb 23 10:13:48
-> limit for how long to look backwards in time in the log file since you've set findtime = 10m in jail.conf.
Fail2ban 'correctly' ignores the log entry that appears to be older than 10 minutes, because of the set time zone -08:00.
btw:
If you need IPv6 support for banning, consider upgrading fail2ban to v0.10.x.
And there is also a brand new fail2ban version v0.11 (not yet marked stable, but running without issue for 1+ month on my machines) that has this wonderful new auto-increment bantime feature.
I have 2 raspberry pi's that I wanted to benchmark for load balancing purpose.
Raspberry pi Model B v1.1 - running Raspbian Jessie
Raspberry pi Model B+ v1.2 - running Raspbian Jessie
I installed sysbench on both systems and ran: sysbench --num-threads=1 --test=cpu --cpu-max-prime=10000 --validate run on the first and changed --num-threads=4 on the second, as its quadcore and ran both.
The results are not at all what I expected (I obviously expected the multithreaded benchmark to severely outperform the single threaded benchmark). When I ran a the command with a single thread, performance was about the same on both systems. But when I changed the number of threads to 4 on the second Pi it still took the same amount of time, except that the per request statistics showed that the average request took about 4 times as much time. I can seem to grasp why this is.
Here are the results:
Raspberry pi v1.1
Single thread
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 1325.0229s
total number of events: 10000
total time taken by event execution: 1324.9665
per-request statistics:
min: 131.00ms
avg: 132.50ms
max: 171.58ms
approx. 95 percentile: 137.39ms
Threads fairness:
events (avg/stddev): 10000.0000/0.00
execution time (avg/stddev): 1324.9665/0.00
Raspberry pi v1.2
Four threads
Maximum prime number checked in CPU test: 20000
Test execution summary:
total time: 1321.0618s
total number of events: 10000
total time taken by event execution: 5283.8876
per-request statistics:
min: 486.45ms
avg: 528.39ms
max: 591.60ms
approx. 95 percentile: 553.98ms
Threads fairness:
events (avg/stddev): 2500.0000/0.00
execution time (avg/stddev): 1320.9719/0.03
"Raspberry pi Model B+ v1.2" has the same CPU as "Raspberry pi Model B v1.1". Both boards are from the first generation of Raspberry Pi and they have 1 core CPU.
For 4 CPU you need Raspberry Pi 2 Model B instead of Raspberry pi Model B+.
Yeah, the naming is a bit confusing :(
I'm surprised by the poor performance of Google Cloud SQL when I ran a sysbench on it. Here's the result from sysbench after setting the test run from these 2 commands.
sysbench --test=oltp --oltp-table-size=1000000 --mysql-host=173.194.225.xxx --mysql-db=test --mysql-user=root --mysql-password=MYPASSWORD prepare
sysbench --test=oltp --oltp-table-size=1000000 --mysql-host=173.194.225.xxx --mysql-db=test --mysql-user=root --mysql-password=96220751 --max-time=60 --oltp-read-only=on --max-requests=0 --num-threads=8 run
Sysbench Result:
OLTP test statistics:
queries performed:
read: 7756
write: 0
other: 1108
total: 8864
transactions: 554 (9.13 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 7756 (127.83 per sec.)
other operations: 1108 (18.26 per sec.)
Test execution summary:
total time: 60.6740s
total number of events: 554
total time taken by event execution: 484.0527
per-request statistics:
min: 856.76ms
avg: 873.74ms
max: 897.26ms
approx. 95 percentile: 890.33ms
Threads fairness:
events (avg/stddev): 69.2500/0.66
execution time (avg/stddev): 60.5066/0.21
Can anyone comment on this result? I ran the test with D0 and D4 and I'm getting very similar result. Even the sysbench test from digitalocean shows a far better performance as shown below:
OLTP test statistics:
queries performed:
read: 358498
write: 0
other: 51214
total: 409712
transactions: 25607 (426.73 per sec.)
deadlocks: 0 (0.00 per sec.)
read/write requests: 358498 (5974.23 per sec.)
other operations: 51214 (853.46 per sec.)
Test execution summary:
total time: 60.0074s
total number of events: 25607
total time taken by event execution: 479.9015
per-request statistics:
min: 7.50ms
avg: 18.74ms
max: 48.85ms
approx. 95 percentile: 21.88ms
Threads fairness:
events (avg/stddev): 3200.8750/5.73
execution time (avg/stddev): 59.9877/0.00
we are having a problem with our swift cluster, with a swift version 1.8.0.
The cluster is built up from 3 storage nodes + a proxy node, we have 2 times replication. Each node sports a single 2TB sata HDD, the OS is running on an SSD.
The traffic is ~300 1.3MB files per minute. The files are of the same size. Each file is uploaded with an X-expire-after with a value equivalent of 7 days.
When we started the cluster around 3 months ago we uploaded significantly less files (~150/m), everything was working fine. As we have put more pressure on the system, at one point the object expirer couldn't expire the files as fast as being uploaded, slowly filling up the servers.
After our analysis we found the following:
It's not a network issue, the interfaces are not overloaded, we don't have an extreme amount of open connections
It's not a CPU issue, loads are fine
It doesn't seem to be a RAM issue, we have ~20G free of 64G
The bottleneck seems to be the disk, iostat is quite revealing:
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 57.00 0.00 520.00 0.00 3113.00 11.97 149.18 286.21 0.00 286.21 1.92 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 2.00 44.00 7.00 488.00 924.00 2973.00 15.75 146.27 296.61 778.29 289.70 2.02 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 3.00 60.00 226.00 5136.00 2659.50 54.51 35.04 168.46 49.13 200.14 3.50 100.00
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sdc 0.00 0.00 110.00 91.00 9164.00 2247.50 113.55 2.98 14.51 24.07 2.95 4.98 100.00
The read and write wait times are not always that good :), can go up into the thousands range msecs, which is pretty dreadful.
We're also seeing many ConnectionTimeout messages from the node side and in the proxy.
Some examples from the storage nodes:
Jul 17 13:28:51 compute005 object-server ERROR container update failed with 10.100.100.149:6001/sdf (saving for async update later): Timeout (3s) (txn: tx70549d8ee9a04f74a60d69842634deb)
Jul 17 13:34:03 compute005 swift ERROR with Object server 10.100.100.153:6000/sdc re: Trying to DELETE /AUTH_698845ea71b0e860bbfc771ad3dade1/container/whatever.file: Timeout (10s) (txn: tx11c34840f5cd42fdad123887e26asdae)
Jul 17 12:45:55 compute005 container-replicator ERROR reading HTTP response from {'zone': 7, 'weight': 2000.0, 'ip': '10.100.100.153', 'region': 1, 'port': 6001, 'meta': '', 'device': 'sdc', 'id': 1}: Timeout (10s)
And also from the proxy:
Jul 17 14:37:53 controller proxy-server ERROR with Object server 10.100.100.149:6000/sdf re: Trying to get final status of PUT to /v1/AUTH_6988e698bc17460bbfc74ea20fdcde1/container/whatever.file: Timeout (10s) (txn: txb114c84404194f5a84cb34a0ff74e273)
Jul 17 12:32:43 controller proxy-server ERROR with Object server 10.100.100.153:6000/sdc re: Expect: 100-continue on /AUTH_6988e698bc17460bbf71ff210e8acde1/container/whatever.file: ConnectionTimeout (0.5s) (txn: txd8d6ac5abfa34573a6dc3c3be71e454f)
If all the services pushing to swift and the object-expirer are stopped, the disk utilization stays at 100% for most of the time. There are no async_pending transactions, but there is a lot of rsyncing going on, probably coming from the object-replicator.
If all are turned on, there are 30-50 or even more async_pending transactions at almost any given moment in time.
We thought about different solutions to mitigate the problem, this is the outcome basically:
SSDs for storage are too expensive, so won't happen
Putting another HDD in paired with each in a RAID0 cluster (we have replication in swift)
Using some caching, like bcache or flashcache
Does anyone of you have experience with this kind of problem?
Any hints/other places to look for the root cause?
Is there a possibility to optimize the expirer/replicator performance?
If any additional info is required, just let me know.
Thanks
I've seen issues where containers with >1 million objects cause timeouts (due to sqlite3 db not being able to get a lock)...can you verify your containers object count?