KDB, comparing values produced by two different systems within time period - kdb

I have two different systems pushing data into a kdb table. These are driven by a feed of prices.
I'd like to compare the values produced so I can ultimately flag large differences.
I'm very new to kdb though and finding it hard to even work out a starting point query.
Ultimately I'd like to take a time period (likely a minute,) find a row for each system within this time period where the driving price is the same and compare the derived values.
A good starting point for me though would be to understand how to grab the first row within a timeperiod for each system and compare/join..
Thanks.
Simplifed example data
Example: -
System | Time | driver | result1 | result2
systemA.instrument1| 11:59:59| 101.4 | 3.4 | 4.6
systemA.instrument1| 12:00:01| 101.5 | 3.8 | 4.8
systemA.instrument1| 12:00:02| 101.6 | 3.3 | 2.3
systemA.instrument2| 12:00:02| 106.6 | 11.1 | 11.3
systemA.instrument1| 12:00:05| 101.7 | 3.9 | 5.6
systemB.instrument1| 12:00:09| 101.1 | 3.2 | 7.8
systemB.instrument1| 12:00:14| 101.2 | 3.9 | 3.4
systemB.instrument1| 12:00:17| 101.3 | 3.1 | 8.9
systemB.instrument2| 12:00:19| 106.5 | 11.2 | 11.4
systemB.instrument1| 12:00:58| 101.7 | 3.9 | 9.3
systemB.instrument1| 12:00:59| 101.7 | 3.3 | 3.4
systemB.instrument1| 12:01:03| 101.4 | 3.1 | 5.6
I only want data from 12:00:00 - 12:00:59
The only matching driver between SystemA and SystemB instrument1 is 101.7. I'd like either to be used and the diff between the results shown.
For instrument2 the driver never matches so I want to use the driver prices that are closest between the systems.
results | driver | driver diff | result1diff | result2diff
instrument1 | 101.7 | 0 | 0 | 3.7
instrument2 | | 0.1 | 0.1 | 0.1

First, split your System column into its constituent parts:
table:(flip exec `System`Instrument!flip ` vs/: System from table)
,'delete System from table
The answer to your first question (get the first row per Instrument and System), is:
q)table:(flip exec `System`Instrument!flip ` vs/: System from table),'delete System from table
Instrument System | Time driver result1 result2
-------------------| -------------------------------
instrument1 systemA| 11:59:59 101.4 3.4 4.6
instrument1 systemB| 12:00:09 101.1 3.2 7.8
instrument2 systemA| 12:00:02 106.6 11.1 11.3
instrument2 systemB| 12:00:19 106.5 11.2 11.4
Btw, in q, it is a more common use case to request the last row and this is easier to achieve:
q)select by Instrument,System from table
Define function to findest indices of closest values in two numerical vectors:
q)closest:{a:a?min a:abs(-) ./: x cross y;(a div count y;a mod count y)}
Query result1 where drivers are closest:
q)select result1:result1(value group System)#'closest . value driver group System by Instrument from table
Instrument | result1
-----------| ---------
instrument1| 3.4 3.1
instrument2| 11.1 11.2

Related

(Solved) How to read from mongodb collection by pymongo when densely writing into it?

Update:
It is solved. Please check myself's answer if you are interested in it. Thanks to everyone all the same!
My original post:
MongoDB server version: 3.6.8 (WSL Ubuntu 20.04)
pymongo 4.1.0
I am learning machine learning. Because I feel TensorBoard is hard to use, I try to implement a simple "traceable and visible training system" ("tvts") that has partial features of TensorBoard by MongoDB and pymongo. I choose MongoDB because it is document-based, NoSQL, and more suitable for recording arbitrary properties of model training.
Below is how I use it to record training conditions:
import tvts
# before training the modle
ts = tvts.tvts(NAME, '172.26.41.157', init_params={
'ver': VER,
'batch_size': N_BATCH_SIZE,
'lr': LR,
'n_epoch': N_EPOCH,
}, save_dir=SAVE_DIR, save_freq=SAVE_FREQ)
# after an epoch is done
ts.save(epoch, {
'cost': cost_avg,
'acc': metrics_avg[0][0],
'cost_val': cost_avg_val,
'acc_val': metrics_avg_val[0][0],
}, save_path)
I write all such data into a collection of my MondoDB, and then I can get statistics and charts like below:
Name: mnist_multi_clf_by_numpynet_mlp_softmax_ok from 172.26.41.157:27017 tvts.train_log
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| train_id | parent | cost(min:last) | LR(b-e:max-min) | epoch_count | existed save/save | from | to | duration |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 1 | None-None | 1.01055:1.01055 | 0.1:0.1 | 100 | 10/10 | 2022-04-14 11:56:17.618000 | 2022-04-14 11:56:21.273000 | 0:00:03.655000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 2 | 1-100 | 0.56357:0.56357 | 0.1:0.1 | 100 | 10/10 | 2022-04-14 12:00:53.170000 | 2022-04-14 12:00:56.705000 | 0:00:03.535000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 3 | 2-100 | 0.15667:0.15667 | 0.1:0.1 | 300 | 15/15 | 2022-04-14 12:01:35.233000 | 2022-04-14 12:01:45.795000 | 0:00:10.562000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 4 | 3-300 | 0.06820:0.06839 | 0.1:0.1 | 300 | 15/15 | 2022-04-14 18:16:08.720000 | 2022-04-14 18:16:19.606000 | 0:00:10.886000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 5 | 2-100 | 0.03418:0.03418 | 0.5:0.5 | 200 | 10/10 | 2022-04-14 18:18:27.665000 | 2022-04-14 18:18:34.644000 | 0:00:06.979000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
| 6 | None-None | 1.68796:1.68858 | 0.001:0.001 | 3000 | 30/30 | 2022-04-16 09:15:56.085000 | 2022-04-16 09:18:01.608000 | 0:02:05.523000 |
+----------+-----------+-----------------+-----------------+-------------+-------------------+----------------------------+----------------------------+----------------+
I found out that it get stuck if I try to get the list of statistics when I densely writing into the collection at the same time. I.e. I try to get the statistics on-the-fly of training and each epoch of the training is very short (about 0.03 second).
But I found out that I can still read out the records by Stuido 3T (a GUI of MongoDB) when I densely writing into the collection.
I googled a lot, but I still cannot solve it. Someone said the writing lock is exclusive (such as link: mongodb write is occuring then a read must wait or not wait?), but why the Studio 3T can make it?
Acturally I am new to MongoDB, I can use it because I have a littel experience with MySQL and in this "tvts" there is only insertion and query, i.e. it is a rahter simple usage of MongoDB. Is there some equivalent concepts of "concurrent inserts" in MySQL? (such as link: concurrent read and write in MySQL) I guess it is not a very hard task of MongoDB to read from it when writing into it.
Although it is a simple simulation of partial features of TensorBoard, I have already coded almost 600 lines of code. So, I am sorry that changing database is not prefered.
Please help me. Thanks a lot!
Unbelievable! I accidentally solved it just a few minutes after I posted this question. It seems that MongoDB collection could be read even if there are dense insertions, and I guess it is its normal performance. I guess that I cannot google an answer because it is not a real issue. My issue may be caused by the IDE Pycharm that I am using. I have the issue if I run my script inside Pycharm. It is OK when I run it in a system shell window.

Visualizing time-series from a SQL Database (Postgres)

I am building an app that applies a datascience model on a SQL Database, for sensor metrics. For this purpose I chose PipelineDB (based on Postgres) that enables me to build a Continuous View on my metrics and apply the model to each new line.
For now, I just want to observe the metrics I collect through the sensor on a dashboard. The table "metrics" looks like this :
+---------------------+--------+---------+------+-----+
| timestamp | T (°C) | P (bar) | n | ... |
+---------------------+--------+---------+------+-----+
| 2015-12-12 20:00:00 | 20 | 1.13 | 0.9 | |
+---------------------+--------+---------+------+-----+
| 2015-12-13 20:00:00 | 20 | 1.132 | 0.9 | |
+---------------------+--------+---------+------+-----+
| 2015-12-14 20:00:00 | 40 | 1.131 | 0.96 | |
+---------------------+--------+---------+------+-----+
I'd like to build a dashboard in which I could see all my metric evolving through time. Even be able to choose which column to display.
So I found a few tools that could match with my need, which are Grafana or Chronograf for InfluxDB.
But neither of them enable me to plug directly on Postgres and query my table to generate metric-formatted data that is required by these tools.
Do you have any advice on what I should do to use such dashboards with such data ?
A bit late here, but Grafana now supports Postgresql datasources directly: https://grafana.com/docs/features/datasources/postgres. I've used it in several projects and it has been really easy to set up and use.

How to debug "Sugar CRM X Files May Only Be Used With A Sugar CRM Y Database."

Sometimes one gets a message like:
Sugar CRM 6.4.5 Files May Only Be Used With A Sugar CRM 6.4.5 Database.
I am wondering how Sugar determines what version of the database it is using. In the above case, I get the following output:
select * from config where name='sugar_version';
+----------+---------------+-------+
| category | name | value |
+----------+---------------+-------+
| info | sugar_version | 6.4.5 |
+----------+---------------+-------+
1 row in set (0.00 sec)
cat config.php |grep sugar_version
'sugar_version' => '6.4.5',
Given the above output, I am wondering how to debug the output "Sugar CRM 6.4.5 Files May Only Be Used With A Sugar CRM 6.4.5 Database.": Sugar seems to think the files are not of version 6.4.5 even though the sugar_version is 6.4.5 in config.php; where should I look next?
Two options for the issue:
Option 1: Update your database for the latest version.
Option 2: Follow the steps below and change the SugarCRM cnfig version.
mysql> select * from config where name ='sugar_version';
+----------+---------------+---------+----------+
| category | name | value | platform |
+----------+---------------+---------+----------+
| info | sugar_version | 7.7.0.0 | NULL |
+----------+---------------+---------+----------+
1 row in set (0.00 sec)
Update your sugarcrm version to apporipriate :
mysql> update config set value='7.7.1.1' where name ='sugar_version';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1 Changed: 1 Warnings: 0
The above commands seem to be correct. Sugar seems to check that config.php and the config table in the database contain the same version. In my case I was making the mistake of using the wrong database -- so if you're like me and tend to have your databases mixed up, double check in config.php that 'dbconfig' is indeed pointing to the right database.

Google Cloud SQL VERY SLOW

I am thinking to migrate my website to Google Cloud SQL and I signed up for a free account (D32).
Upon testing on a table with 23k records the performances were very poor so I read that if I move from the free account to a full paid account I would have access to faster CPU and HDD... so I did.
performances are still VERY POOR.
I am running my own MySQL server for years now, upgrading as needed to handle more and more connections and to gain raw speed (needed because of a legacy application). I highly optimize tables, configuration, and heavy use of query cache, etc...
A few pages of our legacy system have over 1.5k of queries per page, currently I was able to push the mysql query time (execution and pulling of the data) down to 3.6seconds for all those queries, meaning that MySQL takes about 0.0024 seconds to execute the queries and return the values.. not the greatest but acceptable for those pages.
I upload a table involved in those many queries to Google Cloud SQL. I notices that the INSERT already takes SECONDS to execute instead than milliseconds.. but I think that it might be the sync vs async setting. I change it to async and the execution time for the insert doesn't feel like it changes. for now not a big problem, I am only testing queries for now.
I run a simple select * FROM <table> and I notice that it takes over 6 seconds.. I think that maybe the query cache needs to build.. i try again and this times it takes 4 seconds (excluding network traffic). I run the same query on my backup server after a restart and with no connections at all, and it takes less than 1 second.. running it again, 0.06 seconds.
Maybe the problem is the cache, too big... let's try a smaller subset
select * from <table> limit 5;
to my server: 0.00 seconds
GCS: 0.04
so I decide to try a dumb select on an empty table, no records at all, just created with only 1 field
to my server: 0.00 seconds
GCS: 0.03
profiling doesn't give any insights except that the query cache is not running on Google Cloud SQL and that the queries execution seems faster but .. is not...
My Server:
mysql> show profile;
+--------------------------------+----------+
| Status | Duration |
+--------------------------------+----------+
| starting | 0.000225 |
| Waiting for query cache lock | 0.000116 |
| init | 0.000115 |
| checking query cache for query | 0.000131 |
| checking permissions | 0.000117 |
| Opening tables | 0.000124 |
| init | 0.000129 |
| System lock | 0.000124 |
| Waiting for query cache lock | 0.000114 |
| System lock | 0.000126 |
| optimizing | 0.000117 |
| statistics | 0.000127 |
| executing | 0.000129 |
| end | 0.000117 |
| query end | 0.000116 |
| closing tables | 0.000120 |
| freeing items | 0.000120 |
| Waiting for query cache lock | 0.000140 |
| freeing items | 0.000228 |
| Waiting for query cache lock | 0.000120 |
| freeing items | 0.000121 |
| storing result in query cache | 0.000116 |
| cleaning up | 0.000124 |
+--------------------------------+----------+
23 rows in set, 1 warning (0.00 sec)
Google Cloud SQL:
mysql> show profile;
+----------------------+----------+
| Status | Duration |
+----------------------+----------+
| starting | 0.000061 |
| checking permissions | 0.000012 |
| Opening tables | 0.000115 |
| System lock | 0.000019 |
| init | 0.000023 |
| optimizing | 0.000008 |
| statistics | 0.000012 |
| preparing | 0.000005 |
| executing | 0.000021 |
| end | 0.000024 |
| query end | 0.000007 |
| closing tables | 0.000030 |
| freeing items | 0.000018 |
| logging slow query | 0.000006 |
| cleaning up | 0.000005 |
+----------------------+----------+
15 rows in set (0.03 sec)
keep in mind that I connect to both server remotely from a server located in VA and my server is located in Texas (even if it should not matter that much).
What am I doing wrong ? why simple queries take this long ? am I missing or not understanding something here ?
As of right now I won't be able to use Google Cloud SQL because a page with 1500 queries will take way too long (circa 45 seconds)
I know this question is old but....
CloudSQL has poor support for MyISAM tables, it's recommend to use InnoDB.
We had poor performance when migrating a legacy app, after reading through the doc's and contacting the paid support, we had to migrate the tables into InnoDB; No query cache was also a killer.
You may also find later on you'll need to tweak the mysql conf via the 'flags' in the google console. An example being 'wait_timeout' is set too high by default (imo.)
Hope this helps someone :)
Query cache is not as yet a feature of Cloud SQL. This may explain the results. However, I recommend closing this question as it is quite broad and doesn't fit the format of a neat and tidy Q&A. There are just too many variables not mentioned in the Q&A and it doesn't appear clear what a decisive "answer" would look like to the very general question of optimization when there are so many variables at play.

OrientDB: Cannot find a command executor for the command request: sql.MOVE VERTEX

I am using orientdb community edition 1.7.9 on mac osx.
Database Info:
DISTRIBUTED CONFIGURATION: none (OrientDB is running in standalone
mode)
DATABASE PROPERTIES
NAME | VALUE|
Name | null |
Version | 9 |
Date format | yyyy-MM-dd |
Datetime format | yyyy-MM-dd HH:mm:ss |
Timezone | Asia/xxxx |
Locale Country | US |
Locale Language | en |
Charset | UTF-8 |
Schema RID | #0:1 |
Index Manager RID | #0:2 |
Dictionary RID | null |
Command flow:
create cluster xyz physical default default append
alter class me add cluster xyz
move vertex #1:2 to cluster:xyz
Studio UI throw the following error:
014-10-22 14:59:33:043 SEVE Internal server error:
com.orientechnologies.orient.core.command.OCommandExecutorNotFoundException:
Cannot find a command executor for the command request: sql.MOVE
VERTEX #1:2 TO CLUSTER:xyz [ONetworkProtocolHttpDb]
Console return a record as select does. I do not see error in the log.
I am planning a critical feature by using altering cluster for selected records.
Could anyone help on this regard?
Thanks in advance.
Cheers
move vertex command is not supported in 1.7.x
you have to use switch to 2.0-M2
The OrientDB Console is a Java Application made to work against OrientDB databases and Server instances.
more