How to get the length of every column in a kdb database?

How to get the length of every column in a kdb database? - kdb

We've seen issues where a kdb database is corrupted and are looking for a way to implement a check that every kdb column is the same length in a particular table. Any recommendation on how to do this?
i.e., would like to get a return value of each column in a table and it's length.
These tables have upwards of 200 columns. Any way to go about this efficiently?
Any help is appreciated. Thank you.

Something like this might work for you.
q)tables[]
`positions`quote`trade
q)count each flip trade
time | 40000
sym | 40000
src | 40000
price | 40000
amount| 40000
side | 40000
You can run the same thing on partitioned data as well.
q)count each flip select from ohlc where date=last date
date | 1440
sym | 1440
exchange | 1440
timestamp| 1440
open | 1440
high | 1440
low | 1440
close | 1440
volume | 1440
EDIT: The methods used above will only work on tables which are not corrupted, which may not be best suited to your use case.
If the data is corrupted, you can get each column from its location on disk and count it.
q)cols[ohlc]!{count get hsym`$"/path/to/hdb/2020.02.09/ohlc/",string x}each cols ohlc
date | 1440
sym | 1440
date | 1440
exchange | 1440
timestamp| 1439
open | 1440
high | 1440
low | 1440
close | 1440
volume | 1440

Related

Grafana negative spikes in latency query

I have a Grafana dashboard that is measuring latency of a Kafka topic per partition in minutes using this query here:
avg by (topic, consumergroup, environment, partition)(kafka_consumer_lag_millis{environment="production",topic="topic.name",consumergroup="consumer.group.name"}) / 1000 / 60
The graph is working fine but we're seeing negative spikes in the graph that doesn't make a lot of sense to us. Does anyone know potentially what could be causing these spikes?

This is more of a guess than an accurate answer based here. let's suppose in a very simple manner we have 2 metrics being measured, and their subtraction is the number sent to prometheus:
lag = offset-producer - offset-consumer
while the producer offset is measured with a pooling mechanism, the consumer offset is measured with direct synchronous requests (to whatever other inner place has this values). this way, we could have outdated values for the producer. example:
instant | producer | consumer
t1 | 10 | 0
t2 | 30 | 15
t3 | 200 | 70
if we had always updated values, we should have:
instant | lag
t1 | 10 - 0 = 10
t2 | 30 - 15 = 15
t3 | 200 - 70 = 130
let's suppose our offset producer was one measurement behind on t2 due to the long pooling period:
l(t1) = p(t1) - c(t1)
l(t2) = p(t1) - c(t2)
l(t3) = p(t2) - c(t3)
this would produce:
instant | lag
t1 | 10 - 0 = 10
t2 | 10 - 15 = -5
t3 | 30 - 70 = -40
and there's your negative value: when the diff increases and your pooling rate of the positive value is bigger than prometheus' pooling rate, you get the negative value to be bigger than older positive value.
now to really answer your question we need to check prometheus' kafka client code to check if the pooling rate is editable to make it smaller until negative values vanish (or instead just set it smaller than prometheus rate directly)

Dataframe level computation in pySpark

I am using PySpark and want to use the benefit of multiple nodes to improve on performance time.
For example:
Suppose I have 3 columns and have 1 million records:
Emp ID | Salary | % Increase | New Salary
1 | 200 | 0.05 |
2 | 500 | 0.15 |
3 | 300 | 0.25 |
4 | 700 | 0.1 |
I want to compute the New Salary column and want to use the power of multiple nodes in pyspark to reduce overall processing time.
I don't want to do an iterative row wise computation of New Salary.
Does df.withColumn do the computation at a dataframe level? Would it be able to give better performance as more nodes are used?

Spark's dataframes are basically a distributed collection of data. Spark manages this distribution and the operations (such as .withColumn) on them.
Here is a quick google search on how to increase spark's performance.

Pure Data - Get adc value in a particular duration

I'm trying to get the adc values in says 50 seconds. I end up with the picture below
I set up the metro as 50 which is 0.05 sec and the tabwrite size 1000. I got a list of values as below
But I feel it isn't right as I speak louder for a few seconds, the entire graph changed. Can anyone point out what I did wrong? Thank you.

the [metro 50] will retrigger every 50 milliseconds (20 times per second).
So the table will get updated quite often, which explains why it reacts immediately to your voice input.
To record 50 seconds worth of audio, you need:
a table that can hold 2205000 (50*44100) samples (as opposed to the default 64)
a [metro] that triggers every 50 seconds:
[tgl]
|
[metro 50000]
|
| [adc~]
|/
[tabwrite~ mytable]
[table mytable 2205000]

A ratio measured within one dimension shown across another dimension

For the sake of the exercise, let's assume that I'm monitoring the percentage of domestic or foreign auto sales across the US.
Assume my dataset looks like:
StateOfSale | Origin | Sales
'CA' | 'Foreign' | 1200
'CA' | 'Domestic' | 800
'TX' | 'Foreign' | 800
'TX' | 'Domestic' | 800
How would I show the percentage of Foreign Sales, by State of Sale, but each State is a line/mark/bar in the visual?
So for CA, the Foreign Percentage is 60%. For TX, the Foreign Percentage is 50%.

This is what Tableau was born to do!, and there are a lot of great ways to visualize this type of question.
Use a Quick table calculation called "Percent of Total" and compute that percentage according to each State. In the picture below, "StateofOrigin" is in Columns, and "Sum(Sale)" is in Rows, I compute using Table (Down).
You can also graph the raw sales numbers in addition to displaying the text percentage to gain additional context about the number of sales between states.
Finally, if you've got a lot of states, it can be cool to plot it out on a map. You can do this by creating a calculated field for percentage and then filtering out the domestic sales.
Field Name: Percentage
SUM([Sale])/SUM({FIXED [StateofOrigin]: SUM([Sale])})

How to sample from KDB table to reduce data before querying?

I have a table of tick data representing prices of various financial instruments up to millisecond precision. Problem is, there are over 5 billion entries, and even the most basic queries takes several minutes.
I only need data with a precision of up to 1 second - is there an efficient way to sample the table so that the precision is reduced to roughly 1 second prior to querying? This should dramatically cut the amount of data and hence execution time.
So far, as a quick hack I've added the condition where i mod 2 = 0 to my query, but is there a better way?

The best way to bucket time data is with xbar
q)select last price, sum size by 10 xbar time.minute from trade where sym=`IBM
minute| price size
------| -----------
09:30 | 55.32 90094
09:40 | 54.99 48726
09:50 | 54.93 36511
10:00 | 55.23 35768
...
more info http://code.kx.com/q/ref/arith-integer/#xbar