Conditional update field based on row information

Conditional update field based on row information - kdb

In KDB I have this table:
q)tab
items sales prices adjust factor
--------------------------------
nut 6 10 1b 1.2
bolt 8 20 1b 1.5
cam 0 15 1b 2
cog 3 20 0b 0n
nut 6 10 0b 0n
bolt 8 20 0b 0n
I would like to compute a 4th column based on a condition, for instance:
if[adjust; prices * factor;]
The aim is to get the following result:
items sales prices newPrices
----------------------------
nut 6 10 12
bolt 8 20 30
cam 0 15 30
cog 3 20 20
nut 6 10 10
bolt 8 20 20
Can some please help me out?

I think you're looking for something like:
q) update newPrices:?[adjust;prices*factor;prices] from tab
items sales prices adjust factor newPrices
------------------------------------------
nut 6 10 1 1.2 12
bolt 8 20 1 1.5 30
cam 0 15 1 1 15
cog 3 20 0 20
nut 6 10 0 10
bolt 8 20 0 20

You can use a dictionary and fill the prices
q)d:`nut`bolt`cam!1.2 1.5 2
q)update newPrices:prices^prices*d items from tab
items sales prices newPrices
----------------------------
nut 6 10 12
bolt 8 20 30
cam 0 15 30
cog 3 20 20
bolt
screw

Related

Pyspark dataframe conditional filter and imputation

I have a pyspark dataframe df
ID
Total_Count
A
B
C
D
Group
Name
Chain
1
56
0
0
0
0
1
Apple
Fruits1
2
65
0
0
0
0
1
Apple
Fruits1
3
72
0
0
30
0
1
Banana
Fruits1
4
80
0
0
0
0
1
Strawberry
Fruits1
5
142
58
58
14
12
1
Apple
Fruits1
6
130
63
50
9
8
1
Apple
Fruits1
7
145
74
44
17
10
1
Apple
Fruits1
8
119
54
48
8
9
1
Apple
Fruits1
11
161
71
63
16
11
1
Banana
Fruits1
12
124
54
43
19
8
1
Banana
Fruits1
I want to impute the A,B,C,D columns wherever there is 0 in A,B,C,D columns(ID 1,2,3,4).
1.) Logic : Average of GroupxName(if available) or Average of GroupxChain(if available) or at Average of Group :
Taking the example to impute ID 1,2 for demo:
Post filering for Group 1 and Name Apple, Proportion for ID 1&2 is obtained as follows( For ID 1 and 2 resp. filtering rows with similar Group as 1 and similar Name (Apple)) ,proportion is calculated as A/Total_Count, B/Total_Count and so on :
A_PROP
B_PROP
C_PROP
D_PROP
0.408451
0.408450704
0.098592
0.084507042
0.484615
0.384615385
0.069231
0.061538462
0.510345
0.303448276
0.117241
0.068965517
0.453782
0.403361345
0.067227
0.075630252
2.) Average of the above 4 rows is to be taken (for ID 1 & 2 for example).
A,B,C,D in df2 is calcualted as X_prop_avg*Total_Count.
Expected output (df2) :
ID
Total_Count
A_prop_avg
B_prop_avg
C_prop_avg
D_prop_avg
A
B
C
D
1
56
0.46429811
0.37496893
0.08807265
0.07266032
26
21
5
4
2
65
0.464298107
0.374968927
0.088072647
0.072660318
30
24
6
5
3
72
0.43823883
0.369039271
0.126302344
0.066419555
32
27
9
5
4
80
0.455611681
0.372992375
0.10081588
0.070580064
36
30
8
6

Replace infinities with nulls and then fill them using fills in q kdb

I was going through fills explanation and came across below example: i.e if there are infinities replace them with nulls and then fill them using fills.
q)fills {(x where x=0W):0N;x} 0N 2 3 0W 0N 7 0W
Output - 0N 2 3 3 3 7 7
I want to further expand this problem that if the first value in the output is Null then fill it with default value 1, for which I had written two versions of solution.
{(x where x=0N):1;x} fills {(x where x=0W):0N;x} 0N 2 3 0W 0N 20 30 0W
1^fills {(x where x=0W):0N;x} 0N 2 3 0W 0N 20 30 0W /- Output - 1 2 3 3 3 20 30 30
Which of the two is optimized version(I think it's 2nd one using fill)?
Any better/optimized version of it?

You can always test the solutions by timing them for a large vector
q)\ts {(x where x=0N):1;x} fills {(x where x=0W):0N;x}10000000#0N 2 3 0W 0N 20 30 0W
196 553649552
q)\ts 1^fills {(x where x=0W):0N;x}10000000#0N 2 3 0W 0N 20 30 0W
190 553649216
For large vectors you should get a small improvement by only filling the first item with 1, assuming that's the only one you need defaulted to one
q)#[;0;1^]fills {(x where x=0W):0N;x}0N 2 3 0W 0N 20 30 0W
1 2 3 3 3 20 30 30
However, if you have a sequence of nulls in the beginning (not just one) then this won't help
q)#[;0;1^]fills {(x where x=0W):0N;x}0N 0N 2 3 0W 0N 20 30 0W
1 0N 2 3 3 3 20 30 30
In that case you're better off going with the 1^ on the entire vector

Adding 1 to infinity transforms it to a null
q)0W 10 0N+1
0N 11 0N
To keep the orig values, and the nulls, lets subtract 1 from the resulting list
q)-1+0W 10 0N+1
0N 10 0N
Fills can accept two params (the starting digit as x if you give it two params)
q)fills 0N 10 20 0N 40
0N 10 20 20 40
q)fills[33;] 0N 10 20 0N 40
33 10 20 20 40
So putting it altogether for your requirements
q)fills[1;] -1+x+1
1 2 3 3 3 20 30 30 30 2 3 3 3 20 30 30 30 2 3 3 3 20 30 30 30 2 3 3 3 20 30 3..
HTH,
Sean

Convert null and +/- infinity to 0:
nanInfToZero:{[x] // Converts NaN and positive/negative infinity to zero.
if [not 9h=type x;`"Error. Inputs should be 64-bit floats, cast with `float$x"];
result: {[x] (x where x in (-0w,0w)):0n;x}[x];
:0^result;
};
if [not nanInfToZero[(0n 2 3)]~(0 2 3f); `"Incorrect"];
if [not nanInfToZero[(-0w 2 3)]~(0 2 3f); `"Incorrect"];
if [not nanInfToZero[(0w 2 3)]~(0 2 3f); `"Incorrect"];
Then use fills to fill forward.
Tested under Windows x64 and KDB v4.

Indexing a Structure in matlab

I was under the impression that structure in matlab were similar to query tables in sql but I have a feeling I might be wrong.
I have a rather large dataset consisting of many entries and many fields. Ideally I want to index the structure, pulling out only the data I am interested in. Here is an example of the dataset
Cond Type Stime ETime
2 10 1 900
2 10 1 900
2 10 1 900
3 1 901 1800
3 1 901 1800
4 1 1801 2700
8 1 901 1800
8 1 901 1800
9 1 901 1800
9 1 901 1800
12 1 901 1800
12 1 901 1800
13 10 1 900
13 10 1 900
13 10 1 900
16 1 901 1800
16 1 901 1800
17 10 1 900
17 10 1 900
17 10 1 900
19 10 1 900
19 10 1 900
19 10 1 900
20 10 1 900
20 10 1 900
20 10 1 900
22 1 901 1800
22 1 901 1800
25 10 1 900
25 10 1 900
25 10 1 900
27 1 901 1800
27 1 901 1800
28 1 901 1800
28 1 901 1800
30 1 1801 2700
31 1 901 1800
31 1 901 1800
32 10 1 900
32 10 1 900
32 10 1 900
35 10 1 900
35 10 1 900
35 10 1 900
What I want to do is pull specific data entries for analysis example being I want all entries where Type is equal to 10 or I want all Cond from 1:20 that have ETime == 900.
I can do this by the following
idx = find([stats.Type] == 10);
[stats(idx).Stime]
but for multiple types I need a for loop as trying to use a vector throws an error.
idx = find([stats.Type] == 1:10); % Does not work
% must use this
temp = [];
for aa = 1:10
idx = find([stats.Type] == aa);
temp = horzcat(idx,temp);
end
[stats(temp).Stime]
Is this the wrong way to use structures? Is there an easier method to index a structure to pull data of interest?

This answer proposes using table indexing instead of struct indexing, which is a bit of a side-step to directly answering the question. However, my comments on this post were deemed useful so I've formalised as an answer...
If you use struct2table then you can interact with it as a table, which is generally much more intuitive.
Structures are useful if your fields have different numbers of elements (i.e. you couldn't form a consistent height table). In almost all other areas, I find tables are easier to use.
With tables you can use:
Logical indexing
Sorting (including sortrows by column name)
The family of "join" operations
Dot notation for accessing table columns by name, as you do for accessing struct fields, or select multiple columns by name using myTable( :, {'col1','col2'} ). - You don't need weird syntactic tricks like [stats.Type] to group outputs, you can just do stats.Type
I would then use ismember to compare multiple items against a table column...
idx = ismember( stats.Type, 1:10 );
Unless you need the indices, you can skip using find for speed, and just directly index using idx.

(q/kdb+) Generating an automated list

Example 1)
I have the code below
5#10+1*2
that generates
index value
0 12
1 12
2 12
3 12
4 12
How can I replace the number "1" by the index?
then generating
5#10+index*2
index value
0 10
1 12
2 14
3 16
4 18
update Example 2)
Now, if I have, let's say
mult:5;
t:select from ([]numC:1 3 6 4 1;[]s:50 16 53 6 33);
update lst:(numC#'s) from t
the last update will generate
numC s lst
1 50 50
3 16 16 16 16
6 53 53 53 53 53 53 53
4 6 6 6 6 6
1 33 33
How can I generate the "lst" column as per below?
numC s lst
1 50 50+0*mult
3 16 16+0*mult 16+1*mult 16+2*mult
6 53 53+0*mult 53+1*mult 53+2*mult 53+3*mult 53+4*mult 53+5*mult
4 6 6+0*mult 6+1*mult 6+2*mult 6+3*mult
1 33 33+0*mult
I tried something like
update lst:(numC#'s + (til numC)*mult) from t
but I am getting an error
ERROR: 'type
Thanks vm

Is this what you're looking for:
q)x:5
q)x#10+(til x)*2
10 12 14 16 18
http://code.kx.com/q/ref/arith-integer/#til

You can remove take # and use til to simplify to:
q)10+2*til 5
10 12 14 16 18
Using til will create a list of a list of 5 elements (0->4), so you will not need take 5 elements from the resulting list. Take will only be required if your list of indices is greater than 5.
Update:
For your second example the following should work:
q)update lst:{y+x*til z}'[mult;s;numC] from t
q)update lst:s+mult*til each numC from t
numC s lst
-------------------------
1 50 ,50
3 16 16 21 26
6 53 53 58 63 68 73 78
4 6 6 11 16 21
1 33 ,33

There are many ways with which we can get achieve this:
1) 10+2*til 5
2) (2*til 5) + 10
/ take operator: The dyadic take function creates lists. The left argument specifies the count and shape and the right argument provides the data.
It is useful for selecting from the front or end of a list.
https://code.kx.com/wiki/Reference/NumberSign
q)5#0 1 2 3 4 5 6 7 8 / take the first 5 items
0 1 2 3 4
q)-5#0 1 2 3 4 5 6 7 8 / take the last 5 elements
4 5 6 7 8
use take operator # only when it is required.
say we have 10 elements, of which we need five on output, then we can use:
5#10+2*til 10
/ The til function takes a non-negative integer argument X and returns the first X integers

How to eliminate series of values with so much variation

I got a dataset (azimuth vs time) with measure the compass of an object trough time. So I can see when the object is moving (the compass vary so much), and when it's static, without moving (compass do not vary). My question is how to program this in matlab in order to eliminate the data which show that the object is moving and just filter data that shows the object is static.
For example:
Azimuth (angle) | 30 30 30 15 10 16 19 24 24 24 17 14 12 15 16
Time (s) | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
The output would be:
Azimuth (angle) | 30 30 30 24 24 24
Time (s) | 1 2 3 8 9 10

s=diff(Azumuth)==0
%diff only would skip the values at t=1 and t=8. Modify to include them as well:
s=[s(1),s(2:end)|s(1:end-1),s(end)]
Azumuth(s)
Time(s)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Conditional update field based on row information - kdb

I think you're looking for something like: q) update newPrices:?[adjust;prices*factor;prices] from tab items sales prices adjust factor newPrices ------------------------------------------ nut 6 10 1 1.2 12 bolt 8 20 1 1.5 30 cam 0 15 1 1 15 cog 3 20 0 20 nut 6 10 0 10 bolt 8 20 0 20

You can use a dictionary and fill the prices q)d:`nut`bolt`cam!1.2 1.5 2 q)update newPrices:prices^prices*d items from tab items sales prices newPrices ---------------------------- nut 6 10 12 bolt 8 20 30 cam 0 15 30 cog 3 20 20 bolt screw

Related

Pyspark dataframe conditional filter and imputation

Replace infinities with nulls and then fill them using fills in q kdb

Indexing a Structure in matlab

(q/kdb+) Generating an automated list

How to eliminate series of values with so much variation

Categories

Resources