How to join vectors with prometheus? - kubernetes

It's probably something obvious but I don't seem to find a solution for joining 2 vectors in prometheus.
sum(
rabbitmq_queue_messages{queue=~".*"}
) by (queue)
*
on (queue) group_left max(
label_replace(
kube_deployment_labels{label_daemon_name!=""},
"queue",
"$1",
"label_daemon_queue_name",
"(.*)"
)
) by (deployment, queue)
Below a picture of the output of the two separate vectors.

Group left has the many on the left, so you've got the factors to the * the wrong way around. Try it the other way.

Related

Cast a list column into dummy columns in Python Polars?

I have a very large data frame where there is a column that is a list of numbers representing category membership.
Here is a dummy version
import pandas as pd
import numpy as np
segments = [str(i) for i in range(1_000)]
# My real data is ~500m rows
nums = np.random.choice(segments, (100_000,10))
df = pd.DataFrame({'segments': [','.join(n) for n in nums]})
userId
segments
0
885,106,49,138,295,254,26,460,0,844
1
908,709,454,966,151,922,666,886,65,708
2
664,713,272,241,301,498,630,834,702,289
3
60,880,906,471,437,383,878,369,556,876
4
817,183,365,171,23,484,934,476,273,230
...
...
Note that there is a known list of segments (0-999 in the example)
I want to cast this into dummy columns indicating membership to each segment.
I found a few ways of doing this:
In pandas:
df_one_hot_encoded = (df['segments']
.str.split(',')
.explode()
.reset_index()
.assign(__one__=1)
.pivot_table(index='index', columns='segments', values='__one__', fill_value=0)
)
(takes 8 seconds on a 100k row sample)
And polars
df2 = pl.from_pandas(df[['segments']])
df_ans = (df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(','),
pl.lit(1).alias('__one__')
])
.explode('segments')
.pivot(index='row_index', columns='segments', values='__one__')
.fill_null(0)
)
df_one_hot_encoded = df_ans.to_pandas()
(takes 1.5 seconds inclusive of the conversion to and from pandas, .9s without)
However, I hear .pivot is not efficient, and that it does not work well with lazy frames.
I tried other solutions in polars, but they were much slower:
_ = df2.lazy().with_columns(**{segment: pl.col('segments').str.contains(segment) for segment in segments}).collect()
(2 seconds)
(df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(',')
])
.explode('segments')
.to_dummies(columns=['segments'])
.groupby('row_index')
.sum()
)
(4 seconds)
Does anyone know a better solution than the .9s pivot?
This approach ends up being slower than the pivot but it's a got a different trick so I'll include it.
df2=pl.from_pandas(df)
df2_ans=(df2.with_row_count('userId').with_column(pl.col('segments').str.split(',')).explode('segments') \
.with_columns([pl.when(pl.col('segments')==pl.lit(str(i))).then(pl.lit(1,pl.Int32)).otherwise(pl.lit(0,pl.Int32)).alias(str(i)) for i in range(1000)]) \
.groupby('userId')).agg(pl.exclude('segments').sum())
df_one_hot_encoded = df2_ans.to_pandas()
A couple of other observations. I'm not sure if you checked the output of your str.contains method but I would think that wouldn't work because, for example, 15 is contained within 154 when looking at strings.
The other thing, which I guess is just a preference, is the with_row_count syntax vs the pl.arrange. I don't think the performance of either is better (at least not significantly so) but you don't have to reference the df name to get the len of it which is nice.
I tried a couple other things that were also worse including not doing the explode and just doing is_in but that was slower. I tried using bools instead of 1s and 0s and then aggregating with any but that was slower.

OptaPlanner, how to minimize the largest sum in DRL

The Task is the PlanningEntity, each task is in a group, groupId is a member of the Task class. What task is in what group is fixed, the groupId is neither PlanningVariable nor PlanningFact.
Each task has a duration, it's a shadow variable that is affected by the combination(optimization) of the tasks within the group. Let's call "the sum of task durations in a group" S. S1 for group1, S2 for group2 ...
The soft constraint is to minimize the largest S of all the groups. This is not to minimize all S, that's not the goal. I can only think of something like following which is not working. The "index" is the PlanningVariable.
rule "Minimize the largest group duration"
when
accumulate(
accumulate(
$g: Group(),
Task(index != null, groupId == g.getId(), $d: duration);
$s: sum($d)
)
$smax: max($s)
)
then
scoreHolder.addSoftConstraintMatch(kcontext, 1, (-$smax));
end
I am not sure I entirely understand the issue, but something like this should get you the maximum duration:
Task(..., $d: duration)
not Task(..., duration > $d)
$d is now your maximum duration, and you can do anything you like with it.
(Note: You should add the drools tag and possibly also remove the optaplanner one, as this is barely related to OptaPlanner.)

TSQL divide one count by another to give a proportion

I would like to calculate the proportion of animals in column BreedTypeID with a value of 1. I think the easiest way is to count the n BreedTypeID = 1 / total BreedTypeID. (I also wnat them to have the same YearDOB and substring in their ID as shown) I tried the following:
(COUNT([dbo].[tblBreed].[BreedTypeID])=1 OVER (PARTITION BY Substring([AnimalNo],6,6), YEAR([DOB]))/ COUNT([dbo].[tblBreed].[BreedTypeID]) OVER (PARTITION BY Substring([AnimalNo],6,6), YEAR([DOB]))) As Proportion
But it bugged with the COUNT([dbo].[tblBreed].[BreedTypeID])=1
How can I specify to only count [BreedTypeID] when =1?
Many thanks
This will fix your problem, although I would suggest you use table aliases instead of schema.table.column. Much easier to read:
Just replace:
COUNT([dbo].[tblBreed].[BreedTypeID])=1
WITH
SUM( CASE WHEN [dbo].[tblBreed].[BreedTypeID] = 1 THEN 1 ELSE 0 END)

Matlab categorical table variables: Speed? Use in join keys?

I've dipping my toe into Matlab's categorical variable pool in the context of Matlab tables. Actually, I may have wandered into that territory in the past, but if so, it would have been in a relatively superficial manner.
These days, I want to use Matlab code patterns to do what I normally would do in MS Access, e.g., various types of joins and filtering. Much of my data is categorical, and I've read up on the advantages of using categorical variables in tables. However, they mostly centre around descriptiveness (over enumerated types) and memory efficiency. I haven't run across mention of speed. Do categorical variables offer a speed advantage?
I also wonder how advisable it is to use categorical variables when doing various types of joins. The categorical variables will occupy different tables, so it's not clear to me how equivalence in values is established if such variables are involved in the SQL ON clause (which Matlab refers to as a keys parameter).
From the dearth of relevant Google hits, it almost seems like I'm in new territory, which to me would be a scary thing. Lack of documentation of best practices, and the resulting need for trial/error and reverse engineering, requires more time than I can devote, so I'll sadly revert back to using strings.
If anyone can point to online guidance information, I'd appreciate it.
A partial answer only....
The following test indicates that catgorized data behaves sensibly when used as join keys:
BigList = {'dog' 'cat' 'mouse' 'horse' 'rat'}'
SmallList = BigList( 1 : end-2 )
Nrows = 20;
% Create tables for innerjoin using strings
tBig = table( ...
(1:Nrows)' , ...
BigList( ceil( length(BigList) * rand( Nrows , 1 ) ) ) , ...
'VariableNames' , {'B_ID' 'Animal'} )
tSmall = table( ...
(1:Nrows)' , ...
SmallList( ceil( length(SmallList) * rand( Nrows , 1 ) ) ) , ...
'VariableNames' , {'S_ID' 'Animal'} )
tBigSmall = innerjoin( tBig , tSmall , 'Keys','Animal' );
tBig = sortrows( tBig , {'Animal','B_ID'} );
tSmall = sortrows( tSmall, {'Animal','S_ID'} );
tBigSmall = sortrows( tBigSmall, {'Animal' 'B_ID' 'S_ID'} );
% Now innerjoin the same tables using categorized strings
tcBig = tBig;
tcBig.cAnimal = categorical( tcBig.Animal );
tcBig.Animal = [];
tcSmall = tSmall;
tcSmall.cAnimal = categorical( tcSmall.Animal );
tcSmall.Animal = [];
tcBigSmall = innerjoin( tcBig , tcSmall , 'Keys','cAnimal' );
tcBig = sortrows( tcBig , {'cAnimal','B_ID'} );
tcSmall = sortrows( tcSmall, {'cAnimal','S_ID'} );
tcBigSmall = sortrows( tcBigSmall, {'cAnimal' 'B_ID' 'S_ID'} );
% Check if the join results are the same
if all( tBigSmall.Animal == tcBigSmall.cAnimal )
disp('categorical vs string key: inner joins MATCH.')
else
disp('categorical vs string key: inner joins DO NOT MATCH.')
end % if
So the only question now is about speed. This is a general question, not just for joins, so I'm not sure what would be a good test. There are many possibilities, e.g., number of table rows, number of categories, whether it's a join or a filtering, etc.
In any case, I believe that the answers to both question would be better documented.

Postgis using contains and count even if answer is 0

I have a problem with the count function...
I want to isolate all polygons laying beside G10 polygon and I want to count number of points (subway stations) in my polygons (neighborhoods) but i want to receive an answer even if that answer must be 0.
I used the following statement :
select a2.name, count(m.geom)
from arr a1, arr a2, metro m
where n1.code='G10'
and ((st_touches(a1.geom, a2.geom)) or
(st_overlaps(a1.geom, a2.geom)))
and ST_Contains(a2.geom, s.geom)
group by a2.name, m.geom
I know the problem lies with the and ST_Contains(a2.geom, s.geom) part of the where clause, but I do not now how to solve it!
Use an explicit LEFT JOIN:
SELECT a1.name, COUNT(a2.code)
FROM arr a1
LEFT JOIN
arr a2
ON ST_Intersects(a1.geom, a2.geom)
WHERE a1.code = 'G10'
I'm not including the other tables as you have obvious typos in your original query and it's not clear how should they be connected