NDepend - Query to Reduce Matrix - ndepend

I'm using NDepend to write a query to extract a subset of my assemblies and their dependent assemblies into a Dependency Matrix.
I would like to further reduce the size of the matrix to show only dependent assemblies that have a small or medium coupling (the ones that would be relatively easy to decouple) Therefore I only want to show the assemblies that have < 20 method usages.
how do I update this query to show this?
let agentAssemblies =Assemblies.WithNameLike("Agent")
let assembliesUsedByAgents = Assemblies.ExceptThirdParty().UsedByAny(agentAssemblies)
from a in agentAssemblies.Union(assembliesUsedByAgents )
select a

You can refine the query this way:
let agentAssemblies = Assemblies.WithNameLike("Agent")
let assembliesUsedByAgents = Assemblies.ExceptThirdParty().UsedByAny(agentAssemblies)
from a in assembliesUsedByAgents
let methodsUsedFromAgentAssemblies = a.ChildMethods.UsedByAny(agentAssemblies)
where methodsUsedFromAgentAssemblies.Count() < 20
let agentAssembliesMethodsUsingMe = agentAssemblies.ChildMethods().UsingAny(methodsUsedFromAgentAssemblies)
select new {
a,
methodsUsedFromAgentAssemblies ,
agentAssembliesMethodsUsingMe
}
From the code query result you can visualise both methodsUsedFromAgentAssemblies and agentAssembliesMethodsUsingMe...
.. and by right clicking methods sets, you can export both sets to the Dependency Matrix to have a clear understanding of which method is calling which method.

Related

How to predict the outcome variables using a saved pipeline when the data set does not contain the actual outcome?

I have a data set that contains the following columns: outcome (this is the outcome that we want to predict), and raw (a column that consists of text). I want to develop an ML model that will predict the outcome from the raw column. I have trained an ML model in Databricks using the following pipeline:
regexTokenizer = RegexTokenizer(inputCol="raw", outputCol="words", pattern="\\W")
countVec = CountVectorizer(inputCol="words", outputCol="features")
indexer = StringIndexer(inputCol="outcome", outputCol="label").setHandleInvalid("skip").fit(trainDF)
inverter = IndexToString(inputCol="prediction", outputCol="prediction_label", labels=indexer.labels)
nb = NaiveBayes(labelCol="label", featuresCol="features", smoothing=1.0, modelType="multinomial")
pipeline = Pipeline(stages=[regexTokenizer, indexer, countVec, nb, inverter])
model = pipeline.fit(trainDF)
model.write().overwrite().save("/FileStore/project")
In another notebook, I load the model and try to predict the values for a new data set. This data set does not contain the outcome variable ("outcome" in this case):
model = PipelineModel.load("/FileStore/project")
score_output_df = model.transform(score_this)
When I try to predict the values for the new data set, I get an error message that the column "outcome" cannot be found. I suspect that this is due to the fact that some stages in the pipeline transform this column (the indexer and inverter stages are used to convert the outcome column to numbers and then back to string labels.).
My question is this, how can I load a saved model and use it to predict values when the original pipeline contains stages that have this column as an input.
instead of using
model.write().overwrite().save("/FileStore/project")
you have to write it like this
model.write().overwrite().save("/FileStore/project/model.sav")
and then for loading you will use this
model = PipelineModel.load("/FileStore/project/model.sav")
score_output_df = model.transform(score_this)
I have found a solution to the problem and will post it here so that if someone faces the same problem they can benefit from it. The solution was simply to extract the stages that I want to use in the prediction and save them to the model as such:
model = PipelineModel.load("/FileStore/project")
stages1 = []
stages1 += [model.stages[0]]
stages1 += [model.stages[2]]
stages1 += [model.stages[3]]
stages1 += [model.stages[4]]
model.stages = stages1
score_output_df = model.transform(score_this)
In this code, I exclude the second step ([1]) because it contains the indexer. Once I do this, I can predict values when the "outcome" column is not available.

Data ranged subscribe strange behavior

I was playing with swift's Data in the following a small code:
var d = Data(count: 10)
d[5] = 3
let d2 = d[5..<8]
print("\(d2[0])")
To my surprise, this code throws exception on print() while the following code does not:
var d = Data(count: 10)
d[5] = 3
let d2 = d.subdata(in: 5..<8)
print("\(d2[0])")
I somehow understand why this happens, but I don't get why this is designed like this. When I use subdata() I get a whole copy of range, so indexing is valid from 0. But when I use range subscribe [], I get access to the requested range while indexing is the same as before. So in my first example d2[5] is 3.
But I wonder why it is designed like this? I don't want to make a copy of my data by using subdata() method. I just wanted to access a portion of my data with better indexing.
This is especially creates unexpected behaviors if you pass it to a function. For example, following code creates unexpected results and exceptions and you may not find out easily why:
func testit(idata: Data) {
if idata.count > 0 {
print("\(idata.count)")
print("\(idata[0])")
}
}
//...
var d = Data(count: 10)
d[5] = 3
let d2 = d[5..<8]
testit(idata: d2)
This code is really strange. Because if you debug your code, you see that print("\(idata.count)") prints 3 as size of idata which is correct, but accessing it with idata[0] creates exception.
Is there any reason for this design? I was expecting that I could access resulting Data from subscribe starting index 0 while it is not true. Can I do this without using subdata() which creates copy of data or using additional arguments to pass base of data slice?
d[5..<8] returns Data.Slice – which happens to be Data. Generally, slices share the indices with their base collection, as documented in Slice.
One possible reason for this design decision is that it guarantees that subscripting a slice is a O(1) operation (adding an offset for accessing the base collection is not necessarily O(1), e.g. not for strings.)
It is also convenient, as in this example to locate the text after the second occurrence of a character in a string:
let string = "abcdefgabcdefg"
// Find first occurrence of "d":
if let r1 = string.range(of: "d") {
// Find second occurrence of "d":
if let r2 = string[r1.upperBound...].range(of: "d") {
print(string[r2.upperBound...]) // efg
}
}
As a consequence, you must never assume that the indices of a collection are zero-based (unless documented, as for Array.startIndex). Use startIndex to get the first index, or first to get the first element.

How do I merge or combine error rates?

Let's say I have a dataset that has 9 continuous columns of data and 4 columns of categorical data. In Matlab, I separate the columns into two groups and do training/testing (naïve bayes) on them separately and determine that the continuous columns have an error rate of 0.45 and the categorical columns have an error 0.33. My question is - how do I determine the combined error?
EDIT - Simple pseudocode overview added:
for x = 1:num_iterations
Mdl_NB1 = fitcnb(TrainingSet_Con,TrainingTargets,'Distribution','normal');
Mdl_NB2 = fitcnb(TrainingSet_Dis,TrainingTargets,'Distribution','mn');
[NB1_label,NB1_Posterior,NB1_Cost] = predict(Mdl_NB1,TestPoint_Con);
[NB2_label,NB2_Posterior,NB2_Cost] = predict(Mdl_NB2,TestPoint_Dis);
NB1_cumulLoss = NB1_cumulLoss + resubLoss(Mdl_NB1);
NB2_cumulLoss = NB2_cumulLoss + resubLoss(Mdl_NB2);
end
NB1_avg_score = NB1_cumulLoss/num_iterations
NB2_avg_score = NB2_cumulLoss/num_iterations
total_avg_score = ???
The three obvious choices, in principle, are:
(A+B) / 2
A * B
(A*(CountA/TotalCount)) + (B*(CountB/TotalCount))
But not sure if any of these are right, in this case.
This does not make sense; you are effectively building two separate models. So either build one model with all columns (maybe with 'Distribution','mvmn') or combine both models into one with something like
Mdl_Ens = fitcnb([NB1_Posterior; NB2_Posterior],TrainingTargets,'Distribution','normal');
NEns_cumulLoss = NEns_cumulLoss + resubLoss(Mdl_Ens);
to actually build a single model out of the output of the two models based on a subset of the columns each.

OrientDB create edge between two nodes with the same day of year

I'm interested in creating an edge (u,v) between two nodes of the same class in a graph if they share the same day of year and v.year = u.year+1.
Say I have vertices.csv:
id,date
A,2014-01-02
B,2015-01-02
C,2016-01-02
D,2013-06-01
E,2014-06-01
F,2016-06-01
The edge structure I'd like to see would be this:
A --> B --> C
D --> E
F
Let's set the vertex class to be "myVertex" and edge class to be "myEdge"? Is it possible to generate these edges using the SQL interface?
Based on this question, I started trying something like this:
BEGIN
LET source = SELECT FROM myVertex
LET target = SELECT from myVertex
LET edge = CREATE EDGE myEdge
FROM $source
TO (SELECT FROM $target WHERE $source.date.format('MM-dd') = $target.date.format('MM-dd')
AND $source.date.format('yyyy').asInteger() = $target.date.format('yyyy').asInteger()-1)
COMMIT
Unfortunately, this isn't correct. So I got less ambitious and wanted to see if I can create edges just based on matching day-of-year:
BEGIN
LET source = SELECT FROM myVertex
LET target = SELECT from myVertex
LET edge = CREATE EDGE myEdge FROM $source TO (SELECT FROM $target WHERE $source.date.format('MM-dd') = $target.date.format('MM-dd'))
COMMIT
Which still has errors... I'm sure it's something pretty simple to an experienced OrientDB user.
I thought about putting together a JavaScript function like Michela suggested on this question, but I'd prefer to stick to using the SQL commands as much as possible for now.
Help is greatly appreciated.
Other Stack Overflow References
How to print or log on function javascript OrientDB
I tried with OSQL batch but I think that you can't get what you want.
With whis OSQL batch
begin
let a = select #rid, $a1 as toAdd from test let $a1 = (select from test where date.format("MM") == $parent.$current.date.format("MM") and date.format("dd") == $parent.$current.date.format("dd") and #rid<>$parent.$current.#rid and date.format("yyyy") == sum($parent.$current.date.format("yyyy").asInteger(),1))
commit
return $a
I got this
but the problem is that when you create the edge you can not cycle on the table obtained in the previous step.
I think the best solution is to use an JS server-side function.
Hope it helps.

Perform action on all fields of structure

I wonder if it's possible to perform an action on all fields of a structure at once?
My scenario:
I have data from an eye tracker device. It is stored in a struct Data, and has the following fields:
Data.positionX
Data.positionY
Data.velocity
Data.acceleration
Each field contains a vector of integers. Suppose I want to delete sample number 10 from my data stream. I would have to do the following:
Data.positionX(10) = [];
Data.positionY(10) = [];
Data.velocity(10) = [];
Data.acceleration(10) = [];
How would I do this more efficiently?
Yes, use dynamic field names.
fields = fieldnames(Data);
for i=1:length(fields)
field = fields{i};
Data.(field)(10) = [];
end
If your data is simple enough, it may be worth switching to a structure where you index the data directly instead of its contents
Data(10).positionX
Data(10).positionY
...
then it would have been as simple as
Data(10)=[]
Or alternately, if you have a bunch of vectors you want to store together, you may be better off storing them in a matrix:
M = [positionX positionY] %And so on, possibly transposed
Then it would have been as simple as:
M(10,:)=[];