merge two set of weka instances with different Attribute set - merge

i need to merge two instances(those are output of StrinToWordVector) with different set of attribute but overlapping and completely different instances in weka .is there any easy way to do it ?(in other word,i have two tfidf matrix with diferent doc and different word (but overlapped) and i want to have a tfidf matrix that is union of all of them)

You can use InputMappedClassifier, although there are two other options if you still have your documents in raw format, as discused here.

Please check if weka.join could help you, this is an extended weka.core.Instaces class with methods like innerJoin, leftJoin, fullJoin, update and union.

Related

Darknet - Is it possible to load multiple weight files?

I have several different .weight files that were outputted in training. The reason I did this, is I noticed the model trained better with fewer classes than if I combined all 35 together. Could it possible to loop through code and have multiple model.load_weights()?
Any help is appreciated!
Thanks
I don't see the code, but I can say, that you can try to create multiple class instances of the model class, each of them with their own weights and configs, and than run each of them in the way you want

When using Virtual Documents how can a template be used with different report parameters?

I'm using a virtual document structure generated by a script to create a document from EA, and I'm trying to use the same template fragment several times with different elements and different headings.
For example, I have an element which describes the input data to one program, and the output data to another program, so I can't really store the information in the element I'm documenting.
Where it is the input, I want one heading (and similar references within the template), and where it is output I want different values for the headings.
I've tried using the ReportTitle tagged value in the individual <model document> element, but this appears to be ignored and only the <report package> value is used throughout (which makes me wonder why they are there in the first place).
While I could create multiple templates all referring to the same fragment and hard-code the different headings, but that is messy, and as I already have fragments within fragments so it could result in a lot of almost identical templates and fragments. Variables that I can set for each <model document> would be much preferable.
Has anyone got a better approach than this? Thanks!
I don't think there's an easy solution.
If there is a way to determine based on the element, package, or diagram ID whether you should use one title or the other then you could use a script or SQL fragment to return the correct title.
If that is not the case I guess the only possibility is to hardcode the different titles in your templates.
In order to avoid too much duplication you could create a template with only the title and use that on a model document. Since you are generating the modeldocuments by a script anyway that doesn't cost any user time.

Determining canonical classes with text data

I have a unique problem and I'm not aware of any algorithm that can help me. Maybe someone on here does.
I have a dataset compiled from many different sources (teams). One field in particular is called "type". Here are some example values for type:
aple, apples, appls, ornge, fruits, orange, orange z, pear,
cauliflower, colifower, brocli, brocoli, leeks, veg, vegetables.
What I would like to be able to do is to group them together into e.g. fruits, vegetables, etc.
Put another way I have multiple spellings of various permutations of a parent level variable (fruits or vegetables in this example) and I need to be able to group them as best I can.
The only other potentially relevant feature of the data is the team that entered it, assuming some consistency in the way each team enters their data.
So, I have several million records of multiple spellings and short spellings (e.g. apple, appls) and I want to group them together in some way. In this example by fruits and vegetables.
Clustering would be challenging since each entry is most often 1 or two words, making it tricky to calculate a distance between terms.
Short of creating a massive lookup table created by a human (not likely with millions of rows), is there any approach I can take with this problem?
You will need to first solve the spelling problem, unless you have Google scale data that could allow you to learn fixing spelling with Google scale statistics.
Then you will still have the problem that "Apple" could be a fruit or a computer. Apple and "Granny Smith" will be completely different. You best guess at this second stage is something like word2vec trained on massive data. Then you get high dimensional word vectors, and can finally try to solve the clustering challenge, if you ever get that far with decent results. Good luck.

what is the best way to retrive information in a graph through has Step

I'm using titan graph db with tinkerpop plugin. What is the best way to retrieve a vertex using has step?
Assuming employeeId is a unique attribute which has a unique vertex centric index defined.
Is it through label
i.e g.V().has(label,'employee').has('employeeId','emp123')
g.V().has('employee','employeeId','emp123')
(or)
is it better to retrieve a vertex based on Unique properties directly?
i.e g.V().has('employeeId','emp123')
Which one of the two is the quickest and better way?
First you have 2 options to create the index:
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).buildCompositeIndex()
mgmt.buildIndex('byEmployeeId', Vertex.class).addKey(employeeId).indexOnly(employee).buildCompositeIndex()
For option 1 it doesn't really matter which query you're going to use. For option 2 it's mandatory to use g.V().has('employee','employeeId','emp123').
Note that g.V().hasLabel('employee').has('employeeId','emp123') will NOT select all employees first. Titan is smart enough to apply those filter conditions, that can leverage an index, first.
One more thing I want to point out is this: The whole point of indexOnly() is to allow to share properties between different types of vertices. So instead of calling the property employeeId, you could call it uuid and also use it for employers, companies, etc:
mgmt.buildIndex('employeeById', Vertex.class).addKey(uuid).indexOnly(employee).buildCompositeIndex()
mgmt.buildIndex('employerById', Vertex.class).addKey(uuid).indexOnly(employer).buildCompositeIndex()
mgmt.buildIndex('companyById', Vertex.class).addKey(uuid).indexOnly(company).buildCompositeIndex()
Your queries will then always have this pattern: g.V().has('<label>','<prop-key>','<prop-value>'). This is in fact the only way to go in DSE Graph, since we got completely rid of global indexes that span across all types of vertices. At first I really didn't like this decision, but meanwhile I have to agree that this is so much cleaner.
The second option g.V().has('employeeId','emp123') is better as long as the property employeeId has been indexed for better performance.
This is because each step in a gremlin traversal acts a filter. So when you say:
g.V().has(label,'employee').has('employeeId','emp123')
You first go to all the vertices with the label employee and then from the employee vertices you find emp123.
With g.V().has('employeeId','emp123') a composite index allows you to go directly to the correct vertex.
Edit:
As Daniel has pointed out in his answer, Titan is actually smart enough to not visit all employees and leverages the index immediately. So in this case it appears there is little difference between the traversals. I personally favour using direct global indices without labels (i.e. the first traversal) but that is just a preference when using Titan, I like to keep steps and filters to a minimum.

Neo4j how can I merge two nodes in Java?

Any fast ways to merge two nodes into one, without traverse the properties and relatioships myself?
Any advice would be appreciated!
You have to implement that yourself - including conflict resolution if e.g. both nodes contains the same property with different values etc.
Neo4j does not offer any support for this (since this is probably not a very common use case).