Classification with large volume of missing data - classification

When building a model to classify a student is going to be admitted to a special program or not, the main features include,
gender | Ethnicity | State | Zip code | Test score | Education | Job title | Current gpa | Admission
As the data is collected online, many features are missing a lot of data. Feature 'Test score' should be important to the Admission decision, but it misses about 80%. Seems imputation is not practical.
Should keep it as a feature and use EM or Bayesian network, SVM those algorithms which are not sensitive to missing data, Or directly remove this feature when build model? Any suggestions?

You should drop the feature. The test scores can not be averaged out with just 20% of the scores present. Random values with a distribution can't be added either because they are test scores.
You could try building a model with the rows which consists of those values and see if it's effective or not.

Related

Imbalanced multiclass classification using company names

I have this classification scenario below in which Im getting a very low F1, precision, recall and other metrics.
Target is multiclass (about ~200 classes) which is highly imbalanced
I only use company names as classifier (mostly 1-2 words which have max of 8 words), no other fields (like description, etc.)
Training data ~ 100k+ records
Preprocessing: numeric and special characters and stopwords removal
I have very low resources for processing (thats why when I try to use oversampling techniques like smote, distance_smote for multiclass, etc., I always get memory error)
Tried using different vectorization/embedding/tokenizer like word2vec, tfidf, fasttext, bert, roberta, etc. but to no avail
Tried using (and fine-tuning) different algorithms (networks, svm, trees, boosting, etc.) but also getting low scores.
I also did cost-sensitive learning (using class weights) but it only decreased my scores.
Tried all options that I know but scores are not increasing. Can you recommend other options here or do you think any part of the process that may be wrong/discarded? Thank you!
Distribution of target labels:
Sample observations
There is essentially no way to know that 'Exxon' is an oil company, and 'Apple' a computer company, and 'McDonalds' a fast-food chain, just from their company names.
Even if you have a list of every other company in the world, by name and type, that's not enough to make the deduction for these last 3. Only other outside info – like a few sentences about them, or other data – could classify them.
In fact, while company names sometimes describe their exact field-of-commerce, often they're totally arbitrary, as that gives them more freedom to range over many products/services, or create their own unique associations with the name (aka branding).
So I strongly suspect your (unshown) names & (unshown) labels are just too arbitrary for the data you're using to get very good at the task you're attempting.
Is there a real-world situation where someone will only have a company name – no other info, or research options – and benefit from correctly guessing the class? If so, more specifics about the situation might help generate more specific tactical recommendations. But mainly such recommendations will be: get richer data about the targets of the classification.
You might squeeze a little more out of vague trends in corporate naming via better preprocessing/feature-extraction. You may want to keep numbers, special-characters, & punctuation in some form, as they might include extra slight hints. Using subwords (character n-grams) might also reveal some shared word-roots used even in made-up names.

CQRS (event sourcing): Projections with multiple aggregates

I have a question regarding projections involving multiple aggregates on a CQRS architecture.
For example sake, suppose I have two aggregates WorkItem and Developer and that the following events happen sequentially (but not immediately)
WorkItemCreated (workItemId)
WorkItemTitleChanged (workItemId, title)
DeveloperCreated (developerId)
DeveloperNameChanged (developerId, name)
WorkItemAssigned (workitemId, DeveloperId)
I wish to create a projection which is as "inner join" of developer-workitem:
| WorkItemId | DeveloperId | Title | DeveloperName | ... |
|------------|-------------|--------|---------------|-----|
| 1 | 1 | FixBug | John Doe | ... |
The way I am doing my projections is incrementally. Meaning I load the saved projections from the database and apply the remaining events as they come.
My problem is, the event responsible for creating a row on the projection table is WorkItemAssigned. However, that event does not carry required information from previous events (workitem title, developer name, etc.)
In order to have the required information by the time WorkItemAssigned, I have to load all events from the eventstore, keep states in-memory for all WorkItems and Developers so I have the required information by the time a WorkItemAssigned event arrives.
Sure, I could have a projection for Workitem, another for Developer and query them to retrieve their last states. But it seems like a lot of work, if I am to create projections for each aggregate separately, I might as well create a database view to inner-join them (In fact, that is what I am doing.)
I am not doing all this by hand, I am currently using a good framework called EventFlow, but it doesn´t direct me to answer this question.
This is a question on fundamentals of CQRS, and I fell I am missing something here.
I don't think you are missing anything. Projecting read models in an event-sourced system presents a different set of problems than querying from a relational model. The problems are not necessarily easier or harder to solve; they are just different.
The good news is that you have a lot of choices. Event Sourcing allows you to project data in any imaginable way, so you can decide on a solution that is most suitable for each individual projection. I guess the "bad" news (I would argue it's not bad news) is that the solution to the problem is not the same every time as it is with a relational system, which is to construct a query using JOINs.
You've already identified a few possible solutions:
Use a relational model as one of your read models
When a certain type of event comes in, re-query the streams that hold the data you need and use them to project on demand
You could also simply hold some data in an interim state (in memory, a document database, the file system, etc.) that allows you to look up the data and project it when needed. So keep lists of updated WorkItems and Developers where they can be read and used whenever a WorkItemAssigned event comes in.
I would say creating a relational database as an interim or permanent read model is a perfectly viable way of solving the problem, assuming you are not trying to achieve massive scalability.

Comparison of db strategies for polymorphic many to many: RDB vs. NoSQL

I'm choosing a database approach, and still at the stage where the decision is at a very highly level; I'm trying as hard as possible to fit my problem into Postgres, for its enormous maturity and feature set, but I'm not at all familiar with this level of SQL complexity, and wonder if the core relationship I have in mind might just be better expressed in a different model. (I've only found other answers that ask about this within a given framework.)
The driving issue (within a larger setup) is the ability to associate Features (ex. large, weight, type, but a VARYING set of these), that have been assigned by a specific Classifier (model), with Things. These don't just show that a Thing IS 'large,' but that it was ASSESSED for 'largeness,' by a particular Classifier. And most difficult within a RDBMS, the VALUE of 'large' might be binary, while 'weight' is an integer, 'type' is a category, etc. In other words,
Thing1 --> large=true (says model 1)
|-> weight=3 (says model 2)
|-> type='great' (says model 3)
Thing2 --> large=false (says model 1)
(not assessed for 'weight')
|-> type='lousy' (says model 4)
In a nutshell, storing (thing, feature, value, classifier) tuples, where Things and Features are many to many, and values will be of different types, for different features.
Approach 1: Relational
There is a table of Things. There is also a PARENT class Features, which (or maybe its children?) has a many-to-many relationship with Things. Each particular feature, Large, Weight, Type, is a CHILD class of 'Feature', and each of the Child >-< Things junction table also holds the (consistently typed) values of the Things, for that particular feature. In other words,
ThingTypeValues
-------------------
thing.id | type.id | 'great'
thing.id | type.id | 'lousy'
...
ThingLargeValues
---------------------
thing.id | large.id | true
thing.id | large.id | false
...
This allows me to get all 'thing.features,' because they share Feature as a parent, and still query the full (mixed-type) description of the Thing, while having consistent types within tables; it also allows me to (better) avoid the problem of accidentally having a 'great' and a 'grreat' and a 'Great,' by making each Feature-child be its own table (instead of just a string, within an object, or an ever-being-updated ENUM), where I can easily check to see what existing options there are for labels, by treating each Feature as a well-respected, separate-identity, thing.
As a last point, if there were only one Classifier (thing APPLYING Features) then there could be a single table, with every column being a Feature, and every cell having a Value, or NULL, to indicate that it wasn't assessed for that Feature -- but because there will actually be a very large number of Models, each giving their opinion, this sounds like a pretty ugly strategy, for each one to have their own enormous, mostly empty, and always growing (as we add more Features), table.
HOWEVER, using SQLAlchemy for example (I'm by far most familiar with python), I now have to use association_objects and AbstractConcreteBases and it all looks quite nasty, to my untrained eye.
** Approach 2: NoSQL**
Suddenly mixed type, which appears to be the biggest problem, is no longer a problem! I can have a set of features, and each with a type, and a value, and associate them each with an answer. If I want to be careful with not fat-fingering my categories, I can have a function that validates them, or a process that checks against existing Features before adding a new one. These sound more error-prone, certainly, but I'm asking this because I don't know how to evaluate the tradeoff, how to approach the best solution in EITHER framework, or if something entirely different is a much better idea anyway.
Opinions, suggestions, technologies, philosophies, and donations, all welcome.

Watson Retrieve and Rank/ Discovery Service return always table of content with high(est) score

Backgroud:
I'm using Watson Retrieve and Rank/ or Discovery Service to retrieve information from user manuals. I performed the training with an example washing machine manual in pdf format.
My target is to receive the best passages from the document where a specific natural language string occurs (like "Positioning the drain hose"). Which is working in general.
My problem is that the table of content is almost always the passage with the highest score.
Therefore are the first results just the table of content instead of the relevant text passage. (See example results)
"wrong" result (table of content):
Unpacking the washing machine ----------------------------------------------------2 Overview of the washing machine --------------------------------------------------2 Selecting a location -------------------------------------------------------------------- 3 Adjusting the leveling feet ------------------------------------------------------------3 Removing the shipping bolts --------------------------------------------------------3 Connecting the water supply hose ------------------------------------------------- 3 Positioning the drain hose ----------------------------------------------------------- 4 Plugging in the machine
"correct" result
Positioning the drain hose The end of the drain hose may be positioned in three ways: Over the edge of a sink The drain hose must be placed at a height of between 60 and 90 cm. To keep the drain hose spout bent, use the supplied plastic hose
possible Solutions
ignoring the table of content during training process
offset parameter to e.g. ignore the first 3 results
find out whether the result is part of table of content and ignore if YES
Those approaches are static and don't applicable for multiple documents with various structures (table of content at the beginning/ at the end/ no table of content, ...).
Has someone an idea to better approach this topic?
At this time, passage retrieval results are not affected by relevancy training. As passage retrieval always searches the entire corpus, unfortunately the only reliable way of excluding passage retrieval results from a table of contents is to remove the table of contents.

How to use cross reference table in a database modeled using Data Vault principles?

I have a Person Satellite with a Gender attribute. From source systems the values for this attribute can be: F, M, FEMALE, or MALE. Which of the two following approaches is the correct one for Data Vault modeling?
Store data in Gender as it comes from sources and in the Business Vault or Data Marts standardize the values to FEMALE and MALE only
Create a cross-reference table to map out F to FEMALE and M to MALE, while loading the Person Satellite, transform F to FEMALE and M to MALE using the cross-reference table.
I'm using Amazon Redshift that supports column compression.
I emailed Daniel Linstedt, creator of the Data Vault modeling method, to ask him the same question. His answer:
"I typically store it as it comes in, THEN translate it on the way to the Business DV.  This way, if the business ever changes it's mind, we can re-write the translation rule without affecting history.  But more than that, I've seen source systems that deliver values outside the boundaries of what's acceptable.  Do not try to translate on the way in to the Raw DV, to do so would destroy auditability."
Data vault concept is useful when you have a very complex business logic that changes over time but F/Female and M/Male mapping is a pretty simple and stable logic. Having a cross-reference will be just overcomplicating things here. I would just standardize the values to F/M and use char(1) column without compression here.