Did anyone had this issue of genes from same clusters showing on different contigs when you run antismash - cluster-analysis

I had assembled genomes by SPAdes and I am pretty confident that the assemblies are good per the stats report I have. I am looking into secondary metabolite clusters in these genomes and finds that when I run antismash, the genes from a particular cluster appears on different contigs.
This makes me a little worried if my assemly is not good enough
We had sequenced these genomes using illumina so they are short reads.
I tried assemling using a different tool and the stats came out pretty similar. Now I dont know if it has something to do with assemly or antismash. I work on Psuedomonads

Related

Does OptaPlanner have a "built-in" way to perform multi-unit score normalization?

At the moment, my problem has four metrics. Each of these measures something entirely different (each has different units, a different range, etc.) and each is weighted externally. I am using Drools for scoring.
I only have only one score level (SimpleLongScore) and I have to find a way to appropriately combine the individual scores of these metrics onto one long value
The most significant problem at the moment is that the range of values for the metrics can be wildly different.
So if, for example, after a move the score of a metric with a small possible range improves by, say, 10%, that could be completely dwarfed by an alternate move which improves the metric with a larger range's score by only 1% because OptaPlanner only considers the actual score value rather than the possible range of values and how changes affect them proportionally (to my knowledge).
So, is there a way to handle this cleanly which is already part of OptaPlanner that I cannot find?
Is the only feasible solution to implement Pareto scoring? Because that seems like a hack-y nightmare.
So far I have code/math to compute the best-possible and worst-possible scores for a metric that I access from within the Drools and then I can compute where in that range a move puts us, but this also feel quite hack-y and will cause issues with incremental scoring if we want to scale non-linearly within that range.
I keep coming back to thinking I should just just bite the bullet and implement Pareto scoring.
Thanks!
Take a look at #ConstraintConfiguration and #ConstraintWeight in the docs.
Also take a look at the chapter "explaning the score", which can exactly tell you which constraint had which score impact on the best solution found.
If, however, you need pareto optimization, so you need multiple best solutions that don't dominate each other, then know that OptaPlanner doesn't support that yet, but I know of 2 cases that implemented it in OptaPlanner by hacking BestSolutionRecaller.
That being said, 99% of the cases that think of pareto optimization, are 100% happy with #ConstraintWeight instead, because users don't want multiple best solutions (except during simulations), they just want one in production.

How to verify that model output & observed data distribution are similar?

Looking for advice on how to determine wether my model output data distribution is similar (and if so, then how similar) to the observed datasets distribution.
Basically I have a GBM model with mean reversion that provides seemingly good results, when I compare its distribution to observed data. You can see their PDFs side-by-side in the attached picture.
PDF of observed and model data
Both datasets are huge (~6 million datapoint), and I start to suspect that this is part of the problem...
I am looking for a way to verify that the datasets distributions are similar. I tried the two-sample Kolmogorov-Smirnov test, two-sample t-test, but for some reason both of them rejected the Null hypothesis (always, even with different Alphas). In some threads I've read that these tests are unreliable, when applied to huge datasets, but there wasn't a consensus about this.
I am using Matlab currently, but I am open to others if necessary.
Any help would be appreciated! I primarily looking for a hypothesis test for verification, but if you have a different idea don't hold it back!

Factors that Impact Translation Time

I have run across issues in developing models where the translation time (simulates quickly but takes far too long to translate) has become a serious issue and could use some insight so I can look into resolving this.
So the question is:
What are some of the primary factors that impact the translation time of a model and ideas to address the issue?
For example, things that may have an impact:
for loops vs a vectorized method - a basic model testing this didn't seem to impact anything
using input variables vs parameters
impact of annotations (e.g., Evaluate=true)
or tough luck, this is tool dependent (Dymola, OMEdit, etc.) :(
use of many connect() - this seems to be a factor (perhaps primary) as it forces translater to do all the heavy lifting
Any insight is greatly appreciated.
Clearly the answer to this question if naturally open ended. There are many things to consider when computation times may be a factor.
For distributed models (e.g., finite difference) the use of simple models and then using connect equations to link them in the appropriate order is not the best way to produce the models. Experience has shown that this method significantly increases the translation time to unbearable lengths. It is better to create distributed models in the same approach that is used the MSL Dynamic pipe (not exactly like it but similar).
Changing the approach as described is significantly faster in translational time (orders of magnitude for larger models, >~100,000 equations) than using connect statements as the number of distributed elements increases to larger numbers. This was tested using Dymola 2017 and 2017FD01.
Some related materials pointed out by others that may be useful for more information have been included below:
https://modelica.org/events/modelica2011/Proceedings/pages/papers/07_1_ID_183_a_fv.pdf
Scalable Test Suite : https://dx.doi.org/10.3384/ecp15118459

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?