What is RNA in scala? - scala

I am reading about scala collection architecture. RNA term is mentioned here. Can somebody explain what it is?

From Wikipedia:
Ribonucleic acid (RNA) is a polymeric molecule implicated in various biological roles in coding, decoding, regulation, and expression of genes. RNA and DNA are nucleic acids, and, along with proteins and carbohydrates, constitute the three major macromolecules essential for all known forms of life. [...] Cellular organisms use messenger RNA (mRNA) to convey genetic information (using the letters G, U, A, and C to denote the nitrogenous bases guanine, uracil, adenine, and cytosine) that directs synthesis of specific proteins.
Of course this has nothing to do with Scala or programming as such - it is merely used as a real-life example in the tutorial you're reading.

It's an example problem domain used in that article. From there:
In the next few sections you’ll be walked through two examples that do this, namely sequences of RNA bases and prefix maps implemented with Patricia tries.
If you want to learn more about RNA, check out its Wikipedia page.

Related

How to preprocess text for embedding?

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.
I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).
My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.
I've found some guidance online but I'm still curious to know what the community here thinks:
Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?
My thoughts:
It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.
Does this sound right?
I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect.
I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.
I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.
I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
by Jose Camacho-Collados and Mohammad Taher Pilehvar
https://arxiv.org/pdf/1707.01780.pdf
Here is a quote from their conclusion:
"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."
So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).
I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

Can RNA assisted protein folding be performed computationally?

I am looking for tools/ softwares to perform protein folding in presence of its RNA partner. Part of my protein is disordered and I suspect that it will take up structure while interacting with its RNA partner. Till now I did not find any literature where computational tools like MD has been implemented to check for such folding.
If anyone has come across any literature or have performed such studies, can you kindly help me how to proceed with it?
Thanking You,
Dolly

Boolean expression simplification

I am trying to simplify a Boolean expression with exactly 39 inputs, and about 500 million - 800million terms (as in that many and/not/or statements).
A perfect simplification is not needed, but a good one would be nice.
I am aware of the K-maps , Quine–McCluskey, Espresso algorithms. However I am also aware that these mechanisms would take way too long to simplify a circuit of this size based on what I have read
I would need to simplify this expression as much as possible within a 24 hour period.
After google searching, I find it difficult to find any resources for attempting to simplify a machine of quite this magnitude! Any resources out there or a library out there that can attempt to at least simplify this to some extent within a 24 time period?
A greedy heuristic Simplify is described in the somewhat dated book
Robert K. Brayton , Gary D. Hachtel , C. McMullen , Alberto Sangiovanni-Vincentelli
Logic Minimization Algorithms for VLSI Synthesis
You can find the chapter online.
Simplify is based on the unate paradigm. In divide-and-conquer style, it recursively applies Shannon's expansion theorem to split the function into smaller sub-functions. The heuristic rule is to split by the most binate variable first, i.e. the variable which separates the largest number of terms.
A second approach could be to use graph partitioning tools like METIS to split the terms into independent (or at least loosely related) subsets. But I am not aware that this has been tried sucessfully for logic synthesis tasks. My favorite search engine is sceptical and does not return any hits.
A more recent algorithm based on Binary Decision Diagrams was published in
Olivier Coudert: Doing Two-Level Logic Minimization 100 Times Faster
The paper lists examples with very high number of terms similar to your task at hand.
A somewhat related simplification technique BDD Sweeping as described in A Study of Sweeping Algorithms in the Context of Model Checking.
This is a duplicate question. See https://stackoverflow.com/a/60535990/1531728 for resources about logic optimization, or simplication of boolean expressions.

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

Using Conditional Random Fields for Named Entity Recognition

What is Conditional Random Field?
How does exactly Conditional Random Field identify proper names as person, organization, or place in a structured or unstructured text?
For example: This product is ordered by StackOverFlow Inc.
What does Conditional Random Field do to identify StackOverFlow Inc. as an organization?
A CRF is a discriminative, batch, tagging model, in the same general family as a Maximum Entropy Markov model.
A full explanation is book-length.
A short explanation is as follows:
Humans annotate 200-500K words of text, marking the entities.
Humans select a set of features that they hope indicate entities. Things like capitalization, or whether the word was seen in the training set with a tag.
A training procedure counts all the occurrences of the features.
The meat of the CRF algorithm search the space of all possible models that fit the counts to find a pretty good one.
At runtime, a decoder (probably a Viterbi decoder) looks at a sentence and decides what tag to assign to each word.
The hard parts of this are feature selection and the search algorithm in step 4.
Well to understand that you got to study a lot of things.
For start
Understand the basic of markov and bayesian networks.
Online course available in coursera by daphne coller
https://class.coursera.org/pgm/lecture/index
CRF is a special type of markov network where we have observation and hidden states.
The objective is to find the best State Assignment to the unobserved variables also known as MAP problem.
Be Prepared for a lot of probability and Optimization. :-)