Keras LSTM training for text generation

Keras LSTM training for text generation - neural-network

I am working on a character level text generator using Keras. In going through examples/tutorials there is something that I still do not understand.
The training data (X) is being split into semi redundant sequences of length maxlen, with y being the character immediately following the sequence.
I understand that this is for efficiency as it means that the training will only realize dependencies within maxlen characters.
I am struggling to understand why it is done in sequences though. I thought LSTM/RNN were trained by inputting characters one at a time and comparing the predicted next character to the actual next character. This seems very different then inputting them say maxlen=50 characters at a time and comparing length 50 sequences to the next character.
Does Keras actually break up the training sequences and input them character by character "under the hood"?
If not why?

Because of sequence generation I'm assuming that you are setting the flag stateful=True in your recurrent layers. Without this option you are making different sequences / characters independent what I think is not the case. If this flag is set to True then both of this approaches are equivalent - and dividing the text into sequences is made for improvement of performance and simplicity reason.

Related

In Fasttext skipgram training, what will happen if some sentences in the corpus have just one word?

Imagine that you have a corpus in which some lines have just one word, so there is no context around some of the words. In this situation how does Fasttext perform to provide embeddings for these single words? Note that the frequency of some of these words are one and there is no cut-off to get rid of them.

There's no way to train a context_word -> target_word skip-gram pair for such words (in either 'context' or 'target' roles), so such words can't receive trained representations. Only texts with at least 2 tokens contribute anything to word2vec or FastText word-vector training.
(One possible exception: FastText in its 'supervised classification' mode might be able to make use of, and train vectors for, such words, because then even single words can be used to predict the known-label of training texts.)
I suspect that such corpuses will still result in the model counting the word in its initial vocabulary-discovery scan, and thus it will be allocated a vector (if it appears at least min_count times), and that vector will receive the usual small-random-vector initialization. But the word-vector will receive no further training – so when you request the vector back after training, it will be of low-quality, with the only meaningful contributions coming from any char n-grams shared with other words that received real training.
You should consider any text-breaking process that results in single-word texts as buggy for the purposes of FastText. If those single-word texts come from another meaningful context where they were once surrounded by other contextual words, you should change your text-breaking process to work in larger chunks that retain that context.
Also note: it's rare for min_count=1 to be a good idea for word-vector models, at least when the training text is real natural-language material where word-token frequencies roughly follow Zipf's law. There will be many, many 1-occurrence (or few-occurrence) words, but with just one to a few example usage contexts, not likely representing the true breadth and subtleties of that word's real usages, it's nearly impossible for such words to receive good vectors that generalize to other uses of those same words elsewhere.
Training good vectors require a variety of usage examples, and just one or a few examples will practically be "noise" compared to the tens-to-hundreds of examples of other words' usage. So keeping these rare words, instead of dropping them like a default min_count=5 (or higher in larger corpuses) would do, tends to slow training, slow convergence ("settling") of the model, and lower the quality of the other more-frequent word vectors at the end – due to the significant-but-largely-futile efforts of the algorithm to helpfully position these many rare words.

Implementing one hot encoding

I already understand the uses and concept behind one hot encoding with neural networks. My question is just how to implement the concept.
Let's say, for example, I have a neural network that takes in up to 10 letters (not case sensitive) and uses one hot encoding. Each input will be a 26 dimensional vector of some kind for each spot. In order to code this, do I act as if I have 260 inputs with each one displaying only a 1 or 0, or is there some other standard way to implement these 26 dimensional vectors?

In your case, you have to differ between various frameworks. I can speak for PyTorch, which is my goto framework when programming a neural network.
There, one-hot encodings for sequences are generally performed in a way where your network will expect a sequence of indices. Taking your 10 letters as an example, this could be the sequence of ["a", "b", "c" , ...]
The embedding layer will be initialized with a "dictionary length", i.e. the number of distinct elements (num_embeddings) your network can receive - in your case 26. Additionally, you can specify embedding_dim, i.e. the output dimension of a single character. This is already past the step of one-hot encodings, since you generally only need them to know which value to associate with that item.
Then, you would feed a coded version of the above string to the layer, which could be looking like this: [0,1,2,3, ...]. Assuming the sequence is of length 10, his will produce an output of [10,embedding_dim], i.e. a 2-dimensional Tensor.
To summarize, PyTorch essentially allows you to skip this rather tedious step of encoding it as a one-hot encoding. This is mainly due to the fact that your vocabulary can in some instances be quite large: Consider for example Machine Translation Systems, in which you could have 10,000+ words in your vocabulary. Instead of storing every single word as a 10,000-dimensional vector, using a single index is more convenient.
If that should not completely answer your question (since I am essentially telling you how it is generally preferred): Instead of making a 260-dimensional vector, you would again use a [10,26] Tensor, in which each line represents a different letter.

If you have 10 distinct elements(Ex: a,b....j OR 1,2...10) to be represented as 'one hot-encoding' vector of dimension-26 then, your inputs are 10 vectors only each of which is to be represented by 26-dim vector. Do this:
y = torch.eye(26) # If you want a tensor for each 'letter' of length 26.
y[torch.arange(0,10)] #This line gives you 10 one hot-encoding vector each of dimension 26.
Hope this helps a bit.

How to predict word using trained CBOW

I have a question about CBOW prediction. Suppose my job is to use 3 surrounding words w(t-3), w(t-2), w(t-1)as input to predict one target word w(t). Once the model is trained and I want to predict a missing word after a sentence. Does this model only work for a sentence with four words which the first three are known and the last is unknown? If I have a sentence in 10 words. The first nine words are known, can I use 9 words as input to predict the last missing word in that sentence?

Word2vec CBOW mode typically uses symmetric windows around a target word. But it simply averages the (current in-training) word-vectors for all words in the window to find the 'inputs' for the prediction neural-network. Thus, it is tolerant of asymmetric windows – if there are fewer words are available on either side, fewer words on that side are used (and perhaps even zero on that side, for words at the front/end of a text).
Additionally, during each training example, it doesn't always use the maximum-window specified, but some random-sized window up-to the specified size. So for window=5, it will sometimes use just 1 on either side, and other times 2, 3, 4, or 5. This is done to effectively overweight closer words.
Finally and most importantly for your question, word2vec doesn't really do a full-prediction during training of "what exact word does the model say should be heat this target location?" In either the 'hierarchical softmax' or 'negative-sampling' variants, such an exact prediction can be expensive, requiring calculations of neural-network output-node activation levels proportionate to the size of the full corpus vocabulary.
Instead, it does the much-smaller number-of-calculations required to see how strongly the neural-network is predicting the actual target word observed in the training data, perhaps in contrast to a few other words. In hierarchical-softmax, this involves calculating output nodes for a short encoding of the one target word – ignoring all other output nodes encoding other words. In negative-sampling, this involves calculating the one distinct output node for the target word, plus a few output nodes for other randomly-chosen words (the 'negative' examples).
In neither case does training know if this target word is being predicted in preference over all other words – because it's not taking the time to evaluate all others words. It just looks at the current strength-of-outputs for a real example's target word, and nudges them (via back-propagation) to be slightly stronger.
The end result of this process is the word-vectors that are usefully-arranged for other purposes, where similar words are close to each other, and even certain relative directions and magnitudes also seem to match human judgements of words' relationships.
But the final word-vectors, and model-state, might still be just mediocre at predicting missing words from texts – because it was only ever nudged to be better on individual examples. You could theoretically compare a model's predictions for every possible target word, and thus force-create a sort of ranked-list of predicted-words – but that's more expensive than anything needed for training, and prediction of words like that isn't the usual downstream application of sets of word-vectors. So indeed most word2vec libraries don't even include any interface methods for doing full target-word prediction. (For example, the original word2vec.c from Google doesn't.)
A few versions ago, the Python gensim library added an experimental method for prediction, [predict_output_word()][1]. It only works for negative-sampling mode, and it doesn't quite handle window-word-weighting the same way as is done in training. You could give it a try, but don't be surprised if the results aren't impressive. As noted above, making actual predictions of words isn't the usual real goal of word2vec-training. (Other more stateful text-analysis, even just large co-occurrence tables, might do better at that. But they might not force word-vectors into interesting constellations like word2vec.)

Choosing a checksum for short code to prevent typing errors

i need to choose a checksum algorithm to detect when users mistyped a 4 character [A-Z0-9] code by adding 1 character at the end of the code (in [A-Z0-9] also).
Summing ASCII codes and applying a modulo is a bad solution, since inverting 2 key strokes won't be noticed.
I would probably use the Fletcher algorithm, but i would like to know is anyone knows an algorithm designed for this use case (very very small amount of byte, position dependant) ?
Thank you.

You can try the ISO 7064 Mod x,y algorithms. According to the ISO description:
The check character systems specified in ISO/IEC 7064:2002 can detect ( http://www.iso.org/iso/home/store/catalogue_ics/catalogue_detail_ics.htm?csnumber=31531 ):
all single substitution errors (the substitution of a single character for another, for example 4234 for 1234);
all or nearly all single (local) transposition errors (the transposition of two single characters, either adjacent or with one character between them, for example 12354 or 12543 for 12345);
all or nearly all shift errors (shifts of the whole string to the left or right);
a high proportion of double substitution errors (two separate single substitution errors in the same string, for example 7234587 for 1234567);
high proportion of all other errors.
There are some partial implementations you can find like:
http://code.google.com/p/checkdigits/wiki/CheckDigitSystems (includes Java and Javascript implementations of several checksums algorithms).
http://www.codeproject.com/Articles/16540/Error-Detection-Based-on-Check-Digit-Schemes (explains and includes VC implementations).
For example, you could use ISO 7064 Mod 37,36, which can use 0-9 and A-Z (the data and the check character). The detailed description of the algorithm (if you don't feel like buying the ISO) can be found in:
http://www.cdfa.ca.gov/ahfss/animal_health/pdfs/NAIS/Program_Standard_and_Technical_Reference10-07.pdf (it's used for animal identification)
http://www.ifpi.org/content/library/GRid_Standard_v2_1.pdf (also used by the music industry)
http://www.ddex.net/sites/default/files/DDEX-DPID-10-2006.pdf (other media companies)

Near Duplicate Detection in Data Streams

I am currently working on a streaming API that generates a lot of textual content. As expected, the API gives out a lot of duplicates and we also have a business requirement to filter near duplicate data.
I did a bit of research on duplicate detection in data streams and read about Stable Bloom Filters. Stable bloom filters are data structures for duplicate detection in data streams with an upper bound on the false positive rate.
But, I want to identify near duplicates and I also looked at Hashing Algorithms like LSH and MinHash that are used in Nearest Neighbour problems and Near Duplicate Detection.
I am kind of stuck and looking for pointers as to how to proceed and papers/implementations that I could look at?

First, normalize the text to all lowercase (or uppercase) characters, replace all non-letters with a white space, compress all multiple white spaces to one, remove leading and trailing white space; for speed I would perform all these operations in one pass of the text. Next take the MD5 hash (or something faster) of the resulting string. Do a database lookup of the MD5 hash (as two 64 bit integers) in a table, if it exists, it is an exact duplicate, if not, add it to the table and proceed to the next step. You will want to age off old hashes based either on time or memory usage.
To find near duplicates the normalized string needs to be converted into potential signatures (hashes of substrings), see the SpotSigs paper and blog post by Greg Linden. Suppose the routine Sigs() does that for a given string, that is, given the normalized string x, Sigs(x) returns a small (1-5) set of 64 bit integers. You could use something like the SpotSigs algorithm to select the substrings in the text for the signatures, but making your own selection method could perform better if you know something about your data. You may also want to look at the simhash algorithm (the code is here).
Given the Sigs() the problem of efficiently finding the near duplicates is commonly called the set similarity joins problem. The SpotSigs paper outlines some heuristics to trim the number of sets a new set needs to be compared to as does the simhash method.

http://micvog.com/2013/09/08/storm-first-story-detection/ has some nice implementation notes

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse