Given a finite character vocabulary, what is the easiest way to represent arbitrarily long sequences of characters with uniform length? - data-representation

I am attempting to manipulate a finite state transducer for a project. However, in constructing the FST, I need the output symbols to each be some arbitrarily long sequence of characters from the input symbols, which are simply individual unique characters from an associated corpus of text. Additionally, I need to represent these arbitrarily long sequences uniformly, such that each combination's
representation has the same length. Of course, with arbitrary length, the longest possible combination has infinite length, so let us assume that no combination can be longer than the longest document from the associated corpus.
In other words, given an input_vocabulary of ['a', 'b', 'c'], an output_vocabulary of ['a', 'ab', 'acb', 'abcb'] needs each to be represented as some vector of length 4 with each item in each vector being an item from the input_vocabulary. My only idea is to do so with a padded vector, such as, for this example, [ [0, 3, 3, 3], [0, 1, 3, 3], [0, 2, 1, 3], [0, 1, 2, 1] ], where 3 is a pad token, but I am very new to this, so any help would be greatly appreciated.
To clarify, I want to know if there is a way to do this without pad tokens.

Related

Printing n choose k combinations in matlab

I need to create an algorithm in matlab which returns any combination of n subset from the k set. For example I have a set {1,2,3,4,5} and I need any combination of 3 numbers included in this set. So this function should returns:
[[1, 2, 3], [1, 2, 4], [1, 2, 5], [1, 3, 4], [1, 3, 5], [1, 4, 5], [2, 3, 4], [2, 3, 5], [2, 4, 5], [3, 4, 5]]
I have tried to write it by myself, but unsuccessfully and I give up. It partially works, but it creates endless loop.
for i=1:n
if(firstTime)
lastComb=min //123
firstTime=false
else
for d=k:-1:1
while(lastComb(:,end) < n-k+d && lastComb(:,end)<=n)
newComb=lastComb
newComb(d)=lastComb(d)+1
combos= [combos; newComb]
lastComb=newComb
end
while(lastComb(:,end) > n-k+d && lastComb(:,end)<=n)
newComb=lastComb
for p=d:-1:1
if(newComb(p)+1 <=n)
newComb(p)=newComb(p)+1
combos= [combos; newComb]
end
end
end
end
end
end
Overall, your syntax is a little confusing (as previously stated by someone else). If you're posting a question here, it's a good idea to include all of your code, including the defined variables just for ease of those that are helping you.
A few major issues that I'm seeing with what you have are as follows:
(1) You seem to only be getting "n" combinations with what you're writing, which I think is 3 combinations. Instead, you should be getting 10 combinations. The beginning of this function should probably have a combination calculation [nCk = n!/((n-k)!*k!)].
(2) You're defining the first combination as 1,2,3, but then you don't but it into the "combos" set that you are making. Instead you replace it with something else before it can reach "combos."
There are a couple more, but try fixing these parts and the others may come naturally.
Overall, this function already exists. Just type "open nchoosek" into MATLAB to see a refined version of what you are attempting if you get stuck!

Take array input from a file

I have a file in the first line is of the form:
6, [6; 2], 1000, 0.5, 0.01, [6 2], 0, 3.1416, [1 1 1]
Any of the cells can a vector/array, but only one dimensional
I tried taking input using textscan:
C = (fid, '%f%f%f%f%f%f%f%f%f',1,'delimiter',',');
but this doesn't give me the right output.
How can I take this input such that I get all the arrays?
Thanks in advance!
textscan with that format specifier is always going to fail because the [, ;, , and ] characters aren't going to be parsed properly.
You could instead split the string at the commas, and then use str2num to convert each piece into a number or array. This assumes that you never use , within an array.
value = cellfun(#str2num, strsplit(str, ','), 'uniform', 0)

Matlab find the maximum and minimum value for each point of series of arrays (with negative values)

lets say that we have the next series of arrays:
A = [1, 2, -2, -24];
B = [1, 4, -7, -2];
C = [3, 1, -7, -14];
D = [11, 4, -7, -1];
E = [1, 2, -3, -4];
F = [5, 14, -17, -12];
I would like to create two arrays,
the first will be the maximum of each column for all arrays,
i.e.
Maxi = [11,14,-2 -1];
the second will be the minimum of each column for all arrays
i.e.
Mini= [1,1,-17 -24];
I am trying all day, using loops, with max, and abs but I cant make it work
in my problem have a matrix (100,200), so with the above example i am trying to easily approach the problem. The ultimate goal is to get a kinda fitting of the 100 y_lines of 200 x_points. The idea is to calculate two lines (i.e. max,min), that will be the "visual" boarders of all lines (maximum and minimum values for each x). The next step will be to calculate an array of the average of these two arrays, so in the end will be a line between all lines.
any help is more than welcome!
How about this?
Suppose you stack all the row vectors , namely A,B...,F as
arr=[A;B;C;D;E;F];% stack the vectors
And then use the max(), min() and mean() functions provided by Matlab. That is,
Maxi = max(arr); % Maxi is a row vector carrying the max of each column of arr
Mini = min(arr);
Meani = mean(arr);
You just have to stack them as shown above. But if you have 100s of row vectors, use a loop to stack them into array arr as shown above.

How to get a regular sampled matrix in Scilab

I'm trying to program a function (or even better it it already exists) in scilab that calculates a regular timed samples of values.
IE: I have a vector 'values' which contains the value of a signal at different times. This times are in the vector 'times'. So at time times(N), the signal has value values(N).
At the moment the times are not regular, so the variable 'times' and 'values' can look like:
times = [0, 2, 6, 8, 14]
values= [5, 9, 10, 1, 6]
This represents that the signal had value 5 from second 0 to second 2. Value 9 from second 2 to second 6, etc.
Therefore, if I want to calculate the signal average value I can not just calculate the average of vector 'values'. This is because for example the signal can be for a long time with the same value, but there will be only one value in the vector.
One option is to take the deltaT to calculate the media, but I will also need to perform other calculations:average, etc.
Other option is to create a function that given a deltaT, samples the time and values vectors to produce an equally spaced time vector and corresponding values. For example, with deltaT=2 and the previous vectors,
[sampledTime, sampledValues] = regularSample(times, values, 2)
sampledTime = [0, 2, 4, 6, 8, 10, 12, 14]
sampledValues = [5, 9, 9, 10, 1, 1, 1, 6]
This is easy if deltaT is small enough to fit exactly with all the times. If the deltaT is bigger, then the average of values or some approximation must be done...
Is there anything already done in Scilab?
How can this function be programmed?
Thanks a lot!
PS: I don't know if this is the correct forum to post scilab questions, so any pointer would also be useful.
If you like to implement it yourself, you can use a weighted sum.
times = [0, 2, 6, 8, 14]
values = [5, 9, 10, 1, 6]
weightedSum = 0
highestIndex = length(times)
for i=1:(highestIndex-1)
// Get the amount of time a certain value contributed
deltaTime = times(i+1) - times(i);
// Add the weighted amount to the total weighted sum
weightedSum = weightedSum + deltaTime * values(i);
end
totalTimeDelta = times($) - times(1);
average = weightedSum / totalTimeDelta
printf( "Result is %f", average )
Or If you want to use functionally the same, but less readable code
timeDeltas = diff(times)
sum(timeDeltas.*values(1:$-1))/sum(timeDeltas)

Create array of points from single dimensional array of points

Waht i need to do is take a single dimensional array, ie:
[1, 1, 2, 2, 3, 3]
and turn it into an array of points:
[[1, 1], [2, 2], [3, 3]]
I am hoping for a simple native matlab way of doing it rather then a function. This will be going into sets of points ie:
[ [[1, 1], [2, 2], [3, 3]],
[[4, 4], [5, 5], [6, 6]],
[[7, 7], [7, 7], [8, 8]] ]
The reason this is going to happen is the points will be stored in a text file as a single stream and i need to turn them into something meaningful.
First note that a horizontal concatenation of row vectors will result in one larger row vector rather than in a row of pairs, that is [[1, 1], [2, 2], [3, 3]] is the same as [1 1 2 2 3 3]. Hence, you need to concatenate them vertically.
You can try
a = [1, 1, 2, 2, 3, 3];
b = reshape(a, 2, floor(length(a)/2))';
This will result in a matrix where each row represents the coordinates of one point.
b =
1 1
2 2
3 3
I'm just adding this answer for the sake of diversity:
Just as H.Muster said, concatenation of vectors will result in a larger vector or a matrix (depending on your operation). You can go with that.
But you can also use a cell array, which is a set of data containers called "cells". A cell can contain any type of data, regradless of what other cells contain in the same cell array.
In your case, creating a cell array can be done using a slightly different syntax (than H.Muster's answer):
a = [1, 1, 2, 2, 3, 3];
p = mat2cell(a, 1, 2 * ones(1, numel(a) / 2))
p is a cell array, each cell containing a 1-by-2 point vector. To access an element in a cell array, you'll have to use curly braces. For instance, the second point would be p{2} = [2, 2].