How to present an Integer representation to a Neural Network?

How to present an Integer representation to a Neural Network? - neural-network

I want to train a NN with details about parts that make up an entire product, and part of that data is the part ID that I receive as a unique integer ID; how would I go about inputting that to the NN? I don't want to normalize it as I don't want to put it on a common scale as it's not a scalar value; the number represents something. Do I just input it as the raw number?

I guess you should have an input unit for each different part_id you have, that looks similar to what happens when you train networks to process language and you assign each word to a unit.

One way to encode this is as a vector of booleans/integers. If the part with part id x exists, then vec[x] = 1, otherwise it is 0.
On the output side, it can be a vector of probabilities that that part exists.
Alternatively, you can use the number of times that part appeared rather than a simple exists vs not exists.

Related

printmat function: Decimal and percentage

I am very new to MATLAB. I am sorry if my question is basic. I am using "printmat" function to show some matrices in the command console. For example, printmat(A) and printmat(B), where A = 2.79 and B = 0.45e-7 is a scalar (for the sake of simplicity).
How do I increase the precision arbitrarily to seven decimals? For example: my output looks like 2.7943234 and B = 0.00000004563432.
How do I add a currency (say dollar) figure to the output of printmat?
How do I add a percentage figure (%) to the output of printmat?
Note: The reason I use printmat is that I can name my rows and columns. If you know a better function that can do all above, I would be glad to know.

Regards Mariam. From what I understand, you would like to display the numbers and show their full precision. I am also newbie, If I may contribute, you could convert the number data to string data (for display purposes) by using the sprintf function.
I am using the variable A=2.7943234 as example. This value will not display the full precision, instead it will display 2.7943. To show all the decimal tails, you could first convert this to string by
a = sprintf('%0.8f',A);
It will set the value a to a string '2.79432340'. The %0.8f means you want it to display 8 decimal tails. For this example,%0.7f is sufficient of course.
Another example: A=0.00000004563432, use %0.14f.
A=0.00000004563432;
a=sprintf('%0.14f $ or %%',A);
the output should be : '0.00000004563432 $ or %'.
You could analyze further in https://www.mathworks.com/help/matlab/ref/sprintf.html
You could try this first. If this does not help to reach your objective, I appreciate some inputs. Thanks.

The printmat function is very obsolete now. I think table objects are its intended successor (and functions such as array2table to convert a matrix to a table of data). Tables allow you to add row and column names and format the columns in different ways. I don't think there's a way to add $ or % to each number, but you can specify the units of each column.
In general, you can also format the display precision using format. Something like this may be what you want:
format long

Universal hashing, should get the same hash value for the same key?

I mean, I have implemented an universal hashing function using this expression:
h(k) = ((a*k + b)mod p)mod m; (from Cormen)
where:
-p is big prime number greater than k;
-a and b are two numbers that are randomly choosen the first in the range [1, p-1] and the second one [0, p-1].
Now, I implemented this, and for the random function I have choosen the seed equal to k. That's because, if I don't do this, when I insert a value with the key k, it will generate a hash value, that will depends on the default seed of Random function (maybe the time). So if I want to search the key again, I can't do this, because now the universal hashing function returns me another value. So, I would appreciate you to tell me if my reasoning is correct or not.
My doubt is that now, doing so, if two elements have the same key, they will be irrimediably stored in the same linked list (thing that I didn't understand if it is correct or not).
Thanks in advance.

I think you have a slight misunderstanding about how universal hashing works. Rather than choosing a and b at random every time you compute the hash, instead, before you do any hashing at all, select a random a and b. Once you've done that, every time you need to compute the hash, go and compute it using the formula above based on the input value k and the values a and b that you chose initially.

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.

Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!

Precision and Recall in Item based Recommender with boolean preferences in Mahout

I am trying to calculate precision and recall at n of a Data set with Boolean Preferences using item item Recommender given in mahout.
I am using GenericBooleanPrefItemBasedRecommender and
evaluate(RecommenderBuilder recommenderBuilder,DataModelBuilder dataModelBuilder, DataModel dataModel,IDRescorer rescorer,int at,double relevanceThreshold,double evaluationPercentage) throws TasteException;
`
Since there are Boolean preferences, the set of "relevant" or "good" movies for a user are all the ones rated 1.
If I run the same code many times it always gives the same value of precision and recall and they are always equal to each other. Why? I am NOT using RandomUtils.useTestSeed()
How does it split the data into training and test set?
Possibilities:
a)Randomly divides the total data set into test and training at the beginning OR for each user it randomly puts a fixed percentage of relevant movies into test set: :How does it decide this percentage since there is no place for user to input this as a parameter.Why do I get the same value of P and R each time I run the code and why is the value of P at n and R at n the same?
b)For each user, it puts all relevant movies in the training set:
Then there is no information left on user to make any recommendations and thus its not possible.
Since I am getting that value of P and R at n are equal, does that mean that for each user, the number of relevant movies are moved to the test set each time = number of recommendations i.e. n. If the n relevant movies put in the test set are random then why do I get same value of P and R each time I run the code.
The only explanation that I can think of that explains the results is that the recommender calculates P and R at n as follows:
One by one, for each user it randomly puts 'n' relevant movies in test set. The process has to be random since it can't distinguish between all relevant movies but the process is fixed and each time the code is run it picks the same n relevant movies for each user. It then makes n recommendations and calculates P and R at n.
While this explains the results I don't think it is a good process because:
1)The concept of training and test set is not defined as a percentage and thus not consistent with the usual definition.
2) P and R will always be equal to each other so we only get one metric as opposed to two.
3) The process of picking 'n' movies randomly is the same each time.
EDIT: I AM ADDING MY FULL CODE IN CASE IT HELPS ANSWER MY QUESTION:
public static void main (String[] args) throws Exception {
FileDataModel model = new FileDataModel(new File("data/test.csv"));
RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();
RecommenderBuilder recommenderBuilder = new RecommenderBuilder() {
#Override
public Recommender buildRecommender(DataModel model) {
ItemSimilarity similarity = new LogLikelihoodSimilarity(model);
return new GenericBooleanPrefItemBasedRecommender(model, similarity);
}
};
IRStatistics stats = evaluator.evaluate(
recommenderBuilder, null, model, null, 5,
GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD,1.0);
System.out.println(stats.getPrecision());
System.out.println(stats.getRecall());
}

Don't know for sure but if you seed a random number generator with the same value each time you use it, the sequence of numbers it returns will be identical. Check to see if there is a way to seed the rng with something like the system time. Just a guess.

Check out my answer on related question:
How mahout's recommendation evaluator works
I think this will help you understand how the evaluation works, how the relevant items are chosen, and how Precision and Recall are computed.

strcmp files - Very large file size output

I'm reading in a csv file that is about 80MB - data_O3. It's about 250,000 x 5 in size. I created E, which is a little bit larger because it has all the days (data_O3 is missing some days). I want to compare the two so that if the date (saved in variable d3) and siteID (d4) are the same, the data point (column 5) is placed in E.
for j = 1:size(data_O3,1)
E(strcmp(d3,data_O3{j,3})&d4 == data_O3{j,4},5) = data_O3(j,5);
end
This script works fine, but for some reason, running it takes longer than expected. I've run the same code for other data that were only slightly smaller with no problem. Is this an issue with the strcmp code or something else?
The script and files used can be found here: https://www.dropbox.com/sh/7bzq3m1ixfeuhu6/i4oOvxHPkn

There are certainly see a number of ways to speed this up significantly.
First of all, read in all numeric data in as numbers. Matlab is not optimized to work with strings, and even cells should generally be avoided as much as possible. If you want to keep everything as strings, use another language (python or perl)
Once you have the state, county and site read in as numbers, then create a number instead of a string for the siteID. One approach would be to use the formula:
siteID = siteNum + 1e4*countyCode + 1e7*stateCode
That would generate unique siteIDs for all sites.
Use datenum to convert the date field into a number.
You are now in a position where the data_O3 defined on line 79 can be a purely numeric array (no cells!), as can your E matrix. That alone will make the process many times faster.
You also might want to define the E as something other than NaN. Maybe give it values of -1.
There may be more optimizations you can do in the comparison, but do the above first and I expect you will see a huge improvement.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse