How to interpret the Microsoft Neural Network - neural-network

I've created many types of neural networks with Microsoft's data tools, but I still struggle with the interpretation.
My Neural Network output gives so weird numbers. For the output attribute "Total Spent" it contains ranges that have negative numbers (ex. -384 - -58) even though the data set I use contains no negative numbers. What do these ranges actually mean and how am I supposed to interpret them correctly.
I have using a data set that on customer purchases and I group by CustomerID. On the data set some customers purchase thousands of dollars with of goods and others spend fifty dollars.
Thank you in advance for the help.

Related

Purpose of bitcoin mining puzzle

I understand how Bitcoin mining requires a long effort to guess the nounce until one is able to produce hash with leading zeros.
I have two particular questions here --
Why is Bitcoin mining made so computationally expensive in the first place? If the purpose is to just choose a random winner for block placement, why not use a simple and faster proof-of-work algorithm? (one example could be to generate a random number between 0-1 and the one with the smallest/largest value wins the round). By making the puzzle less computationally expensive, we should save lot of electric energy globally.
Is there any specific advantage of choosing a puzzle to produce resulting hash with leading zeros?
The difficulty of the algorithm is precisely what makes it difficult to cheat/steal on the Bitcoin network. If the algorithm was easy, then anyone could recreate old blocks and delete old spends so it looks like they never spent any Bitcoin after they buy something, for example. The purpose is not to pick a random winner, the purpose is to reward the miners doing the most work. It's true the winner is random, but the more work you do (more hashpower) the higher the probability that you will win. The probability is equal to the proportion of hashpower you're spending in relation to the total hashpower of the network.
Leading zeros is not what makes the hash valid, it just has to be under a threshold value. Leading zeros just happens because the number is low. It's like writing 1000 or 001000, it's still the same number, but the hash is 32 bytes so the leading zeros are there so you can see all 32 bytes.
I highly recommend reading the Bitcoin Whitepaper on proof-of-work. Also check out the Bitcoin Wiki - PoW

Evaluating mahout based Boolean recommendation engine - interpreting precision & recall

I would like to evaluate a mahout based recommendation engine of a fashion E-Commerce Site. They use shopping card information about item bought thogether - so boolean. I want to evaluate the engine using precision and recall.
1) How can I use these metrics to evaluate the recommendation engine? Is it just possible to use these values when altering the algorithm and to then check against yourself?
2) Or does it make sense to compare to other algorithms (also using boolean data)? If yes, is there any benchmark of precision and recall available (e.g if precision is x and recall is y, then algorithm should be discarded or accepted)?
Hoping to find help I thank you in advance guys!
Well in the Information Retrieval context the items are handled in boolean manner i.e., they’re either relevant or non relevant. Mahout’s GenericRecommenderIRStatsEvaluator utilizes data splitter to make a set from already preferred (or bought in your case, bought) items which represent relevant items. In mahout’s case the selected items are top-n most preferred items. So, since the ratings are boolean it just selects n preferred items. I don’t believe this would make evaluation itself drastically more inaccurate than with normal five star ratings since buying is pretty strong sign of preference. So:
1) If you have managed to make recommendations then you are able to evaluate the recommendations using precision and recall as metrics.
2) I have used a random recommender as an benchmark (just an implementation of an mahout recommender which selects n random items). It usually produces pretty low precision and recall so if the algorithm has lower precision and recall than random recommender it probably should be ditched. Other metric that would I look in offline evaluation phase is reach, since recommender which produces recommendations only to 80 users out of 6000 active users is pretty useless.
Also it should be noted that in academic papers the precision and recall metrics have been criticized when used as a sole metric. In the end the user decides what is relevant and what is not relevant. And a recommender, which produces slightly lower than the other, is not necessarily worse than the other. For example more novel or serendipitous recommendations may lover precision and recall.

How to train a neural network to detect presence of a pattern?

The question phrasing is vague - and I'm happy to change it based on feedback. But, I am trying to train a neural network to detect fraudulent transactions on a website. I have a lot of parameters as inputs (time of day, country of origin, number of visits in the past month, number of visits from unique IP's in the past month, number of transactions, average transaction size, etc, etc). Altogether, perhaps over 100 inputs. The inputs have been normalized and sanitized and they form a time series. Historically, I can look at my data and identify that a transaction was fraudulent of Type A or of Type B or not fraudulent. My training set can be large (in the thousands or tens of thousands of points).
Ultimately, I want an indicator: Fraud of Type A, Fraud of Type B or Not Fraud. Generally, fraudulent transactions tend to fit a pattern. I can't exactly identify the pattern (that's why I'm using a NN). However, not fraudulent transactions can be of any type of pattern. So it seems strange to identify things into 3 buckets when the third bucket is "other".
If this were a switch / case, it would be something like:
switch transactionData
when transactionData.transaction_count < 0.2 && ....
FRAUD_A
when transactionData.transaction_count > 0.5 && ....
FRAUD_B
else
NOT_FRAUD
Obviously, these are simplified cases, but my problem runs into how to properly train for the else case. Do I get three types of data (fraud_a, fraud_b and not_fraud) and train them? Or is there another way to train for other?
It is usually perfectly ok to have OTHER (NOT FRAUD) class along with these you are interested in. But I understand your concern. Basically, its job of NN to learn "case/switch" and in most cases it will learn right one, assuming that most samples belong to NOT FRAUD class. In some pathological cases classifiers can learn different idea e.g. everything is FRAUD A class, unless proven otherwise. You can't usually control it directly, but it can be changed by creating better features and some other tricks. For now, proceed with what you have and see what happens.
One thing you can do is to train two classifiers, one (FRAUD/NOT FRAUD) and then if fraud is detected feed data into second two-class classifier (FRAUD A/FRAUD B). Sometimes (but not always) this works better.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!
First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

How probable is that two exact same calculations give different results?

I am currently working on remaking an old invoicing program that was originally written in VB6.
It has two parts, one on an android tablet, the other on a pc. The old database used , stored derived values because there was a chance that the calculations would be incorrect if repeated.
For example if one sold 5 items whose price was 10 euros at 10% discount and a tax value of 23% , it would store the above 4 values but also the result of the calucation of (5 * (10 * 1.23)) * 0.9.
I do not really like having duplicate or derivable information in my database, but the actual sell value must be the same, whether it is viewed on a tablet , or a pc.
So my question is , is there a chance (even the slightest one) that the above calucation (to a three decimal percision) would have different results on different operating systems (such as an android device and a desktop computer) ?
Thanks in advance for any help you can provide
Yes, it's possible. Floating-point arithmetic is always subject to rounding errors and different languages (and architectures) deal with those errors in different ways. There are best practices in dealing with these issues, though I don't consider myself knowledgeable enough to speak to them. But here are a couple of options for you.
Use a data type meant for floating-point arithmetic. For example, VB6 has a Single and Double type for floating point but also a Currency type for accurate decimal math.
Scale your floating-point values to integers and perform your calculations on these integer values. You can even store the results as integers in your DB. The ERP system we use does this and includes a data dictionary that defines how each type was scaled so that it can be "unscaled" before display.
Hope that helps.