Am I using PCA in Orange in a correct way? - orange

I am analysing if 15 books can be grouped according to 6 variables (of the 15 books, 2 are written by an author, 6 by an other one, and 7 by an other one). I counted the number of occurrences of the variables and I calculated the percentage. Then I used Orange software to use PCA. I uploaded the file. selected the columns and rows. And when it comes to PCA the program asks me if I want to normalize the data or not, but I am not sure about that because I have already calculated the percentage - is normalize different from calculating the percentage? Moreover, below the normalize button it asks me to show only:... and I have to choose a number between 0 and 100 but I don’t really know what it is.
Could you help me understand what I should do? Thank you in advance

Related

Netlogo: Built-in function to calculate the expected profit

Sorry for long post. I am newbie in agent-based modelling. So please accept my apology in advance if my question sounds stupid. I am trying to model a scenario where framer (i.e. agent) decides which type of crop should be harvest in different types of fields to increase the profit. The farmer agent has a budget i.e. the amount of money that can be spent on farming each time step equal to $100.
The farmer operates a farm that is subdivided into nine fields, which are arranged in a 3x3
cellular grid. Each field is of the same size. Water availability varies spatially across the fields with a rating of either 1 (driest), 2 (moderate),
or 3 (wettest). The manner in which water availability varies across the fields (i.e. randomly).
The farmer must choose among three crops. As initial parameter settings, the crops have the
following characteristics:
Yield Price Costs Minimum Water Req.
Crop 1 300 20 15 3
Crop 2 200 12 10 2
Crop 3 100 7 5 1
Each crop requires a certain amount of water to grow. Crop yields will only be realized if the crop is
planted in a field with at least the crop’s minimum water requirement.
Now the problem is that I couldn't find any function in Netlogo that calculates the permutation or combination of crop, field, and water requirements to calculate the expected profit. Any help would be high appreciated.
I believe you describe a linear programming problem.
Useful functions for solving Simplex Linear Programming problems are in NumAnal extension, which does not come bundled with NetLogo but which you can get as follows:
In NetLogo, under Tools / Extensions ... you can find NumAnal, probably with no green check-mark. Select it. On the right, you have buttons to install it, and then one to add it to your code. When you click those, it should now get a green checkmark and you should have a new line in your code "extensions [ numanal ]", and you are now able to use those commands, with the "numanal:" prefix, for example, numanal:simplex.
The documentation for it is in the folder where it was installed. But where is that?
Sadly, the documentation for where extensions are downloaded is not current.
https://ccl.northwestern.edu/netlogo/docs/extensions.html#where-extensions-are-located
After exhaustive search by date-modified, I actually found the folder on my Windows 10 laptop here: c:\Users\condor\AppData\Roaming\NetLogo\6.1\extensions
( Note the "\Roaming\" ).
That folder has a README.md text file, and a pdf document named "NumAnal-v3.4.0" explaining how to use it, and an examples folder with code. It is a little dense.
Here's a link to the basics of how to describe a Linear Programming problem, which is beyond the scope of StackOverflow. You can find help via Google.
Here's one 8 minute video ( as of 24-Nov-2019) that might help you figure out if this is what you need.
Simplex Algorithm Explanation (How to Solve a Linear Program)
https://www.youtube.com/watch?v=RO5477EKlXE

Qliksense: Compute median of grouped data

I'm facing an issue in QlikSense, trying to compute some statistical indicators (Percentiles, Quartiles, StdDev, Median etc.) on a dataset which is already grouped by the source.
I mean that my dataset is something similar to the following, in which I have for each combination of Week and Customer Age the total number of purchases:
I want to show the median of Customer Age, and due to the structure of the dataset I can't use fractile or median built-in functions, since they would come out with something different.
Let's suppose I want to calculate the median age of people for all the 3 weeks, so that I want to know what's the age of people who have done the 50% of my purchases.
To let you better understand the question, I show you the histogram:
In this case, the median I want to get is 24-26 years, since the 50% of the total population falls under that range.
I found a useful reference here, but I am having troubles in writing this formula in QlikSense
https://mba-lectures.com/statistics/descriptive-statistics/603/relationship-between-quartiles-decile...
Thanks a lot in advance.
[EDIT]: This is my Data Model View:
[EDIT 2]: Here is my qvf with a dataset more similar to the original one I'm using. As you can see, I can't get the correct result using your formula. In addition, I would like to use it in order to plot the trend of the median through weeks, but it doesn't seem to be possible (Even if I use the modified version of the formula I pointed out in the comments).
If you want to calculate median in such a scenario you need to weighted median and basically check which dimension value is in the middle:
Aggr(
If(
(Rangesum(
Above([# Purchases],0,RowNo())
)
/Sum(TOTAL [# Purchases]))>=0.5
and
(Rangesum(
Above([# Purchases],1,RowNo()-1))
/Sum(TOTAL [# Purchases]))<0.5
,[Customer Age])
,[Customer Age])

Generate subset of data with known mean

I have a dataset of n observations (nx1 vector) and would like to create a subset of this data, whose mean is known in advance, by selecting at random only n/3 observations (or within some constraint, ie where the mean of the data subset is within a range about the known mean).
Can someone please help me with the code do this in matlab?
Note, I don't want to use the rand function to create random data as I already have my data collected.
For example on a smaller scale: If I had the following dataset of 12 observations:
data = [8;7;4;6;9;6;4;7;3;2;1;1];
but then wanted to randomly select a subset of this data containing only 4 observations with a mean of 4 (or with a mean between 3.5-4.5 for example):
Then the answer might be datasubset=[7;3;2;4] but the answer could also be datasubset=[6;4;2;4] or datasubset=[6;4;3;4].
It doesn't matter if there are several possible solutions, I just need one of them, though I'd like to know the alternative solutions also.

Newbie to Neural Networks

Just starting to play around with Neural Networks for fun after playing with some basic linear regression. I am an English teacher so don't have a math background and trying to read a book on this stuff is way over my head. I thought this would be a better avenue to get some basic questions answered (even though I suspect there is no easy answer). Just looking for some general guidance put in layman's terms. I am using a trial version of an Excel Add-In called NEURO XL. I apologize if these questions are too "elementary."
My first project is related to predicting a student's Verbal score on the SAT based on a number of test scores, GPA, practice exam scores, etc. as well as some qualitative data (gender: M=1, F=0; took SAT prep class: Y=1, N=0; plays varsity sports: Y=1, N=0).
In total, I have 21 variables that I would like to feed into the network, with the output being the actual score (200-800).
I have 9000 records of data spanning many years/students. Here are my questions:
How many records of the 9000 should I use to train the network?
1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
If I split the data into an even number, say 9x1000 (or however many) and created a network for each one, then tested the results of each of these 9 on the other 8 sets to see which had the lowest MSE across the samples, would this be a valid way to "choose" the best network if I wanted to predict the scores for my incoming students (not included in this data at all)?
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
E.g.
750-800 = 10
700-740 = 9
etc.
Is there any benefit to doing this or should I just go ahead and try to predict the exact score?
What if ALL I cared about was whether or not the score was above or below 600. Would I then just make the output 0(below 600) or 1(above 600)?
5a. I read somewhere that it's not good to use 0 and 1, but instead 0.1 and 0.9 - why is that?
5b. What about -1(below 600), 0(exactly 600), 1(above 600), would this work?
5c. Would the network always output -1, 0, 1 - or would it output fractions that I would then have to roundup or rounddown to finalize the prediction?
Once I have found the "best" network from Question #3, would I then play around with the different parameters (number of epochs, number of neurons in hidden layer, momentum, learning rate, etc.) to optimize this further?
6a. What about the Activation Function? Will Log-sigmoid do the trick or should I try the other options my software has as well (threshold, hyperbolic tangent, zero-based log-sigmoid).
6b. What is the difference between log-sigmoid and zero-based log-sigmoid?
Thanks!
First a little bit of meta content about the question itself (and not about the answers to your questions).
I have to laugh a little that you say 'I apologize if these questions are too "elementary."' and then proceed to ask the single most thorough and well thought out question I've seen as someone's first post on SO.
I wouldn't be too worried that you'll have people looking down their noses at you for asking this stuff.
This is a pretty big question in terms of the depth and range of knowledge required, especially the statistical knowledge needed and familiarity with Neural Networks.
You may want to try breaking this up into several questions distributed across the different StackExchange sites.
Off the top of my head, some of it definitely belongs on the statistics StackExchange, Cross Validated: https://stats.stackexchange.com/
You might also want to try out https://datascience.stackexchange.com/ , a beta site specifically targeting machine learning and related areas.
That said, there is some of this that I think I can help to answer.
Anything I haven't answered is something I don't feel qualified to help you with.
Question 1
How many records of the 9000 should I use to train the network? 1a. Should I completely randomize the selection of this training data or be more involved and make sure I include a variety of output scores and a wide range of each of the input variables?
Randomizing the selection of training data is probably not a good idea.
Keep in mind that truly random data includes clusters.
A random selection of students could happen to consist solely of those who scored above a 30 on the ACT exams, which could potentially result in a bias in your result.
Likewise, if you only select students whose SAT scores were below 700, the classifier you build won't have any capacity to distinguish between a student expected to score 720 and a student expected to score 780 -- they'll look the same to the classifier because it was trained without the relevant information.
You want to ensure a representative sample of your different inputs and your different outputs.
Because you're dealing with input variables that may be correlated, you shouldn't try to do anything too complex in selecting this data, or you could mistakenly introduce another bias in your inputs.
Namely, you don't want to select a training data set that consists largely of outliers.
I would recommend trying to ensure that your inputs cover all possible values for all of the variables you are observing, and all possible results for the output (the SAT scores), without constraining how these requirements are satisfied.
I'm sure there are algorithms out there designed to do exactly this, but I don't know them myself -- possibly a good question in and of itself for Cross Validated.
Question 3
Since the scores on the tests that I am using as inputs vary in scale (some are on 1-100, and others 1-20 for example), should I normalize all of the inputs to their respective z-scores? When is this recommended vs not recommended?
My understanding is that this is not recommended as the input to a Nerual Network, but I may be wrong.
The convergence of the network should handle this for you.
Every node in the network will assign a weight to its inputs, multiply them by their weights, and sum those products as a core part of its computation.
That means that every node in the network is searching for some coefficients for each of their inputs.
To do this, all inputs will be converted to numeric values -- so conditions like gender will be translated into "0=MALE,1=FEMALE" or something similar.
For example, a node's metric might look like this at a given point in time:
2*ACT_SCORE + 0*GENDER + (-5)*VARISTY_SPORTS ...
The coefficients for each values are exactly what the network is searching for as it converges.
If you change the scale of a value, like ACT_SCORE, you just change the scale of the coefficient that will be found by the reciporical of that scaling factor.
The result should still be the same.
There are other concerns in terms of accuracy (computers have limited capacity to represent small fractions) and speed that may enter this, but not being familiar with NEURO XL, I can't say whether or not they apply for this technology.
Question 4
I am predicting the actual score, but in reality, I'm NOT that concerned about the exact score but more of a range. Would my network be more accurate if I grouped the output scores into buckets and then tried to predict this number instead of the actual score?
This will reduce accuracy, although you should converge to a solution much faster with fewer possible outputs (scores).
Neural Networks actually describe very high-dimensional functions in their input variables.
If you reduce the granularity of that function's output space, you essentially state that you don't care about local minima and maxima in that function, especially around the borders between your output scores.
As a result, you are sacrificing information that may be an essential component of the "true" function that you are searching for.
I hope this has been helpful, but you really should break this question down into its many components and ask them separately on different sites -- potentially some of them do belong here on StackOverflow as well.

Mahout Log Likelihood similarity metric behaviour

The problem I'm trying to solve is finding the right similarity metric, rescorer heuristic and filtration level for my data. (I'm using 'filtration level' to mean the amount of ratings that a user or item must have associated with it to make it into the production database).
Setup
I'm using mahout's taste collaborative filtering framework. My data comes in the form of triplets where an item's rating are contained in the set {1,2,3,4,5}. I'm using an itemBased recommender atop a logLikelihood similarity metric. I filter out users who rate fewer than 20 items from the production dataset. RMSE looks good (1.17ish) and there is no data capping going on, but there is an odd behavior that is undesireable and borders on error-like.
Question
First Call -- Generate a 'top items' list with no info from the user. To do this I use, what I call, a Centered Sum:
for i in items
for r in i's ratings
sum += r - center
where center = (5+1)/2 , if you allow ratings in the scale of 1 to 5 for example
I use a centered sum instead of average ratings to generate a top items list mainly because I want the number of ratings that an item has received to factor into the ranking.
Second Call -- I ask for 9 similar items to each of the top items returned in the first call. For each top item I asked for similar items for, 7 out of 9 of the similar items returned are the same (as the similar items set returned for the other top items)!
Is it about time to try some rescoring? Maybe multiplying the similarity of two games by (number of co-rated items)/x, where x is tuned (around 50 or something to begin with).
Thanks in advance fellas
You are asking for 50 items similar to some item X. Then you look for 9 similar items for each of those 50. And most of them are the same. Why is that surprising? Similar items ought to be similar to the same other items.
What's a "centered" sum? ranking by sum rather than average still gives you a relatively similar output if the number of items in the sum for each calculation is roughly similar.
What problem are you trying to solve? Because none of this seems to have a bearing on the recommender system you describe that you're using and works. Log-likelihood similarity is not even based on ratings.