Storing and accessing data for future analysis of intra and inter-operator variability - perl

I'm am about to start a short project which will involve a reasonable amount of data which I would like to store in a sensible manner - preferably a postgressql database.
I will give a quick outline of the task. I will be processing and analysing data for a series of images each of which will have a unique ID. For each image, myself and other operators will complete some simple image processing tasks including adjusting angles and placing regions with the end result being numerous quantitative parameters - eg mean, variance etc. We expect there will be intra and inter-operator variability in these measures which is what I would like to analyse.
My initial plan was to store the data in the following way
ID Operator Attempt Date Result1 Result2 Reconstruction Method Iterations
1 AB 1 01/01/13 x x FBP
1 AB 2 01/01/13 x x FBP
1 CD 1 01/01/13 x x FBP
1 CD 2 01/01/13 x x FBP
2 AB 1 01/01/13 x x FBP
2 AB 2 01/01/13 x x FBP
2 CD 1 01/01/13 x x FBP
2 CD 2 01/01/13 x x FBP
1 AB 1 11/01/13 x x FBP
1 AB 2 01/01/13 x x MLEM
Now what I would like to compare (using correlation and Bland Altman plots) are the difference in results for the same operator processing the same image (the images must have the same ID, Date, Reconstruction technique) for all operators. i.e for all identical image and operator how do attempt 1 and 2 differ. I want to do the same analysis for interoperator variability i.e how does AB compare to CD for ID 1 for all images reconstructed with FBP or EF to AB for all images reconstructed with MLEM. Images with the same unique ID but acquired on different dates or reconstruction techniques should not be compared as they will contain difference outwith operator variability.
I have various R scripts to do the analysis but what I am uncertain of is how to access my data and arrange it in a sensible format to carry out the analysis or if my planned storage method is optimum for doing so. In the past I have used perl to access the database and pull out the numbers but I have recently discovered Rpostgressql which may be more suitable.
I guess my question is, for such a database how can I pick out:
(a) all unique images (ID, acquired on same date with same reconstruction method) and compare the difference in all Result1 for operator AB (CD etc) for attempt 1 and 2
(b) the same thing comparing all Result1 attempt 1s between AB and CD, CD and EF etc
Here is an example of the output I would like for (a)
ID Operator Date Result1 (Attempt 1) Result1(Attempt 2)
1 AB 01/01/13 10 12
2 AB 01/01/13 22 21
3 AB 03/01/13 15 17
4 AB 04/01/13 27 25
5 AB 06/01/13 14 12
1 AB 11/01/13 3 6
I would then analyse the last 2 columns
An example output for (b) comparing AB and CD
ID Date Result1 (Op: AB, Att: 1) Result1(Op: CD: Att 1)
1 01/01/13 10 12
2 01/01/13 22 21
3 05/01/13 12 14
1 11/01/13 19 24

These are just a rough idea!
(a) all unique images (ID, acquired on same date with same
reconstruction method) and compare the difference in all Result1 for
operator AB (CD etc) for attempt 1 and 2
For (a) you can use SQL statements that make use of the arguments, DISTINCT & SORT BY.
For example
SELECT DISTINCT Images FROM YourTable SORT BY DATE(Date), "Reconstruction Method"
(b) the same thing comparing all Result1 attempt 1s between AB and CD,
CD and EF etc
For (b) you can use SQL statements that make use of the argument, WHERE.
For example
SELECT * From YourTable WHERE Operator=AB

Related

Running total using two columns

Given a table with data like:
A
B
Qty.
Running Total
5
5
5
10
5
15
I can create the running total using the formula =SUM($A$2:A2) and then drag down to get the running total after each quantity (here Qty.)
What may I do for calculating running total using two columns which may or may not be consecutive as shown below:
A
B
C
D
Qty. 1
Other
Qty. 2
RT
2
blah
2
4
2
phew
2
8
3
xyz
2
13
Place in cell D2 the formula =SUM(A2,C2,D1). Do not pay attention to the fact that the function will refer to a non-numeric cell D1 - the SUM() function will not break, unlike ordinary addition =A2+C2+D1. Now, just stretch the formula down.

Correlation matrix from categorical and non-categorical variables (Matlab)

In Matlab, I have a dataset in a table of the form:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
UR F 12 U FT TEA MOTHER 1 11
GB M 22 R FT SER FATHER 5 15
GB M 12 R FT OTH FATHER 3 12
GB M 11 R PT POL FATHER 2 10
Where some variables are binary, some are categorical, some numerical. Would it be possible to extract from it a correlation matrix, with the correlation coefficients between the variables? I tried using both corrcoef and corrplot from the econometrics toolbox, but I come across errors such as 'observed data must be convertible to type double'.
Anyone would have a take on how this can be done? Thank you.
As said above, you first need to transform your categorical and binary variables to numerical values.
So if your data is in a table (T) do something like:
T.SCHOOL = categorical(T.SCHOOL);
A worked example can be found in the Matlab help here, where they use the patients dataset, which seems to be similar to your data.
You could then transform your categorical columns to double:
T.SCHOOL = double(T.SCHOOL);
Be careful with double though, as it transforms categorical variables to arbitrary numbers, see the matlab forum.
Also note, that you are introducing order into your categorical variables, if you simply transform them to numbers. So if you for example transform JOB 'TEA', 'SER', 'OTH' to 1, 2, 3, etc. you are making the variable ordinal. 'TEA' is then < 'OTH'.
If you want to avoid that you can re-code the categorical columns into 'binary' dummy variables:
dummy_schools = dummyvar(T.SCHOOL);
Which returns a matrix of size nrows x unique(T.SCHOOL).
And then there is the whole discussion, whether it is useful to calculate correlations of categorical variables. Like here.
I hope this helps :)
I think you need to make all the data numeric, i.e change/code the non-numerical columns to for example:
SCHOOL SEX AGE ADDRESS STATUS JOB GUARDIAN HEALTH GRADE
1 1 12 1 1 1 1 1 11
2 2 22 2 1 2 2 5 15
2 2 12 2 1 3 2 3 12
2 2 11 2 2 4 2 2 10
and then do the correlation.

Clustering matrix distance between 3 time series

I have a question about the application of clustering techniques more concretely the K-means.
I have a data frame with 3 sensors (A,B,C):
time A | B | C |
8:00:00 6 10 11
8:30:00 11 17 20
9:00:00 22 22 15
9:30:00 20 22 21
10:00:00 17 26 26
10:30:00 16 45 29
11:00:00 19 43 22
11:30:00 20 32 22
... ... ... ...
And I want to group sensors that have the same behavior.
My question is: Looking at the dataframe above, I must calculate the correlation of each object of the data frame and then apply the Euclidean distance on this correlation matrix, thus obtaining a 3 * 3 matrix with the value of distances?
Or do I transpose my data frame and then compute the dist () matrix with Euclidean metric only and then I will have a 3 * 3 matrix with the distances value.
You have just three sensors. That means, you'll need three values, d(A B), d(B,C) and d(A B). Any "clustering" here does not seem to make sense to me? Certainly not k-means. K-means is for points (!) In R^d for small d.
Choose any form of time series similarity that you like. Could be simply correlation, but also DTW and the like.
Q1: No. Why: The correlation is not needed here.
Q2: No. Why: I'd calculate the distances differently
For the first row, R' built-in s dist() function (which uses Euclidean distance by default)
dist(c(6, 10, 11))
gives you the intervals between each value
1 2
------
2| 4
3| 5 1
item 2 and 3 are closest to each other. That's simple.
But there is no single way to calculate the distance between a point and a group of points. There you need a linkage function (min/max/average/...)
What I would do using R's built-in kmeans() function:
Ignore the date column,
(assuming there are no NA values in any A,B,C columns)
scale the data if necessary (here they all seem to have same order of magnitude)
perform KMeans analysis on the A,B,C columns, with k = 1...n ; evaluate results
perform a final KMeans with your suitable choice of k
get the cluster assignments for each row
put them in a new column to the right of C

Calculating group means with own group excluded in MATLAB

To be generic the issue is: I need to create group means that exclude own group observations before calculating the mean.
As an example: let's say I have firms, products and product characteristics. Each firm (f=1,...,F) produces several products (i=1,...,I). I would like to create a group mean for a certain characteristic of the product i of firm f, using all products of all firms, excluding firm f product observations.
So I could have a dataset like this:
firm prod width
1 1 30
1 2 10
1 3 20
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
To reproduce the table:
firm=[1,1,1,2,2,2,3,3]
prod=[1,2,3,1,2,4,2,4]
hp=[30,10,20,25,15,40,10,35]
x=[firm' prod' hp']
Then I want to estimate a mean which will use values of all products of all other firms, that is excluding all firm 1 products. In this case, my grouping is at the firm level. (This mean is to be used as an instrumental variable for the width of all products in firm 1.)
So, the mean that I should find is: (25+15+40+10+35)/5=25
Then repeat the process for other firms.
firm prod width mean_desired
1 1 30 25
1 2 10 25
1 3 20 25
2 1 25
2 2 15
2 4 40
3 2 10
3 4 35
I guess my biggest difficulty is to exclude the own firm values.
This question is related to this page here: Calculating group mean/medians in MATLAB where group ID is in a separate column. But here, we do not exclude the own group.
p.s.: just out of curiosity if anyone works in economics, I am actually trying to construct Hausman or BLP instruments.
Here's a way that avoids loops, but may be memory-expensive. Let x denote your three-column data matrix.
m = bsxfun(#ne, x(:,1).', unique(x(:,1))); % or m = ~sparse(x(:,1), 1:size(x,1), true);
result = m*x(:,3);
result = result./sum(m,2);
This creates a zero-one matrix m such that each row of m multiplied by the width column of x (second line of code) gives the sum of other groups. m is built by comparing each entry in the firm column of x with the unique values of that column (first line). Then, dividing by the respective count of other groups (third line) gives the desired result.
If you need the results repeated as per the original firm column, use result(x(:,1))

Matlab NN inputs & output maniuplation

Assume I have this matrix, A :
A=[ 25 11 2010 10 23 75
30 11 2010 11 24 45
31 12 2010 19 24 44
31 12 2010 22 27 32
1 1 2011 14 27 27
2 12 2011 15 28 30
3 12 2011 16 24 42 ];
The first 5 columns represent the inputs of some measured parameters and the last column is the corresponding output. The number of rows is the number of taking these measurements.
I want to use Matlab Neural network GRNN with the function newgrnn ( or any other NN function ) to train the data up to the 5th row and test the remaining 2 rows inputs to evaluate their corresponding outputs. I have tried many many times to do this but it always gives me error and the program did not run correctly. I have looked to newgrnn help example but it is only for one input while I have in this example 5 inputs.
My question is how do we put the inputs and the output in the newgrnn function structure. Actually, I have very large matrix with 22 inputs and one output and the size of my matrix is 26352 by 23 but the above is only sample example.
Since you haven't given any examples of what you've tried and what errors you get from your attempts, I'll have to give you a fairly generic answer.
Have a look at the newgrnn help file.
net = newgrnn(P,T,spread) takes three inputs,
P R-by-Q matrix of Q input vectors
T S-by-Q matrix of Q target class vectors
spread Spread of radial basis functions (default = 1.0)
So if your matrix A always has just the last column being the outputs (target class vectors) then the outputs (target class vectors) are A[1:5,end], and the inputs are A[1:5,1:(end-1)]. These say "first 5 rows of A, and the last column", and "first 5 rows of A, and all but the last column" respectively.
Then (simply following the example in the newgrnn help file, you will have to tweak to your own particular A):
net = newgrnn( A[1:5,1:(end-1)], A[1:5,end] )
% predict new values
Y = sim(net, A[6:7,1:(end-1)])
I think you should also read the Matlab help file for indexing arrays and matrices.