Save and use eigenvectors from PCA - command

I performed a Principal Component Analysis (PCA) in Stata.
My dataset includes eight financial indicators that vary across 9 countries.
For example:
* Example generated by -dataex-. To install: ssc install dataex
clear
input str7 Country double(Investment Profit Income Tax Repayment Leverage Interest Liquidity) int Year
"France" -.1916055239385184 .046331346724579184 .16438012750896466 .073106839282063 30.373216652548326 4.116650784492168 3.222219873614461 .01453109309122077 2010
"UK" -.09287803170279468 .10772082765154019 .19475363707485557 .05803923583546618 31.746409646181174 9.669982727208433 1.2958094802269167 .014273374324088752 2010
"US" -.06262935107629553 .08674901201182428 .1241593221865416 .13387194413811226 25.336612638526013 11.14330064161111 1.954785887176916 .008355601163285917 2010
"Italy" -.038025847122363045 .1523162032749684 .23885658237030563 .2057478638900476 31.02007902336988 2.9660938817562292 6.12544787693943 .011694993164234125 2010
"Germany" -.05454795914578491 .06287079763890834 .09347194572148769 .08730237262847926 35.614342337621174 12.03770488195981 1.1958205191308358 .012467084153714813 2010
"Spain " -.09133982259799572 .1520056836126315 .20905656056324853 .21054797530580743 30.133833346916546 2.0623245902645073 5.122615899157435 .013545432336873187 2010
"Sweden" -.05403262462960799 .20463787181576967 .22924827352771968 .05655833155565016 20.30540887860061 10.392313613725324 .8634381995636089 .008030624504967313 2010
"Norway " -.07560184571862992 .08383822093909514 .15469418498932822 .06569716455818478 29.568228705840234 14.383460621594622 1.5561013535825234 .012843159364225464 2010
"Algeria" -.0494187835163535 .056252436429004446 .09174672864585759 .08143181185307143 34.74103858167055 15.045254276254616 1.2074942921860699 .011578038401820303 2010
"France" -.03831442432584342 .14722819896988698 .22035417794604084 .12183886462162773 28.44763045286005 12.727100288710087 1.405629911115614 .011186908059399987 2011
"UK" -.05002189329928202 .16833493262244398 .2288402623558823 .04977050186975224 27.640103129372747 11.17376089844228 1.1764542835994092 .008386726178729322 2011
"US" -.0871005985124144 .10270482619857023 .1523559355903486 .06775742210623094 26.840586700880362 10.783899184031576 1.454011947763254 .013501919089967212 2011
"Italy" -.1069324103590126 -.5877872620957578 -.47469302172710803 .2004436360021364 23.133243742952658 5.3936761686065875 4.532771849692548 .012586313916956204 2011
"Germany" -.05851794344524515 .09960345907923154 .136805115392161 .1373407846168154 32.6182637042919 14.109738344526052 1.5077699357228835 .013200993625042274 2011
"Spain " -.10650743527105216 -.015785638597076792 .1808727613216441 .05038848927405154 28.22206251292902 10.839614113486853 1.5021425852392374 .012076771099482617 2011
"Sweden" -.09678946710644694 .11801761803893955 .18569993056826523 .1481844716617448 27.439283362903794 5.771154420635893 5.493437819181101 .013820243145673811 2011
"Norway " -.04263379351591438 .09931719473864983 .14469611775596314 .0796835513869996 26.68561168581991 14.06385602832082 1.5200488174887825 .01029136242440406 2011
"Algeria" -.04871983526465598 .2139061303228528 .2728647845448156 .056537570099712456 22.50263575072073 16.919641035094685 .7539881754626142 .009734650338902404 2011
end
I called my first component "indebtedness" and my second one "profitability", after rotation.
I have the same data for 2011, 2012, 2013, 2014 and so on. I want to use the matrix of weights Stata computed for 2010 and apply it to 2011, 2012, 2013 separately. My goal is to compare the indebtedness and the profitability between countries over time.
To do this, I use the estimate save and estimates use commands (Chapter 20 of Stata manual on estimates and the post-estimation PCA command help).
However, I can't understand what Stata is saving. Is it saving the scores computed for 2010 or the eigenvalues and eigenvectors?
This is the code I use:
tempfile pca
save `pca'
use `pca' if Year==2010
global xlist Investment Profit Income Tax Repayment Leverage Interest Liquidity
pca $xlist, components(2)
estimates save pcaest, replace
predict score
summarize score
use `pca' if Year==2011, clear
estimates use pcaest
predict score
summarize score
Does this method and code seem correct to you?
I'd also like to save the matrix of weights and create a new vector Z=b|1,1]*investment+....

Using your toy example for year 2010:
clear
input str7 Country double(Investment Profit Income Tax Repayment Leverage Interest Liquidity) int Year
"France" -.1916055239385184 .046331346724579184 .16438012750896466 .073106839282063 30.373216652548326 4.116650784492168 3.222219873614461 .01453109309122077 2010
"UK" -.09287803170279468 .10772082765154019 .19475363707485557 .05803923583546618 31.746409646181174 9.669982727208433 1.2958094802269167 .014273374324088752 2010
"US" -.06262935107629553 .08674901201182428 .1241593221865416 .13387194413811226 25.336612638526013 11.14330064161111 1.954785887176916 .008355601163285917 2010
"Italy" -.038025847122363045 .1523162032749684 .23885658237030563 .2057478638900476 31.02007902336988 2.9660938817562292 6.12544787693943 .011694993164234125 2010
"Germany" -.05454795914578491 .06287079763890834 .09347194572148769 .08730237262847926 35.614342337621174 12.03770488195981 1.1958205191308358 .012467084153714813 2010
"Spain " -.09133982259799572 .1520056836126315 .20905656056324853 .21054797530580743 30.133833346916546 2.0623245902645073 5.122615899157435 .013545432336873187 2010
"Sweden" -.05403262462960799 .20463787181576967 .22924827352771968 .05655833155565016 20.30540887860061 10.392313613725324 .8634381995636089 .008030624504967313 2010
"Norway " -.07560184571862992 .08383822093909514 .15469418498932822 .06569716455818478 29.568228705840234 14.383460621594622 1.5561013535825234 .012843159364225464 2010
"Algeria" -.0494187835163535 .056252436429004446 .09174672864585759 .08143181185307143 34.74103858167055 15.045254276254616 1.2074942921860699 .011578038401820303 2010
end
I get the following results:
local xlist Investment Profit Income Tax Repayment Leverage Interest Liquidity
pca `xlist', components(2)
Principal components/correlation Number of obs = 9
Number of comp. = 2
Trace = 8
Rotation: (unrotated = principal) Rho = 0.7468
--------------------------------------------------------------------------
Component | Eigenvalue Difference Proportion Cumulative
-------------+------------------------------------------------------------
Comp1 | 3.43566 .896796 0.4295 0.4295
Comp2 | 2.53887 1.23215 0.3174 0.7468
Comp3 | 1.30672 .750756 0.1633 0.9102
Comp4 | .555959 .472866 0.0695 0.9797
Comp5 | .0830926 .0181769 0.0104 0.9900
Comp6 | .0649157 .0526462 0.0081 0.9982
Comp7 | .0122695 .00975098 0.0015 0.9997
Comp8 | .00251849 . 0.0003 1.0000
--------------------------------------------------------------------------
Principal components (eigenvectors)
------------------------------------------------
Variable | Comp1 Comp2 | Unexplained
-------------+--------------------+-------------
Investment | 0.0004 -0.3837 | .6262
Profit | 0.3896 -0.3794 | .1131
Income | 0.4621 -0.1162 | .232
Tax | 0.4146 0.1236 | .3706
Repayment | -0.1829 0.4747 | .3131
Leverage | -0.4685 -0.2596 | .07464
Interest | 0.4580 0.2625 | .1045
Liquidity | -0.0082 0.5643 | .1913
------------------------------------------------
To see what items the pca command returns type:
ereturn list
scalars:
e(N) = 9
e(f) = 2
e(rho) = .7468162625387222
e(trace) = 8
e(lndet) = -13.76082122673546
e(cond) = 36.93476257313668
macros:
e(cmdline) : "pca Investment Profit Income Tax Repayment Leverage Interest Liquidity, components(2)"
e(cmd) : "pca"
e(title) : "Principal components"
e(marginsnotok) : "_ALL"
e(estat_cmd) : "pca_estat"
e(rotate_cmd) : "pca_rotate"
e(predict) : "pca_p"
e(Ctype) : "correlation"
e(properties) : "nob noV eigen"
matrices:
e(sds) : 1 x 8
e(means) : 1 x 8
e(C) : 8 x 8
e(Psi) : 1 x 8
e(Ev) : 1 x 8
e(L) : 8 x 2
functions:
e(sample)
One way to save the returned matrix containing the eigenvectors as variables for the next year is to create a copy of the matrix and load the 2011 data:
matrix A = e(L)
clear
input str7 Country double(Investment Profit Income Tax Repayment Leverage Interest Liquidity) int Year
"France" -.03831442432584342 .14722819896988698 .22035417794604084 .12183886462162773 28.44763045286005 12.727100288710087 1.405629911115614 .011186908059399987 2011
"UK" -.05002189329928202 .16833493262244398 .2288402623558823 .04977050186975224 27.640103129372747 11.17376089844228 1.1764542835994092 .008386726178729322 2011
"US" -.0871005985124144 .10270482619857023 .1523559355903486 .06775742210623094 26.840586700880362 10.783899184031576 1.454011947763254 .013501919089967212 2011
"Italy" -.1069324103590126 -.5877872620957578 -.47469302172710803 .2004436360021364 23.133243742952658 5.3936761686065875 4.532771849692548 .012586313916956204 2011
"Germany" -.05851794344524515 .09960345907923154 .136805115392161 .1373407846168154 32.6182637042919 14.109738344526052 1.5077699357228835 .013200993625042274 2011
"Spain " -.10650743527105216 -.015785638597076792 .1808727613216441 .05038848927405154 28.22206251292902 10.839614113486853 1.5021425852392374 .012076771099482617 2011
"Sweden" -.09678946710644694 .11801761803893955 .18569993056826523 .1481844716617448 27.439283362903794 5.771154420635893 5.493437819181101 .013820243145673811 2011
"Norway " -.04263379351591438 .09931719473864983 .14469611775596314 .0796835513869996 26.68561168581991 14.06385602832082 1.5200488174887825 .01029136242440406 2011
"Algeria" -.04871983526465598 .2139061303228528 .2728647845448156 .056537570099712456 22.50263575072073 16.919641035094685 .7539881754626142 .009734650338902404 2011
end
Then you can simply use the svmat command:
svmat A
list A* if _n < 9
+-----------------------+
| A1 A2 |
|-----------------------|
1. | .0003921 -.383703 |
2. | .3895898 -.3793983 |
3. | .4621098 -.1162487 |
4. | .4146066 .1235683 |
5. | -.1828703 .4746658 |
|-----------------------|
6. | -.4685374 -.2596268 |
7. | .457974 .2624738 |
8. | -.0081538 .5643047 |
+-----------------------+
EDIT:
Revised according to comments:
use X1, clear
local xlist Investment Profit Income Tax Repayment Leverage Interest Liquidity
forvalues i = 1 / 5 {
pca `xlist' if year == 201`i', components(2)
matrix A201`i' = e(L)
svmat A201`i'
generate B201`i'1 = (A201`i'1 * Investment) + (A201`i'1 * Profit) + ///
(A201`i'1 * Income) + (A201`i'1 * Tax) + ///
(A201`i'1 * Repayment) + (A201`i'1 * Leverage) + ///
(A201`i'1 * Interest) + (A201`i'1 * Liquidity)
generate B201`i'2 = (A201`i'2 * Investment) + (A201`i'2 * Profit) + ///
(A201`i'2 * Income) + (A201`i'2 * Tax) + ///
(A201`i'2 * Repayment) + (A201`i'2 * Leverage) + ///
(A201`i'2 * Interest) + (A201`i'2 * Liquidity)
}

Related

statistical test to compare 1st/2nd differences based on output from ggpredict / ggeffect

I want to conduct a simple two sample t-test in R to compare marginal effects that are generated by ggpredict (or ggeffect).
Both ggpredict and ggeffect provide nice outputs: (1) table (pred prob / std error / CIs) and (2) plot. However, it does not provide p-values for assessing statistical significance of the marginal effects (i.e., is the difference between the two predicted probabilities difference from zero?). Further, since I’m working with Interaction Effects, I'm also interested in a two sample t-tests for the First Differences (between two marginal effects) and the Second Differences.
Is there an easy way to run the relevant t tests with ggpredict/ggeffect output? Other options?
Attaching:
. reprex code with fictitious data
. To be specific: I want to test the following "1st differences":
--> .67 - .33=.34 (diff from zero?)
--> .5 - .5 = 0 (diff from zero?)
...and the following Second difference:
--> 0.0 - .34 = .34 (diff from zero?)
See also Figure 12 / Table 3 in Mize 2019 (interaction effects in nonlinear models)
Thanks Scott
library(mlogit)
#> Loading required package: dfidx
#>
#> Attaching package: 'dfidx'
#> The following object is masked from 'package:stats':
#>
#> filter
library(sjPlot)
library(ggeffects)
# create ex. data set. 1 row per respondent (dataset shows 2 resp). Each resp answers 3 choice sets, w/ 2 alternatives in each set.
cedata.1 <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2), # respondent ID.
QES = c(1,1,2,2,3,3,1,1,2,2,3,3), # Choice set (with 2 alternatives)
Alt = c(1,2,1,2,1,2,1,2,1,2,1,2), # Alt 1 or Alt 2 in choice set
LOC = c(0,0,1,1,0,1,0,1,1,0,0,1), # attribute describing alternative. binary categorical variable
SIZE = c(1,1,1,0,0,1,0,0,1,1,0,1), # attribute describing alternative. binary categorical variable
Choice = c(0,1,1,0,1,0,0,1,0,1,0,1), # if alternative is Chosen (1) or not (0)
gender = c(1,1,1,1,1,1,0,0,0,0,0,0) # male or female (repeats for each indivdual)
)
# convert dep var Choice to factor as required by sjPlot
cedata.1$Choice <- as.factor(cedata.1$Choice)
cedata.1$LOC <- as.factor(cedata.1$LOC)
cedata.1$SIZE <- as.factor(cedata.1$SIZE)
# estimate model.
glm.model <- glm(Choice ~ LOC*SIZE, data=cedata.1, family = binomial(link = "logit"))
# estimate MEs for use in IE assessment
LOC.SIZE <- ggpredict(glm.model, terms = c("LOC", "SIZE"))
LOC.SIZE
#>
#> # Predicted probabilities of Choice
#> # x = LOC
#>
#> # SIZE = 0
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.33 | 1.22 | [0.04, 0.85]
#> 1 | 0.50 | 1.41 | [0.06, 0.94]
#>
#> # SIZE = 1
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.67 | 1.22 | [0.15, 0.96]
#> 1 | 0.50 | 1.00 | [0.12, 0.88]
#> Standard errors are on the link-scale (untransformed).
# plot
# plot(LOC.SIZE, connect.lines = TRUE)

Odds and Rate Ratio CIs in Hurdle Models with Factor-Factor Interactions

I am trying to build hurdle models with factor-factor interactions but can't figure out how to calculate the CIs of the odds or rate ratios among the various factor-factor combinations.
library(glmmTMB)
data(Salamanders)
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
pred_dat <- data.frame(spp = rep(unique(Salamanders$spp), 2),
mined = rep(unique(Salamanders$mined), each = length(unique(Salamanders$spp))))
pred_dat # All factor-factor combos
Does anyone know how to appropriately calculate the CI around the ratios among these various factor-factor combos? I know how to calculate the actual ratio estimates (which consists of exponentiating the sum of 1-3 model coefficients, depending on the exact comparison being made) but I just can't seem to find any info on how to get the corresponding CI when an interaction is involved. If the ratio in question only requires exponentiating a single coefficient, the CI can easily be calculated; I just don't know how to do it when two or three coefficients are involved in calculating the ratio. Any help would be much appreciated.
EDIT:
I need the actual odds and rate ratios and their CIs, not the predicted values and their CIs. For example: exp(confint(m3)[2,3]) gives the rate ratio of sppPR/minedYes vs sppGP/minedYes, and c(exp(confint(m3)[2,1]),exp(confint(m3)[2,2]) gives the CI of that rate ratio. However, a number of the potential comparisons among the spp/mined combinations require summing multiple coefficients e.g., exp(confint(m3)[2,3] + confint(m3)[8,3]) but in these circumstances I do not know how to calculate the rate ratio CI because it involves multiple coefficients, each of which has its own SE estimates. How can I calculate those CIs, given that multiple coefficients are involved?
If I understand your question correctly, this would be one way to obtain the uncertainty around the predicted/fitted values of the interaction term:
library(glmmTMB)
library(ggeffects)
data(Salamanders)
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
ggpredict(m3, c("spp", "mined"))
#>
#> # Predicted counts of count
#> # x = spp
#>
#> # mined = yes
#>
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 1.59 | 0.92 | [0.26, 9.63]
#> PR | 1.13 | 0.66 | [0.31, 4.10]
#> DM | 1.74 | 0.29 | [0.99, 3.07]
#> EC-A | 0.61 | 0.96 | [0.09, 3.96]
#> EC-L | 0.42 | 0.69 | [0.11, 1.59]
#> DF | 1.49 | 0.27 | [0.88, 2.51]
#>
#> # mined = no
#>
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 2.67 | 0.11 | [2.15, 3.30]
#> PR | 1.59 | 0.28 | [0.93, 2.74]
#> DM | 3.10 | 0.10 | [2.55, 3.78]
#> EC-A | 2.30 | 0.17 | [1.64, 3.21]
#> EC-L | 5.25 | 0.07 | [4.55, 6.06]
#> DF | 2.68 | 0.12 | [2.13, 3.36]
#> Standard errors are on link-scale (untransformed).
plot(ggpredict(m3, c("spp", "mined")))
Created on 2020-08-04 by the reprex package (v0.3.0)
The ggeffects-package calculates marginal effects / estimates marginal means (EMM) with confidence intervals for your model terms. ggpredict() computes these EMMs based on predict(), ggemmeans() wraps the fantastic emmeans package and ggeffect() uses the effects package.

how to use estat vif in the right way

I have 2 questions concerning estat vif to test multicollinearity:
Is it correct that you can only calculate estat vif after the regress command?
If I execute this command Stata only gives me the vif of one independent variable.
How do I get the vif of all the independent variables?
Q1. I find estat vif documented under regress postestimation. If you can find it documented under any other postestimation heading, then it is applicable after that command.
Q2. You don't give any examples, reproducible or otherwise, of your problem. But estat vif by default gives a result for each predictor (independent variable).
. sysuse auto, clear
(1978 Automobile Data)
. regress mpg weight price
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 66.85
Model | 1595.93249 2 797.966246 Prob > F = 0.0000
Residual | 847.526967 71 11.9369995 R-squared = 0.6531
-------------+---------------------------------- Adj R-squared = 0.6434
Total | 2443.45946 73 33.4720474 Root MSE = 3.455
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0058175 .0006175 -9.42 0.000 -.0070489 -.0045862
price | -.0000935 .0001627 -0.57 0.567 -.000418 .0002309
_cons | 39.43966 1.621563 24.32 0.000 36.20635 42.67296
------------------------------------------------------------------------------
. estat vif
Variable | VIF 1/VIF
-------------+----------------------
price | 1.41 0.709898
weight | 1.41 0.709898
-------------+----------------------
Mean VIF | 1.41

Spark: All RDD data not getting saved to Cassandra table

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?
There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.

MATLAB sorting data and creating matrix by year

I have the following data sample:
2001 1
2000 1
1974 1
2007 1
2007 2
2007 6
2007 3
1994 1
1986 1
2007 1
I want to sort the data by year and then plot the values. I wrote a code using for and find. However, using fprintf I got only the output in the command window, like this:
Ano-modelo 2009 | 88242 veiculos
Ano-modelo 2010 | 125822 veiculos
Ano-modelo 2011 | 132360 veiculos
Ano-modelo 2012 | 167984 veiculos
So, Is there some alternative way that, inside for loop, to create a matrix c = [year; sum_vehicles]?
My code is the following:
dados = dlmread('c:\experimental\frota_detran\frota-detran_total.dat');
ano = 1922:2015;
for i = ano
%procura somente os valores a cada ano
pro = find(dados(:,1)==i);
%lista somente os valores
qt = dados(pro,:);
%soma o ano modelo em questao
total = sum (qt(:,2));
%exibe os valores para cada ano modelo
fprintf('%s %d %s %d %s \n','Ano-modelo',i ,'|',total, 'veiculos');
end
Looks like you want the data sorted and aggregated:
[sorted ia ic] = unique(dados(:,1));
c = [sorted accumarray(ic, dados(:,2))];