PCA Example in SAS
Creating the Data Set
DATA appledat; INPUT farm$ yy nat kk pp shade water; LABEL yy='AppleTaste' nat='Sodium' kk='Potassium' pp='Phosphorus' shade='Shade' water='Water'; DATALINES; A 2876 20.0 38 2488 2.42 216 B 2078 11.1 13 2998 1.62 321 C 3052 19.8 31 3835 2.79 376 D 2265 13.9 19 2360 1.65 265 E 940 17.0 24 233 0.86 18 F 2815 16.9 26 3922 2.70 369 G 2661 11.6 16 4343 2.40 453 H 2181 14.3 22 3110 2.05 267 I 2052 10.5 13 2869 1.63 286 J 2064 18.2 31 2335 2.17 252 K 1551 8.3 8 1784 0.84 185 L 2338 20.4 36 2601 2.47 275 M 1753 8.7 18 2124 1.27 201 N 2110 7.5 4 4408 1.85 411 ; RUN;
Description
Before we start, a brief description of the dataset. The context is the following: A panel judges the taste of apples at different levels of 5 variables: 3 relating to soil nutrients and 2 on the amounts of shade and water. Our job is to come up with a model that predicts taste based on the 5 inputs. In doing this analysis, it would help to know which of the variables are most important - are any important? Are some combinations better than others? The approach we have seen that can be used to disentangle the joint effects in these variables is Principal Components Analysis. The variables in the data set are yy, which is the dependent variable that we are trying to predict – i.e., taste farm, an indicator variable telling you the farm the apples came from – we will not use this variable in our analysis. Nat, the amount of sodium in the soil kk, the amount of Potassium in the soil pp, the amount of Phosphorus in the soil The last two variable names are clear enough – shade is the amount of shade and water is the amount of water the apples get.
Regression on Original variables
PROC GLM DATA=pcadat; MODEL yy = nat kk pp shade water; RUN;
Applying PCA
PROC PRINCOMP DATA=appledat OUT=pcadat OUTSTAT=stadat; VAR nat kk pp shade water; RUN;
Basic Analysis Means
PROC MEANS DATA=pcadat MEAN VAR MAXDEC=4; VAR prin1-prin5; RUN;
Correlations
PROC CORR DATA=pcadat; VAR yy prin1-prin5; RUN;
Plots
PROC PLOT DATA=pcadat; PLOT prin2*prin1 = farm; RUN;
Additional Analysis Regression on all 5 Principal Components
PROC GLM DATA=pcadat; MODEL yy = prin1-prin5; RUN;
Regression on Largest 2 Principal Components
PROC GLM DATA=pcadat; MODEL yy = prin1 prin2; RUN;
Regression on Largest Principal Component
PROC GLM DATA=pcadat; MODEL yy = prin1 prin2; RUN;