This calculator performs comprehensive Principal Component Analysis (PCA), a powerful dimensionality reduction technique that transforms your multivariate data into a set of uncorrelated components. PCA helps you identify patterns, reduce complexity, and visualize high-dimensional data while retaining the most important information.
💡 Pro Tip: PCA works best with standardized data (recommended for variables with different scales). Use the scree plot and Kaiser criterion (eigenvalue > 1) to decide how many components to retain. For classification tasks, consider Linear Discriminant Analysis.
There is no single official standard for PCA biplot scaling. Different statistical software packages use different scaling conventions for displaying biplots:
biplot()): Uses the scale parameter (0, 1, or values in between) to control arrow lengthsReady to explore your multivariate data? (student test scores) to see PCA in action, or upload your own data to discover the underlying structure in your variables.
Selected: 0 of 0 variables
Leave empty to extract all components
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated variables into a smaller set of uncorrelated variables called principal components. Each component is a linear combination of the original variables and captures as much variance as possible.
Eigenvalue (Variance Explained):
Principal Component (Linear Combination):
Proportion of Variance Explained:
Drag the slider to rotate the red line. Try to find the angle that maximizes the variance (spread) of the projected blue dots. This is exactly what PCA does automatically - it finds the direction of maximum variance!
💡 Tip: The optimal angle (around 30°) gives the maximum variance. This would be the first principal component (PC1). A line perpendicular to this (around 120°) would capture the remaining variance and become PC2.
The scree plot helps you decide how many principal components to keep. Look for the "elbow" where the curve flattens out - components before the elbow contain meaningful information (signal), while those after are likely just noise.
Signal Components (Keep)
Components 1-3 explain 79.6% of variance. These capture the meaningful patterns in your data.
Noise Components (Discard)
Components 4-10 add little value and likely represent measurement error or random variation.
Decision Rules
Both PCA and linear regression find a "line of best fit," but they minimize different types of distances. Toggle between the two methods to see the key difference!
🔄 PCA (Green lines): Projects points perpendicularly onto the line. This treats all variables symmetrically—neither X nor Y is special. PCA finds the direction that captures maximum variance in the data.
✓ Use when: You want to reduce dimensions without treating any variable as the "outcome"
💡 Key Insight: Notice how the projection lines change! PCA's perpendicular projections are shorter overall, treating both axes equally. Regression's vertical projections only care about errors in the Y direction, which is perfect when you're trying to predict Y from X.
library(tidyverse)
# Student test scores data
data <- tibble(
math_score = c(85, 78, 92, 88, 76, 95, 82, 89, 91, 73, 87, 94, 79, 86, 90),
science_score = c(82, 75, 89, 85, 74, 92, 80, 86, 88, 70, 84, 91, 77, 83, 87),
reading_score = c(88, 82, 95, 90, 79, 97, 85, 91, 93, 76, 89, 96, 81, 88, 92),
writing_score = c(86, 80, 93, 88, 77, 94, 83, 89, 91, 74, 87, 95, 79, 86, 90),
study_hours = c(15, 10, 20, 17, 9, 22, 13, 18, 19, 8, 16, 21, 11, 15, 18)
)
# Perform PCA
pca_result <- prcomp(data, scale. = TRUE)
# View results
summary(pca_result)
# Scree plot
screeplot(pca_result, main = "Scree Plot", type = "lines")
# Biplot with ggplot2
biplot(pca_result, main = "PCA Biplot", scale = 0)
# Component loadings
pca_result$rotation
# Component scores
head(pca_result$x)import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
# Student test scores data
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76, 95, 82, 89, 91, 73, 87, 94, 79, 86, 90],
'science_score': [82, 75, 89, 85, 74, 92, 80, 86, 88, 70, 84, 91, 77, 83, 87],
'reading_score': [88, 82, 95, 90, 79, 97, 85, 91, 93, 76, 89, 96, 81, 88, 92],
'writing_score': [86, 80, 93, 88, 77, 94, 83, 89, 91, 74, 87, 95, 79, 86, 90],
'study_hours': [15, 10, 20, 17, 9, 22, 13, 18, 19, 8, 16, 21, 11, 15, 18]
})
# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Perform PCA
pca = PCA()
principal_components = pca.fit_transform(data_scaled)
# Variance explained
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Eigenvalues:", pca.explained_variance_)
# Component loadings
loadings = pd.DataFrame(
pca.components_.T,
columns=[f'PC{i+1}' for i in range(len(pca.components_))],
index=data.columns
)
print("Component loadings:")
print(loadings)
# Scree plot
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(pca.explained_variance_) + 1),
pca.explained_variance_, 'bo-', linewidth=2)
plt.axhline(y=1, color='r', linestyle='--', label='Kaiser criterion')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.title('Scree Plot')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
pc1 = principal_components[:, 0]
pc2 = principal_components[:, 1]
# Loadings for PC1 and PC2
loadings_pc1 = pca.components_[0]
loadings_pc2 = pca.components_[1]
# Scaling factor to make arrows visible
# (Try adjusting 2.5, 3, etc., depending on your data)
scaling_factor = 3
plt.figure(figsize=(10, 8))
# Scatter plot of PCA scores
plt.scatter(pc1, pc2, alpha=0.5)
# Add arrows for each variable
for i, feature in enumerate(data.columns):
plt.arrow(
0, 0,
loadings_pc1[i] * scaling_factor,
loadings_pc2[i] * scaling_factor,
color='red',
width=0.005,
head_width=0.08
)
plt.text(
loadings_pc1[i] * scaling_factor * 1.1,
loadings_pc2[i] * scaling_factor * 1.1,
feature,
color='red',
fontsize=12
)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('PCA Biplot with Loadings')
plt.grid(True, alpha=0.3)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0, color='black', linewidth=0.5)
plt.show()This calculator performs comprehensive Principal Component Analysis (PCA), a powerful dimensionality reduction technique that transforms your multivariate data into a set of uncorrelated components. PCA helps you identify patterns, reduce complexity, and visualize high-dimensional data while retaining the most important information.
💡 Pro Tip: PCA works best with standardized data (recommended for variables with different scales). Use the scree plot and Kaiser criterion (eigenvalue > 1) to decide how many components to retain. For classification tasks, consider Linear Discriminant Analysis.
There is no single official standard for PCA biplot scaling. Different statistical software packages use different scaling conventions for displaying biplots:
biplot()): Uses the scale parameter (0, 1, or values in between) to control arrow lengthsReady to explore your multivariate data? (student test scores) to see PCA in action, or upload your own data to discover the underlying structure in your variables.
Selected: 0 of 0 variables
Leave empty to extract all components
Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms correlated variables into a smaller set of uncorrelated variables called principal components. Each component is a linear combination of the original variables and captures as much variance as possible.
Eigenvalue (Variance Explained):
Principal Component (Linear Combination):
Proportion of Variance Explained:
Drag the slider to rotate the red line. Try to find the angle that maximizes the variance (spread) of the projected blue dots. This is exactly what PCA does automatically - it finds the direction of maximum variance!
💡 Tip: The optimal angle (around 30°) gives the maximum variance. This would be the first principal component (PC1). A line perpendicular to this (around 120°) would capture the remaining variance and become PC2.
The scree plot helps you decide how many principal components to keep. Look for the "elbow" where the curve flattens out - components before the elbow contain meaningful information (signal), while those after are likely just noise.
Signal Components (Keep)
Components 1-3 explain 79.6% of variance. These capture the meaningful patterns in your data.
Noise Components (Discard)
Components 4-10 add little value and likely represent measurement error or random variation.
Decision Rules
Both PCA and linear regression find a "line of best fit," but they minimize different types of distances. Toggle between the two methods to see the key difference!
🔄 PCA (Green lines): Projects points perpendicularly onto the line. This treats all variables symmetrically—neither X nor Y is special. PCA finds the direction that captures maximum variance in the data.
✓ Use when: You want to reduce dimensions without treating any variable as the "outcome"
💡 Key Insight: Notice how the projection lines change! PCA's perpendicular projections are shorter overall, treating both axes equally. Regression's vertical projections only care about errors in the Y direction, which is perfect when you're trying to predict Y from X.
library(tidyverse)
# Student test scores data
data <- tibble(
math_score = c(85, 78, 92, 88, 76, 95, 82, 89, 91, 73, 87, 94, 79, 86, 90),
science_score = c(82, 75, 89, 85, 74, 92, 80, 86, 88, 70, 84, 91, 77, 83, 87),
reading_score = c(88, 82, 95, 90, 79, 97, 85, 91, 93, 76, 89, 96, 81, 88, 92),
writing_score = c(86, 80, 93, 88, 77, 94, 83, 89, 91, 74, 87, 95, 79, 86, 90),
study_hours = c(15, 10, 20, 17, 9, 22, 13, 18, 19, 8, 16, 21, 11, 15, 18)
)
# Perform PCA
pca_result <- prcomp(data, scale. = TRUE)
# View results
summary(pca_result)
# Scree plot
screeplot(pca_result, main = "Scree Plot", type = "lines")
# Biplot with ggplot2
biplot(pca_result, main = "PCA Biplot", scale = 0)
# Component loadings
pca_result$rotation
# Component scores
head(pca_result$x)import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
# Student test scores data
data = pd.DataFrame({
'math_score': [85, 78, 92, 88, 76, 95, 82, 89, 91, 73, 87, 94, 79, 86, 90],
'science_score': [82, 75, 89, 85, 74, 92, 80, 86, 88, 70, 84, 91, 77, 83, 87],
'reading_score': [88, 82, 95, 90, 79, 97, 85, 91, 93, 76, 89, 96, 81, 88, 92],
'writing_score': [86, 80, 93, 88, 77, 94, 83, 89, 91, 74, 87, 95, 79, 86, 90],
'study_hours': [15, 10, 20, 17, 9, 22, 13, 18, 19, 8, 16, 21, 11, 15, 18]
})
# Standardize the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Perform PCA
pca = PCA()
principal_components = pca.fit_transform(data_scaled)
# Variance explained
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Eigenvalues:", pca.explained_variance_)
# Component loadings
loadings = pd.DataFrame(
pca.components_.T,
columns=[f'PC{i+1}' for i in range(len(pca.components_))],
index=data.columns
)
print("Component loadings:")
print(loadings)
# Scree plot
plt.figure(figsize=(8, 5))
plt.plot(range(1, len(pca.explained_variance_) + 1),
pca.explained_variance_, 'bo-', linewidth=2)
plt.axhline(y=1, color='r', linestyle='--', label='Kaiser criterion')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.title('Scree Plot')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
pc1 = principal_components[:, 0]
pc2 = principal_components[:, 1]
# Loadings for PC1 and PC2
loadings_pc1 = pca.components_[0]
loadings_pc2 = pca.components_[1]
# Scaling factor to make arrows visible
# (Try adjusting 2.5, 3, etc., depending on your data)
scaling_factor = 3
plt.figure(figsize=(10, 8))
# Scatter plot of PCA scores
plt.scatter(pc1, pc2, alpha=0.5)
# Add arrows for each variable
for i, feature in enumerate(data.columns):
plt.arrow(
0, 0,
loadings_pc1[i] * scaling_factor,
loadings_pc2[i] * scaling_factor,
color='red',
width=0.005,
head_width=0.08
)
plt.text(
loadings_pc1[i] * scaling_factor * 1.1,
loadings_pc2[i] * scaling_factor * 1.1,
feature,
color='red',
fontsize=12
)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('PCA Biplot with Loadings')
plt.grid(True, alpha=0.3)
plt.axhline(0, color='black', linewidth=0.5)
plt.axvline(0, color='black', linewidth=0.5)
plt.show()