Building Your First Stock Forecasting Model
Building your first stock forecasting model is an exciting journey into the intersection of finance, data science, and machine learning. While professional systems are complex, you can create a functional forecasting model that teaches fundamental concepts and provides a foundation for more sophisticated work. This step-by-step tutorial guides you through the entire process, from data acquisition to model validation.
Prerequisites and Tools
Before diving in, ensure you have the necessary tools and knowledge:
Required Knowledge
- Basic Python programming
- Understanding of pandas DataFrames
- Familiarity with machine learning concepts (training, testing, overfitting)
- Basic understanding of stock markets (what prices, volume, and returns mean)
Required Libraries
- yfinance: For downloading stock data
- pandas: Data manipulation
- numpy: Numerical computations
- scikit-learn: Machine learning models and preprocessing
- matplotlib/seaborn: Visualization
Install with: pip install yfinance pandas numpy scikit-learn matplotlib seaborn
Step 1: Data Acquisition
First, we need historical stock data. We'll use yfinance to download data for Apple (AAPL) as our example:
Download Historical Data
import yfinance as yf
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
# Download 5 years of Apple stock data
ticker = "AAPL"
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)
# Download data
data = yf.download(ticker, start=start_date, end=end_date)
print(f"Downloaded {len(data)} days of data")
print(data.head())
This gives us a DataFrame with columns: Open, High, Low, Close, Volume, and Adj Close (adjusted for splits and dividends).
Understanding the Data
- Close: The price at market close
- Adj Close: Adjusted for corporate actions (use this for analysis)
- Volume: Number of shares traded
- High/Low: Trading range for the day
Step 2: Feature Engineering
Raw price data isn't directly useful for machine learning. We need to create features (predictive variables) from the data.
Calculate Technical Indicators
def create_features(df):
"""Create technical indicators as features"""
df = df.copy()
# Returns (percentage change)
df['returns'] = df['Adj Close'].pct_change()
# Moving averages
df['SMA_20'] = df['Adj Close'].rolling(window=20).mean()
df['SMA_50'] = df['Adj Close'].rolling(window=50).mean()
# Price position relative to moving averages
df['price_to_SMA20'] = df['Adj Close'] / df['SMA_20']
df['price_to_SMA50'] = df['Adj Close'] / df['SMA_50']
# Volatility (20-day standard deviation of returns)
df['volatility'] = df['returns'].rolling(window=20).std()
# RSI (Relative Strength Index)
delta = df['Adj Close'].diff()
gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
rs = gain / loss
df['RSI'] = 100 - (100 / (1 + rs))
# Volume features
df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(window=20).mean()
# Price momentum (return over different periods)
df['momentum_5'] = df['Adj Close'].pct_change(periods=5)
df['momentum_20'] = df['Adj Close'].pct_change(periods=20)
# MACD (Moving Average Convergence Divergence)
exp1 = df['Adj Close'].ewm(span=12, adjust=False).mean()
exp2 = df['Adj Close'].ewm(span=26, adjust=False).mean()
df['MACD'] = exp1 - exp2
df['MACD_signal'] = df['MACD'].ewm(span=9, adjust=False).mean()
return df
# Apply feature engineering
data = create_features(data)
print(f"Created {len(data.columns)} total columns")
print(data.columns.tolist())
Create Target Variable
We need to define what we're predicting. Let's predict whether the price will be higher 5 days from now (binary classification):
# Create target: 1 if price increases in next 5 days, 0 otherwise
data['target'] = (data['Adj Close'].shift(-5) > data['Adj Close']).astype(int)
# Remove rows with NaN values (from rolling calculations and shifting)
data = data.dropna()
print(f"After cleaning: {len(data)} samples")
print(f"Target distribution: {data['target'].value_counts()}")
Step 3: Prepare Data for Training
Now we split our data into features (X) and target (y), then create training and test sets.
Select Features
# Define which columns are features (exclude target and raw price data)
feature_columns = [
'returns', 'price_to_SMA20', 'price_to_SMA50', 'volatility',
'RSI', 'volume_ratio', 'momentum_5', 'momentum_20', 'MACD', 'MACD_signal'
]
X = data[feature_columns]
y = data['target']
print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")
Train-Test Split (Time-Series Aware)
from sklearn.model_selection import TimeSeriesSplit
# For time series, we MUST split chronologically, not randomly
split_point = int(len(data) * 0.8)
X_train = X[:split_point]
X_test = X[split_point:]
y_train = y[:split_point]
y_test = y[split_point:]
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Train period: {data.index[0]} to {data.index[split_point-1]}")
print(f"Test period: {data.index[split_point]} to {data.index[-1]}")
Feature Scaling
from sklearn.preprocessing import StandardScaler
# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns, index=X_test.index)
Step 4: Train Multiple Models
Let's train three different models and compare their performance:
1. Logistic Regression (Baseline)
from sklearn.linear_model import LogisticRegression
# Train logistic regression
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)
# Predictions
lr_train_pred = lr_model.predict(X_train_scaled)
lr_test_pred = lr_model.predict(X_test_scaled)
# Probabilities (useful for analysis)
lr_train_proba = lr_model.predict_proba(X_train_scaled)[:, 1]
lr_test_proba = lr_model.predict_proba(X_test_scaled)[:, 1]
2. Random Forest
from sklearn.ensemble import RandomForestClassifier
# Train random forest
rf_model = RandomForestClassifier(
n_estimators=100,
max_depth=10,
random_state=42,
n_jobs=-1
)
rf_model.fit(X_train_scaled, y_train)
# Predictions
rf_train_pred = rf_model.predict(X_train_scaled)
rf_test_pred = rf_model.predict(X_test_scaled)
rf_test_proba = rf_model.predict_proba(X_test_scaled)[:, 1]
3. Gradient Boosting (XGBoost)
# Install if needed: pip install xgboost
import xgboost as xgb
# Train XGBoost
xgb_model = xgb.XGBClassifier(
n_estimators=100,
max_depth=5,
learning_rate=0.1,
random_state=42
)
xgb_model.fit(X_train_scaled, y_train)
# Predictions
xgb_train_pred = xgb_model.predict(X_train_scaled)
xgb_test_pred = xgb_model.predict(X_test_scaled)
xgb_test_proba = xgb_model.predict_proba(X_test_scaled)[:, 1]
Step 5: Evaluate Model Performance
Now we evaluate how well our models perform:
Calculate Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
def evaluate_model(y_true, y_pred, y_proba, model_name, dataset_name):
"""Calculate and print evaluation metrics"""
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)
auc = roc_auc_score(y_true, y_proba)
print(f"\n{model_name} - {dataset_name} Set:")
print(f" Accuracy: {accuracy:.4f}")
print(f" Precision: {precision:.4f}")
print(f" Recall: {recall:.4f}")
print(f" F1 Score: {f1:.4f}")
print(f" ROC AUC: {auc:.4f}")
return accuracy, precision, recall, f1, auc
# Evaluate all models
print("="*50)
print("MODEL EVALUATION")
print("="*50)
# Logistic Regression
evaluate_model(y_train, lr_train_pred, lr_train_proba, "Logistic Regression", "Training")
evaluate_model(y_test, lr_test_pred, lr_test_proba, "Logistic Regression", "Test")
# Random Forest
evaluate_model(y_train, rf_train_pred, rf_train_proba, "Random Forest", "Training")
evaluate_model(y_test, rf_test_pred, rf_test_proba, "Random Forest", "Test")
# XGBoost
evaluate_model(y_train, xgb_train_pred, xgb_train_proba, "XGBoost", "Training")
evaluate_model(y_test, xgb_test_pred, xgb_test_proba, "XGBoost", "Test")
Understanding the Metrics
- Accuracy: Percentage of correct predictions (can be misleading if classes are imbalanced)
- Precision: Of positive predictions, how many were actually correct?
- Recall: Of actual positive cases, how many did we catch?
- F1 Score: Harmonic mean of precision and recall (balanced metric)
- ROC AUC: Area under the receiver operating characteristic curve (measures overall performance)
Step 6: Feature Importance Analysis
Understanding which features drive predictions is crucial:
import matplotlib.pyplot as plt
import seaborn as sns
# Get feature importance from Random Forest
feature_importance = pd.DataFrame({
'feature': feature_columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
# Plot
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance')
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.show()
print("\nTop 5 Most Important Features:")
print(feature_importance.head())
Step 7: Backtesting (Simulated Trading)
Let's simulate how the model would perform in a simple trading strategy:
Simple Trading Strategy
def backtest_strategy(data, predictions, probabilities, threshold=0.5):
"""
Simple strategy: Buy when model predicts up, hold cash otherwise
"""
# Create results DataFrame
results = pd.DataFrame(index=data.index)
results['actual_return'] = data['Adj Close'].pct_change(5).shift(-5)
results['prediction'] = predictions
results['probability'] = probabilities
# Only trade when confidence is above threshold
results['signal'] = (results['probability'] > threshold).astype(int)
# Calculate strategy returns
results['strategy_return'] = results['actual_return'] * results['signal']
# Calculate cumulative returns
results['buy_hold_cumulative'] = (1 + results['actual_return']).cumprod()
results['strategy_cumulative'] = (1 + results['strategy_return']).cumprod()
# Remove NaN values
results = results.dropna()
# Calculate metrics
total_return_bh = results['buy_hold_cumulative'].iloc[-1] - 1
total_return_strategy = results['strategy_cumulative'].iloc[-1] - 1
sharpe_bh = results['actual_return'].mean() / results['actual_return'].std() * np.sqrt(252/5)
sharpe_strategy = results['strategy_return'].mean() / results['strategy_return'].std() * np.sqrt(252/5)
print(f"\nBacktest Results:")
print(f" Buy & Hold Return: {total_return_bh:>8.2%}")
print(f" Strategy Return: {total_return_strategy:>8.2%}")
print(f" Buy & Hold Sharpe: {sharpe_bh:>8.2f}")
print(f" Strategy Sharpe: {sharpe_strategy:>8.2f}")
print(f" Trades Taken: {results['signal'].sum():>8.0f}")
print(f" Win Rate: {(results[results['signal']==1]['actual_return'] > 0).mean():>8.2%}")
return results
# Backtest on test set
test_data = data[split_point:]
backtest_results = backtest_strategy(test_data, xgb_test_pred, xgb_test_proba)
# Plot cumulative returns
plt.figure(figsize=(12, 6))
plt.plot(backtest_results['buy_hold_cumulative'], label='Buy & Hold', linewidth=2)
plt.plot(backtest_results['strategy_cumulative'], label='ML Strategy', linewidth=2)
plt.title('Cumulative Returns: Buy & Hold vs ML Strategy')
plt.xlabel('Date')
plt.ylabel('Cumulative Return')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('backtest_results.png')
plt.show()
Step 8: Walk-Forward Validation
For a more realistic evaluation, implement walk-forward validation:
from sklearn.model_selection import TimeSeriesSplit
def walk_forward_validation(X, y, model, n_splits=5):
"""
Perform walk-forward validation
"""
tscv = TimeSeriesSplit(n_splits=n_splits)
results = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
# Split data
X_train_fold = X.iloc[train_idx]
X_test_fold = X.iloc[test_idx]
y_train_fold = y.iloc[train_idx]
y_test_fold = y.iloc[test_idx]
# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_fold)
X_test_scaled = scaler.transform(X_test_fold)
# Train
model.fit(X_train_scaled, y_train_fold)
# Predict
y_pred = model.predict(X_test_scaled)
y_proba = model.predict_proba(X_test_scaled)[:, 1]
# Evaluate
accuracy = accuracy_score(y_test_fold, y_pred)
auc = roc_auc_score(y_test_fold, y_proba)
results.append({
'fold': fold + 1,
'accuracy': accuracy,
'auc': auc,
'train_size': len(train_idx),
'test_size': len(test_idx)
})
print(f"Fold {fold+1}: Accuracy={accuracy:.4f}, AUC={auc:.4f}")
results_df = pd.DataFrame(results)
print(f"\nAverage Accuracy: {results_df['accuracy'].mean():.4f} (+/- {results_df['accuracy'].std():.4f})")
print(f"Average AUC: {results_df['auc'].mean():.4f} (+/- {results_df['auc'].std():.4f})")
return results_df
# Perform walk-forward validation
print("\nWalk-Forward Validation Results:")
print("="*50)
wf_results = walk_forward_validation(X, y, RandomForestClassifier(n_estimators=100, random_state=42))
Common Pitfalls and How to Avoid Them
1. Look-Ahead Bias
Problem: Using future information in features or incorrect train-test splits.
Solution: Always think chronologically. Calculate features using only past data. Split data by time, not randomly.
2. Overfitting
Problem: Model performs well on training data but poorly on test data.
Solution: Use simpler models, regularization, and walk-forward validation. If training accuracy is 90% but test is 52%, you've overfit.
3. Ignoring Transaction Costs
Problem: Backtests show profits but ignore 0.1-0.5% trading costs.
Solution: Subtract realistic costs from every trade. Factor in bid-ask spread and market impact.
4. Data Snooping
Problem: Testing many model variations and picking the best test performance.
Solution: Use proper validation. If you test 20 models, expect one to look good by chance.
Next Steps and Improvements
Once you've mastered this basic system, consider:
Model Improvements
- Try LSTM or Transformer models for better temporal pattern capture
- Implement ensemble methods combining multiple models
- Add more sophisticated features (sentiment, fundamentals)
- Optimize hyperparameters using walk-forward cross-validation
Risk Management
- Implement position sizing based on confidence levels
- Add stop-loss and take-profit rules
- Calculate maximum drawdown and Value at Risk
- Portfolio diversification across multiple stocks
Production Considerations
- Automate daily data collection and model retraining
- Implement monitoring to detect model degradation
- Paper trading before risking real capital
- Build risk management safeguards
Conclusion
Congratulations! You've built a complete stock forecasting model from scratch. While this system isn't ready for real trading, you've learned the fundamental workflow:
- Acquire clean, adjusted historical data
- Engineer meaningful features from raw data
- Split data chronologically to avoid look-ahead bias
- Train multiple models and compare performance
- Evaluate using appropriate metrics
- Backtest to simulate real-world performance
- Validate using walk-forward testing
The most important skills you've developed are understanding data integrity, avoiding common pitfalls like look-ahead bias, and maintaining realistic expectations about performance. These principles apply whether you're building simple models or sophisticated deep learning systems.
Remember: successful quantitative trading requires far more than accurate predictions. Risk management, transaction cost awareness, psychological discipline, and continuous learning matter just as much as model accuracy. Use this foundation to keep learning, experimenting, and improving your skills in this fascinating intersection of finance and machine learning.