Building Your First Stock Forecasting Model

AI Forecasting & Finance 2025-02-15 15 min read By All About AI

Building your first stock forecasting model is an exciting journey into the intersection of finance, data science, and machine learning. While professional systems are complex, you can create a functional forecasting model that teaches fundamental concepts and provides a foundation for more sophisticated work. This step-by-step tutorial guides you through the entire process, from data acquisition to model validation.

Prerequisites and Tools

Before diving in, ensure you have the necessary tools and knowledge:

Required Knowledge

  • Basic Python programming
  • Understanding of pandas DataFrames
  • Familiarity with machine learning concepts (training, testing, overfitting)
  • Basic understanding of stock markets (what prices, volume, and returns mean)

Required Libraries

  • yfinance: For downloading stock data
  • pandas: Data manipulation
  • numpy: Numerical computations
  • scikit-learn: Machine learning models and preprocessing
  • matplotlib/seaborn: Visualization

Install with: pip install yfinance pandas numpy scikit-learn matplotlib seaborn

Learning Approach: This tutorial prioritizes understanding over complexity. We'll build a simple but complete system, explaining every step thoroughly. Once you master these fundamentals, you can explore advanced techniques.

Step 1: Data Acquisition

First, we need historical stock data. We'll use yfinance to download data for Apple (AAPL) as our example:

Download Historical Data

import yfinance as yf
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Download 5 years of Apple stock data
ticker = "AAPL"
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

# Download data
data = yf.download(ticker, start=start_date, end=end_date)
print(f"Downloaded {len(data)} days of data")
print(data.head())

This gives us a DataFrame with columns: Open, High, Low, Close, Volume, and Adj Close (adjusted for splits and dividends).

Understanding the Data

  • Close: The price at market close
  • Adj Close: Adjusted for corporate actions (use this for analysis)
  • Volume: Number of shares traded
  • High/Low: Trading range for the day
Critical Decision: Always use Adj Close for price calculations, not Close. Adjusted prices account for stock splits and dividends, ensuring accurate historical returns.

Step 2: Feature Engineering

Raw price data isn't directly useful for machine learning. We need to create features (predictive variables) from the data.

Calculate Technical Indicators

def create_features(df):
    """Create technical indicators as features"""
    df = df.copy()

    # Returns (percentage change)
    df['returns'] = df['Adj Close'].pct_change()

    # Moving averages
    df['SMA_20'] = df['Adj Close'].rolling(window=20).mean()
    df['SMA_50'] = df['Adj Close'].rolling(window=50).mean()

    # Price position relative to moving averages
    df['price_to_SMA20'] = df['Adj Close'] / df['SMA_20']
    df['price_to_SMA50'] = df['Adj Close'] / df['SMA_50']

    # Volatility (20-day standard deviation of returns)
    df['volatility'] = df['returns'].rolling(window=20).std()

    # RSI (Relative Strength Index)
    delta = df['Adj Close'].diff()
    gain = (delta.where(delta > 0, 0)).rolling(window=14).mean()
    loss = (-delta.where(delta < 0, 0)).rolling(window=14).mean()
    rs = gain / loss
    df['RSI'] = 100 - (100 / (1 + rs))

    # Volume features
    df['volume_ratio'] = df['Volume'] / df['Volume'].rolling(window=20).mean()

    # Price momentum (return over different periods)
    df['momentum_5'] = df['Adj Close'].pct_change(periods=5)
    df['momentum_20'] = df['Adj Close'].pct_change(periods=20)

    # MACD (Moving Average Convergence Divergence)
    exp1 = df['Adj Close'].ewm(span=12, adjust=False).mean()
    exp2 = df['Adj Close'].ewm(span=26, adjust=False).mean()
    df['MACD'] = exp1 - exp2
    df['MACD_signal'] = df['MACD'].ewm(span=9, adjust=False).mean()

    return df

# Apply feature engineering
data = create_features(data)
print(f"Created {len(data.columns)} total columns")
print(data.columns.tolist())

Create Target Variable

We need to define what we're predicting. Let's predict whether the price will be higher 5 days from now (binary classification):

# Create target: 1 if price increases in next 5 days, 0 otherwise
data['target'] = (data['Adj Close'].shift(-5) > data['Adj Close']).astype(int)

# Remove rows with NaN values (from rolling calculations and shifting)
data = data.dropna()

print(f"After cleaning: {len(data)} samples")
print(f"Target distribution: {data['target'].value_counts()}")
Key Concept: We use shift(-5) to look 5 days forward. This creates our target variable. However, we must be careful not to use this future information in our features (that would be look-ahead bias).

Step 3: Prepare Data for Training

Now we split our data into features (X) and target (y), then create training and test sets.

Select Features

# Define which columns are features (exclude target and raw price data)
feature_columns = [
    'returns', 'price_to_SMA20', 'price_to_SMA50', 'volatility',
    'RSI', 'volume_ratio', 'momentum_5', 'momentum_20', 'MACD', 'MACD_signal'
]

X = data[feature_columns]
y = data['target']

print(f"Features shape: {X.shape}")
print(f"Target shape: {y.shape}")

Train-Test Split (Time-Series Aware)

from sklearn.model_selection import TimeSeriesSplit

# For time series, we MUST split chronologically, not randomly
split_point = int(len(data) * 0.8)

X_train = X[:split_point]
X_test = X[split_point:]
y_train = y[:split_point]
y_test = y[split_point:]

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"Train period: {data.index[0]} to {data.index[split_point-1]}")
print(f"Test period: {data.index[split_point]} to {data.index[-1]}")
Critical Mistake to Avoid: Never use random train-test split for time series! This causes look-ahead bias where future information leaks into training. Always split chronologically.

Feature Scaling

from sklearn.preprocessing import StandardScaler

# Fit scaler on training data only
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for easier handling
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_columns, index=X_test.index)

Step 4: Train Multiple Models

Let's train three different models and compare their performance:

1. Logistic Regression (Baseline)

from sklearn.linear_model import LogisticRegression

# Train logistic regression
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)

# Predictions
lr_train_pred = lr_model.predict(X_train_scaled)
lr_test_pred = lr_model.predict(X_test_scaled)

# Probabilities (useful for analysis)
lr_train_proba = lr_model.predict_proba(X_train_scaled)[:, 1]
lr_test_proba = lr_model.predict_proba(X_test_scaled)[:, 1]

2. Random Forest

from sklearn.ensemble import RandomForestClassifier

# Train random forest
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf_model.fit(X_train_scaled, y_train)

# Predictions
rf_train_pred = rf_model.predict(X_train_scaled)
rf_test_pred = rf_model.predict(X_test_scaled)
rf_test_proba = rf_model.predict_proba(X_test_scaled)[:, 1]

3. Gradient Boosting (XGBoost)

# Install if needed: pip install xgboost
import xgboost as xgb

# Train XGBoost
xgb_model = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=5,
    learning_rate=0.1,
    random_state=42
)
xgb_model.fit(X_train_scaled, y_train)

# Predictions
xgb_train_pred = xgb_model.predict(X_train_scaled)
xgb_test_pred = xgb_model.predict(X_test_scaled)
xgb_test_proba = xgb_model.predict_proba(X_test_scaled)[:, 1]

Step 5: Evaluate Model Performance

Now we evaluate how well our models perform:

Calculate Metrics

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def evaluate_model(y_true, y_pred, y_proba, model_name, dataset_name):
    """Calculate and print evaluation metrics"""
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_proba)

    print(f"\n{model_name} - {dataset_name} Set:")
    print(f"  Accuracy:  {accuracy:.4f}")
    print(f"  Precision: {precision:.4f}")
    print(f"  Recall:    {recall:.4f}")
    print(f"  F1 Score:  {f1:.4f}")
    print(f"  ROC AUC:   {auc:.4f}")

    return accuracy, precision, recall, f1, auc

# Evaluate all models
print("="*50)
print("MODEL EVALUATION")
print("="*50)

# Logistic Regression
evaluate_model(y_train, lr_train_pred, lr_train_proba, "Logistic Regression", "Training")
evaluate_model(y_test, lr_test_pred, lr_test_proba, "Logistic Regression", "Test")

# Random Forest
evaluate_model(y_train, rf_train_pred, rf_train_proba, "Random Forest", "Training")
evaluate_model(y_test, rf_test_pred, rf_test_proba, "Random Forest", "Test")

# XGBoost
evaluate_model(y_train, xgb_train_pred, xgb_train_proba, "XGBoost", "Training")
evaluate_model(y_test, xgb_test_pred, xgb_test_proba, "XGBoost", "Test")

Understanding the Metrics

  • Accuracy: Percentage of correct predictions (can be misleading if classes are imbalanced)
  • Precision: Of positive predictions, how many were actually correct?
  • Recall: Of actual positive cases, how many did we catch?
  • F1 Score: Harmonic mean of precision and recall (balanced metric)
  • ROC AUC: Area under the receiver operating characteristic curve (measures overall performance)
What's Good Performance? For stock prediction, accuracy above 53-55% is often considered decent. This isn't impressive compared to other ML tasks, but remember: stock markets are inherently noisy. Even small edges can be profitable with proper risk management.

Step 6: Feature Importance Analysis

Understanding which features drive predictions is crucial:

import matplotlib.pyplot as plt
import seaborn as sns

# Get feature importance from Random Forest
feature_importance = pd.DataFrame({
    'feature': feature_columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance')
plt.tight_layout()
plt.savefig('feature_importance.png')
plt.show()

print("\nTop 5 Most Important Features:")
print(feature_importance.head())

Step 7: Backtesting (Simulated Trading)

Let's simulate how the model would perform in a simple trading strategy:

Simple Trading Strategy

def backtest_strategy(data, predictions, probabilities, threshold=0.5):
    """
    Simple strategy: Buy when model predicts up, hold cash otherwise
    """
    # Create results DataFrame
    results = pd.DataFrame(index=data.index)
    results['actual_return'] = data['Adj Close'].pct_change(5).shift(-5)
    results['prediction'] = predictions
    results['probability'] = probabilities

    # Only trade when confidence is above threshold
    results['signal'] = (results['probability'] > threshold).astype(int)

    # Calculate strategy returns
    results['strategy_return'] = results['actual_return'] * results['signal']

    # Calculate cumulative returns
    results['buy_hold_cumulative'] = (1 + results['actual_return']).cumprod()
    results['strategy_cumulative'] = (1 + results['strategy_return']).cumprod()

    # Remove NaN values
    results = results.dropna()

    # Calculate metrics
    total_return_bh = results['buy_hold_cumulative'].iloc[-1] - 1
    total_return_strategy = results['strategy_cumulative'].iloc[-1] - 1

    sharpe_bh = results['actual_return'].mean() / results['actual_return'].std() * np.sqrt(252/5)
    sharpe_strategy = results['strategy_return'].mean() / results['strategy_return'].std() * np.sqrt(252/5)

    print(f"\nBacktest Results:")
    print(f"  Buy & Hold Return:    {total_return_bh:>8.2%}")
    print(f"  Strategy Return:      {total_return_strategy:>8.2%}")
    print(f"  Buy & Hold Sharpe:    {sharpe_bh:>8.2f}")
    print(f"  Strategy Sharpe:      {sharpe_strategy:>8.2f}")
    print(f"  Trades Taken:         {results['signal'].sum():>8.0f}")
    print(f"  Win Rate:             {(results[results['signal']==1]['actual_return'] > 0).mean():>8.2%}")

    return results

# Backtest on test set
test_data = data[split_point:]
backtest_results = backtest_strategy(test_data, xgb_test_pred, xgb_test_proba)

# Plot cumulative returns
plt.figure(figsize=(12, 6))
plt.plot(backtest_results['buy_hold_cumulative'], label='Buy & Hold', linewidth=2)
plt.plot(backtest_results['strategy_cumulative'], label='ML Strategy', linewidth=2)
plt.title('Cumulative Returns: Buy & Hold vs ML Strategy')
plt.xlabel('Date')
plt.ylabel('Cumulative Return')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('backtest_results.png')
plt.show()
Reality Check: This backtest doesn't include transaction costs, slippage, or market impact. Real-world performance would be lower. Use this as a learning tool, not a trading system.

Step 8: Walk-Forward Validation

For a more realistic evaluation, implement walk-forward validation:

from sklearn.model_selection import TimeSeriesSplit

def walk_forward_validation(X, y, model, n_splits=5):
    """
    Perform walk-forward validation
    """
    tscv = TimeSeriesSplit(n_splits=n_splits)

    results = []

    for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
        # Split data
        X_train_fold = X.iloc[train_idx]
        X_test_fold = X.iloc[test_idx]
        y_train_fold = y.iloc[train_idx]
        y_test_fold = y.iloc[test_idx]

        # Scale
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train_fold)
        X_test_scaled = scaler.transform(X_test_fold)

        # Train
        model.fit(X_train_scaled, y_train_fold)

        # Predict
        y_pred = model.predict(X_test_scaled)
        y_proba = model.predict_proba(X_test_scaled)[:, 1]

        # Evaluate
        accuracy = accuracy_score(y_test_fold, y_pred)
        auc = roc_auc_score(y_test_fold, y_proba)

        results.append({
            'fold': fold + 1,
            'accuracy': accuracy,
            'auc': auc,
            'train_size': len(train_idx),
            'test_size': len(test_idx)
        })

        print(f"Fold {fold+1}: Accuracy={accuracy:.4f}, AUC={auc:.4f}")

    results_df = pd.DataFrame(results)
    print(f"\nAverage Accuracy: {results_df['accuracy'].mean():.4f} (+/- {results_df['accuracy'].std():.4f})")
    print(f"Average AUC: {results_df['auc'].mean():.4f} (+/- {results_df['auc'].std():.4f})")

    return results_df

# Perform walk-forward validation
print("\nWalk-Forward Validation Results:")
print("="*50)
wf_results = walk_forward_validation(X, y, RandomForestClassifier(n_estimators=100, random_state=42))

Common Pitfalls and How to Avoid Them

1. Look-Ahead Bias

Problem: Using future information in features or incorrect train-test splits.

Solution: Always think chronologically. Calculate features using only past data. Split data by time, not randomly.

2. Overfitting

Problem: Model performs well on training data but poorly on test data.

Solution: Use simpler models, regularization, and walk-forward validation. If training accuracy is 90% but test is 52%, you've overfit.

3. Ignoring Transaction Costs

Problem: Backtests show profits but ignore 0.1-0.5% trading costs.

Solution: Subtract realistic costs from every trade. Factor in bid-ask spread and market impact.

4. Data Snooping

Problem: Testing many model variations and picking the best test performance.

Solution: Use proper validation. If you test 20 models, expect one to look good by chance.

Most Important Lesson: If your backtest looks too good to be true (accuracy >70%, consistent high returns), you probably have a bug. Stock prediction is hard—realistic models have modest performance with high variability.

Next Steps and Improvements

Once you've mastered this basic system, consider:

Model Improvements

  • Try LSTM or Transformer models for better temporal pattern capture
  • Implement ensemble methods combining multiple models
  • Add more sophisticated features (sentiment, fundamentals)
  • Optimize hyperparameters using walk-forward cross-validation

Risk Management

  • Implement position sizing based on confidence levels
  • Add stop-loss and take-profit rules
  • Calculate maximum drawdown and Value at Risk
  • Portfolio diversification across multiple stocks

Production Considerations

  • Automate daily data collection and model retraining
  • Implement monitoring to detect model degradation
  • Paper trading before risking real capital
  • Build risk management safeguards

Conclusion

Congratulations! You've built a complete stock forecasting model from scratch. While this system isn't ready for real trading, you've learned the fundamental workflow:

  1. Acquire clean, adjusted historical data
  2. Engineer meaningful features from raw data
  3. Split data chronologically to avoid look-ahead bias
  4. Train multiple models and compare performance
  5. Evaluate using appropriate metrics
  6. Backtest to simulate real-world performance
  7. Validate using walk-forward testing

The most important skills you've developed are understanding data integrity, avoiding common pitfalls like look-ahead bias, and maintaining realistic expectations about performance. These principles apply whether you're building simple models or sophisticated deep learning systems.

Remember: successful quantitative trading requires far more than accurate predictions. Risk management, transaction cost awareness, psychological discipline, and continuous learning matter just as much as model accuracy. Use this foundation to keep learning, experimenting, and improving your skills in this fascinating intersection of finance and machine learning.