{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "e5945e24",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "# 📘 **NOAI 2026 - Programming Task 1**\n",
    "- **Subject:** Machine Learning (Regression)\n",
    "- **Title:** Gradient Boosting from Scratch\n",
    "- **Total** = 25 Marks\n",
    "  * **Part 1: The Foundation (8 Marks)**\n",
    "  * **Part 2: The Assembler (12 Marks)**\n",
    "  * **Part 3: Optimisation (Bonus: 5 Marks)**\n",
    "\n",
    "Timestamp: 20 Feb 2026"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a9df06f3",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "## **1. The Scenario**\n",
    "You are engineering a predictive model for a remote environmental sensor. The hardware is low-power and cannot run heavy libraries like `XGBoost` or `LightGBM`. Your task is to build a **Gradient Boosting Regressor** manually, using only `numpy` and standard Decision Trees.\n",
    "\n",
    "### **What is Gradient Boosting? (The Intuition)**\n",
    "Imagine playing golf. You want to get the ball to the hole (the target $y$).\n",
    "1.  **Shot 1 (Initial Prediction):** You hit the ball, but it stops 20 meters short.\n",
    "    * *Error (Residual):* +20 meters.\n",
    "2.  **Shot 2 (First Tree):** You don't aim for the hole; you aim for the *ball's current distance* to the hole (the residual). You hit it 15 meters.\n",
    "    * *New Position:* 5 meters left.\n",
    "3.  **Shot 3 (Second Tree):** You aim for the remaining 5 meters.\n",
    "\n",
    "By adding up all your small shots, you get very close to the target. Gradient Boosting works the same way: **We fit new models to the errors of the previous models.**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7fd830d",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "### **📚 Supplementary Reading**\n",
    "\n",
    "* **[A Guide to The Gradient Boosting Algorithm](https://www.datacamp.com/tutorial/guide-to-the-gradient-boosting-algorithm)**"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "014326c5",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "## **2. The Math (Reference Table)**\n",
    "You will need these formulas for the questions below.\n",
    "\n",
    "| Loss Type | Loss Function $L(y, \\hat{y})$ | Negative Gradient (What we fit to) |\n",
    "| :--- | :--- | :--- |\n",
    "| **L1 (MAE)** | $\\lvert y - \\hat{y} \\rvert$ | $sign(y - \\hat{y})$ |\n",
    "| **L2 (MSE)** | $\\frac{1}{2}(y - \\hat{y})^2$ | $y - \\hat{y}$ (The Residual) |\n",
    "\n",
    "**Legend:**\n",
    "* $y$ : The true/target value\n",
    "* $\\hat{y}$ : The predicted value\n",
    "\n",
    "**Definition of Sign (Boundary Condition):**\n",
    "The derivative of the absolute error at 0 is undefined, so we use the *sub-gradient*:\n",
    "\n",
    "$\\displaystyle \\text{sign}(x) \\in \\begin{cases} \\{-1\\} & \\text{if } x < 0 \\\\ [-1, 1] & \\text{if } x = 0 \\\\ \\{1\\} & \\text{if } x > 0 \\end{cases}$\n",
    "\n",
    "**Coding Hint:** For the L1 Negative Gradient, you can use `np.sign(y_true - y_pred)` to compute it for the whole array at once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "211223dc-83b0-4fb6-bd19-b5251e8f9af1",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.tree import DecisionTreeRegressor\n",
    "from sklearn.datasets import make_regression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import mean_squared_error"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7a89f4a",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "## **Part 1: The Foundation (8 Marks)**\n",
    "---\n",
    "### **Question 1.1 & 1.2: Negative Gradients (4 Marks)**\n",
    "### **Question 1.3 & 1.4: Initialisation (4 Marks)**\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed2eb55a",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "### Question 1.1 (2 points)\n",
    "\n",
    "The \"Negative Gradient\" is simply the direction of the error. This is what our next tree will try to predict.\n",
    "* **L1 Gradient:** The *sign* of the difference (robust to outliers).\n",
    "* **L2 Gradient:** The raw difference between the truth and prediction.\n",
    "\n",
    "**Task:** Implement the two functions below using the formulas from the Reference Table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4b67fa5a",
   "metadata": {
    "tags": [
     "otter_answer_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### ANSWER QUESTION 1.1 HERE\n",
    "\n",
    "\n",
    "def l1_negative_gradient(y: np.ndarray, y_pred: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"\n",
    "    Compute L1 negative gradient (sign of residual).\n",
    "    \n",
    "    This returns the direction of the error, not the magnitude.\n",
    "    Used for MAE (L1) loss in gradient boosting.\n",
    "\n",
    "    Parameters:\n",
    "        y (np.ndarray): True values\n",
    "        y_pred (np.ndarray): Current predictions\n",
    "\n",
    "    Returns:\n",
    "        np.ndarray: L1 Negative gradient (values will be -1, 0, or 1)\n",
    "    \"\"\"\n",
    "\n",
    "    ### QUESTION 1.1 START\n",
    "    # TODO: Implement L1 gradient computation\n",
    "    ### QUESTION 1.1 END"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "42064d4c",
   "metadata": {
    "deletable": false,
    "editable": false,
    "otter": {
     "tests": [
      "q1.1"
     ]
    }
   },
   "outputs": [],
   "source": [
    "### TEST FOR QUESTION 1.1\n",
    "\n",
    "\n",
    "print(\"=\" * 60)\n",
    "print(\"Testing l1_negative_gradient:\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "y = np.random.rand(3, 5)\n",
    "y_pred = np.random.rand(3, 5)\n",
    "print(repr(y))\n",
    "print(repr(y_pred))\n",
    "print(repr(l1_negative_gradient(y, y_pred)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "117ec2aa",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "### Question 1.2 (2 points)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "96a75fa3",
   "metadata": {
    "tags": [
     "otter_answer_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### ANSWER QUESTION 1.2 HERE\n",
    "\n",
    "\n",
    "def l2_negative_gradient(y: np.ndarray, y_pred: np.ndarray) -> np.ndarray:\n",
    "    \"\"\"\n",
    "    Compute L2 negative gradient (residual).\n",
    "    \n",
    "    This is simply the error: how far predictions are from truth.\n",
    "    Used for MSE (L2) loss in gradient boosting.\n",
    "\n",
    "    Parameters:\n",
    "        y (np.ndarray): True values\n",
    "        y_pred (np.ndarray): Current predictions\n",
    "\n",
    "    Returns:\n",
    "        np.ndarray: L2 Negative gradient (same shape as y)\n",
    "    \"\"\"\n",
    "\n",
    "    ### QUESTION 1.2 START\n",
    "    # TODO: Implement L2 gradient computation\n",
    "    ### QUESTION 1.2 END"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "678f0eb0",
   "metadata": {
    "deletable": false,
    "editable": false,
    "otter": {
     "tests": [
      "q1.2"
     ]
    }
   },
   "outputs": [],
   "source": [
    "### TEST FOR QUESTION 1.2\n",
    "\n",
    "\n",
    "print(\"=\" * 60)\n",
    "print(\"Testing l2_negative_gradient:\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "y = np.random.rand(3, 5)\n",
    "y_pred = np.random.rand(3, 5)\n",
    "print(repr(y))\n",
    "print(repr(y_pred))\n",
    "print(repr(l2_negative_gradient(y, y_pred)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5e3829e",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "### **Question 1.3 & 1.4: Initialisation (4 Marks)**\n",
    "Before we start adding trees, the algorithm needs a baseline guess ($F_0$).\n",
    "* For **L1 Loss**, the best constant guess is the **Median**.\n",
    "* For **L2 Loss**, the best constant guess is the **Mean**.\n",
    "\n",
    "**Task:** Implement the functions to return the mean or median of the input array `y`.\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "717a83b5",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "### Question 1.3 (2 points)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a36e2652",
   "metadata": {
    "tags": [
     "otter_answer_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### ANSWER QUESTION 1.3 HERE\n",
    "\n",
    "\n",
    "def l1_initial_prediction(y: np.ndarray) -> np.float64:\n",
    "    \"\"\"\n",
    "    Compute initial prediction for L1 (MAE) loss.\n",
    "    \n",
    "    For L1 loss, the best constant prediction is the MEDIAN.\n",
    "\n",
    "    Parameters:\n",
    "        y (np.ndarray): True values (training targets)\n",
    "\n",
    "    Returns:\n",
    "        np.float64: A single number - the initial prediction for all samples\n",
    "    \"\"\"\n",
    "\n",
    "    ### QUESTION 1.3 START\n",
    "    # TODO: Return the median of y\n",
    "    ### QUESTION 1.3 END"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f185607b",
   "metadata": {
    "deletable": false,
    "editable": false,
    "otter": {
     "tests": [
      "q1.3"
     ]
    }
   },
   "outputs": [],
   "source": [
    "### TEST FOR QUESTION 1.3\n",
    "\n",
    "\n",
    "print(\"=\" * 60)\n",
    "print(\"Testing l1_initial_prediction:\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "y = np.random.rand(3, 5)\n",
    "print(repr(y))\n",
    "print(repr(l1_initial_prediction(y)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f17500db",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "### Question 1.4 (2 points)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17828448",
   "metadata": {
    "tags": [
     "otter_answer_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### ANSWER QUESTION 1.4 HERE\n",
    "\n",
    "\n",
    "def l2_initial_prediction(y: np.ndarray) -> np.float64:\n",
    "    \"\"\"\n",
    "    Compute initial prediction for L2 (MSE) loss.\n",
    "    \n",
    "    For L2 loss, the best constant prediction is the MEAN.\n",
    "\n",
    "    Parameters:\n",
    "        y (np.ndarray): True values (training targets)\n",
    "\n",
    "    Returns:\n",
    "        np.float64: A single number - the initial prediction for all samples\n",
    "    \"\"\"\n",
    "\n",
    "    ### QUESTION 1.4 START\n",
    "    # TODO: Return the mean of y\n",
    "    ### QUESTION 1.4 END"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "692e87ba",
   "metadata": {
    "deletable": false,
    "editable": false,
    "otter": {
     "tests": [
      "q1.4"
     ]
    }
   },
   "outputs": [],
   "source": [
    "### TEST FOR QUESTION 1.4\n",
    "\n",
    "\n",
    "print(\"=\" * 60)\n",
    "print(\"Testing l2_initial_prediction:\")\n",
    "print(\"=\" * 60)\n",
    "\n",
    "y = np.random.rand(3, 5)\n",
    "print(repr(y))\n",
    "print(repr(l2_initial_prediction(y)))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0f854ed1",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "---\n",
    "## **Part 2: The Assembler (12 Marks)**\n",
    "* **Question 1.5: The Class Structure (4 Marks)**\n",
    "* **Question 1.6: The Fitting Algorithm (8 Marks)**\n",
    "\n",
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "545237dd",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "Now we will assemble the pieces into a proper Class.\n",
    "\n",
    "### **Question 1.5: The Class Structure (4 Marks)**\n",
    "We will define the class `BaseGradientBoosting`. Your first task is to implement the `__init__` method to store our settings.\n",
    "\n",
    "**Task:**\n",
    "Initialise the following attributes using `self.variable_name = variable_name`:\n",
    "1.  `n_estimators` (int): The number of trees to build.\n",
    "2.  `learning_rate` (float): The step size for each tree (prevents overfitting).\n",
    "3.  `max_depth` (int): The maximum depth of each decision tree.\n",
    "4.  `criterion` (str): The loss function to use ('l1' or 'l2')."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0108032c",
   "metadata": {
    "tags": [
     "otter_answer_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### ANSWER QUESTION 1.5 HERE\n",
    "\n",
    "\n",
    "from abc import ABC, abstractmethod\n",
    "\n",
    "\n",
    "class BaseGradientBoosting(ABC):\n",
    "    \"\"\"Abstract base class for Gradient Boosting.\"\"\"\n",
    "\n",
    "    def _compute_negative_gradient(self, y, y_pred):\n",
    "        if self.criterion == \"l1\":\n",
    "            return l1_negative_gradient(y, y_pred)\n",
    "        elif self.criterion == \"l2\":\n",
    "            return l2_negative_gradient(y, y_pred)\n",
    "        else:\n",
    "            raise ValueError(f\"Unknown criterion: {self.criterion}\")\n",
    "\n",
    "    def _compute_initial_prediction(self, y):\n",
    "        if self.criterion == \"l1\":\n",
    "            return l1_initial_prediction(y)\n",
    "        elif self.criterion == \"l2\":\n",
    "            return l2_initial_prediction(y)\n",
    "        else:\n",
    "            raise ValueError(f\"Unknown criterion: {self.criterion}\")\n",
    "\n",
    "    def _subsample_indices(self, n_samples, rng):\n",
    "        \"\"\"\n",
    "        Generate indices for stochastic gradient boosting.\n",
    "\n",
    "        Parameters:\n",
    "        -----------\n",
    "        n_samples : int\n",
    "            Total number of samples\n",
    "        rng : np.random.RandomState\n",
    "            Random number generator\n",
    "\n",
    "        Returns:\n",
    "        --------\n",
    "        np.ndarray\n",
    "            Indices of selected samples\n",
    "        \"\"\"\n",
    "        if self.subsample < 1.0:\n",
    "            n_subsample = int(n_samples * self.subsample)\n",
    "            return rng.choice(n_samples, size=n_subsample, replace=False)\n",
    "        else:\n",
    "            return np.arange(n_samples)\n",
    "\n",
    "    def predict(self, X) -> np.array:\n",
    "        \"\"\"\n",
    "        Generate predictions for input samples.\n",
    "\n",
    "        Parameters:\n",
    "        -----------\n",
    "        X : np.ndarray, shape (n_samples, n_features)\n",
    "            Input features\n",
    "\n",
    "        Returns:\n",
    "        --------\n",
    "        np.ndarray, shape (n_samples,)\n",
    "            Predicted values\n",
    "            \n",
    "        Algorithm:\n",
    "            1. Start with initial prediction (F0) for all samples\n",
    "            2. Add each tree's contribution (scaled by learning_rate)\n",
    "            3. Return final predictions\n",
    "        \"\"\"\n",
    "        ### QUESTION 1.5 START\n",
    "        \n",
    "        # TODO: Initialize predictions array with self.init_prediction\n",
    "        \n",
    "        # TODO: Loop through self.trees and add their predictions\n",
    "        ...\n",
    "        ### QUESTION 1.5 END\n",
    "\n",
    "    @abstractmethod\n",
    "    def fit(self, X, y):\n",
    "        return"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9071edbf",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "### **Question 1.6: The Fitting Algorithm (8 Marks)**\n",
    "This is the core of the task. You must implement the `fit` and `predict` methods.\n",
    "\n",
    "**The Logic (Pseudocode):**\n",
    "1.  **Initialise:** Calculate $F_0$ (using your Q1.3/1.4 functions) and create a `current_pred` array filled with this value.\n",
    "2.  **Loop:** Iterate `n_estimators` times:\n",
    "    * **Step A:** Compute residuals ($r$) using your Q1.1/1.2 functions.\n",
    "    * **Step B:** Fit a `DecisionTreeRegressor` to predict $r$ from $X$.\n",
    "    * **Step C:** Update `current_pred`: $F_{new} = F_{old} + \\text{learning\\_rate} \\times \\text{tree.predict}(X)$.\n",
    "    * **Step D:** Save the tree to `self.trees`.\n",
    "\n",
    "**Important:** Do not forget to multiply the tree's prediction by the `learning_rate`!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9882727c-85aa-45d0-bc7b-90737ab90d22",
   "metadata": {
    "tags": [
     "otter_answer_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### ANSWER QUESTION 1.6 HERE\n",
    "\n",
    "class GradientBoostingRegressor(BaseGradientBoosting):\n",
    "    \n",
    "    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3, \n",
    "             criterion='l2', subsample=1.0, random_state=42):\n",
    "        \"\"\"\n",
    "        Gradient Boosting for regression.\n",
    "        \n",
    "        Parameters:\n",
    "        -----------\n",
    "        n_estimators : int\n",
    "            Number of boosting stages (trees)\n",
    "        learning_rate : float\n",
    "            Shrinkage factor for each tree's contribution\n",
    "        max_depth : int\n",
    "            Maximum depth of individual trees\n",
    "        criterion : str\n",
    "            Loss function: 'l1' (MAE) or 'l2' (MSE)\n",
    "        subsample : float\n",
    "            Fraction of samples used for fitting each tree (1.0 = use all samples)\n",
    "        random_state : int\n",
    "            Random seed for reproducibility\n",
    "        \"\"\"\n",
    "        self.n_estimators = n_estimators\n",
    "        self.learning_rate = learning_rate\n",
    "        self.max_depth = max_depth\n",
    "        self.criterion = criterion\n",
    "        self.subsample = subsample\n",
    "        self.random_state = random_state\n",
    "        \n",
    "        # Internal storage\n",
    "        self.trees = []\n",
    "        self.init_prediction = None\n",
    "\n",
    "    def fit(self, X, y):\n",
    "        \"\"\"\n",
    "        Fit the gradient boosting model.\n",
    "        \n",
    "        Parameters:\n",
    "        -----------\n",
    "        X : np.ndarray, shape (n_samples, n_features)\n",
    "            Training features\n",
    "        y : np.ndarray, shape (n_samples,)\n",
    "            Training targets\n",
    "            \n",
    "        Returns:\n",
    "        --------\n",
    "        self : GradientBoostingRegressor\n",
    "            Fitted model\n",
    "        \"\"\"\n",
    "        # 1. Initialisation\n",
    "        if self.criterion == 'l2':\n",
    "            self.init_prediction = l2_initial_prediction(y)\n",
    "        else:\n",
    "            self.init_prediction = l1_initial_prediction(y)\n",
    "            \n",
    "        # Initialize\n",
    "        current_pred = np.full(y.shape, self.init_prediction)\n",
    "        rng = np.random.RandomState(self.random_state)\n",
    "        n_samples = X.shape[0]\n",
    "        \n",
    "        # 2. Iteratively fit trees\n",
    "        ### QUESTION 1.6a START\n",
    "        \n",
    "        for m in range(self.n_estimators):\n",
    "            # TODO Step a: Compute negative gradient (residuals)\n",
    "            # Use l2_negative_gradient() or l1_negative_gradient() based on self.criterion\n",
    "\n",
    "            # TODO Step b: Get subsample indices (for stochastic gradient boosting)\n",
    "\n",
    "            # TODO Step c: Fit a DecisionTreeRegressor to the residuals\n",
    "\n",
    "            # TODO Step d: Update predictions for ALL samples\n",
    "\n",
    "            # TODO Step e: Append tree to self.trees\n",
    "            \n",
    "        return self\n",
    "        ### QUESTION 1.6a END\n",
    "\n",
    "    def predict(self, X):\n",
    "        \"\"\"\n",
    "        Generate predictions for input samples.\n",
    "        \n",
    "        Parameters:\n",
    "        -----------\n",
    "        X : np.ndarray, shape (n_samples, n_features)\n",
    "            Input features\n",
    "            \n",
    "        Returns:\n",
    "        --------\n",
    "        np.ndarray, shape (n_samples,)\n",
    "            Predicted values\n",
    "        \"\"\"\n",
    "        ### QUESTION 1.6b START\n",
    "        \n",
    "        if self.init_prediction is None:\n",
    "            raise ValueError(\"Model not fitted yet\")\n",
    "            \n",
    "        # TODO: Start with F0 for all samples\n",
    "        # TODO: Add contributions from all trees (scaled by learning_rate)\n",
    "        ...\n",
    "            \n",
    "        ### QUESTION 1.6b END"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0007767f",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "### TEST FOR QUESTION 1.6\n",
    "\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    # Generate synthetic data\n",
    "    X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)\n",
    "    X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.2, random_state=42\n",
    "    )\n",
    "\n",
    "    # Test MSE loss\n",
    "    print(\"=\" * 60)\n",
    "    print(\"Testing MSE Loss:\")\n",
    "    print(\"=\" * 60)\n",
    "    gb_l1 = GradientBoostingRegressor(\n",
    "        n_estimators=100, learning_rate=0.1, max_depth=3, criterion=\"l1\"\n",
    "    )\n",
    "    gb_l1.fit(X_train, y_train)\n",
    "    pred_l1 = gb_l1.predict(X_test)\n",
    "    print(f\"  MSE: {mean_squared_error(y_test, pred_l1):.4f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fa4e9138",
   "metadata": {
    "deletable": false,
    "editable": false,
    "otter": {
     "tests": [
      "q1.5_6"
     ]
    }
   },
   "outputs": [],
   "source": [
    "### TEST FOR QUESTION 1.5\n",
    "### QUESTION 1.6 MUST BE COMPLETED FIRST\n",
    "\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    # Generate synthetic data\n",
    "    X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)\n",
    "    X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.2, random_state=42\n",
    "    )\n",
    "\n",
    "    # Test L1 loss\n",
    "    print(\"=\" * 60)\n",
    "    print(\"Testing L1 Loss:\")\n",
    "    print(\"=\" * 60)\n",
    "    gb_l1 = GradientBoostingRegressor(\n",
    "        n_estimators=100, learning_rate=0.1, max_depth=3, criterion=\"l1\"\n",
    "    )\n",
    "    gb_l1.fit(X_train, y_train)\n",
    "    pred_l1 = gb_l1.predict(X_test)\n",
    "    print(f\"  MSE: {mean_squared_error(y_test, pred_l1):.4f}\")\n",
    "\n",
    "    # Test L2 loss\n",
    "    print()\n",
    "    print(\"=\" * 60)\n",
    "    print(\"Testing L2 Loss:\")\n",
    "    print(\"=\" * 60)\n",
    "    gb_l2 = GradientBoostingRegressor(\n",
    "        n_estimators=100, learning_rate=0.1, max_depth=3, criterion=\"l2\"\n",
    "    )\n",
    "    gb_l2.fit(X_train, y_train)\n",
    "    pred_l2 = gb_l2.predict(X_test)\n",
    "    print(f\"  MSE: {mean_squared_error(y_test, pred_l2):.4f}\")\n",
    "\n",
    "    # Test stochastic gradient boosting\n",
    "    print()\n",
    "    print(\"=\" * 60)\n",
    "    print(\"Testing Stochastic GB (subsample=0.8):\")\n",
    "    print(\"=\" * 60)\n",
    "    gb_stoch = GradientBoostingRegressor(\n",
    "        n_estimators=100, learning_rate=0.1, max_depth=3, criterion=\"l1\", subsample=0.8\n",
    "    )\n",
    "    gb_stoch.fit(X_train, y_train)\n",
    "    pred_stoch = gb_stoch.predict(X_test)\n",
    "    print(f\"  MSE: {mean_squared_error(y_test, pred_stoch):.4f}\")\n",
    "\n",
    "    # Compare with sklearn\n",
    "    print(\"\\n\" + \"=\" * 60)\n",
    "    print(\"Comparison with sklearn:\")\n",
    "    print(\"=\" * 60)\n",
    "    from sklearn.ensemble import GradientBoostingRegressor as SklearnGB\n",
    "\n",
    "    sklearn_gb = SklearnGB(n_estimators=100, learning_rate=0.1, max_depth=3)\n",
    "    sklearn_gb.fit(X_train, y_train)\n",
    "    sklearn_pred = sklearn_gb.predict(X_test)\n",
    "    print(f\"  Sklearn GB MSE: {mean_squared_error(y_test, sklearn_pred):.4f}\")\n",
    "    print(f\"  Our GB MSE:     {mean_squared_error(y_test, pred_l2):.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "509634bc-f52d-41c7-9a8d-e15014bcc446",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "## **Part 3: Optimisation (Bonus)**\n",
    "---\n",
    "### **1.7 Bonus: Early Stopping (5 Marks)**\n",
    "Training too many trees can lead to **overfitting** (memorising the training data).\n",
    "\n",
    "**Task:**\n",
    "Implement `GBREarlyStop`. This class checks a validation set (`X_val`, `y_val`) after every tree.\n",
    "* If the validation error does not improve for `patience` iterations, **stop the loop early**.\n",
    "* This saves time and usually improves accuracy on new data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9082331b",
   "metadata": {
    "tags": [
     "otter_answer_cell"
    ]
   },
   "outputs": [],
   "source": [
    "### ANSWER QUESTION 1.7 HERE\n",
    "\n",
    "\n",
    "class GBREarlyStop(GradientBoostingRegressor):\n",
    "\n",
    "    def __init__(\n",
    "        self,\n",
    "        n_estimators=100,\n",
    "        learning_rate=0.1,\n",
    "        max_depth=3,\n",
    "        criterion=\"l1\",\n",
    "        subsample=1.0,\n",
    "        random_state=42,\n",
    "    ):\n",
    "        \"\"\"\n",
    "        Gradient Boosting for regression with early stopping.\n",
    "\n",
    "        Parameters:\n",
    "        -----------\n",
    "        n_estimators : int\n",
    "            Number of boosting stages (trees)\n",
    "        learning_rate : float\n",
    "            Shrinkage factor for each tree's contribution\n",
    "        max_depth : int\n",
    "            Maximum depth of individual trees\n",
    "        criterion : str\n",
    "            Loss function: one of 'l1' or 'l2'\n",
    "        subsample : float\n",
    "            Fraction of samples used for fitting each tree (for stochastic GB)\n",
    "        random_state : int\n",
    "            Random seed for reproducibility\n",
    "\n",
    "        \"\"\"\n",
    "\n",
    "        self.n_estimators = n_estimators\n",
    "        self.learning_rate = learning_rate\n",
    "        self.max_depth = max_depth\n",
    "        self.criterion = criterion\n",
    "        self.subsample = subsample\n",
    "        self.random_state = random_state\n",
    "\n",
    "        self.trees = []\n",
    "        self.init_prediction = None\n",
    "\n",
    "    def fit(self, X_train, y_train, X_val, y_val, patience=10, min_delta=1e-4):\n",
    "        \"\"\"\n",
    "        Fits Boosting for Regression with early stopping\n",
    "\n",
    "        Parameters:\n",
    "        -----------\n",
    "\n",
    "        X_train : ndarray of shape (n_samples, n_features)\n",
    "            The training data array.\n",
    "        y_train : ndarray of shape (n_samples,)\n",
    "            The training target labels.\n",
    "        X_val : ndarray of shape (n_samples, n_features)\n",
    "            The evaluation data array.\n",
    "        y_val : ndarray of shape (n_samples,)\n",
    "            The evaluation target labels.\n",
    "        patience : int\n",
    "            Number of rounds of no improvements before training is stopped early\n",
    "        min_delta :\n",
    "            Minimum improvement in evaluation loss to count as an improvement\n",
    "        \"\"\"\n",
    "\n",
    "        rng = np.random.RandomState(self.random_state)\n",
    "        n_samples = X_train.shape[0]\n",
    "\n",
    "        # Tracking variables\n",
    "        train_losses = []\n",
    "        val_losses = []\n",
    "\n",
    "        best_val_loss = float(\"inf\")\n",
    "        best_n_estimators = 0\n",
    "        patience_counter = 0\n",
    "\n",
    "        # Step 1: Initialise prediction\n",
    "        self.init_prediction = self._compute_initial_prediction(y_train)\n",
    "        F_train = np.full(X_train.shape[0], self.init_prediction)\n",
    "        F_val = np.full(X_val.shape[0], self.init_prediction)\n",
    "\n",
    "        # Step 2: Iteratively fit trees\n",
    "        for m in range(self.n_estimators):\n",
    "            ### BONUS QUESTION 1.7 START\n",
    "\n",
    "            # a. Compute negative gradient\n",
    "\n",
    "            # b. Get subsample indices if using stochastic GB\n",
    "\n",
    "            # c. Fit a DecisionTreeRegressor to negative gradients\n",
    "\n",
    "            # d. Update predictions for training and evaluation sets\n",
    "\n",
    "            # e. Append tree to self.trees\n",
    "\n",
    "            # Compute losses on training and evaluation\n",
    "            ...\n",
    "\n",
    "            # Check for improvement on evaluation loss. Increment patience_counter if there is no improvement\n",
    "            ...\n",
    "\n",
    "            # Early stop if patience_counter exceeds patience\n",
    "            ...\n",
    "\n",
    "        # Trim trees to best number\n",
    "        self.trees = self.trees[:best_n_estimators]\n",
    "        self.n_estimators = best_n_estimators\n",
    "\n",
    "        ### BONUS QUESTION 1.7 END\n",
    "\n",
    "        return self"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fdace8ad",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "### TEST FOR BONUS QUESTION 1.7\n",
    "\n",
    "\n",
    "if __name__ == \"__main__\":\n",
    "    # Generate synthetic data\n",
    "    X, y = make_regression(n_samples=1000, n_features=10, noise=10, random_state=42)\n",
    "    X_train, X_test, y_train, y_test = train_test_split(\n",
    "        X, y, test_size=0.2, random_state=42\n",
    "    )\n",
    "\n",
    "    print(\"=\" * 60)\n",
    "    print(\"Testing Early Stopping:\")\n",
    "    print(\"=\" * 60)\n",
    "    X_train_es, X_val, y_train_es, y_val = train_test_split(\n",
    "        X_train, y_train, test_size=0.2, random_state=42\n",
    "    )\n",
    "\n",
    "    gb_es = GBREarlyStop(\n",
    "        n_estimators=500, learning_rate=0.1, max_depth=3, criterion=\"l2\"\n",
    "    )\n",
    "    gb_es.fit(X_train_es, y_train_es, X_val, y_val, patience=1)\n",
    "    pred_es = gb_es.predict(X_test)\n",
    "    print(f\"  Best n_estimators: {gb_es.n_estimators}\")\n",
    "    print(f\"  MSE: {mean_squared_error(y_test, pred_es):.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bbd1083a",
   "metadata": {
    "deletable": false,
    "editable": false,
    "otter": {
     "tests": [
      "q1.7"
     ]
    }
   },
   "source": [
    "## End of NOAI 2026 - Programming Question 1\n",
    "---"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "noai-questions-2026",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.14"
  },
  "otter": {
   "OK_FORMAT": true,
   "tests": {
    "q1.1": {
     "name": "q1.1",
     "points": null,
     "suites": [
      {
       "cases": [],
       "scored": true,
       "setup": "",
       "teardown": "",
       "type": "doctest"
      }
     ]
    },
    "q1.2": {
     "name": "q1.2",
     "points": null,
     "suites": [
      {
       "cases": [],
       "scored": true,
       "setup": "",
       "teardown": "",
       "type": "doctest"
      }
     ]
    },
    "q1.3": {
     "name": "q1.3",
     "points": null,
     "suites": [
      {
       "cases": [],
       "scored": true,
       "setup": "",
       "teardown": "",
       "type": "doctest"
      }
     ]
    },
    "q1.4": {
     "name": "q1.4",
     "points": null,
     "suites": [
      {
       "cases": [],
       "scored": true,
       "setup": "",
       "teardown": "",
       "type": "doctest"
      }
     ]
    },
    "q1.5_6": {
     "name": "q1.5_6",
     "points": null,
     "suites": [
      {
       "cases": [
        {
         "code": ">>> gb = GradientBoostingRegressor(n_estimators=10, criterion='l2')\n>>> np.random.seed(42)\n>>> X = np.random.rand(50, 3)\n>>> y = np.random.rand(50)\n>>> gb_fit = gb.fit(X, y)\n>>> assert len(gb.trees) == 10\n",
         "hidden": false,
         "locked": false
        }
       ],
       "scored": true,
       "setup": "",
       "teardown": "",
       "type": "doctest"
      }
     ]
    },
    "q1.7": {
     "name": "q1.7",
     "points": null,
     "suites": [
      {
       "cases": [
        {
         "code": ">>> X, y = make_regression(n_samples=300, n_features=5, noise=5, random_state=42)\n>>> X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)\n>>> gb = GBREarlyStop(n_estimators=500, criterion='l2')\n>>> gb_fit = gb.fit(X_train, y_train, X_val, y_val, patience=10)\n>>> n_trees = len(gb.trees)\n>>> preds = gb.predict(X_val)\n>>> assert preds.shape == y_val.shape\n>>> assert n_trees < 500\n",
         "hidden": false,
         "locked": false
        }
       ],
       "scored": true,
       "setup": "",
       "teardown": "",
       "type": "doctest"
      }
     ]
    }
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
