Models are at the core of InsightRX Nova, our dosing application for model-informed precision dosing (MIPD). More specifically, these models are population pharmacokinetic (PopPK) models, which provide a quantitative and somewhat mechanistic interpretation of the fate of drugs inside the human body. Especially when applied with Therapeutic Drug Monitoring (TDM) and Bayesian feedback, these models can be a great tool for tailoring drug exposure through individualization of dosing regimens.
In a recent blog post I wrote about the importance of selecting the optimal model, but I didn’t go into the statistical techniques used to actually make such decisions. In this blog post, I’ll cover these aspects in more detail: How is model performance commonly evaluated and compared in literature, and what should you look out for when reading such articles?
As is often the case in science, there is no single approach that is the best (or most widely) accepted method. All approaches follow the same general pattern in that they compare collected observations (usually data from therapeutic drug monitoring, TDM) with predictions from a popPK model. However, it is important to realize which approach(es) are used, as their predictions may differ in whether they look backward of forward (fitting historical data vs forecasting future data) or the type of data used (with or without TDM); therefore, the interpretation of their results will be different and metrics cannot be compared between them.
Roughly speaking, three different approaches to generate predictions of drug exposure can be distinguished to compare the performance of popPK models in MIPD1: Population predictions, individual fitted predictions, and individual forecasted predictions. I’ll go over them one-by-one below.
Population predictions are always based only on characteristics of the patient (“covariates”) known at the time of dosing, such as weight, sex, age, renal function, etc, and the dosing record, but explicitly not on measured drug concentrations (i.e., those collected as part of therapeutic drug monitoring, TDM). The approach is therefore forward-looking and forecasts future drug exposure in the patient. If no TDM is performed, this is the only type of prediction that is available.
Predictions made using individual fits are based on a weighted (Bayesian) estimate of the population prior and a fit of the model to some or all previous TDM available for the patient. A key characteristic is that the “predictions” in this approach are always looking backward, and are not forecasts of future drug levels. Obviously, the fitted predictions will be closer to the observed TDM than the population predictions were, as the fit now is partially based on the measured concentrations.
Lastly, forecasted predictions are also based on fits to one or more retrospective TDM samples, but predictions are then calculated only for subsequently measured TDM samples, usually using an iterative approach. For example, first, a Bayesian fit is performed based on only the first TDM sample; individual estimates are then updated and used to predict the second TDM. Next, a Bayesian fit based on the first two TDM samples will provide a new set of individual model parameters, and from that a new prediction for the third TDM is made. And so on until we have a forecasted prediction for each TDM. This approach thus mimics exactly how MIPD is typically implemented in clinical practice (I’ll discuss an exception below), and will therefore provide the best translation to actual clinical performance. When the approach to calculate predictive ability matches how the model will be used in practice later, we call such an analysis “fit for purpose” (see ter Heine et al).2
So in summary, the hallmark of approaches 1 and 3 is that they are forecasting future data. This in contrast to approach 2, which is always based only on fitting historical data. At InsightRX, we prioritize how well our models predict future drug levels, rather than just how closely they fit historical data. This focus on forecasted precision ensures our dosing recommendations can perform reliably in real-world clinical settings.
An interesting analogy is algorithm-based stock trading. In algorithmic trading, algorithms are always “backtested” before being put into production: this means that the trading strategy will initially be trained on historical data, but it is then always tested on how it would perform in future data, i.e. data not seen during training. This is obvious for trading, since we care about future gains and less how well we can describe historical patterns.3 The same is true for MIPD, and we therefore consider approaches that look at forecasting accuracy the gold standard for model comparisons in MIPD.
Still, many comparative studies in literature look only at approach 2 (fits) and not at forecasting accuracy. Is it a big deal if models are evaluated and compared based only on fit as opposed to forecasting accuracy? It depends. In most cases, the fit approach may directionally give similar answers to the forecasting approach: a model that fits very poorly will likely not have good forecasting performance, and a model that fits great will often show at least decent forecasting performance. It is more in the “grey zone” where differences will be seen. For example, it often happens that a model that provides good individual fits will only show moderate forecasting performance. This may happen if the model is “overfitting” the available TDM data: in such scenarios it may have lower accuracy than a model that only showed mediocre fit. It is therefore much harder to draw firm conclusions about real-world performance from analyses that only focus on fit accuracy.
One caveat to the forecasting approach that one could raise is that it may not always be truly “fit-for-purpose”: for example, in the case of optimizing drug exposure for vancomycin, we are primarily interested in targeting area under the curve (AUC) and not trough levels. So ideally we would want to know how well we can predict future AUCs, rather than future TDM levels. However, answering this question would require dedicated clinical trials, which are neither financially or ethically feasible, so looking at predictive performance on future TDM levels is the next best thing. If both trough- and peak-levels are available in a dataset, and possibly even mid-interval samples, predictive performance measured this way should still be a good proxy for the prediction of the shape of the PK curve, and hence of AUC.
So why is the forecasted approach not always reported in literature? There may be several reasons for this.4 One reason is that it is not (yet!) a default statistic in most pharmacometricians’ workflows and is not something that is taught during training. Pharmacometricians focus primarily on the fit of the model to the data; only rarely is forecasting precision a concern. This is because in most cases models are developed not for use in MIPD, but to answer specific questions during drug development (from clinical trial data) or routine clinical use (e.g. to study drug-drug interactions). In such scenarios, forecasting accuracy is less important than population predictions or model fits. A second reason is that the calculation of forecasting precision is more complex than the calculation of population and individual predictions, which are computed routinely in all model development software. In contrast, forecasting precision requires more specialized software or custom scripting. At InsightRX, we have automated procedures set up to calculate these statistics, either by simulating data or through analysis of historical data5, but adoption of these tools in pharmacometrics is lagging.
Beyond these high-level differences between model comparison approaches, another important aspect is the statistic that is used to summarize predictive performance. If you read a few articles on PK model comparisons, there is a good chance you will see at least a handful of statistical metrics thrown around, such as MPE, R2, or RMSE, and researchers may speak of bias, precision, and accuracy. An overview of many of the common statistics was recently published, but I’ll give a brief introduction into how we think about these metrics below. I’d like to highlight that: a) there is not a single statistic that tells the entire story, and b) most statistics are not inherently good or bad, they just summarize different aspects of a model’s performance, and c) while it might be tempting, one should be extremely careful comparing statistics between different articles. I’ll get back to that last point later.
At a high level, we need to distinguish metrics that measure predictive performance. Other blogs and articles already provide excellent explanations and visualizations of these terms, but when comparing models we generally distinguish bias and accuracy:
Bias: on average, are predictions under- or overshooting the observed data? For example, one could describe bias as “on average the model over-predicts by 2 mg/L”. Bias is definitely useful to know, but a model with low or absent bias can still be very poor if its predictions have large random errors. For bias, the most often used statistic is “mean percentage error” or MPE, and this is probably the easiest and least controversial choice. It’s calculated as the average difference between the observed and predicted values.
Accuracy: in my view this is a more important measure than bias, as it measures both the bias and the error magnitude (the difference between observed and predicted). To measure accuracy, at InsightRX we prefer the use of a custom metric that is also easily interpretable for clinicians: we often define it as the percentage of future predictions within an acceptable range of the measured or true value. For example, one could define accuracy in a specific analysis as “the percentage of predictions that are within 2 mg/L or 15% of the observed TDM“. Of course, with this custom definition, what constitutes as the “acceptable range” is still subjective (is it 20% or 25%?), and ultimately what is an acceptable accuracy (should it be 80%, or 90%) is again subjective... But subjectivity is unavoidable for any metric, so we feel that it is more important to make sure the metric is at least understood by the end-user of the model so it can be incorporated into decision-making. In any case, metrics are more useful for comparing between models and prediction approaches rather than using them to (in-)validate a single model or forecasting approach. We often also compute the root-mean squared error (RMSE), which can also be considered a measure of accuracy. However, we prefer our custom definition because RMSE is less intuitive for the average clinician.
Precision: We generally refrain from the term “precision” when evaluating models because, following technical definitions, precision relates to repeated measurements of the same observation, which we generally do not have when evaluating model predictions. In principle, we are able to calculate the uncertainty around our predictions (in Nova this uncertainty can be shown by selecting “Show CI (80%)” under the plotting options for the concentration-time curves). However, when we compare models for predictive performance we generally only look at point estimates of the predictions and ignore the uncertainty.
For either bias and accuracy, other groups and softwares may have different preferences: sometimes measures like RMSE or mean absolute prediction error (MAPE) are referred to as “precision”, even though technically precision should not look at differences between predictions and true values. As long as the underlying statistics are well-defined, this is not a major problem: ultimately terms like bias and precision are just labels and only the actual statistics like RMSE and MPE have explicit mathematical definitions. Given this variability in nomenclature, it is important to read and understand how terms are defined in the paper.
Furthermore, I already mentioned that it is difficult to compare the statistics between different papers. For example, metrics like RMSE are especially sensitive to outliers (as residuals are squared), so if one analysis had a more stringent removal of outlier and erroneous data than another analysis, one cannot compare their metrics reliably. Besides differences in the underlying data, we often see differences in the definitions of the statistics that are used, even if the same acronym is used (e.g. “mean percentage error (MPE)” and “median prediction error (MPE)” are very different!), and as mentioned accuracy and precision are sometimes used interchangeably. Even within our own data science team we may occasionally choose a different statistic for different analyses depending on clinical considerations. For example, we often prefer to use normalized-RMSE as a measure of prediction error (normalized to the observations), as it is a bit more intuitive than “regular” RMSE. However, when many observations are close to zero, the normalized RMSE may be greatly inflated and become unusable as a statistic. In such cases, regular RMSE is more useful.
A final note about the statistics that are reported is that, ideally, articles should not only report the point value of the statistic but also the confidence interval or standard error for it. Especially if the goal of the analysis is to pick one or more “best” models, one should really look at statistical significance of the difference between the model statistics. In our own research papers we commonly choose to report a “best” model but also to highlight which models were “not significantly different from the best model”. Confidence intervals can be calculated fairly easily using parametric approximations, but they can also be obtained—usually more reliably—using non-parametric bootstraps.
Model comparison is an intricate process and while we can make some objective distinctions, there will always be some subjectivity to analyses: for example in selection of statistical approaches and metrics, and thresholds for judging model performance. When reading model comparisons in the literature, I would advise to consider at least: