Machine learning (ML) is progressively reshaping the fields of quantitative finance and algorithmic trading. ML tools are increasingly adopted by hedge funds and asset managers, notably for alpha signal generation and stocks selection. The technicality of the subject can make it hard for non-specialists to join the bandwagon, as the jargon and coding requirements may seem out of reach. Machine Learning for Factor Investing: R Version bridges this gap. It provides a comprehensive tour of modern ML-based investment strategies that rely on firm characteristics.
The book covers a wide array of subjects which range from economic rationales to rigorous portfolio back-testing and encompass both data processing and model interpretability. Common supervised learning algorithms such as tree models and neural networks are explained in the context of style investing and the reader can also dig into more complex techniques like autoencoder asset returns, Bayesian additive trees, and causal models.
All topics are illustrated with self-contained R code samples and snippets that are applied to a large public dataset that contains over 90 predictors. The material, along with the content of the book, is available online so that readers can reproduce and enhance the examples at their convenience. If you have even a basic knowledge of quantitative finance, this combination of theoretical concepts and practical illustrations will help you learn quickly and deepen your financial and technical expertise.
Guillaume Coqueret is associate professor of finance and data science at EMLYON Business School. His recent research revolves around applications of machine learning tools in financial economics.
Tony Guida is executive director at RAM Active Investments. He serves as chair of the machineByte think tank and is the author of Big Data and Machine Learning in Quantitative Investment.
1. Preface What this book is not about The targeted audience How this book is structured Companion website Why R? Coding instructions Acknowledgements Future developments 2. Notations and data Notations Dataset
3. Introduction Context Portfolio construction: the workflow Machine Learning is no Magic Wand
4. Factor investing and asset pricing anomalies Introduction Detecting anomalies Simple portfolio sorts Factors Predictive regressions, sorts, and p-value issues Fama-Macbeth regressions Factor competition Advanced techniques Factors or characteristics? Hot topics: momentum, timing and ESG Factor momentum Factor timing The green factors The link with machine learning A short list of recent references Explicit connections with asset pricing models Coding exercises 5. Data preprocessing Know your data Missing data Outlier detection Feature engineering Feature selection Scaling the predictors Labelling Simple labels Categorical labels The triple barrier method Filtering the sample Return horizons Handling persistence Extensions Transforming features Macro-economic variables Active learning Additional code and results Impact of rescaling: graphical representation Impact of rescaling: toy example Coding exercises
II Common supervised algorithms
6. Penalized regressions and sparse hedging for minimum variance portfolios Penalised regressions Simple regressions Forms of penalizations Illustrations Sparse hedging for minimum variance portfolios Presentation and derivations Example Predictive regressions Literature review and principle Code and results Coding exercise
7. Tree-based methods Simple trees Principle Further details on classification Pruning criteria Code and interpretation Random forests Principle Code and results Boosted trees: Adaboost Methodology Illustration Boosted trees: extreme gradient boosting Managing Loss Penalisation Aggregation Tree structure Extensions Code and results Instance weighting Discussion Coding exercises 8. Neural networks The original perceptron Multilayer perceptron (MLP) Introduction and notations Universal approximation Learning via back-propagation Further details on classification How deep should we go? And other practical issues Architectural choices Frequency of weight updates and learning duration Penalizations and dropout Code samples and comments for vanilla MLP Regression example Classification example Custom losses Recurrent networks Presentation Code and results Other common architectures Generative adversarial networks Auto-encoders A word on convolutional networks Advanced architectures Coding exercise 9. Support vector machines SVM for classification SVM for regression Practice Coding exercises 10. Bayesian methods The Bayesian framework Bayesian sampling Gibbs sampling Metropolis-Hastings sampling Bayesian linear regression Naive Bayes classifier Bayesian additive trees General formulation Priors Sampling and predictions Code
III From predictions to portfolios 11. Validating and tuning Learning metrics Regression analysis Classification analysis Validation The variance-bias tradeoff: theory The variance-bias tradeoff: illustration The risk of overfitting: principle The risk of overfitting: some solutions The search for good hyperparameters Methods Example: grid search Example: Bayesian optimization Short discussion on validation in backtests
12. Ensemble models Linear ensembles Principles Example Stacked ensembles Two stage training Code and results Extensions Exogenous variables Shrinking inter-model correlations Exercise 13. Portfolio backtesting Setting the protocol Turning signals into portfolio weights Performance metrics Discussion Pure performance and risk indicators Factor-based evaluation Risk-adjusted measures Transaction costs and turnover Common errors and issues Forward looking data Backtest overfitting Simple safeguards Implication of non-stationarity: forecasting is hard General comments The no free lunch theorem Example Coding exercises
IV Further important topics
14. Interpretability Global interpretations Simple models as surrogates Variable importance (tree-based) Variable importance (agnostic) Partial dependence plot Local interpretations LIME Shapley values Breakdown 15. Two key concepts: causality and non-stationarity Causality Granger causality Causal additive models Structural time-series models Dealing with changing environments Non-stationarity: yet another illustration Online learning Homogeneous transfer learning 16. Unsupervised learning The problem with correlated predictors Principal component analysis and autoencoders A bit of algebra PCA Autoencoders Application Clustering via k-means Nearest neighbors Coding exercise 17. Reinforcement learning Theoretical layout General framework Q-learning SARSA The curse of dimensionality Policy gradient Principle Extensions Simple examples Q-learning with simulations Q-learning with market data Concluding remarks Exercises
Data Description Solution to exercises