Enhance Mutual Fund Grouping Using Machine Learning
  • July
  • 2023

Enhance Mutual Fund Grouping Using Machine Learning

How to optimise mutual fund categorisation through machine learning

The methods used to recommend mutual funds to customers vary greatly between companies. Often the recommendation techniques used rely on using metadata of the mutual funds, such as region, category, or investment objective. By grouping using these properties investors are given an overview of funds with similar classifications and can select funds from the groups they are interested in.

While grouping mutual funds in this way may provide investors with a convenient way to explore funds that align with their preferences and investment strategy, this method of recommendation has some potential limitations and risks. For example, there is a risk of oversimplification of the funds themselves. Funds in groups may share a common region, category or asset type but can differ significantly in terms of performance, risk and management style. Investors may also become overconfident in the diversity of their portfolio by relying too much on picking mutual funds from different regions or categories.

One can instead opt to use a more advanced partitioning technique, namely clustering. Clustering is an unsupervised machine learning method, whose task is to divide a data set into smaller groups, clusters. This task is performed by clustering methods, and uses data of the mutual funds’ historical performance. Clustering methods work by defining some measure of similarity or dissimilarity between the funds, grouping similar funds together and dissimilar funds in separate clusters. By clustering mutual funds using some measure of correlation, investors can gain valuable insight into the relationships and dependencies of the mutual funds. The resulting clusters would consist of mutual funds that tend to move together and exhibit similar patterns of return over time. By selecting mutual funds from different clusters, the investor can thus create a diversified portfolio based on historical performance, rather than manual classification.

By evaluating their existing investment portfolio, investors can identify mutual funds that complement their current holdings. This can be achieved by selecting funds from clusters that contain funds they presently lack. Moreover, within these clusters, investors can discover comparable funds with their current holdings that might exhibit superior performance or have lower fees.

Performing clustering on mutual funds also highlights the flaws of partitioning them using metadata. For example, North America funds are typically heavily correlated with Global funds. This is caused by the fact that the majority of the stocks making up a Global fund are usually North American. If one did not have this knowledge, one may falsely believe that owning both Global funds and North American funds increases the diversity of their portfolio when this is not always the case.

In order for the clustering results to be meaningful, one must select the clustering method that generates the most optimal clustering result when partitioning the data. But how do we determine which method is the best? It turns out that this is no trivial task. In the master thesis Evaluating clustering techniques in financial time series, a range of different validation methods were examined in order to find validation methods suitable for financial time series data.

What to consider when evaluating clustering methods

There are a few factors that are important to consider when selecting a method of cluster evaluation. One should consider the quality of the data that is being clustered, and select validation methods accordingly. Financial time series describing mutual funds are time dependent and multivariate, which may break assumptions made when using conventional cluster validation methods. Additionally, the measure of dissimilarity between the mutual funds also impacts which evaluation methods are available to you. Using some measure of correlation is intuitive, since we expect similar funds to correlate with one another. The problem with basing a distance or dissimilarity metric on correlation however, is that the distance will be non-Euclidean. The Euclidean distance for time series is the pointwise difference between two observations with the same index. Correlation distances are instead measurements that quantify relationships between time series. Some quantitative evaluation methods rely on the notion of a centroid, which is the arithmetic mean position of all data points in the cluster. Unfortunately, centroids are undefined for non-Euclidean distances.

The search for the optimal clustering method

So what alternative course of action do we take? In the thesis, alternative cluster validation methods that are more fit for validation of time series clusters were researched. The methods that performed the best when evaluating the clustering results took both the stability and the quality of the clusters into account.

Since we are working with time series, the clustering result may change when we use different sequences of the series. A clustering algorithm can thus be considered more stable the less the clustering result changes when we are using different parts of the time series to perform the clustering. This means that an algorithm that places all data points in one cluster will receive a perfect stability score. To counteract this, the quality of the clusters must also be considered. The quality can be measured in many ways, but should preferably capture the density of the cluster. One method is to use the mean squared distance from all mutual funds in one cluster to the fund with the smallest distance to the other funds. This point is called the medioid of the cluster, and is the non-Euclidean analogue to the centroid.

In the thesis, a validation method called Cluster Over-Time Stability Evaluation (Klassen et al. (2022)) was applied to clustering results containing mutual funds available on the Swedish market. This method was originally developed for clusters of time series, and considers both the stability and quality of the clusters during evaluation. This method showed that there is a trade-off between cluster stability and quality, and rated clustering algorithms that produced highly stable clusters with an acceptable quality the highest.

The experiments also showed that while the method performed the best out of validation methods tested, it placed an emphasis on stability rather than cluster quality. If slightly more unstable clusters with higher quality are preferred, CLOSE still provides a good starting point for the cluster method parameters and can be tweaked even further to one's liking.

Consider using clustering for mutual funds

When selecting a clustering method for your financial time series data, it is important to consider both the stability of the clusters over time as well as the quality of the clusters themselves. A clustering method that produces stable clusters may be more robust and reliable in the future as well as when new observations are added to the time series. By considering the trade-off between stability and quality, one can decide what is more important: stability, quality, or a mix of both.

While selecting an appropriate clustering method is a process that requires taking a few extra steps to achieve optimal results, there are many benefits to using clustering over grouping mutual funds using metadata. Clustering uses the historical performance of the mutual funds in order to partition them into separate groups, and can help in giving customers tailored advice based on their current mutual fund portfolio. Clustering reduces the risk of over-reliance on metadata driven grouping and offers a more nuanced understanding of mutual fund performance and diversity. By selecting mutual funds from different clusters created using a measure of correlation, the risk of the portfolio can be managed by distributing their risk across mutual funds that do not correlate with one another.


G. Klassen, M. Tatusch, and S. Conrad, “Cluster-based stability evaluation in time series data sets”. In: Applied Intelligence (2022), pp. 1–24.

J.Millberg, “Evaluating clustering techniques in financial time series”, 2023. URL

Similar Articles

The Future of Finance: Embracing Embedded Finance

  • February
  • 2024
As we progress through 2024, the concept of embedded finance, incorporating fina

A Step Towards Data Readiness: Improving Financial Data Aggregation

  • February
  • 2024
The diversity of financial data sources and forecasting requirements are common

The Top Trends Setting the Wealth Management Stage for Q1 2024

  • February
  • 2024
It is fair to conclude that sustainable investments, the rise of AI and gamifica