Rajeev Kasthuri
May 12, 2004
Sector
Exposure Model
Simple Multifactor
Regression
This
report (PDF, 236KB)
Portfolio optimization requires the minimal risk with certain expected return. The risk structure of securities, such as their exposure to countries, industrial sectors, or commodity/factor, have to be characterized, and then the optimal weights of securities in a portfolio can be determined to minimize the exposure of the portfolio to any specific risk factor.
Typically, the risk factors are not independent, for example, the health and pharmaceutical industries generally have positive correlation. Besides, the expected return of a security is also related to its exposure to certain risk factors. Therefore, the portfolio optimization, or risk management, cannot be done by simply minimizing the exposure of the portfolio to each single risk factor independently. Instead, an efficient portfolio frontier can be constructed by Markowitz portfolio theory [1]. However, the statistical input requirements to apply Markowitz portfolio theory in a large portfolio are significant. Specifically, we must estimate the expected returns and the covariance matrix, which becomes difficult for a large portfolio and the Sector Exposure Model, or the Multifactor Model, becomes necessary to achieve this goal. Therefore, the objectives of developing sector exposure models are two-fold: help us in understanding and controlling the exposure of our portfolio to any specific risk factor, and implementing portfolio optimization or risk management.
The CAPM assumes that the only relevant source of risk arises from variations in stock returns, and classifies sources of uncertainty as systematic (market) factors or idiosyncratic (firm-specific) factors [2]. A representative (market) portfolio can perfectly eliminate the idiosyncratic risks of individual stocks. Therefore, only systematic risk of an individual stock can contribute to overall portfolio risk; hence the risk premium on an individual stock is determined solely by its beta on the market portfolio.
But in reality, this oversimplified view of risk cannot perfectly capture the movement of the equity market. For example, the firm size is an important characteristic of a firm. Investors working for small companies diversify their portfolios by adding more large stocks, and vice versa. Suppose there exists a mismatch between the demands for small stocks and large stocks, this will drive the stock prices and therefore expected return of small and large stocks to move away from the prediction of CAPM. Merton developed a multifactor CAPM (ICAPM) by deriving the demand for securities concerned with lifetime consumption [3]. While the single-factor CAPM predicts that only market risk has an effect on expected returns, the ICAPM predicts that other sources of risk can also affect the expected return. The possible common sources of uncertainty that might affect expected security returns might include the size, the value and the volatility of the firm, and the industry and country the firm operates in, etc.
Empirically, R-square of single-factor model regression, which measures the fraction of the variation in a security’s return that can be attributed to variation in the market return, typically has an averaged value of 0.16 [4]. A large amount of risk of an individual security is left unexplained. Therefore, from the point of view of empirical study, a multifactor model is preferred over a single-factor model.
1.2. Application of Multifactor Model
Sector Exposure Analysis
The Multifactor Model can be used to quantitatively calculate the risk exposure to a specific factor and measure a security’s performance. For example, how much does the IBM’s stock outperform the risk-free interest rate after adjusting for the risk? Theoretically, if its expected return after adjusting for its risk is larger than risk-free interest rate, it indicates that the performance of IBM is good and adding IBM’s stock into investors’ portfolio is a good choice. Furthermore, by comparing the risk profiles of two companies in the same area, for example Goldman Sachs and JP Morgan, the company that has a smaller and uniform risk coefficients on various factors is better in risk management and diversification.
To quantify the risk exposure of a stock, for example IBM, to various factors, such as computer or transportation, the simplest and naďve way is to directly calculate the correlation between IBM and each risk factor. This approach is valid only if there is little interaction or dependence between these risk factors. Unfortunately, things are never this simple in real life. High correlation between the return of IBM and transportation sector may just result from the fact that transportation sector happens to move together with computer sector within our observation window, which hides or corrupts the real information about IBM’s risk profile. Therefore, to extract the real risk characteristics of a stock requires sophisticated statistical tool, such as the Multifactor Model.
Customize portfolio
By understanding the exposure of each individual security to any risk factor, we can construct a portfolio with certain risk characteristics to tailor an investors’ specific risk preference. For example, if our client demands a portfolio immune to the variation of the gold market, we can add a constraint condition that sets the exposure of the portfolio to the gold factor to zero.
Covariance Matrix Construction
Portfolio optimization requires a covariance matrix of securities available to investors as the input. Due to the extremely large number of securities available, the construction of the covariance matrix is not trivial. For example, if we have ‘n’ assets, and for each asset we have ‘T’ historical returns data. If n>T, the covariance matrix becomes singular. Therefore, we cannot directly calculate the covariance from the historical data. Instead, we can put more structure on returns, i.e. the linear multifactor model.
1.3. Several risk factors models
Single set risk factors models
These models assume that all equities are driven by
exactly one parsimonious set of factors. For example, various industries in the
In this project, we plan to explore a large set of risk factors, including stock characteristics (size), industry (Autos, Energy, Financials, etc.) etc., provided data are accessible. Various statistical tools and tests, such as stepwise regression (forward and backward) and t-test, can be used to determine the set of risk factors that most effectively explain the security returns [6]. With multiple regressions, we might run into the problem of heteroskedasticity, serial correlation, or multicollinearity [7]. Therefore, care need to be taken to detect and correct each possible problem.
Multiset risk factors model, BARRA
BARRA proposed a multiset risk factors model based on the observation that various industries in different countries are correlated to a limited extent, but not perfectly [8]. Therefore, their approach to modeling global equities consists of building a set of local risk models and establishing the linkage between them. They fit models for local markets, and then model the relationship between local factors in different markets to quantify the cross market risk in a global portfolio.
2.
Methodology of Multifactor Regression
Multifactor regression is regression analysis with more than one independent variable. It is used to quantitatively measure the impact of two or more independent variables on a dependent variable. Compared to single linear regression, which only explains stock variation in terms of market risk, multifactor regression allows stock returns to be regressed against risk and additional variables, such as firm size, industry and any other factors that might influence the specific stock returns.
The form of multiple linear regression is in the form of:
Yi = a0 + a1X1i + a2X2i + … + akXki
Where:
Yi = ith return value of the dependent variable Y, i=1,2,.. n
Xi = independent variables, j=1, 2, …, k
Xji = ith return of the jth independent variable
a0 = intercept term
aj = slope coefficient for each of the jth independent variable
Here, intercept term stands for the value of the independent variable when all independent variables equal zero. The object of multiple linear regression is to compute the appropriate intercept and slope coefficients such that the sum of the squared error terms, S(SAi*Xi-Yi)2, will be minimized.
Why is Multiple Regression not easy?
The main reason for the difficulty to build a good or reasonable model in multiple regression is that regression analysis is based on many assumptions. Once some of these assumptions are violated, the model will be questionable consequently. The most common problem of regression analysis is multicollinearity.
Multicollinearity
Multicollinearity is the condition that there exists a high correlation between or among two or more of the independent variables in a multiple regression. In this case, the overall variables may explain much of the variation in the dependent variable, but the individual independent variables cannot. Since the independent variables are highly correlated with each other, the common source of variation is explaining the dependent variable, the high degree of correlation make the individual effects weak.
Stepwise regression
The most common solution to the problem of multicollinearity is to delete one or more of the independent variables that are causing trouble, but sometimes it is not easy to find out the source of the multicolliearity. A statistical procedure, named as stepwise regression, is used to remove variables from the regression until multicollinearity is minimized.
There are two alternative versions of the stepwise regression, forward or backward. Forward stepwise regression starts from one variable. In each step, it adds one more variable by choosing the one that explains the variation in the dependent variables best together with the old variables. The procedure ends until the explaining power of the model cannot be improved by adding more individual variables. Backward stepwise regression applies an opposite approach by starting with a full set of independent variables and deleting one variable in each step until deleting any one of the independent variables cannot improve the model’s explaining power.
3.
Implementation
3.1 Data collection
The data that was required included the Industry
Indexes for over 25 industries and sectors, as well as additional indexes based
on Market capitalization and other factors. The U.S Federal Interest rate data
for the period in consideration was also required. In addition to this the
daily Adjusted Closing price of all the shares in the Standard and Poor's 500
Index for all 500 stocks for the last 2 years was also required to build the
Model.
Requirements
There was a lot of requirements of the data to
be used for the program. The data had to be accurate, synchronized with all of
the other data in terms of dates, frequency etc as well as had to be completely
reliable. Additionally it was required that since the dataset itself would be
vast, automatic data collection and formatting was critical, and so would need
to be automated. After taking these and other considerations into account, we
used the database from CBS -Marketwatch (www.marketwatch.com). The site offered
historical quotes given a specific date. So an API in java was built fot the
site to automate data collection.
API Specifications
The API was built in java, and enabled a user to
get the daily Adjusted Closing prices for a Set of stocks given the stock
symbol, the start date and the end date, The program stored the data as
tab-seperated text, and returned a text file with the name specified.
The program was
particularly designed to enable a user to perform data collection for a large
set of stocks automatically by merely specifying the period for which data
collection was required. Additionally the program formats the data so that it
could be viewed directly in Microsoft Excel or could be used directly in MATLAB
with minimum changes.
Issues in Data collection
The program had to go through several versions,
particularly after it was found that the website did not have a uniform html
representation of the quotes, but instead had several versions that each
differed from each other slightly. Another problem that occured was that the
site sometimes rejected the java socket connection, so that the program would
crash. So additional reconnection and reliability features were implemented so
that the program could run unsupervised. Also the output data was automatically
formatted to the required input for the program (tab spaced text data) so at to
ensure minimum human supervision.
Another issue was that on certain dates values
of one or more indexes were unavailable due to holidays etc. The program had to
automatically detect this situation to ensure that incomplete data was not
stored.
Evaluation of data collection
The data tool was used to collect nearly 280,000
stock/index quotes from the CBS Marketwatch site. The entire collection process
took less than 3 days with the program running unsupervised most of the time.
Additionally to further indicate performance, running on a 2.4GHz Pentium 4
Machine with 512 MB of RAM and an ethernet connection, the program took less
than 17 minutes to get the formatted 2 year daily closing prices of a set of 12
input Stock symbols. The total number of retrieved quotes was 12 * 508 = 6000+
stock quotes. The performance was however dependent on the load on the machine
as well as the network speed. I ran several process simultaneously to maximize
throughput, though multi-threading could also have been used though since the
target machine was a single processor machine, it would not have improved
performance significantly.
3.2. Multifactor regression sector exposure model
The type and number of factors is crucial to build a reasonable model in multiple regression, since it determines whether our model has covered enough characteristics to explain all risk variation of stocks. The factors need to be distributed in many areas so that it reflects as much risk exposure information as possible for a security. Further, the factors need to be independent to each other, which is a fundamental requirement for multifactor regression. To avoid this stringent requirement, which is rarely satisfied in real life, stepwise regression is used. We have access to the data of 23 indices that can be used as factors in the model, as in Table 1.
|
Short Name |
Full Name |
Ticker |
|
Airlines |
||
|
Banks |
||
|
Biotech |
||
|
Brokers |
||
|
Consumer |
||
|
Internet |
||
|
Pharmaceuticals |
||
|
Gold-Silver |
||
|
Semiconductor |
||
|
Gas |
||
|
Oil |
||
|
Retail |
||
|
Insurance |
||
|
Computers |
||
|
NASDAQ-100 |
NASDAQ-100
(NASDAQ.COM) |
^NDX |
|
Industrial |
Dow Jones Industrial Average (Yahoo!) |
^DJI |
|
Utility |
Dow Jones Utility Average (Yahoo!) |
^DJU |
|
Transportation |
Dow Jones Transportation Average (Yahoo!) |
^DJT |
|
S&P |
S&P500 |
SPX |
|
Size (L) |
ING fix income for large company |
IDLOX |
|
Size (S) |
ING fix income for small company |
IDSOX |
Table 1. Indices
used as data in the model
Select Factors
In the effort to choose risk factors that are independent to others, we delete several highly correlated factors in the first step. For example, SPX and DJI fall into this category. Since they are both representative of equity market performance, and are more than 97% correlated with each other for the last two years, we decided to include DJI only. The high correlation between them can be simply verified by the charts in finance.yahoo.com. Figure 1 shows the variation trend in recent two years for both SPX and DJI. Comparing these two indices by charts is a simple and effective method to verify the correlation between or among two or more factors by comparing their trend charts.
Figure 1.
Compare the SPX and DJI in the last two years
Another index we prefer not to be used in the model is NDX. NDX stands for Nasdaq 100, which includes 100 high tech companies crossing biology, computer, internet, utility and so on. Since it is dependent on several other factors, it has a large impact on other factor’s coefficient. Therefore, we would rather keep those underlying sectors, such as computer and internet, and remove NDX.
The rule of thumb is that to use the model to manage risk for the two-year investment period in future, we should look back in two years to collect historical data. So we choose the daily prices for stocks and indices in the last two years. Here, the price is not open or close price. Using either of these would underestimate the stock return because dividends received by stock holders are not included. Instead, we should use the adjusted close price because adjusted price is the price after adjustment for dividends and more importantly, stock split.
Build the model
Since building the risk management model needs a lot of statistical methods, MatLab, a popular and powerful mathematical software, is used as the tool to develop our model. The following steps are needed to build the model:
· Normalize the data
Transform data from adjusted close price to return.
· Build the initial model by least square regression.
The model parameters are chosen such that the sum of square deviation between realized return and model predicted return, S(AX-Y)2, is minimized. This optimization problem can be effectively solved by solving the system of linear equations AX=Y with singular value decomposition method, which is simply represented in MatLab by the symbol “\”.
· Stepwise regression.
The correlation between 2 or more independent variables will distort the coefficient standard errors, which will eventually distort t-test for statistical significance of parameters. However, it is not easy to find the correlation degree among a set of factors. Fortunately, stepwise regression is a statistical solution to this problem. There are two kinds of stepwise regression, forward and backward, and we choose the backward approach. Figure 2 shows the algorithm of the stepwise regression.
Figure 2. The algorithm of stepwise regression
· T-test and rebuild model again.
After stepwise regression, the correlations between remaining risk factors become much more smaller. Now, t-test can be used to measure the significance of each factor. With the result of t-test, the model deletes the factors that are not statistically significant. Because we use more than 400 data, t-distribution with 400 degree of freedom is close to the normal distribution. If we choose two-tailed test with significant level of 2.5%, all the t-value larger than 1..96 or smaller than –1.96 is considered as significant. We then delete the insignificant factors and do the last turn of R-square regression and get its R-square value. Now the risk matrix is available for further analysis.
4.
Performance Evaluation
The result of the multiple regression of our model is shown in the Table 2, which includes 10 stocks’ risk exposure to 20 factors. Tests for significance in multiple regression involve testing whether:
1. Using t-testing, it tests whether each independent variable explains the variation in the dependent variable well;
2. Using R-square, it tests whether all independent variables together explains the variation in the dependent variable well.
In this model, we have applied the t-testing in the last step and only kept the factors which are statistically significant on a significant level of 2.5%.
The most direct way of measuring the strength of a multifactor model is probably R-square. However, R-square almost always increases as more risk factors are added to the model, even if the contribution of the new risk factor is not statistically significant. Consequently, a relatively high R-square value may reflect the impact of a large set of independent variables rather than how well the set explains the dependent variable. To overcome the problem of overestimating the impact of additional variables on the explanatory power of a regression model, adjusted R-square is used. In average, the adjusted R-square is around 59%, which outperforms a lot of models in empirical research. Let’s have a look at each stock in detail.
Advanced Micro Devices,
Inc.(AMD), incorporated in 1969, is a semiconductor manufacturer
with manufacturing facilities in the
Our model shows that AMD has the largest exposure(1.01) to semiconductor sector, since it is a company which produce the chips. It also has a high correlation with the airline. It seems that AMD’s products are widely used in airplane electronic device fabrication and a large fraction of AMD’s profit comes from contracts with large airline companies. Another interesting fact is that AMD has a –0.56 exposure to transportation sector. The market capitalization for AMD is $5.47B, which is not a very large value, but not a small value either. That explains that its exposure to company size factor is 0. In total, these factors explain 49% of the variation of AMD’s return.
Amgen, Inc.(AMGN) is a global biotechnology company that discovers, develops, manufactures and markets human therapeutics based on advances in cellular and molecular biology. The Company markets human therapeutic products[9]. Our model shows that AMGN has the highest exposure to biotechnology(0.62). It also has a very high exposure to drug sector(0.53). As a high-tech company focused on new drug development, it bears a high risk as the approval of new drugs and therapeutics are under strict regulation of FDA. On the other hand, it also has large upside profit gain potential. Therefore, the capital flow to Amgen is opposite to traditional production sector, such as consumer product sector, which explains its negative exposure to consumer product sector(-0.26). In total, these factors explain 55% of the variation of Amgen’s.return.
General Motors Corporation
(GM) participates in the automotive industry through the activities of
General Motors Automotive. GM designs, manufactures and/or markets vehicles,
primarily in
The Goldman Sachs Group, Inc. (GS) is a global investment banking, securities and investment management firm that provides a range of services worldwide to a substantial and diversified client base that includes corporations, financial institutions, governments and high-net-worth individuals. The Company's activities are divided into three segments: Investment Banking, through which Goldman Sachs provides a range of investment banking services to a diverse group of corporations, financial institutions, governments and individuals; Trading and Principal Investments, through which the Company facilitates customer transactions with a diverse group of corporations, financial institutions, governments and individuals, and Asset Management and Securities Services, through which Goldman Sachs provides investment advisory and financial planning services to a diverse client base of institutions and individuals[9]. Obviously, GS is highly correlated with bank(0.60). Also it has a large exposure to the market(0.39). As a focused investment banking institute, GS has very little exposure to other sectors. In total, bank sector and market explain more than 60% of the variation of GS’s return.
International Business Machines Corporation (IBM) is an information technology (IT) company. Its portfolio of capabilities ranges from services that include business transformation consulting to software, hardware, fundamental research, financing and the component technologies used to build larger systems. These capabilities are combined to provide business insight and solutions in the enterprise computing space. IBM's clients include many different kinds of enterprises, from sole proprietorships to large organizations, governments and companies, representing every major industry and endeavor[9]. IBM is highly correlated with the market(0.39) and computer(0.64). Because of its large market capitalization, it has a negative exposure to size factor(-0.17). These factors together explains more than 60% of the variation of IBM’s return.
Avigen, Inc.(IVGN) is focused on the development of adeno-associated virus-based gene therapy products for the treatment of serious, chronic diseases. The Company has developed a proprietary gene delivery platform technology based on adeno-associated virus vectors (AAV vectors)[9]. As a drug development and medical service company, IVGN has a large exposure to biotechnology(0.71) and retail sector(0.18). These two factors explain more than 40% of the variation of IVGN’s return.
J.P. Morgan Chase & Co.(JPM) is a global financial services firm with operations in more than 50 countries. IB provides investment banking and commercial banking products and services. The Company's national consumer and middle market businesses, which provide lending and full-service banking to consumers and small and middle market businesses, collectively comprise Chase Financial Services[9]. Obviously, JPM has a large exposure to bank sector(1.23) and the market(0.53). It also has negative exposure to consumer product(-0.41) and transportation(-0.19). These factors together explain more than 70% of the variation of JPM’s return.
Southwest Airlines Co.
(LUV) is a domestic airline that provides predominantly shorthaul,
high-frequency, point-to-point, low-fare service in the
Starbucks Corporation(SBUX) purchases and roasts whole bean coffees and sells them, along with fresh, rich-brewed coffees, Italian-style espresso beverages, cold blended beverages, a variety of complementary food items, coffee-related accessories and equipment, a selection of premium teas and a line of compact discs, primarily through Company-operated retail stores[9]. As a food service company, Starbucks has a high exposure to consumer(0.69) and retail sector(0.40). It also has negative exposure to drug(-0.15) and insurance sector(-0.33). These factors together explain about 40% of the variation of SBUX’s return.
Wal-Mart Stores, Inc.(WMT)
operates retail stores in various formats around the world. It organizes its
business into three segments: Wal-Mart Stores, SAM'S CLUB and International[9].
Walmart has a large exposure to retail sector(0.94). This factor itself
explains almost 70% of the variation of WMT’s return.
The result of the simple multiple regression for the same 10 stocks is shown in the Table 3. Since the simple multiple regression does not select the factors, which are best to explain the variation of the security, it simply does multiple regression on all factors. It turns out that AMD has a very high correlation with banks(BKX), which is not true in real life. Using our model, it successfully deleted the factor of BKX by stepwise regression and gives a much more accurate result. Further, if there are two or more factors with higher correlation to each other than the factors we selected, using simple multiple regression will be high possibility to turn out unreasonable result.
5. Conclusion and Future
Work
A sector exposure model for risk management of security portfolio explores a large set of risk factors, including stock characteristics (size), industry (Autos, Energy, Financials, etc.) etc. It has a wide range of application to quantitatively calculate the risk exposure to a specific factor, to construct a portfolio with certain risk characteristics to tailor an investors’ specific risk preference and to build a covariance matrix of securities. Based on a broad research in the related books and Internet websites, this reports includes description the background of the multifactor model, the mathematical method of multiple regression, the stepwise regression, and so on. It also includes the implementation details of the model, such as data collection, factor selecting, multifactor model algorithm. Furthermore, it applies the multifactor model on 10 stocks, and describes risk analyses for each specific stock. In average, the adjusted R-square is around 59%, which outperforms a lot of models in empirical research.
Concerning with the data
collection, our future work is to develop a GUI based tool that runs over the
API. This would enable naive users to make use of the tool for data collection.
The tool could also be extended to get all the data like opening price, closing
price, percentage change in price etc. This would not be very difficult to do
and presents a simple yet useful extension to the program. The tool returns the
output as a tab-separated text file, but could very easily also store the data
in a database.
For the model, one future direction is to fit local models for a certain number of countries (the number depends on the data available). With each local model, the stock return is regressed against risk factors, including stock characteristics, industries, etc. These are then regressed against the global industry factors and country factors to construct a global model. The local models and global models, on different levels, can then be integrated to display the risk structure or construct the covariance matrix of a global portfolio.
Reference
[1] Harry Markowitz, “Portfolio Selection,” Journal of Finance, March 1952.
[2] William Sharpe, “Capital Asset Prices: A Theory of Market Equilibrium,” Journal of Finance, September, 1964.
[3] Robert C. Merton, “An Intertemporal Capital Asset Pricing Model”, Econometrica 41, 1973.
[4] Zvi Bodie, Alex Kane, and Alan J. Marcus, Investments, Fourth Edition, Irwin McGraw-Hill, p. 297.
[5] Eugene F. Fama and Kenneth R. French, “Multifactor Explanations of Asset Pricing Anomalies,” Journal of Finance 51, 1996.
[6] Trevor Hastie, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning, Springer.
[7] Schweser Study Notes for the 2004 CFA Exam, Level 2, Ethics and Quantitative Methods.
[8] The Barra Newsletter, Spring 2003, p. 2.
[9] finance.yahoo.com
|
|
|
BKX |
BTK |
CMR |
DJI |
DJT |
DJU |
DOT |
DRG |
IUX |
RLX |
SOXX |
XAL |
XAU |
XBD |
XCI |
XNG |
XOI |
Size |
R2 |
|
AMD |
-0.00011 |
0.00 |
-0.23 |
0.00 |
0.00 |
-0.56 |
0.00 |
0.00 |
0.00 |
-0.35 |
-0.16 |
1.01 |
0.30 |
0.00 |
0.69 |
0.00 |
0.00 |
0.00 |
0.00 |
0.49 |
|
AMGN |
0.00002 |
0.00 |
0.62 |
-0.26 |
0.00 |
0.23 |
0.00 |
0.00 |
0.53 |
0.00 |
0.00 |
0.00 |
-0.14 |
-0.03 |
0.00 |
0.00 |
-0.16 |
0.00 |
0.00 |
0.55 |
|
GM |
0.00044 |
-0.39 |
0.00 |
-0.59 |
2.02 |
0.00 |
0.17 |
0.00 |
-0.37 |
0.15 |
0.00 |
-0.02 |
0.09 |
-0.06 |
0.00 |
0.00 |
0.09 |
0.00 |
0.00 |
0.62 |
|
GS |
-0.00046 |
0.60 |
0.00 |
0.00 |
0.39 |
0.00 |
0.00 |
-0.02 |
0.00 |
0.00 |
0.00 |
0.13 |
0.00 |
0.05 |
0.00 |
0.00 |
0.00 |
-0.15 |
0.00 |
0.68 |
|
IBM |
-0.00010 |
0.00 |
0.00 |
0.00 |
0.39 |
0.00 |
0.00 |
0.00 |
0.00 |
-0.07 |
0.00 |
-0.05 |
0.00 |
-0.05 |
0.00 |
0.64 |
0.02 |
0.00 |
-0.17 |
0.66 |
|
IVGN |
-0.00090 |
0.00 |