Working with Multiple Stocks
Last updated
Last updated
To recap, here is an illustration diagramming the structure and content of a pandas DataFrame.
The DataFrame has columns, one for each symbol, and rows, one for each date. Each cell has a value; in this case, it is the closing price of the security represented by the symbol on the corresponding date.
To build out a DataFrame like the one pictured above, we need to consider a few things first.
For example, you might remember that our CSV file from a previous lecture contains data from 2000 through 2015. Since the DataFrame we want to build contains data from only 2010 to 2012, we need to figure out how to read in data from specific date ranges.
This new DataFrame contains information about multiple equities, whereas previous DataFrames only covered a single equity. We need to learn how to read in data for multiple stocks.
Additionally, we need a mechanism by which we can align dates. For example, if SPY traded on some days, and IBM traded on other days, we need to be able to create a DataFrame that aligns the closing prices correctly.
Finally, we need to undo the problem we discovered in the CSV from the last lecture; specifically, the dates were presented in reverse order. We need to build our DataFrame such that the chronology is as we expect: from past to present.
See breakdown here (for 2018 and 2019).
We can start constructing our DataFrame by first building an empty DataFrame df
, which we index by the dates that we want to consider.
Since our goal is to load df
with a column for SPY, IBM, GOOG, and GLD, we begin by reading in SPY data into a DataFrame dfSPY
.
When we compare df1
and dfSPY
, we notice two interesting things.
First, there are many more dates in dfSPY
than there are in the target df
. dfSPY
contains all of the data for SPY, and we need a way to retrieve data from only the dates we are considering.
Second, there are dates present in df
that are not present in dfSPY
. When we constructed our index, we didn't skip weekends or holidays. Obviously, SPY did not trade on those dates since the market was not open. We will need to deal with this alignment issue.
We now need to combine df
and dfSPY
into a single DataFrame. Thankfully, pandas has several different strategies for doing just that. We are going to look at an operation called a join.
There are a few different types of joins, and the names may sound familiar if you have ever taken a databases course. The type of join that we are interested in retains only those rows that have dates present in both df
and dfSPY
. Formally, this is known as an inner join.
Not only does the inner join eliminate the weekends and holidays originally present in df
, but it also drops the dates in dfSPY
that fall outside of the desired date range.
After we have joined df
and dfSPY
, we can repeat the procedure for the other equities - IBM, GOOG, and GLD - by performing additional joins.
Before we start joining in our stock data, we first need to create our empty DataFrame using pandas.
We can create a date range dates
that will serve as the index of our DataFrame with the following code:
If we print dates
, we see the following.
Note that dates
is not a list of strings, but rather a series of DatetimeIndex
objects.
We can retrieve the first object from dates
by calling dates[0]
, which looks like the following if we print it.
The "00:00:00" represents the default timestamp, midnight, for a DateTimeIndex
object. The index for our DataFrame only concerns dates, so we can effectively ignore this timestamp.
We can now create a DataFrame df
that uses dates
as its index with the following code:
If we print df
, we see the following.
We can see that df
is an empty DataFrame, with no columns, that uses dates
as its index.
With df
in place, we can now read SPY data into a temporary DataFrame, dfSPY
, with the following code:
We can attempt to join the two DataFrames with the following code:
The default join is a left join, which means the following. First, all of the rows in df
are retained. Second, all of the rows in dfSPY
whose index values are not present in the index of df
are dropped. Third, all of the rows in df
whose index values are not present in the index of dfSPY
are filled with NaN
values.
If we print the resulting DataFrame, relabelled as df
, we see the following.
None of the values from dfSPY
appear in our new df
. Let's print dfSPY
to debug.
The issue here is that while df
has an index of DatetimeIndex
objects, dfSPY
has a simple, numeric index.
We can rectify this by telling pandas that the Date
column of the SPY CSV should be used as the index column and that the values in this column should be interpreted as dates. We accomplish this with the following code:
If we print dfSPY
now, we see the following, correct DataFrame.
Since we only care about the adjusted close and the date columns, we can construct dfSPY
to only include those columns using the usecols
parameter.
Additionally, we can replace textual values representing null or absent values with proper NaN
s using the na_values
parameter. In the SPY CSV, NaN
is represented by the string "nan".
The full initialization of dfSPY
is demonstrated by the following code:
If we again print df
, the result of the join, we see the following DataFrame.
Finally, we can drop weekends and holidays - where adjusted close is NaN
- using the following code:
If we print out df
, we see the following, correct DataFrame.
We can avoid having to explicitly call dropna
on line 22 by passing a certain value for the 'how' parameter of the join
call on line 19.
What is the value of that parameter?
We want to read in data about three more stocks - GOOG, IBM, and GLD - and create our complete DataFrame. We can iterate over the stock symbols, create temporary DataFrames, and join them to df
with the following code:
However, when we try to print df
, we see an error.
The issue here is that we have multiple DataFrames that each has a column name "Adj Close". Pandas complains that it does not know how to resolve the overlap when joining DataFrames with identical column names. In other words, column names must be unique.
Instead of having four columns with the same name, we'd like to name each column after the symbol whose data it contains. We can accomplish that with the following code:
If we print df
now, we see the following, correct output.
If we look at the code we have written so far, we see some duplication.
Your task is to consolidate this code into one location: the utility function get_data
.
We have consolidated both the DataFrame initialization code and the joining code.
Additionally, since we are using SPY as our baseline, we drop all rows from df
where SPY has not traded - that is, where the SPY column has NaN
values - with the following code:
Here is our complete DataFrame, containing data from 2010 to 2012 for SPY, IBM, GOOG, and GLD.
Suppose we want to focus on a subset, or a slice or this data; for instance, we might be interested in just values of GOOG and GLD for 2/13/10 through 2/15/10.
Pandas exposes a syntax for creating such a slice. Given a DataFrame df
, we can retrieve the data for all columns between these two dates with the following expression:
We can further scope this slice to include information about only GOOG and GLD with the following expression:
Note that neither the rows nor the columns that we slice must be contiguous in df
. The following slice of nonadjacent columns IBM and GLD is just as valid as our original slice:
Our original DataFrame only contained data for four days. We can read in data for the whole year of 2010 by changing the dates of the DatetimeIndex
object we built at the beginning of this lesson:
If we rebuild df
with this new index and print it out, we see the following.
Pandas offers several different options for slicing DataFrames.
Row slicing gives us the requested rows and all of the columns. This type of slicing might be useful when you want to compare the movement of all the stocks over a subset of time.
If we want to retrieve data for all of the symbols during January, for example, we can use the following code:
Note that this code is equivalent to the following:
However, the former is considered to be more pythonic.
aside: it's not. The
ix
method has since been deprecated, and users should use either theloc
oriloc
methods.
Column slicing returns all of the rows for a given column or columns, and is helpful when we want to view just the movement of a subset of stocks over the entire date range.
The following code shows the syntax for retrieving information for just GOOG, as well as for both IBM and GLD.
Finally, we can slice through both rows and columns. For example, we can select information about just SPY and IBM, for just the dates between March 10th and March 15th:
It's pretty easy to generate a good-looking plot from a DataFrame; in fact, we merely have to call the plot
method.
One feature that we can immediately notice about this plot is that GOOG is priced quite higher than the other stocks. It's often the case that different stocks are priced at significantly different levels.
As a result, it can be hard to compare these stocks objectively. We'd like to adjust the absolute prices so that they all start from a single point - such as 1.0 - and move from there.
If we can normalize the stock prices in this manner, we can more easily compare them on an "apples to apples" basis.
While both of these are technically correct, the second approach leverages vectorization which is must faster than the iterative approach. Read more about vectorization here.
To plot our data, we first need to import matplotlib
. We can then define a plot_data
function that receives a DataFrame df
and calls plot
on it.
If we run this function, we see the following graph.
We need to add more information to our graph, such as a title as well as x- and y-axis labels. Additionally, we can change the font-size of the text to improve readability. We can adjust the title and the font-size with the following code:
To generate axis labels, we need to call the set_xlabel
and set_ylabel
methods on the object that the plot
method returns.
If we run this function now, we see a more informative graph.
Your task is to write code that first slices df
along the rows bounded by start_index
and end_index
and across columns
, and then passes the slice to the plot_selected
method.
We can create our slice using the following code, which we then pass to the plot_data
method:
Let's take another look at the plot we have created.
Here we see the absolute price movement of each stock. What we are more interested in, however, is the price movement of each stock relative to its starting price.
To create a plot that shows this relative movement, we first need to normalize the prices of each stock. We do this by dividing the values of each column by their price on day one. This ensures that each stock starts at 1.0.
We can create a normalize_data
method that normalizes the data in a DataFrame df
:
If we graph our normalized DataFrame, we see the following graph. Notice that all the stocks start at 1.0 and move from there.