📉
OMSCS Notes
  • OMSCS Lecture Notes
  • information-security
    • Firewalls
    • Database Security
    • Operating System Security
    • Access Control
    • Cyber Security
    • Law, Ethics, and Privacy
    • Security Protocols
    • Introduction to Cryptography
    • Software Security
    • Midterm 2 Study Guide
    • Mandatory Access Control
    • Modern Malware
    • Udacity Quizzes
    • Midterm 1 Study Guide
    • IPSec and TLS
    • Wireless and Mobile Security
    • Welcome
    • Malicious Code
    • Intrusion Detection
    • Public-Key Cryptography
    • The Security Mindset
    • Hashes
    • Symmetric Encryption
    • Web Security
    • Authentication
  • machine-learning-trading
    • The Fundamental Law of Active Portfolio Management
    • So You Want to be a Hedge Fund Manager
    • Dealing With Data
    • Technical Analysis
    • Sharpe Ratio and Other Portfolio Statistics
    • Incomplete Data
    • Dyna
    • Portfolio Optimization and the Efficient Frontier
    • Histograms and Scatter Plots
    • How Hedge Funds Use the CAPM
    • The Power of NumPy
    • Regression
    • Market Mechanics
    • Optimizers: How to Optimize a Portfolio
    • Efficient Markets Hypothesis
    • The Capital Assets Pricing Model (CAPM)
    • Optimizers: Building a Parameterized Model
    • Reinforcement Learning
    • Q-Learning
    • Statistical Analysis of Time Series
    • What Is a Company Worth
    • Udacity Quizzes - ML4T
    • Textbook Information
    • Welcome
    • How Machine Learning is Used at a Hedge Fund
    • Ensemble Learners, Bagging, and Boosting
    • Reading and Plotting Stock Data
    • Assessing a Learning Algorithm
    • Working with Multiple Stocks
  • computer-networks
    • Congestion Control & Streaming
    • Programming SDNs
    • Spam
    • Network Security
    • Switching
    • DNS
    • Content Distribution
    • Naming, Addressing & Forwarding
    • Architecture and Principles
    • Routing
    • Internet Worms
    • Traffic Engineering
    • Denial of Service Attacks
    • Welcome
    • Router Design Basics
    • Rate Limiting & Traffic Shaping
    • Software Defined Networking
  • operating-systems
    • Threads And Concurrency
    • Introduction To Operating Systems
    • Datacenter Technologies
    • Synchronization Constructs
    • Thread Performance Considerations
    • Threads Case Study - PThreads
    • Processes and Process Management
    • Thread Design Considerations
    • Final Exam Review Questions
    • Memory Management
    • Midterm Exam Review Questions
    • Distributed File Systems
    • IO Management
    • Virtualization
    • Inter-Process Communication
    • Scheduling
    • Welcome
    • Remote Procedure Calls
    • Distributed Shared Memory
  • simulation
    • Course Tour
    • Arena, Continued
    • Arena
    • Output Data Analysis
    • Welcome
    • Random Variate Generation
    • Calculus, Probability, and Statistics Primers, Continued
    • Hand and Spreadsheet Simulations
    • Comparing Systems
    • General Simulation Principles
    • Comparing Systems, Continued
    • Generating Uniform Random Numbers
    • Calculus, Probability, and Statistics Primers
    • Random Variate Generation, Continued
    • Input Analysis
  • secure-computer-systems
    • Welcome to SCS
    • Getting Started
    • Design Principles for Secure Computer Systems
    • Protecting the TCB from Untrusted Applications
    • Virtualization and Security
    • Authentication
    • Discretionary Access Control
    • Mandatory Access Control
    • Midterm Exam Review
  • .github
    • ISSUE_TEMPLATE
      • bug_report
      • typo-report
Powered by GitBook
On this page
  • Pristine Data
  • Why Data Goes Missing
  • Why this is Bad / What we can Do
  • Pandas fillna() Quiz
  • Pandas fillna() Quiz Solution
  • Documentation
  • Using fillna()
  • Documentation
  • Fill Missing Values Quiz
  • Fill Missing Values Solution
  • Documentation

Was this helpful?

  1. machine-learning-trading

Incomplete Data

PreviousSharpe Ratio and Other Portfolio StatisticsNextDyna

Last updated 4 years ago

Was this helpful?

Pristine Data

People have many misconceptions when it comes to financial data, and one of the biggest is that the data is always pristine. People often imagine that financial data is perfectly recorded, minute by minute, free of gaps and other errors.

That's just not the case.

For example, a stock might be traded on multiple exchanges - the New York Stock Exchange, the NASDAQ, and , among others. At any particular minute during the day, it may trade at one price on one exchange, and a different price on another.

As a result, there is no single price for a stock at a particular time, and it's hard to say which exchange is "right" when it comes to pricing a stock. The reality of the data that we get is that it's often an amalgamation from several different sources, and different data providers supply different numbers.

Additionally, not all stocks trade every day, so we can see gaps in our data in between periods of activity. Additionally, stocks come into and go out of existence, so we might see data for a stock suddenly begin or end.

Why Data Goes Missing

Let's consider the following plot of SPY pricing data.

Let's look at another stock: JAVA.

As we can see, JAVA was trading from the beginning of our time frame but then abruptly stopped. The reason for this is that Sun Microsystems, trading under the ticker JAVA, was acquired by Oracle in 2010. On that day, the JAVA ticker ceased to exist. From this date forward, all data for JAVA will be NaN.

Interestingly, before JAVA represented Sun Microsystems, it was the ticker for Mr. Coffee. If you look historically for data for JAVA, you'll find two different time series for the ticker: one for Mr. Coffee, and one for Sun Microsystems.

Let's look at another, admittedly fake, stock: FAKE1.

As you can see, FAKE1 didn't exist before a date roughly in the middle of our range. While we see NaN values after a particular date for JAVA, we see NaN values before a particular date for FAKE1.

Let's look at a second fake stock: FAKE2.

We can see that, in addition to not having data at the beginning of the range, FAKE2 also has serious gaps in the data throughout the range. This data is not typical for a very large, liquid stock like GOOG or APPL, but can be typical for thinly traded stocks, such as those that have a smaller market capitalization and trade infrequently.

Why this is Bad / What we can Do

What do we do when we have significant gaps in our data? Let's take another look at our fake ticker FAKE2.

The truth of the matter is that, between the two points on either side of the gap, there simply was no trading; in other words, there was no price for this stock.

However, if we want to compute any sort of statistic over these gaps, we need to fill them with something because we can't run statistics of NaN values.

Our first thought might be to interpolate the data. That is, we could fill in the data between the two points on either side of the gap by drawing a line connecting them.

This is the wrong approach.

Suppose we were looking for patterns in the data, and we had rolled back time to step through the pricing data day by day. Let's suppose the current day we are examining is one of the points that we interpolated.

To interpolate the points in the gap, we had to know the points at the beginning and the end of the gap. However, from the perspective of any individual interpolated point, the end of the gap exists in the future.

As a result, we cannot consider any of the interpolated points without conceding that we necessarily baked information from the future (relative to the point) to bring about its existence. Since we can't peek into the future in real life, we cannot take this approach in simulation.

A better approach is the fill forward approach. In this approach, each data point in the gap is given the value of the last-seen data point.

However, we still have missing data at the beginning of our range. The first gap begins somewhere to the left of the y-axis, leaving us with no value from which to fill forward.

We need to fill in this gap, so we choose to fill backward from the first known price in this situation.

Generally, we fill forward first and fill backward second. This approach helps us avoid peeking into the future as much as possible.

Pandas fillna() Quiz

Our task is to find the parameter that we need to pass to fillna to fill forward missing values.

Pandas fillna() Quiz Solution

Documentation

Using fillna()

Let's take another look at the data for FAKE2.

We notice a gap in the data at the beginning of the date range, as well as several gaps in the middle.

Given a DataFrame df, we can forward fill missing values with the following method call:

df.fillna(method="ffill")

Let's plot the FAKE2 prices again.

We can see the horizontal lines where we have forward filled the gaps in the data. However, we are still missing values at the beginning of the range.

Documentation

Fill Missing Values Quiz

If we look at the following plot of stock price data, we can see several gaps.

Our task is to use the fillna method to fill these gaps.

Fill Missing Values Solution

We can use forward filling for gaps that have a definitive start date, and backward filling for gaps that have no beginning or begin before our date range.

Documentation

SPY, an that tracks the S&P 500, is one of the most liquid and actively-traded ETFs on the market. Because of its frequent trading, we typically use SPY as a proxy for the stock market being open and as a time and date reference for other stocks.

ETF
pandas.DataFrame.fillna
pandas.DataFrame.fillna
pandas.DataFrame.fillna
BATS