Webscraping with Python

This is some code I wrote to scrape stock prices with Python. I wrote it on Jupyter notebook.

First off you’ll need chromedriver (Google “download chromedriver” and get the file on the first link. Put it in the folder with your Jupyter notebook.

Next, you’ll need a bunch of libraries, some of which will need to be pip installed.

from selenium import webdriver
import datetime
import time
from multiprocessing import Pool,TimeoutError
import urllib.request
import re
from urllib.error import URLError, HTTPError

In the code below, you won’t need all of this, but I’m just copying the entire import section of my code.

Next, we’ll fire up a browser.

driverspy = webdriver.Chrome()
driverspy.get(‘https://finance.yahoo.com/quote/SPY?p=SPY’)

This should open a python controlled browser that surfs its way to Yahoo Finance and loads up the page for SPY (a popular S&P 500 ETF).

Finally, we’ll define a function to scrape the price and then scrape the price off this page.

sourcespy = driverspy.page_source
found = re.search(‘”35″>(\d+\.\d+)</span>’, sourcespy).group(1)

If you look at the html code of the page_source of the Yahoo page with the SPY data, you’ll see it has, buried in it, something that looks like this:

<span class=”Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)” data-reactid=“35”>283.82</span><span class=”Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataGreen)” data-reactid=”36″>+1.72 (+0.61%)</span><div

We rely on the bolded part always being the same (“35”> … </span> and encapsulating theĀ bold+underlinedĀ price (283.82) to extract the price. The \d+.\d+ tells Python to look for a positive number, a period and another positive number.

Now, we have a basic scraper to get prices from Yahoo finance. If we set up a loop, we can get prices every few minutes and generate a time series dataset.