{"id":216,"date":"2018-08-16T01:36:29","date_gmt":"2018-08-16T01:36:29","guid":{"rendered":"http:\/\/www.sugata.in\/?p=216"},"modified":"2018-08-16T02:23:52","modified_gmt":"2018-08-16T02:23:52","slug":"webscraping-with-python","status":"publish","type":"post","link":"http:\/\/www.sugata.in\/index.php\/2018\/08\/16\/webscraping-with-python\/","title":{"rendered":"Webscraping with Python"},"content":{"rendered":"<p>This is some code I wrote to scrape stock prices with Python. I wrote it on Jupyter notebook.<\/p>\n<p>First off you&#8217;ll need chromedriver (Google &#8220;download chromedriver&#8221; and get the file on the first link. Put it in the folder with your Jupyter notebook.<\/p>\n<p>Next, you&#8217;ll need a bunch of libraries, some of which will need to be pip installed.<\/p>\n<p style=\"padding-left: 30px;\"><em>from selenium import webdriver<\/em><br \/>\n<em>import datetime<\/em><br \/>\n<em>import time<\/em><br \/>\n<em>from multiprocessing import Pool,TimeoutError<\/em><br \/>\n<em>import urllib.request<\/em><br \/>\n<em>import re<\/em><br \/>\n<em>from urllib.error import URLError, HTTPError<\/em><\/p>\n<p>In the code below, you won&#8217;t need all of this, but I&#8217;m just copying the entire import section of my code.<\/p>\n<p>Next, we&#8217;ll fire up a browser.<\/p>\n<p style=\"padding-left: 30px;\"><em>driverspy = webdriver.Chrome()<\/em><br \/>\n<em>driverspy.get(&#8216;https:\/\/finance.yahoo.com\/quote\/SPY?p=SPY&#8217;)<\/em><\/p>\n<p>This should open a python controlled browser that surfs its way to Yahoo Finance and loads up the page for SPY (a popular S&amp;P 500 ETF).<\/p>\n<p>Finally, we&#8217;ll define a function to scrape the price and then scrape the price off this page.<\/p>\n<p style=\"padding-left: 30px;\"><em>sourcespy = driverspy.page_source<\/em><br \/>\n<em>found = re.search(&#8216;&#8221;35&#8243;&gt;(\\d+\\.\\d+)&lt;\/span&gt;&#8217;, sourcespy).group(1<\/em>)<\/p>\n<p>If you look at the html code of the page_source of the Yahoo page with the SPY data, you&#8217;ll see it has, buried in it, something that looks like this:<\/p>\n<p style=\"padding-left: 30px;\">&lt;span class=&#8221;Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)&#8221; data-reactid=<strong>&#8220;35&#8221;&gt;<span style=\"text-decoration: underline;\">283.82<\/span>&lt;\/span&gt;<\/strong>&lt;span class=&#8221;Trsdu(0.3s) Fw(500) Pstart(10px) Fz(24px) C($dataGreen)&#8221; data-reactid=&#8221;36&#8243;&gt;+1.72 (+0.61%)&lt;\/span&gt;&lt;div<\/p>\n<p>We rely on the <strong>bolded<\/strong> part always being the same (&#8220;35&#8221;&gt; &#8230; &lt;\/span&gt; and encapsulating the\u00a0<span style=\"text-decoration: underline;\"><strong>bold+underlined\u00a0<\/strong><\/span>price (283.82) to extract the price. The \\d+.\\d+ tells Python to look for a positive number, a period and another positive number.<\/p>\n<p>Now, we have a basic scraper to get prices from Yahoo finance. If we set up a loop, we can get prices every few minutes and generate a time series dataset.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is some code I wrote to scrape stock prices with Python. I wrote it on Jupyter notebook. First off you&#8217;ll need chromedriver (Google &#8220;download chromedriver&#8221; and get the file on the first link. Put it in the folder with your Jupyter notebook. Next, you&#8217;ll need a bunch of libraries, some of which will need [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[9,10],"tags":[],"_links":{"self":[{"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/posts\/216"}],"collection":[{"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/comments?post=216"}],"version-history":[{"count":4,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/posts\/216\/revisions"}],"predecessor-version":[{"id":220,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/posts\/216\/revisions\/220"}],"wp:attachment":[{"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/media?parent=216"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/categories?post=216"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/www.sugata.in\/index.php\/wp-json\/wp\/v2\/tags?post=216"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}