I recently learned and started using Python for some of my projects. Python is a high level programming language with a number of pre-programmed packages for a variety of useful tasks. Tasks I’ve used Python for include scraping the web for data (excellent!), machine learning (meh … but that’s more my fault than Python’s), OCR (super meh), and algorithmic name classification, such as gender determination (again, excellent!).
While I will not provide direct code to perform predictive analysis using Python, I will use this post to link to a variety of resources that I have used, along with how I use it.
First, how to get started with Python. I use Jupyter Notebook, along with Anaconda. Both of these are installed when you download and install the latest version of Anaconda – google “download jupyter notebook” and go to the first link. The actual download will be from the Anaconda website. As of posting, the latest version is Python 3.6. Click “Download,” run the file and choose all the default options and install Python and Jupyter Notebook.
Jupyter Notebook runs inside your browser. Open up Jupyter Notebook, create a folder for coding, and then create a new Notebook. Each Notebook has distinct cells for distinct blocks of code that can be run separately. Once you run the code in a cell, the output is produced right below. Here is an example:
As you can see, when you run each cell, it simply generates the output right below. One thing I wanted to point out is that variables and variable types are generated dynamically. the code “a=1” first defined a as an integer and then sets it to two. Printing (and other functions) can be applied to integers (e.g. “print(a)”) or strings (e.g. print(‘hello world’)) but not to a mix (see the error in the second cell).
The second thing (and I love this) is the indentation is part of the language.
if 3>2:
print(‘hi’)
print(‘there’)
will return
hi
there
if 3<2:
print(‘hi’)
print(‘there’)
will return nothing
but
if 3<2:
print(‘hi’)
print(‘there’)
will return
there
The indentation controls what is run in the “if” statement. This forces discipline in generating readable (and workable) code.
Once you’ve gotten Python up and running – you’ll need additional packages to do other code.
For webscraping, I’d recommend selenium and chromedriver.
For OCR, I’d recommend Tesseract (Google’s OCR).
For machine learning, I use (but don’t know enough to recommend) tensorflow.