In Section 4.2 you wrote and ran a few short scripts in various programming languages. But often, we want to not only be able to write and execute code, but do so piece-by-piece, and share the results with other people without requiring them to run the code themselves...
Section6.1Intro to Jupyter
Definition6.1.1.
A Jupyter notebook is a file that stores commentary, code, and output in an all-in-one format suitable for sharing with other people.
Jupyter is a popular open-source tool used in data science, scientific computing, and computational journalism. GitHub provides a Codespace ready for running Jupyter notebooks out of the box: https://github.com/github/codespaces-jupyter/.
directly. Before we dive into editing a notebook ourselves, we can first browse the notebooks directory on the repository page. We see three files, each with the extension *.ipynb (short for “IPYthon NoteBook”, Jupyter’s original name).
Clicking on each file, you’ll note that while there’s code, most of the file is actually narrative and visualization. That’s the appeal of Jupyter for many people: it’s about communicating stories, not just data or software.
Additionally, you’ll see a data directory, which includes a *.csvComma Separated Values spreadsheet. This file can be read into a notebook for analysis.
Now, let’s follow the instructions of the repository’s README file (Remark 2.3.2). As of writing, it recommends to just use the Code button to open a Codespace, without needing to fork (Section 5.3) the repository. This allows you to “try out” the Codespace without saving your work long-term, but you can still create a fork with your changes later if you decide to.
Section6.3Kernels
At the core of any Jupyter notebook is its “kernel”.
Definition6.3.1.
The kernel of a Jupyter notebook is a process that wires up a notebook to a particular programming language.
Kernels for several different programming languages exist. We will use a Python kernel in this book, not least of which because it’s one of the most commonly used kernels, and the kernel that’s already set up for use with the GitHub Jupyter Codespace repo.
In your Codespace, use the “Select kernel” button, to choose a “Python environment”. You should be able to select the default global environment without needing to create a new one. Your notebook is ready once you see Python 3.x.y (for some values of \(x,y\)) in the upper-right corner of the notebook.
Section6.4Cells
A notebook is composed of many consecutive parts, known as “cells”.
Definition6.4.1.
A cell of a notebook encapsulates either commentary/documentation (as a Markdown cell) or code (as a Code cell). Cells can be rearranged, inserted, cut, pasted, and so on.
Each Markdown cell uses, well, Markdown (Definition 2.3.1) to describe content that should be displayed to the reader, similar to a README file in your repository.
But it’s the Code cells that set a notebook apart. Each Code cell in a notebook is run consecutively, with the result of the final line of code being displayed for the reader. Importantly, these outputs are saved to the notebook itself, meaning that by sharing the notebook with a colleague, they can see the output of your code without running it themselves! This is not only convenient, but it’s essential when communicating the result of code that uses software your reader does not have installed themselves. Likewise, it allows for showing the results of code via a web browser, such as at this link 1
that you can upload to your Codespace to experiment with.
Section6.6Handling big datasets
A (possible) disadvantage of using Codespaces compared to your own computer is that all processing happens in the cloud, so you’re limited by the resources made available to you by GitHub. But Remark 4.5.1 describes how to beef up your Codespace with more resources, should you need to crunch a particularly large dataset.