Import and Export Files to and from GitHub via API
GitHub is typically used as a repository for code, but GitHub can also be used as a locker for a project’s assets and files.
To upload a file on the GitHub website, the site makes it straightforward with an “Add File” button at the top right of every directory. Downloading isn’t as simple, however. You either have to view the raw file in a browser window and then do a “Save As”, or you need to download the entire repo as a zip file, unzip it, and then locate the file you want.
Regardless, data engineers and software developers sometimes need to access these assets remotely without visiting the GitHub website at all, and they need to automate the upload and download of a project’s assets. For these tasks, we can use GitHub’s APIs.
Create a Python Script to Upload a File to GitHub
Suppose I have a dataset as a csv file in a local folder, and I want to upload it to a GitHub repo. It only takes a relatively short block of code.
For the following scripts, I have PyGithub installed on my machine, which allows for easy interaction with the GitHub API.
This script uploads a csv file to a repo.
Here’s a general overview of the steps:
- Create a GitHub instance. In the scripts in this post, I’m authenticating my account with a token. (To generate a token, in your GitHub account go to settings → developer settings, or go to https://github.com/settings/apps when logged in. On the left sidebar, there’s a dropdown called “Personal access tokens.”)
- Assign the repo you wish to access to a variable in the format
username/repo-name
. - Use a
with
statement withopen
to read the file contents and assign the contents to a variable. - Send the contents with the
create_file
command.
With the create_file
command, you should have:
- the directory path in your GitHub repo with the file’s desired filename
- a commit message
- the contents to write to the file
- the branch name (optional), which will default to ‘main’ if not included
After running this script, go to the repo in a browser, and you’ll see the new file there.
Use Variables to Assist Automation
If you’re using the API to upload multiple files to different repos owned by different users, it likely won’t be practical to hardcode the information into the script and rewrite it every time. Instead, you can create the skeleton of a script and pass in the details as variables. A text file or Python’s sys
library are two ways to achieve this.
Pass in Variables with a Text File
Create a text file with the unique information. In this example, I’m using a file called vars.txt
, and I’m including:
- the token
- GitHub repo to which to upload
- the path to the local file
- the destination directory and filename on the GitHub repo
- a commit message
Then, have the Python script read this text file, assign each line to a variable, and use the variables in the appropriate places.
Pass in Variables with the sys Module
If you would rather not create text files, Python’s sys
module allows you to pass in variables directly from the command line.
Suppose my script file is called ghAPI_script.py
, I can call the script in the command line and tack on strings at the end:
In the script itself, I include import sys
and assign variables to each string using sys.argv[i]
where i
refers to the order in which the objects are supplied. (Note that sys.argv[0]
is skipped.)
…and the file is uploaded.
Download a File from GitHub Using a Python Script
Let’s now go in the reverse direction and download a file.
Again, start by creating a GitHub instance, and set the relevant repo to a variable. Then, use the get_contents()
method to grab all the file’s information from that GitHub repo and assign it to a variable.
This call creates a PyGithub ContentFile object, but the content is not in the desired form yet, because it is still encoded in bytes. You will need to decode it with the decoded_content
attribute, and then the contents can be written to a new file.
Upload and Download Binary Files to GitHub
The above scripts work if I’m using the API to send and receive files that contain only text, such as csv files, markdown files, or files of code. But what if I want to send or receive a zip file or a media file like an image, a video, or a song?
Binary files like these require an extra step or two. Basically, they need to be encoded to or decoded from the Base64 format.
When uploading, include import base64
at the top of the script and then convert the contents into Base64 before doing the API call. As an example, this script uploads an image called “image.jpg”.
And when downloading from GitHub, again use import base64
. The file will arrive already encoded in Base64, so it takes another step or two to get it into an actual image file.
Summary
This Medium post explains how to upload and download files, whether text-based or binary, to a GitHub repo via APIs.