Trial by Fire: Effective Use of Python Asyncio

Learning to Effectively use Python Asyncio

I’m working on a old project I started when I first began coding previously called Nuntium and now Newspy. The goal is to create an archive of news articles that is routinely rechecked for changes - i.e an article with correction / editor note that is removed not long thereafter. When I originally coded the project I recall growing frustrated by Python’s concurrency. So much so I wrote a threaded html downloader in Go. As I begin to rewrite Newspy I am determined to keep it all in Python.

Today when reworking some code I stumbled upon the correct way I should’ve been handling the event loop and async coroutines. I’ve included snippets of the before and after code and below them I will explain the changes.

TLDR - when a coroutine has to process multiple items wrap them in a task list and await asyncio.gather() once as opposed to awaiting the result of each loop iteration

TLDR 2 - while I’m getting a better handle on python async and it can be really fast - and somewhat concurrent - the GIL is a pain in the ass.

"""
Before code: ~40 urls took 13 seconds
"""
async def py_run():
    print("starting py download...")
    s_time = time.perf_counter()
    output = []
    async with aiohttp.ClientSession() as session:
        for u in _articles_py:
            output.append(await download_html(u, session))
    print(output)
    print(f"finished py download in {time.perf_counter() - s_time} seconds")
"""
After code: ~40 urls took 3 seconds
"""
async def py_run():
    print("starting py download...")
    s_time = time.perf_counter()
    tasks = []
    async with aiohttp.ClientSession() as session:
        for u in _articles_py:
            tasks.append(download_html(u, session))
        output = await asyncio.gather(*tasks)
    print(output)
    print(f"finished py download in {time.perf_counter() - s_time} seconds")

The main difference is how tasks are processed in the event loop. In the before code block the code is being blocked within the output.append() call. This is simply wrong, and a result of my ignorance regarding the handling of an event loop and how it’s passed throughout a chain of awaitable coroutines.

The after code block creates a list of coroutine tasks that, as a whole should will be called and awaited. The async.gather() command which takes the tasks list then schedules the tasks.

Note how the program is now awaiting once for gather. Whereas the before code would stop and await for the return html download on each iteration of _articles_py.

One final thought I want to get out so I don’t forget is whether it’s possible to wrap an event loop in a threadpool executor…. Say I’m processing 3 sources ideally I would want each to run in it’s own thread / process with an event loop specific to that source.

There’s a lot to learn, stay tuned for more of my trial by fire coding notes.

Posts in this series