Improving Python Concurrency With Asyncio

In this post, I will share my experience of optimizing the performance of a Python web scraper that downloads news articles and monitors them for changes. The project, formerly known as Nuntium and now Newspy, aims to create a dynamic archive of news articles that can track corrections, edits, or deletions over time.

When I first started this project, I was frustrated by the limitations of Python’s concurrency model. I even resorted to writing a threaded HTML downloader in Go, which was faster but more complex to maintain.

However, I recently discovered a better way to handle concurrency in Python using the asyncio library. Asyncio allows writing asynchronous code that can run multiple tasks concurrently without blocking the main thread. Asyncio uses an event loop to manage the execution of coroutines, which are functions that can suspend and resume their execution at certain points.

Before and After Code

To illustrate the difference that asyncio makes, I will show you two versions of the same function that downloads HTML from a list of URLs. The first version is the original one, which uses a simple for loop and the aiohttp library to make HTTP requests. The second version is the improved one, which uses asyncio.gather to schedule multiple tasks at once.

TLDR - when a coroutine has to process multiple items wrap them in a task list and await asyncio.gather() once as opposed to awaiting the result of each loop iteration

TLDR 2 - while I’m getting a better handle on python async and it can be really fast - and somewhat concurrent - the GIL is a pain in the ass.

"""
Before code: ~40 urls took 13 seconds
"""
async def py_run():
    print("starting py download...")
    s_time = time.perf_counter()
    output = []
    async with aiohttp.ClientSession() as session:
        for u in _articles_py:
            output.append(await download_html(u, session))
    print(output)
    print(f"finished py download in {time.perf_counter() - s_time} seconds")
"""
After code: ~40 urls took 3 seconds
"""
async def py_run():
    print("starting py download...")
    s_time = time.perf_counter()
    tasks = []
    async with aiohttp.ClientSession() as session:
        for u in _articles_py:
            tasks.append(download_html(u, session))
        output = await asyncio.gather(*tasks)
    print(output)
    print(f"finished py download in {time.perf_counter() - s_time} seconds")

Explanation of the Changes

The main difference between the two versions is how the tasks are processed in the event loop. In the before code, the for loop blocks the event loop on each iteration, waiting for the download_html coroutine to finish before moving on to the next one. This is inefficient, as it does not take advantage of the concurrency potential of asyncio.

In the after code, the for loop creates a list of coroutine tasks that are passed to asyncio.gather, which returns a single coroutine that waits for all the tasks to complete. This way, the event loop can run multiple tasks concurrently, without blocking on any single one. This results in a significant improvement in the execution time, as shown by the print statements.

Note how the after code only awaits once for the asyncio.gather coroutine, while the before code awaits on each iteration of the for loop. This reduces the number of context switches and allows the event loop to run more smoothly.

Future Work

One question that I still have is whether it is possible to run multiple event loops in parallel, using a thread pool or a process pool. For example, if I have three sources of news articles, I might want to run a separate event loop for each source, to isolate them from each other and increase the throughput. I will explore this possibility in a future post.

Conclusion

In this post, I showed you how to improve the performance of a Python web scraper using asyncio. I demonstrated the difference between a blocking and a non-blocking approach to concurrency, and how to use asyncio.gather to schedule multiple tasks at once. I hope you found this post useful and learned something new about Python concurrency. Stay tuned for more posts on my coding journey.

Posts in this series