Iterate through multiple sorted iterables without merging
Suppose you have multiple lists or iterables (such as lines in a file or records in a database), and you want to iterate through all of…
Suppose you have multiple lists or iterables (such as lines in a file or records in a database), and you want to iterate through all of them, sorted by a key.
Sure, you could use itertools
to achieve that:
from itertools import chain
sortedlist = sorted(chain(iter1, iter2, ...), key=lambda item: ...)
What that would do, however, is bring all items into memory and sort them. If you are dealing with small lists, it’s ok, but if you iterate through large files or if your query gets too big, that won’t work.
If, however, these lists are sorted in their source, or if you are able to quickly sort them (for example, using the Unix command sort
or specifying ORDER BY
in your query), you don’t have to bring them all to memory, and you can iterate through the items using the function below.
import bisect
SENTINEL = object()
def sortedmerge(*iters, key=None):
"""Iterates through all the items in the sorted iterables and
yield all items in other.
In other words, merges all iterables into one.
"""
# These functions will be used to sort the lists below
if key is None:
def itemkey(item):
return item[0]
else:
def itemkey(item):
return key(item[0])
# Transform all arguments into iterables (if not already)
iters = [iter(i) for i in iters]
# Obtain the first item in each (or SENTINEL if they're empty)
iters = [(next(it, SENTINEL), it) for it in iters]
# Remove the empty iterables
iters = [(value, it) for value, it in iters if value is not SENTINEL]
# Sort it the first time - and keep it sorted
iters.sort(key=itemkey)
while iters:
# Take the first item (smallest)
value, it = iters.pop(0)
yield value
# Reinsert the iterable in the right position (unless exhausted)
if (value := next(it, SENTINEL)) is not SENTINEL:
bisect.insort_left(iters, (value, it), key=itemkey)
This function will simply return the next item of each iterable in order. An example of its usage is as below:
# Creates a single log file (merged.log) with the contents of 3 other files
with open('merged.log', 'w') as outlog, \
open('app.log') as log1, \
open('system.log') as log2, \
open('internal.log') as log3:
for line in sortedmerge(log1, log2, log3):
outlog.write(line)
Note: Since the iterables have to be sorted, this function simply presumes that each line starts with the date/time in a sorted format (e.g. “YYYY-MM-DD HH:MM:SS”). If there are multi-line log entries, this function won’t work.