Managing and grouping files with Python

Introduction

We are sometimes faced with systems which generate lots of files that we have to deal with, be it for triage or later reading; it could be logs, data dumps, automatically generated reports and so on. In many cases, we don't have a say in the how and why of the files' organization and might want to deal with them automatically, later on. Which is what I'd like to tackle here, using Python standard library. This is very much an "automate the boring stuff" article where I'll show a handful of functions that make dealing with files easier.

The bread and butter of the following sections will be the pathlib module which provides a lot of cross-platform functionalities to work with paths and files. I'll also use the itertools module which is worth getting familiar with.

To goal of this post is to provide functionalities that will allow turning a directory such as this one (sub-directories management possible as well):

AREA44_LOG1.txt
AREA51_LOG2.txt
info.md
AREZ51_LOG3.txt -> mistake here!
AREA23_LOG1.txt
AREA51_LOG1.txt
junk.json
AREA00_LOG1.txt
Makefile.mk
AREA23_LOG2.txt

Into the result here:

output-dir/
├── AREA00
│   └── AREA00_LOG1.txt
├── AREA23
│   ├── AREA23_LOG1.txt
│   └── AREA23_LOG2.txt
├── AREA44
│   └── AREA44_LOG1.txt
└── AREA51
    ├── AREA51_LOG1.txt
    └── AREA51_LOG2.txt

Filtering files on extensions

A simple, useful operation is filtering on file extensions, such as extracting all the text files from a directory or all the images files from a directory and its sub-directories.

Python's Path objects allow globbing at 3 different levels: inside the directory, inside the directory and its direct children, inside the directory and all its recursive children.

See for example, the following structure:

test-dir/
├── other-subdir
│   ├── deeper-dir
│   │   └── deeper-file.txt
│   ├── subfile.py
│   └── subfile.txt
├── random.txt
├── stuff.py
└── subdir
    ├── subdir.log
    ├── subdir.mk
    └── subdir.txt

We can grab its files like this:

from pathlib import Path

directory = Path("test-dir")

# First level, inside the dir only
level_1 = list(directory.glob("*.txt"))

# Second level, the dir and the direct children
level_2 = list(directory.glob("*/*.txt"))

# Recursive level, the dir and all its children
level_recur = list(directory.glob("**/*.txt"))

print(level_1, "\n")
print(level_2, "\n")
print(level_recur, "\n")

[PosixPath('test-dir/random.txt')]

[PosixPath('test-dir/other-subdir/subfile.txt'),
 PosixPath('test-dir/subdir/subdir.txt')]

[PosixPath('test-dir/random.txt'), PosixPath('test-dir/other-subdir/subfile.txt'),
 PosixPath('test-dir/other-subdir/deeper-dir/deeper-file.txt'),
 PosixPath('test-dir/subdir/subdir.txt')]

There are a couple of things of note here. Firstly, I'm running this file inside the top directory and using a relative path (namely, test-dir). I could also use an absolute path, pathlib lets us pick how paths are declared and can also extract the absolute path from a relative one using the absolute() method. Secondly, I had to wrap the glob results in a list because the output is a generator object. Thirdly, the output is a list of PosixPath because I'm on a Linux machine, on other operating systems, you might see something different.

Now this is all well and good but somewhat limited, because we can only glob a single file extension at a time. We can easily fix this by creating a list of valid extensions.

Let's try to get all the Python files and text files inside a directory and its sub-directories, we'll use a generator comprehension and f-strings.

The following won't have the desired output because it creates a list of generators instead of a list of files. I could iterate over the list and create a list of list but I'd like a flat output instead. This is where itertools comes to the rescue.

from pathlib import Path
from itertools import chain
directory = Path("test-dir")
extensions = ["py", "txt"]

# List of generators, bad output
py_and_text_files = list(directory.glob(f"**/*.{ext}") for ext in extensions)

# List of files from flattened generators, good output
# Per the docs: chain.from_iterable(['ABC', 'DEF']) --> A B C D E F
py_and_text_files = list(
    chain.from_iterable(directory.glob(f"**/*.{ext}") for ext in extensions)
)

print(py_and_text_files)

[PosixPath('test-dir/stuff.py'),
 PosixPath('test-dir/other-subdir/subfile.py'),
 PosixPath('test-dir/random.txt'),
 PosixPath('test-dir/other-subdir/subfile.txt'),
 PosixPath('test-dir/other-subdir/deeper-dir/deeper-file.txt'),
 PosixPath('test-dir/subdir/subdir.txt')]

Armed with enough knowledge, let's write a helper function that will filter a directory given some extensions and a recursion level. I've added a couple checks that I thought were useful. That is, converting string paths to Path objects, handling one extension vs. a list of extensions and finally, handling how deep the globbing goes.

from pathlib import Path
from itertools import chain

def filter_extensions(directory, extensions, level = 0):
    lvl_map = {0: ".", 1: "*", 2: "**"}
    if isinstance(directory, str):
        directory = Path(directory)
    if isinstance(extensions, str):
        return directory.glob(f"{lvl_map[level]}/*.{extensions}")

    return chain.from_iterable(
        directory.glob(f"{lvl_map[level]}/*.{e}") for e in extensions
    )

# Getting all Python and text files from all levels
list(filter_extensions(Path("test-dir"), ["py", "txt"], 2))

[PosixPath('test-dir/stuff.py'),
 PosixPath('test-dir/other-subdir/subfile.py'),
 PosixPath('test-dir/random.txt'),
 PosixPath('test-dir/other-subdir/subfile.txt'),
 PosixPath('test-dir/other-subdir/deeper-dir/deeper-file.txt'),
 PosixPath('test-dir/subdir/subdir.txt')]

We've got the first step down! Now we can work onto the next, which is grouping files.

Grouping files

After filtering the files to get only the relevant extensions, we'll manipulate them a bit more. I'll show you how to group files by arbitrary conditions using a function.

A common solution to capture patterns in (file) names is to use regular expressions, they have their place when you know in advance what you're matching but it's not always the case. For example, log files for some actions starting with LOG_X_... where X is a number. The range of numbers might not be known beforehand but we still want to group logs with the same numbers together or more likely, with the same code, ID, timestamp, etc. We can accomplish this with itertools.groupby.

A groupby requires a function that will indicate how to group as well as generally sort the data to pack group components together. Then, it can separate the keys and the groups. Here is an example with tuples where the first element of the tuple is the key.

from itertools import groupby

unique_keys = []
groups = []
l = [("a", 1, 2), ("b", 3, 4), ("b", 2), ("c", 11, 12), ("a", 0, 2, 6)]
key_func = lambda x : x[0]
data = sorted(l, key=key_func)
for key, group in groupby(data, key_func):
    unique_keys.append(key)
    # returns an iterator so wrapping in list is necessary
    groups.append(list(group))

print(unique_keys)
print(groups)

['a', 'b', 'c']
[[('a', 1, 2), ('a', 0, 2, 6)], [('b', 3, 4), ('b', 2)], [('c', 11, 12)]]

In the output, you can see that the data is sorted by the keys (first element) and put into a list, for each key. This is exactly what we are going to do with the file names but using a key-value pair (a dictionary) instead of plain lists.

Let's try it with the following files under the same directory:

AREA44_LOG1.txt
AREA51_LOG2.txt
info.md
AREZ51_LOG3.txt -> mistake here!
AREA23_LOG1.txt
AREA51_LOG1.txt
junk.json
AREA00_LOG1.txt
Makefile.mk
AREA23_LOG2.txt

We would like to filter only the text files and group the area paths together but there's a small mistake in one of the file name, which would lead to it having its own group. We don't want groups to be created willy-nilly so let's add a filter to remove anything that doesn't start with AREA. This will require a bit of knowledge of regex.

import re

# Filters paths by file name (ignores the rest of the path)
def filter_keyword(path_generator, regex):
    pat = re.compile(rf'{regex}')
    return filter(lambda x: re.match(pat, x.name), path_generator)

Next, let's call this function with a simple regex after filtering by extension:

p = Path("test-groups")

list(filter_keyword(filter_extensions(p, "txt", 0), r'^AREA'))

[PosixPath('test-groups/AREA44_LOG1.txt'),
 PosixPath('test-groups/AREA51_LOG2.txt'),
 PosixPath('test-groups/AREA23_LOG1.txt'),
 PosixPath('test-groups/AREA51_LOG1.txt'),
 PosixPath('test-groups/AREA00_LOG1.txt'),
 PosixPath('test-groups/AREA23_LOG2.txt')]

Finally, let's set the keys as well as grouping rule as having the same characters before the underscore (_). Each area name will be stored as a dictionary key while the paths will be the values. A defaultdict is used to store the elements.

from collections import defaultdict

keyfunc = lambda x : x.name.split("_")[0] # first element of split on `_`

def group_files(filepaths, key_function):
    grp_files = defaultdict(list)
    data = sorted(filepaths, key=key_function)
    for key, group in groupby(data, key_function):
        grp_files[key] = list(group)

    return grp_files

When everything is put together, we can just do this:

p = Path("test-groups")
keyfunc = lambda x : x.name.split("_")[0]

group_files(filter_keyword(filter_extensions(p, "txt", 0), r'^AREA'), keyfunc)

defaultdict(<class 'list'>,
{
'AREA00': [PosixPath('test-groups/AREA00_LOG1.txt')],
'AREA23': [PosixPath('test-groups/AREA23_LOG1.txt'), PosixPath('test-groups/AREA23_LOG2.txt')],
'AREA44': [PosixPath('test-groups/AREA44_LOG1.txt')],
'AREA51': [PosixPath('test-groups/AREA51_LOG2.txt'), PosixPath('test-groups/AREA51_LOG1.txt')]
})

Which closes the grouping section, the next step is to move each group of files into a dedicated folder.

Moving & copying files

Up until now, we've only manipulated file paths which is quite safe and if you played around with the code above, nothing big happened. This section though will move actual files around, if you want to try the code out, do so with useless files or non-critical directories at first.

The goal here is to use the keys of the grouped file paths as new directories to store the files.

This will require a destination directory but no source directory as this information is already contained in the (absolute) path of the files. Note that the / operator is overloaded such that path1 / path2 becomes path1/path2. Also, the parents and exist_ok parameters are needed to mirror those in Path.mkdir.

def mv_files_into_dir(paths_dict, dest_dir, parents=False, exist_ok=False):
    for dir, files in paths_dict.items():
        curr_dir = dest_dir / dir
        Path.mkdir(curr_dir, parents=parents, exist_ok=exist_ok)
        for f in files:
            f.rename(curr_dir / f.name)

If we put all the calls together, we get:

from pathlib import Path
from itertools import chain, groupby
from collections import defaultdict
import re

# functions definition here ... #

p = Path("test-groups")
out = Path("output-dir")

key_func = lambda x: x.name.split("_")[0]
paths_dict = group_files(
    filter_keyword(filter_extensions(p, "txt", 0), r'^AREA'), key_func
)

mv_files_into_dir(paths_dict, out)

With the output directory looking something like this:

output-dir/
├── AREA00
│   └── AREA00_LOG1.txt
├── AREA23
│   ├── AREA23_LOG1.txt
│   └── AREA23_LOG2.txt
├── AREA44
│   └── AREA44_LOG1.txt
└── AREA51
    ├── AREA51_LOG1.txt
    └── AREA51_LOG2.txt

This moved the files, they are not in the test-groups directory anymore. As such, you should have a strategy for reversal in case something wrong happens. The command pattern is aimed at encapsulating operations and allows for reversal. I won't go into it because it's beyond the scope of the article but if you're planning on making tools that move files around, an undo command will most likely come handy.

An alternative is the copy the files instead of moving them, using the shutil module, with the caveat that it won't copy the metadata (file creation time, modification time) of the file unless you're willing to sacrifice some speed and use shutil.copy2.

import shutil

def copy_files_into_dir(paths_dict, dest_dir, parents=False, exist_ok=False):
    for dir, files in paths_dict.items():
        curr_dir = dest_dir / dir
        Path.mkdir(curr_dir, parents=parents, exist_ok=exist_ok)
        for f in files:
            shutil.copy(f, curr_dir) # use copy2 to try to keep metadata

Conclusion

In the post, I attempted to show how the powerful Python built-ins let us manipulate paths and files, with cross-platform support. There are ways to improve what's been shown; some of the functions would benefit from being wrapped into a class and there are more powerful operations possible as well. Probably the best path (😉) to improve the functionalities here would be to use more complex functions during the groupby such as grouping per file size until some threshold is reached or grouping per last file modification to archive stale files.