TL;DR: These are equivalents to find -type f, to go over all files in all folders below and including the current one:
folder = '.'
import os
for currentpath, folders, files in os.walk(folder):
for file in files:
print(os.path.join(currentpath, file))
## or:
import glob
for pathstr in glob.iglob(glob.escape(folder) + '/**/*', recursive=True):
print(pathstr)
Comparing the two methods:
os.walk is about 3× faster
os.walk used slightly more memory in my test because the files array held 82k entries whereas glob returns an iterator and streams the results. The piecemeal handling of each result (more calls and less buffering going on) likely explains the speed difference
- If you forget the
i in glob.iglob(), it will return a list rather than an iterator and potentially use a lot more memory
os.walk will not silently give incomplete results or unexpectedly interpret a name as a matching pattern
glob doesn't show empty directories
glob needs to escape directory and file names using glob.escape(name) because they can contain special characters
glob excludes directories and files starting with a dot (e.g., ~/.bashrc or ~/.vim) and include_hidden does not solve that (it includes hidden folders only; you need to specify a second pattern for dotfiles)
glob doesn't tell you what is a file and what a directory
glob walks into symlinks and may lead to you enumerating a lot of files in completely different places (which may be what you want; in that case, os.walk has followlinks=True as an option)
os.walk lets you modify which paths to walk down by modifying the folders array while it's running, though personally this feels a bit messy and I'm not sure I would recommend that
Other answers already mentioned os.walk(), but it could be explained better. It's quite simple! Let's walk through this tree:
docs/
└── doc1.odt
pics/
todo.txt
With this code:
for currentpath, folders, files in os.walk('.'):
print(currentpath)
The currentpath is the current folder it is looking at. This will output:
.
./docs
./pics
So it loops three times, because there are three folders: the current one, docs, and pics. In every loop, it fills the variables folders and files with all folders and files. Let's show them:
for currentpath, folders, files in os.walk('.'):
print(currentpath, folders, files)
This shows us:
# currentpath folders files
. ['pics', 'docs'] ['todo.txt']
./pics [] []
./docs [] ['doc1.odt']
So in the first line, we see that we are in folder ., that it contains two folders namely pics and docs, and that there is one file, namely todo.txt. You don't have to do anything to recurse into those folders, because as you see, it recurses automatically and just gives you the files in any subfolders. And any subfolders of that (though we don't have those in the example).
If you just want to loop through all files, the equivalent of find -type f, you can do this:
for currentpath, folders, files in os.walk('.'):
for file in files:
print(os.path.join(currentpath, file))
This outputs:
./todo.txt
./docs/doc1.odt