Python recursive folder read

Question

I have a C++/Obj-C background and I am just discovering Python (been writing it for about an hour). I am writing a script to recursively read the contents of text files in a folder structure.

The problem I have is the code I have written will only work for one folder deep. I can see why in the code (see #hardcoded path), I just don't know how I can move forward with Python since my experience with it is only brand new.

Python Code:

import os
import sys

rootdir = sys.argv[1]

for root, subFolders, files in os.walk(rootdir):

    for folder in subFolders:
        outfileName = rootdir + "/" + folder + "/py-outfile.txt" # hardcoded path
        folderOut = open( outfileName, 'w' )
        print "outfileName is " + outfileName

        for file in files:
            filePath = rootdir + '/' + file
            f = open( filePath, 'r' )
            toWrite = f.read()
            print "Writing '" + toWrite + "' to" + filePath
            folderOut.write( toWrite )
            f.close()

        folderOut.close()

AndiDog · Accepted Answer · 2014-08-21 07:50:29Z

477

Make sure you understand the three return values of os.walk:

for root, subdirs, files in os.walk(rootdir):

has the following meaning:

root: Current path which is "walked through"
subdirs: Files in root of type directory
files: Files in root (not in subdirs) of type other than directory

And please use os.path.join instead of concatenating with a slash! Your problem is filePath = rootdir + '/' + file - you must concatenate the currently "walked" folder instead of the topmost folder. So that must be filePath = os.path.join(root, file). BTW "file" is a builtin, so you don't normally use it as variable name.

Another problem are your loops, which should be like this, for example:

import os
import sys

walk_dir = sys.argv[1]

print('walk_dir = ' + walk_dir)

# If your current working directory may change during script execution, it's recommended to
# immediately convert program arguments to an absolute path. Then the variable root below will
# be an absolute path as well. Example:
# walk_dir = os.path.abspath(walk_dir)
print('walk_dir (absolute) = ' + os.path.abspath(walk_dir))

for root, subdirs, files in os.walk(walk_dir):
    print('--\nroot = ' + root)
    list_file_path = os.path.join(root, 'my-directory-list.txt')
    print('list_file_path = ' + list_file_path)

    with open(list_file_path, 'wb') as list_file:
        for subdir in subdirs:
            print('\t- subdirectory ' + subdir)

        for filename in files:
            file_path = os.path.join(root, filename)

            print('\t- file %s (full path: %s)' % (filename, file_path))

            with open(file_path, 'rb') as f:
                f_content = f.read()
                list_file.write(('The file %s contains:\n' % filename).encode('utf-8'))
                list_file.write(f_content)
                list_file.write(b'\n')

If you didn't know, the with statement for files is a shorthand:

with open('filename', 'rb') as f:
    dosomething()

# is effectively the same as

f = open('filename', 'rb')
try:
    dosomething()
finally:
    f.close()

edited Aug 21, 2014 at 7:50

answered Feb 6, 2010 at 9:48

AndiDog

70.6k21 gold badges166 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Brock Woolf Over a year ago

Superb, lots of prints to understand what's going on and it works perfectly. Thanks! +1

Steazy Over a year ago

Heads up to anyone as dumb/oblivious as me... this code sample writes a txt file to each directory. Glad I tested it in a version controlled folder, though everything I need to write a cleanup script is here too :)

amphibient Over a year ago

that second (longest) code snippet worked very well, saved me a lot of boring work

user136036 Over a year ago

Since speed if obviously the most important aspect, os.walk is not bad, though I came up with an even faster way via os.scandir. All glob solutions are a lot slower than walk & scandir. My function, as well as a complete speed analysis, can be found here: stackoverflow.com/a/59803793/2441026

steeveeet Over a year ago

This is a great SO answer, not only drilling into the issue but stuff like this "BTW "file" is a builtin, so you don't normally use it as variable name." is golden for someone new to a language

|

AnaS Kayed · Accepted Answer · 2020-09-01 11:53:47Z

290

If you are using Python 3.5 or above, you can get this done in 1 line.

import glob

# root_dir needs a trailing slash (i.e. /root/dir/)
for filename in glob.iglob(root_dir + '**/*.txt', recursive=True):
     print(filename)

As mentioned in the documentation

If recursive is true, the pattern '**' will match any files and zero or more directories and subdirectories.

If you want every file, you can use

import glob

for filename in glob.iglob(root_dir + '**/**', recursive=True):
     print(filename)

edited Sep 1, 2020 at 11:53

AnaS Kayed

5724 silver badges9 bronze badges

answered Jul 18, 2017 at 16:26

Chillar Anand

29.8k10 gold badges126 silver badges144 bronze badges

7 Comments

Chillar Anand Over a year ago

As mentioned in the beginning, it is only for Python 3.5+

drojf Over a year ago

root_dir must have a trailing slash (otherwise you get something like 'folder**/*' instead of 'folder/**/*' as the first argument). You can use os.path.join(root_dir, '*/'), but I don't know if it's acceptable to use os.path.join with wildcard paths (it works for my application though).

Dan Nissenbaum Over a year ago

@ChillarAnand Can you please add a comment to the code in this answer that root_dir needs a trailing slash? This will save people time (or at least it would have saved me time). Thanks.

mikey Over a year ago

If I ran this as in the answer it didn't work recursively. To make this work recursively I had to change it to: glob.iglob(root_dir + '**/**', recursive=True). I'm working in Python 3.8.2

Thomas Over a year ago

Be aware that glob.glob does not match dotfiles. You may use pathlib.glob instead

|

the Tin Man · Accepted Answer · 2013-12-24 03:36:12Z

45

Agree with Dave Webb, os.walk will yield an item for each directory in the tree. Fact is, you just don't have to care about subFolders.

Code like this should work:

import os
import sys

rootdir = sys.argv[1]

for folder, subs, files in os.walk(rootdir):
    with open(os.path.join(folder, 'python-outfile.txt'), 'w') as dest:
        for filename in files:
            with open(os.path.join(folder, filename), 'r') as src:
                dest.write(src.read())

edited Dec 24, 2013 at 3:36

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

answered Feb 6, 2010 at 9:59

Clément

6,7381 gold badge32 silver badges22 bronze badges

1 Comment

Brock Woolf Over a year ago

Nice one. This works as well. I do however prefer AndiDog's version even though its longer because it's clearer to understand as a beginner to Python. +1

Luc · Accepted Answer · 2023-12-22 20:21:05Z

TL;DR: These are equivalents to find -type f, to go over all files in all folders below and including the current one:

folder = '.'

import os
for currentpath, folders, files in os.walk(folder):
    for file in files:
        print(os.path.join(currentpath, file))
## or:
import glob
for pathstr in glob.iglob(glob.escape(folder) + '/**/*', recursive=True):
    print(pathstr)

Comparing the two methods:

os.walk is about 3× faster
os.walk used slightly more memory in my test because the files array held 82k entries whereas glob returns an iterator and streams the results. The piecemeal handling of each result (more calls and less buffering going on) likely explains the speed difference
- If you forget the i in glob.iglob(), it will return a list rather than an iterator and potentially use a lot more memory
os.walk will not silently give incomplete results or unexpectedly interpret a name as a matching pattern
glob doesn't show empty directories
glob needs to escape directory and file names using glob.escape(name) because they can contain special characters
glob excludes directories and files starting with a dot (e.g., ~/.bashrc or ~/.vim) and include_hidden does not solve that (it includes hidden folders only; you need to specify a second pattern for dotfiles)
glob doesn't tell you what is a file and what a directory
glob walks into symlinks and may lead to you enumerating a lot of files in completely different places (which may be what you want; in that case, os.walk has followlinks=True as an option)
os.walk lets you modify which paths to walk down by modifying the folders array while it's running, though personally this feels a bit messy and I'm not sure I would recommend that

Other answers already mentioned os.walk(), but it could be explained better. It's quite simple! Let's walk through this tree:

docs/
└── doc1.odt
pics/
todo.txt

With this code:

for currentpath, folders, files in os.walk('.'):
    print(currentpath)

The currentpath is the current folder it is looking at. This will output:

.
./docs
./pics

So it loops three times, because there are three folders: the current one, docs, and pics. In every loop, it fills the variables folders and files with all folders and files. Let's show them:

for currentpath, folders, files in os.walk('.'):
    print(currentpath, folders, files)

This shows us:

# currentpath  folders           files
.              ['pics', 'docs']  ['todo.txt']
./pics         []                []
./docs         []                ['doc1.odt']

So in the first line, we see that we are in folder ., that it contains two folders namely pics and docs, and that there is one file, namely todo.txt. You don't have to do anything to recurse into those folders, because as you see, it recurses automatically and just gives you the files in any subfolders. And any subfolders of that (though we don't have those in the example).

If you just want to loop through all files, the equivalent of find -type f, you can do this:

for currentpath, folders, files in os.walk('.'):
    for file in files:
        print(os.path.join(currentpath, file))

This outputs:

./todo.txt
./docs/doc1.odt

dstandish · Accepted Answer · 2019-12-27 01:06:18Z

22

The pathlib library is really great for working with files. You can do a recursive glob on a Path object like so.

from pathlib import Path

for elem in Path('/path/to/my/files').rglob('*.*'):
    print(elem)

answered Dec 27, 2019 at 1:06

dstandish

2,43823 silver badges36 bronze badges

Comments

Michael Silverstein · Accepted Answer · 2020-03-04 20:20:40Z

11

I've found the following to be the easiest

from glob import glob
import os

files = [f for f in glob('rootdir/**', recursive=True) if os.path.isfile(f)]

Using glob('some/path/**', recursive=True) gets all files, but also includes directory names. Adding the if os.path.isfile(f) condition filters this list to existing files only

answered Mar 4, 2020 at 20:20

Michael Silverstein

1,85318 silver badges17 bronze badges

Comments

Neeraj Sonaniya · Accepted Answer · 2019-03-16 09:26:52Z

10

import glob
import os

root_dir = <root_dir_here>

for filename in glob.iglob(root_dir + '**/**', recursive=True):
    if os.path.isfile(filename):
        with open(filename,'r') as file:
            print(file.read())

**/** is used to get all files recursively including directory.

if os.path.isfile(filename) is used to check if filename variable is file or directory, if it is file then we can read that file. Here I am printing file.

edited Mar 16, 2019 at 9:26

answered Mar 16, 2019 at 5:42

Neeraj Sonaniya

3754 silver badges14 bronze badges

Comments

Scott Smith · Accepted Answer · 2019-02-05 00:31:14Z

8

If you want a flat list of all paths under a given dir (like find . in the shell):

   files = [ 
       os.path.join(parent, name)
       for (parent, subdirs, files) in os.walk(YOUR_DIRECTORY)
       for name in files + subdirs
   ]

To only include full paths to files under the base dir, leave out + subdirs.

answered Feb 5, 2019 at 0:31

Scott Smith

1,0182 gold badges14 silver badges19 bronze badges

Comments

Gwang-Jin Kim · Accepted Answer · 2021-09-05 09:13:53Z

5

For my taste os.walk() is a little too complicated and verbose. You can do the accepted answer cleaner by:

all_files = [str(f) for f in pathlib.Path(dir_path).glob("**/*") if f.is_file()]

with open(outfile, 'wb') as fout:
    for f in all_files:
        with open(f, 'rb') as fin:
            fout.write(fin.read())
            fout.write(b'\n')

answered Sep 5, 2021 at 9:13

Gwang-Jin Kim

11.1k20 silver badges39 bronze badges

Comments

the Tin Man · Accepted Answer · 2013-12-24 03:37:23Z

4

use os.path.join() to construct your paths - It's neater:

import os
import sys
rootdir = sys.argv[1]
for root, subFolders, files in os.walk(rootdir):
    for folder in subFolders:
        outfileName = os.path.join(root,folder,"py-outfile.txt")
        folderOut = open( outfileName, 'w' )
        print "outfileName is " + outfileName
        for file in files:
            filePath = os.path.join(root,file)
            toWrite = open( filePath).read()
            print "Writing '" + toWrite + "' to" + filePath
            folderOut.write( toWrite )
        folderOut.close()

edited Dec 24, 2013 at 3:37

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

answered Feb 6, 2010 at 9:39

ghostdog74

346k62 gold badges264 silver badges349 bronze badges

1 Comment

Brock Woolf Over a year ago

It looks like this code works for folders 2 levels (or deeper) only. Still it does get me closer.

neuviemeporte · Accepted Answer · 2021-12-17 21:09:51Z

4

If just the file names are not enough, it's easy to implement a Depth-first search on top of os.scandir():

stack = ['.']
files = []
total_size = 0
while stack:
    dirname = stack.pop()
    with os.scandir(dirname) as it:
        for e in it:
            if e.is_dir(): 
                stack.append(e.path)
            else:
                size = e.stat().st_size
                files.append((e.path, size))
                total_size += size

The docs have this to say:

The scandir() function returns directory entries along with file attribute information, giving better performance for many common use cases.

answered Dec 17, 2021 at 21:09

neuviemeporte

6,48810 gold badges54 silver badges79 bronze badges

1 Comment

leerssej Over a year ago

Unfortunately when I run this, I get locked in an infinite loop if there are any subdirectories. It seems like there is a need to keep track of which directories have already been visited, or the loop will get stuck in re-doing them until all the computer memory is used up.

the Tin Man · Accepted Answer · 2013-12-24 03:35:36Z

1

os.walk does recursive walk by default. For each dir, starting from root it yields a 3-tuple (dirpath, dirnames, filenames)

from os import walk
from os.path import splitext, join

def select_files(root, files):
    """
    simple logic here to filter out interesting files
    .py files in this example
    """

    selected_files = []

    for file in files:
        #do concatenation here to get full path 
        full_path = join(root, file)
        ext = splitext(file)[1]

        if ext == ".py":
            selected_files.append(full_path)

    return selected_files

def build_recursive_dir_tree(path):
    """
    path    -    where to begin folder scan
    """
    selected_files = []

    for root, dirs, files in walk(path):
        selected_files += select_files(root, files)

    return selected_files

edited Dec 24, 2013 at 3:35

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

answered Aug 23, 2011 at 13:24

b1r3k

8328 silver badges15 bronze badges

1 Comment

borisbn Over a year ago

In Python 2.6 walk() do return recursive list. I tried your code and got a list with many repeats... If you just remove lines under the comment "# recursive calls on subfolders" - it works fine

knall0 · Accepted Answer · 2020-07-08 09:24:28Z

1

If you prefer an (almost) Oneliner:

from pathlib import Path

lookuppath = '.' #use your path
filelist = [str(item) for item in Path(lookuppath).glob("**/*") if Path(item).is_file()]

In this case you will get a list with just the paths of all files located recursively under lookuppath. Without str() you will get PosixPath() added to each path.

answered Jul 8, 2020 at 9:24

knall0

111 bronze badge

Comments

Eugene Yarmash · Accepted Answer · 2023-03-16 14:32:30Z

1

Starting from Python 3.12, you can also use walk() from pathlib which is similar to os.walk(), but yields tuples of (dirpath, dirnames, filenames) where dirpath is a Path. For example:

from pathlib import Path

for root, dirs, files in Path("cpython/Lib/concurrent").walk(on_error=print):
  print(
      root,
      "consumes",
      sum((root / file).stat().st_size for file in files),
      "bytes in",
      len(files),
      "non-directory files"
  )
  if '__pycache__' in dirs:
        dirs.remove('__pycache__')

answered Mar 16, 2023 at 14:32

Eugene Yarmash

152k44 gold badges346 silver badges391 bronze badges

Comments

the Tin Man · Accepted Answer · 2013-12-24 03:36:49Z

0

I think the problem is that you're not processing the output of os.walk correctly.

Firstly, change:

filePath = rootdir + '/' + file

to:

filePath = root + '/' + file

rootdir is your fixed starting directory; root is a directory returned by os.walk.

Secondly, you don't need to indent your file processing loop, as it makes no sense to run this for each subdirectory. You'll get root set to each subdirectory. You don't need to process the subdirectories by hand unless you want to do something with the directories themselves.

edited Dec 24, 2013 at 3:36

the Tin Man

161k44 gold badges222 silver badges308 bronze badges

answered Feb 6, 2010 at 9:34

David Webb

195k57 gold badges319 silver badges302 bronze badges

2 Comments

Brock Woolf Over a year ago

I have data in each sub directory, so I need to have a separate text file for the contents of each directory.

Alok Singhal Over a year ago

@Brock: the files part is the list of files in the current directory. So the indentation is indeed wrong. You are writing to filePath = rootdir + '/' + file, that doesn't sound right: file is from the list of current files, so you are writing to a lot of existing files?

Diego · Accepted Answer · 2017-07-13 16:46:36Z

0

Try this:

import os
import sys

for root, subdirs, files in os.walk(path):

    for file in os.listdir(root):

        filePath = os.path.join(root, file)

        if os.path.isdir(filePath):
            pass

        else:
            f = open (filePath, 'r')
            # Do Stuff

answered Jul 13, 2017 at 16:46

Diego

11710 bronze badges

1 Comment

Luc Over a year ago

Why would you do another listdir() and then isdir() when you already have the directory listing split into files and directories from walk()? This looks like it would be rather slow in large trees (do three syscalls instead of one: 1=walk, 2=listdir, 3=isdir, instead of just walk and loop through the 'subdirs' and 'files').

Scott · Accepted Answer · 2020-11-19 02:20:54Z

0

This worked for me:

import glob

root_dir = "C:\\Users\\Scott\\" # Don't forget trailing (last) slashes    
for filename in glob.iglob(root_dir + '**/*.jpg', recursive=True):
     print(filename)
     # do stuff

edited Nov 19, 2020 at 2:20

answered Jul 9, 2020 at 12:18

Scott

5,9568 gold badges48 silver badges79 bronze badges

Collectives™ on Stack Overflow

Python recursive folder read

17 Answers 17

6 Comments

7 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

17 Answers 17

6 Comments

7 Comments

1 Comment

Comments

Comments

Comments

Comments

Comments

Comments

1 Comment

1 Comment

1 Comment

Comments

Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related