Direct IO in Python

Doing file I/O of any kind in Python is really easy. You can start with plain open() and friends, working with Python’s file objects. by the way, Python’s open() resembles C‘s fopen() so closely that I can’t stop thinking that open() may be based on fopen().

When its not enough, you can always upgrade to open() and close() from os module. Opening man page on open() (the system call – open(2)) reveals all those O_something options that you can pass to os.open(). But not all of them can be used in Python. For example, if you open a file with O_DIRECT and then try to write to it, you will end up with some strange error message.

>>> import os
>>> f = os.open('file', os.O_CREAT | os.O_TRUNC | os.O_DIRECT | os.O_RDWR)
>>> s = ' ' * 1024
>>> os.write(f, s)
Traceback (most recent call last):
  File "", line 1, in
OSError: [Errno 22] Invalid argument
>>>

Invalid argument?. What invalid argument? Hey there’s nothing wrong with those arguments…

Reading open(2) man page further reveals that working with O_DIRECT requires that all buffers used for I/O should be aligned to 512 byte boundary. But how can you have a memory buffer aligned to 512 bytes in Python?

Apparently, there’s a way. Python comes with a module called mmap. mmap() is a system call that allows one to map portion of file into memory. All writes to memory mapped file, go directly to file despite it looks like you’re working with plain memory buffer. Same with reads.

There’s one interesting thing about mmap. It works with granularity of one memory page – 4kb that is. So every memory mapped buffer is naturally memory aligned to 4kb, thus to 512 byte boundary too. But hey, shouldn’t mmap map files?

Well, apparently mmap can be used for memory allocations. I.e. specifying -1 as file descriptor does just that – allocates RAM, as much as you tell it. So, this is what we do:

>>> import os
>>> import mmap
>>>
>>> f = os.open('file', os.O_CREAT | os.O_DIRECT | os.O_TRUNC | os.O_RDWR)
>>> m = mmap.mmap(-1, 1024 * 1024)
>>> s = ' ' * 1024 * 1024
>>>
>>> m.write(s)
>>> os.write(f, m)
1048576
>>> os.close(f)
>>>

Note that mmap memory buffer object behaves like a file. I.e. you can write into the buffer and read from it – like I do in line 8. More on it in official documentation.

Have fun! :-)

Did you know that you can receive periodical updates with the latest articles that I write right into your email box? Alternatively, you subscribe to the RSS feed!

Want to know how? Check out
Subscribe page

32 Comments

  1. Ivan Novick says:

    The real question is why would you want to do this? I would like to hear a use case for using direct IO in python.

    Databases are the best use case for direct IO and even some of them don’t use it (see postgresql).

    You may also find this posting by Linus regarding direct io useful.

    http://lkml.org/lkml/2007/1/10/233

    Cheers,
    Ivan

  2. @Ivan Novick
    In Linux, when write() finishes, this does not necessarily mean that the data is on media. If you want an example of this behavior, try dding 1MB to floppy disk and see what happens.
    This is natural behavior considering page cache. On the other hand this is a nightmare for everyone concerned with high availability of data. Why? Because you can’t build highly available system knowing that although you think your data is on disk, it may not be there if something bad happens.
    So, O_DIRECT being used naturally in databases, clustered file-systems, and every other highly available system.

    As for what Linus said about it… Well, Linus and other kernel developers define what Linux will look like in the future. What Linus say today will become reality in 5-10 years. But it is not a reality yet. In reality, the thread you linked to is three years old and yet POSIX_FADV_NOREUSE is still a no-op in Linux.

  3. Ivan Novick says:

    Making sure your data is on disk is a separate issue from bypassing the filesystem cache.

    To make sure your data is on disk the system call is fsync. If you don’t call fysnc the OS will not harden the data to disk immediately for performance reasons. When you need to know the data is on disk then you make the fsync call.

    O_DIRECT bypasses the filesystem cache and writes “directly” to disk. This will be a lot slower than writing to filesystem cache so you don’t want to use O_DIRECT unless your application is caching data itself.

    Linus’s argument is someone has to cache the data so why not let the OS do it. This is exactly what postgresql does, they rely on using the OS filesystem cache and dont use O_DIRECT.

  4. @Ivan Novick
    Well, first of all, fsync() isn’t a solution for the problem because it is not atomic, expensive and unnecessary – operating system should give you a method to disable caching for portion of disk, instead of you trying to simulate it with fsync().

    It’s simple. Imagine you have a cluster with two machines that processes financial transactions. Imagine that cluster being asked to place 10000$ on your account. Server 1 handles the request and crashes right after it. Moment later, cluster receives a request to transfer 5000$ from your account to some other bank account. The question is how much money remains on your account.

    This particular application would do it’s own caching, as you suggested, but so would many other applications.

    Don’t get me wrong. I am in for caching in OS with all of mine limbs. However, there’s must be a way to control it, because there are many applications that need something more delicate then OS caching.

  5. […] days ago I’ve written a post explaining how to do a direct I/O in Python. But then I thought that it might be a good idea to explain what direct I/O is. So, here we […]

  6. zls says:

    But there’s nothing we can do for os.read where file is opened with flag O_DIRECT. Because system call “read” also needs memory alignment for the read buffer. But the memory is allocated in python without any memory alignment.

  7. @zls
    This is exactly the problem this post addresses. Please read it more carefully.

  8. eswierk says:

    @Alexander Sandler – OK, I give up. How exactly does allocating an aligned buffer help you if you’re calling os.read, which does its own memory allocation?

  9. Kepler Kramer says:

    I too would like to use Python for O_DIRECT reads and could really use an example similar to your excellent write snippet.

  10. @eswierk
    I am not sure os.read does memory allocations. In any case, direct I/O requires aligned memory buffer. Otherwise, any buffer will do.

  11. jpetazzo says:

    4 years ago, we wondered how to perform direct I/O with Python. One of our interns did a small proof of concept (available at http://pypi.python.org/pypi/directio). I received some feedback about it (it had some horrible bugs). I have (hopefully) fixed most of the bugs tonight (actually, I have applied patches that other people have submitted) and I’m in the process of uploading the new version to pypy. It should allow you to play with direct I/O with Python !

    (Behind the scenes, it works exactly like the regular io module, but does a little bit of pointer arithmetics to cope with memory alignment when you do a read or write.)

  12. Lyubomir says:

    http://bugs.python.org/issue5396

    O_DIRECT does not work with os.read, and apparently it won’t :( .

  13. jpetazzo says:

    @Lyubomir
    If you require O_DIRECT with Python, have a look at the above-mentioned directio Python module at http://pypi.python.org/pypi/directio. I just uploaded the 1.1 version (which includes patches provided by nice and competent people, hopefully fixing some awful flaws of the first version).
    I hope this helps.

  14. @Lyubomir
    I didn’t try os.read(), but I don’t see any particular reason for it not to work with memory allocated using mmap.

  15. Lyubomir says:

    @Alexander Sandler
    read(…)
    read(fd, buffersize) -> string
    Read a file descriptor.

    Unlike POSIX read, os.read does not take a parameter where read data is stored, but instead returns a string allocated without special alignment considerations So, unless I’m missing something obvious, there’s no way to use memory allocated in advance with os.read.
    Great site, by the way. I still have open and unread tabs :).

    @jpetazzo
    Thanks, the module does the job on x86 machines :).
    However, I had strange problems getting it to work on a x64 CentOS 5. The Python script tries to measure disk performance by making a lot of consecutive read/write calls. For some reason, calloc returns a null pointer after several invocations – that is, it fails to allocate 1MB on an idle machine with 4 GB of RAM. I could not trace the cause, so for the moment I’m using separate functions to create and delete a global aligned buffer.
    By the way, I’ll try to port it to OS X, would that be of interest to you?

  16. @Lyubomir
    I missed that point. You’re right of course. And.. Thanks :-)

  17. jpetazzo says:

    @Lyubomir

    About the CentOS 5 issue: maybe you can run your test program within ltrace to actually see the calls issued. Using malloc debugging macros might also give some information, but I’m afraid that the information would be in the middle of other Python malloc/free, which would make debugging quite hard.

    About the OS X port: I don’t use OS X, but I would be happy to know if it works (and/or what fixes were needed)!

  18. Evan Jones says:

    The comments here are ignoring the fact that O_DIRECT writes *do* write directly to the device, but *don’t* issue “cache flush” commands to ensure that the data is in fact on disk. In other words: even when using O_DIRECT writes you *still* need fsync, or you need to use the O_DSYNC or O_SYNC flags as well, when calling open().

    I’ve done a bunch of power failure testing with this stuff. I really should write a blog post about it …

  19. […] only works for raw system file descriptors – not python file objects. There is a interesting post here on the subject. Changing the filesystem, as suggested by orgcandman, could help, a good […]

  20. I together with my friends came analyzing the good tips located on your website and so instantly got a terrible suspicion I had not thanked the website owner for those strategies. Those guys were definitely glad to study all of them and now have truly been making the most of them. Many thanks for indeed being quite helpful and then for obtaining this sort of fine issues millions of individuals are really desirous to be aware of. Our honest regret for not expressing appreciation to you earlier.

  21. @Evan Jones
    Please do so. I’ll post a link if you do.

  22. Evan Jones says:

    @Alexander Sandler – I actually did, but forgot to post something here. See Making Writes Durable. This is unfortunately written from the perspective of the underlying C system calls, and doesn’t discuss Python at all. However, it is relevant provided that you use the correct Python function calls.

    http://www.evanjones.ca/durable-writes.html

  23. @Evan Jones
    Good. Very interesting reading. Thanks a lot.

  24. Sebastian says:

    Thanks for the hint to alignment as a reason for “Invalid argument”!

  25. Kenneth says:

    How do you go about direct io reading from the file? I get the same silly error when I do an “data = os.read(4*1024)”.

  26. Matt C says:

    I found out how to do direct I/O reading from a file: convert it into a file object and the use the readinto() method of python file objects to read into the mmap’d buffer. To add onto the original code above:

    fo = os.fdopen(f, ‘rw’)
    fo.readinto(m)

  27. Brad Goodman says:

    That’s what I needed – thanks!

  28. Logan Gunthorpe says:

    This works for reads into an mmap buffer:

    import mmap
    import os

    fd = os.open(“strace.txt”, os.O_DIRECT | os.O_RDWR)
    f = os.fdopen(fd, “rb+”, 0)
    m = mmap.mmap(-1, 4096)

    print(f.readinto(m))

    • Reshmi says:

      @Logan: this code does not seem to work in my system.

      Traceback (most recent call last):
      File “test.py”, line 4, in
      f = os.fdopen(fd, “rb+”, 0)
      OSError: [Errno 22] Invalid argument

Leave a Reply to Evan Jones

Prove you are not a computer or die *