rogerbinns
Occasional Visitor
It is important to distinguish between what happens in user space vs what the kernel and network filesystem redirectors can do. This applies to Windows and the other operating systems such as Linux.
There is a read call which points to where the resulting data will be placed, how much to read and an implicit or explicit file offset. Traditionally the read call is used synchronously - control is not returned to the program until all the data has been read.
Any number can be supplied for how much to read from one byte up to the file size (if there is sufficient memory to place the results). The kernel then has to decide what to do to satisfy the read request. Typically a large request will be broken into smaller ones and small requests will be turned into larger ones to perform readahead. With local disks this is all easy. With a network filesystem there are tradeoffs to be made. If data is readahead then it may be (undetectably) stale by the time the program reads it. If a large request was decomposed into smaller ones, they could all be sent over the wire at once but this could saturate the network and overwhelm the server on the other end. You also have to use kernel memory to buffer up data coming back and kernel memory is considered very expensive in Windows up to and including XP (NT was optimized to run in 4-8MB of memory, XP defaults to a 10MB file cache!)
What this means is that what you see in process monitor are requests to the kernel. The kernel can then satisfy those requests using any amount of behind the scenes asynchronousity (is that even a word?), breaking down into smaller requests, readahead etc. You can only see what really happened by using a network sniffer. Window's kernel implementations tended to try and conserve kernel memory and not overwhelm servers, all of which changed in Vista when they recognised that more memory is available. Everything that is done in Vista could have been in earlier versions of Windows and in earlier versions of SMB. They simply chose not to. (I can give you the long boring story if you want).
If you really want to do a good benchmark, use read buffer sizes of something like 64MB (or even larger) from user space. The kernel can then break that up into pieces of its own choosing. (The CopyFile api is trying to pick the sizes in user space.) Using one byte sizes will tell you how well it does it read ahead as well as the overhead of user to kernel to user context switches. SMB has opportunistic locks so it is possible to do any amount of readahead and know if the data become stale.
Also look at the flags to CreateFile (the file open function) where you can give additional hints about whether the file will be accessed sequentially (turns on readahead for Windows), if the cache should be bypassed etc.
For historical reasons Windows programs have second guessed the kernel to improve performance which has led to the kernel trying to second guess the programs into an unvirtuous cycle. Raymond Chen's blog has lots of amusing stories about this sort of thing. The Linux/Unix user space apis are considerably simpler (opening a file takes 3 parameters not 7!) and programs don't try to optimize so the kernel does get to do the right thing.
Write requests have a similar set of issues. If you do a 64MB write from user space, the kernel has to break that up and do them concurrently or sequentially. It can return before data is on platters or wait (they usually return as soon as possible before all data is actually written since most user space write apis are used synchronously).
There is a rather vibrant WAN optimization industry where the number one protocol that customers care about is SMB/CIFS. The really funny thing is that the appliances optimise the requests for maximum performance and all the techniques used could be done by Microsoft's kernel code, but generally aren't.
There is a read call which points to where the resulting data will be placed, how much to read and an implicit or explicit file offset. Traditionally the read call is used synchronously - control is not returned to the program until all the data has been read.
Any number can be supplied for how much to read from one byte up to the file size (if there is sufficient memory to place the results). The kernel then has to decide what to do to satisfy the read request. Typically a large request will be broken into smaller ones and small requests will be turned into larger ones to perform readahead. With local disks this is all easy. With a network filesystem there are tradeoffs to be made. If data is readahead then it may be (undetectably) stale by the time the program reads it. If a large request was decomposed into smaller ones, they could all be sent over the wire at once but this could saturate the network and overwhelm the server on the other end. You also have to use kernel memory to buffer up data coming back and kernel memory is considered very expensive in Windows up to and including XP (NT was optimized to run in 4-8MB of memory, XP defaults to a 10MB file cache!)
What this means is that what you see in process monitor are requests to the kernel. The kernel can then satisfy those requests using any amount of behind the scenes asynchronousity (is that even a word?), breaking down into smaller requests, readahead etc. You can only see what really happened by using a network sniffer. Window's kernel implementations tended to try and conserve kernel memory and not overwhelm servers, all of which changed in Vista when they recognised that more memory is available. Everything that is done in Vista could have been in earlier versions of Windows and in earlier versions of SMB. They simply chose not to. (I can give you the long boring story if you want).
If you really want to do a good benchmark, use read buffer sizes of something like 64MB (or even larger) from user space. The kernel can then break that up into pieces of its own choosing. (The CopyFile api is trying to pick the sizes in user space.) Using one byte sizes will tell you how well it does it read ahead as well as the overhead of user to kernel to user context switches. SMB has opportunistic locks so it is possible to do any amount of readahead and know if the data become stale.
Also look at the flags to CreateFile (the file open function) where you can give additional hints about whether the file will be accessed sequentially (turns on readahead for Windows), if the cache should be bypassed etc.
For historical reasons Windows programs have second guessed the kernel to improve performance which has led to the kernel trying to second guess the programs into an unvirtuous cycle. Raymond Chen's blog has lots of amusing stories about this sort of thing. The Linux/Unix user space apis are considerably simpler (opening a file takes 3 parameters not 7!) and programs don't try to optimize so the kernel does get to do the right thing.
Write requests have a similar set of issues. If you do a 64MB write from user space, the kernel has to break that up and do them concurrently or sequentially. It can return before data is on platters or wait (they usually return as soon as possible before all data is actually written since most user space write apis are used synchronously).
There is a rather vibrant WAN optimization industry where the number one protocol that customers care about is SMB/CIFS. The really funny thing is that the appliances optimise the requests for maximum performance and all the techniques used could be done by Microsoft's kernel code, but generally aren't.