On windows filesystem tuning and funky cache mechanisms

Edited: I just realized I must have used different postmark settings for vista and XP. Do NOT use the following numbers to compare Vista to XP performance.

I won’t go into a diatribe on how to tune Windows – there are excellent guides on Microsoft’s and IBM’s sites, among others.

But I wanted to share some goodness based on some recent findings of mine.

First, the part that most probably know (works on XP and 2003):

From a command window do

fsutil behavior set disablelastaccess 1

This will disable access time recording, which IMO is useless unless you really do care when a file was accessed and/or there isn’t much going on with your disk (or are on some fancy EMC box with tons of cache). If you have busy disks, this typically helps a bit.

On 2003, you can also increase the size of the lookaside buffer if you have many concurrent file operations:

fsutil behavior set memoryusage 2

This also works on Vista but not XP, sadly. See more here: http://technet2.microsoft.com/WindowsServer/en/library/9fcf44c8-68f4-4204-b403-0282273bc7b31033.mspx?mfr=true

Now, for the interesting part. I use a laptop that’s pretty decent (100GB 7200RPM drive, 2GB RAM). I hammer my disk since I use the laptop for vmware and other duties (music software with thousands of files, for instance).

I like postmark and iozone for measuring performance. Here’s how I configure postmark:

set number 10000

set transactions 20000

set subdirectories 5

set size 500 100000

set read 4096

set write 4096

run

This will create 10,000 files, then perform 20,000 transactions on them. The files will range from 500 bytes to 100KB in size. This is brutal on CPU, cache and disk. If you want different-sized files you just specify the min and max sizes, just be careful with the number (if you leave it at 10,000 and tell it to make 100GB files, better make sure you have the space).

Anyway, here are some results:

Vista untweaked (10000 files and transactions, 512 byte I/O):

Time:
181 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (83 per second)
Creation alone: 10000 files (121 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (83 per second)
Deletion alone: 10094 files (210 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.43 megabytes per second)
826.79 megabytes written (4.57 megabytes per second)

Vista tweaked with fsutil as described above:

Time:
159 seconds total
51 seconds of transactions (196 per second)

Files:
15047 created (94 per second)
Creation alone: 10000 files (158 per second)
Mixed with transactions: 5047 files (98 per second)
4945 read (96 per second)
5055 appended (99 per second)
15047 deleted (94 per second)
Deletion alone: 10094 files (224 per second)
Mixed with transactions: 4953 files (97 per second)

Data:
257.93 megabytes read (1.62 megabytes per second)
826.79 megabytes written (5.20 megabytes per second)

So it’s a bit better.

Another thing you can do is set the processor quanta to be fixed 120ms chunks (simply done by right clicking on “My Computer”, properties, advanced, performance, settings, advanced, processor scheduling for background services. Yes, I’ve had by far the best luck with XP by tuning it like a server. Your mileage may vary but this also increases postmark results a bit.

You can also play with increasing the cache (in that advanced pane again select “system cache” and, with regedit, go to HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters\size and make it a 3. This is all if you have XP. In 2003 it comes just like that. Unless you want to run SQL, IIS or Exchange, in which case there’s a setting, “maximize throughput for network applications”. This limits cache to 512MB, and lets the apps cache on their own.
OR, you can actually spend some money and ridiculously increase performance by getting a caching product like Superspeed’s Supercache or Datacore’s Uptempo (I tried O&O Clevercache as well and was thoroughly underwhelmed).
Here are results with 20,000 transactions and 4K I/O, XP tuned just like a server:

Time:
386 seconds total
308 seconds of transactions (64 per second)

Files:
20092 created (52 per second)
Creation alone: 10000 files (142 per second)
Mixed with transactions: 10092 files (32 per second)
9935 read (32 per second)
10064 appended (32 per second)
20092 deleted (52 per second)
Deletion alone: 10184 files (1273 per second)
Mixed with transactions: 9908 files (32 per second)

Data:
548.25 megabytes read (1.42 megabytes per second)
1158.00 megabytes written (3.00 megabytes per second)

And here are results with the exact same settings but with 256MB of Supercache on that volume, lazy writes on:

Time:
196 seconds total
163 seconds of transactions (122 per second)

Files:
20092 created (102 per second)
Creation alone: 10000 files (344 per second)
Mixed with transactions: 10092 files (61 per second)
9935 read (60 per second)
10064 appended (61 per second)
20092 deleted (102 per second)
Deletion alone: 10184 files (2546 per second)
Mixed with transactions: 9908 files (60 per second)

Data:
548.25 megabytes read (2.80 megabytes per second)
1158.00 megabytes written (5.91 megabytes per second)
I am a believer. The size of the dataset far exceeded the capacity of supercache, but it helped tremendously regardless.
Since I don’t believe all benchmarks, I also ran iozone.

4096 8192 16384
64
128
256
512
1024
2048
4096 70011
8192 29264 50257
16384 26229 33289 37198
32768 27578 28827 34778
65536 26982 27890 28997
131072 20901 21680 22223
262144 21769 20789 22249
524288 23076 25270 26258

The top row shows record size, the left column file size. The above is without the cache. Now with cache:

4096 8192 16384
64
128
256
512
1024
2048
4096 279746
8192 264110 262117
16384 250322 249355 238230
32768 233373 238932 233980
65536 204786 232418 234544
131072 234552 230336 225731
262144 164434 227792 222540
524288 35515 31533 41262

These results are for writes, in both cases. Iozone’s output is too large to include here but I’ll gladly send the entire file to anyone that wants it. I would ignore record sizes under 4K since windows will coalesce writes to 4K and up anyway (up to 64K).
It seems that these products are worth a serious look. In most cases, significant benefits will be realized by caching the volume that holds the swapfile, even if only using 128MB. In one case I went from 124 seconds for a postmark run to 70s by caching the swap volume. Even though I had ample memory and windows shouldn’t be using swap.

Unix is generally a bit more robust for caching and virtual memory, so you don’t need extra products. Looks like Windows needs a bit of help. Indeed, Microsoft uses Supercache on the servers that host MSN, I found out…
Anyway, you can see that up to 256MB supercache kicks windows’ cache ass. Now remember, this is a box tuned just like a server, it was using like 1GB of cache even without supercache. After you exceed the size of the cache by using the large 512MB test file, you still realize some benefits, as you can see.

Datacore’s uptempo produced similar results, is far less tunable, uses a unified cache (instead of a chunk per partition), is easier to configure and can be more or less expensive – Supercache for 4 CPUs is like $1K, but half that for 2 CPUs. UpTempo is about $700 regardless. Another difference is that UpTempo is 32-bit only at the moment.

D

7 Replies to “On windows filesystem tuning and funky cache mechanisms”

  1. Do you know where you can still download postmark from (win32 binary if it comes as a binary)? It appears to not be on the links on netapp that show up on google?

    Cheers,
    J

  2. i would like to get a coyp of the win32 binary as well – would you mind sending them to my email or posting a link?

Leave a comment for posterity...