TrueZIP
  1. TrueZIP
  2. TRUEZIP-343

Using TPath at least twice as slow as TFile when creating files before writing to them

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Works as designed
    • Affects Version/s: TrueZIP 7.7.6
    • Fix Version/s: None
    • Component/s: TrueZIP Archetype Path
    • Labels:
      None

      Description

      Using TPath is at least twice as slow as TFile when each file in an archive is created before the contents is written. I've created a benchmark application for this based on a benchmark I found when searching through the mailing list (created by Gergely Csucs). The tests I've run using a profiler suggest that the difference is caused by calls to TempFilePool.createTempFile() and TempFilePool$Buffer.pool(). Is this the result of something that can be optimized or is it expected?

      The results on my machine for the benchmark:

      Start (TPath)
      Completed writing (TPath): 28285
      Finalized (TPath): 64513
      Start (TFile)
      Completed writing (TFile): 27540
      Finalized (TFile): 27873

      The code:

      ZipBenchMark.java
      package com.quintiq.vfs.test.manual;
      
      import java.io.OutputStream;
      import java.nio.file.Files;
      import java.util.Random;
      
      import de.schlichtherle.truezip.file.TFile;
      import de.schlichtherle.truezip.file.TFileOutputStream;
      import de.schlichtherle.truezip.file.TVFS;
      import de.schlichtherle.truezip.nio.file.TPath;
      
      public class ZipBenchMark
      {
        private static final String FILE_TPATH = "testA.zip", FILE_TFILE = "testB.zip";
        
        public static void main(String args[]) throws Exception
        {
          long start;
          Random r = new Random(0);
          byte bytes[] = new byte[65536];
          
          // Test TPath
          start = System.currentTimeMillis();
          TPath base = new TPath(FILE_TPATH);
          System.out.println("Start (TPath)");
          
          for (int i=0; i<10000; i++)
          {
            OutputStream os = Files.newOutputStream( Files.createFile( base.resolve("dump" + i + ".byt") ) );
            r.nextBytes(bytes);
            os.write(bytes);
            os.close();
          }
          
          System.out.println("Completed writing (TPath): " + (System.currentTimeMillis() - start));
          TVFS.umount();    
          System.out.println("Finalized (TPath): " + (System.currentTimeMillis() - start));
          
          // Test TFile
          start = System.currentTimeMillis();
          TFile file = new TFile(FILE_TFILE);
          System.out.println("Start (TFile)");
          
          for (int i=0; i<10000; i++)
          {
            TFile fileInArchive = new TFile(file, "dump" + i + ".byt");
            fileInArchive.createNewFile();
            
            OutputStream os = new TFileOutputStream(fileInArchive);
            r.nextBytes(bytes);
            os.write(bytes);
            os.close();
          }
          
          System.out.println("Completed writing (TFile): " + (System.currentTimeMillis() - start));
          TVFS.umount();
          System.out.println("Finalized (TFile): " + (System.currentTimeMillis() - start));
        }
      }
      

        Activity

        Hide
        Christian Schlichtherle added a comment -

        There are several things to note first:

        1) Please specify the OS version. The behavior of file systems and the JRE is quite different among Unix and Windows. I am running the following code on Mac OS X 10.9.2, Oracle JDK 1.7.0_51-b13.
        2) The JVM optimizes hot spots at run time, so for a real benchmark please wrap the test in another loop to compensate for JVM warm up time.
        3) In production, please make sure to ALWAYS close your streams, even in case of a throwable, e.g. within a finally block or with a try-with-resources statement.
        4) For comparable results, you must delete the archive files before the test runs because TrueZIP behaves differently depending on whether the archive file already exists or not.

        Now here is the simple solution: You have introduced superfluous calls to TFile.createNewFile() and Files.createFile(TPath). These calls have side effects, and in the case of Files.createFile(TPath), when looping over an archive file this results in quadratic runtime, which is what you were witnessing.

        The following code is like yours except that I have removed these superfluous calls:

        package com.company.product.mavenproject1.java;
        
        import de.schlichtherle.truezip.file.TFile;
        import de.schlichtherle.truezip.file.TFileOutputStream;
        import de.schlichtherle.truezip.file.TVFS;
        import de.schlichtherle.truezip.nio.file.TPath;
        import java.io.OutputStream;
        import java.nio.file.Files;
        import java.util.Random;
        
        public class ZipBenchMark
        {
          private static final String FILE_TPATH = "testA.zip", FILE_TFILE = "testB.zip";
        
          public static void main(String args[]) throws Exception
          {
            long start;
            Random r = new Random(0);
            byte bytes[] = new byte[65536];
        
            // Test TPath
            start = System.currentTimeMillis();
            TPath base = new TPath(FILE_TPATH);
            System.out.println("Start (TPath)");
        
            for (int i=0; i<10000; i++)
            {
              OutputStream os = Files.newOutputStream(base.resolve("dump" + i + ".byt"));
              r.nextBytes(bytes);
              os.write(bytes);
              os.close();
            }
        
            System.out.println("Completed writing (TPath): " + (System.currentTimeMillis() - start));
            TVFS.umount();
            System.out.println("Finalized (TPath): " + (System.currentTimeMillis() - start));
        
            // Test TFile
            start = System.currentTimeMillis();
            TFile file = new TFile(FILE_TFILE);
            System.out.println("Start (TFile)");
        
            for (int i=0; i<10000; i++)
            {
              TFile fileInArchive = new TFile(file, "dump" + i + ".byt");
              OutputStream os = new TFileOutputStream(fileInArchive);
              r.nextBytes(bytes);
              os.write(bytes);
              os.close();
            }
        
            System.out.println("Completed writing (TFile): " + (System.currentTimeMillis() - start));
            TVFS.umount();
            System.out.println("Finalized (TFile): " + (System.currentTimeMillis() - start));
          }
        }
        

        On my machine, I get the following sample results:

        Start (TPath)
        Completed writing (TPath): 26980
        Finalized (TPath): 27537
        Start (TFile)
        Completed writing (TFile): 24568
        Finalized (TFile): 24966
        

        To demonstrate the effect of the JVM warmup, when I swap the usage of the APIs, I get the following sample results:

        Start (TFile)
        Completed writing (TFile): 26320
        Finalized (TFile): 26848
        Start (TPath)
        Completed writing (TPath): 24751
        Finalized (TPath): 25121
        

        To sum it up, there really isn't a noticeable performance difference between these APIs, but you need to be careful about the side effects of the API calls.

        Show
        Christian Schlichtherle added a comment - There are several things to note first: 1) Please specify the OS version. The behavior of file systems and the JRE is quite different among Unix and Windows. I am running the following code on Mac OS X 10.9.2, Oracle JDK 1.7.0_51-b13. 2) The JVM optimizes hot spots at run time, so for a real benchmark please wrap the test in another loop to compensate for JVM warm up time. 3) In production, please make sure to ALWAYS close your streams, even in case of a throwable, e.g. within a finally block or with a try-with-resources statement. 4) For comparable results, you must delete the archive files before the test runs because TrueZIP behaves differently depending on whether the archive file already exists or not. Now here is the simple solution: You have introduced superfluous calls to TFile.createNewFile() and Files.createFile(TPath). These calls have side effects, and in the case of Files.createFile(TPath), when looping over an archive file this results in quadratic runtime, which is what you were witnessing. The following code is like yours except that I have removed these superfluous calls: package com.company.product.mavenproject1.java; import de.schlichtherle.truezip.file.TFile; import de.schlichtherle.truezip.file.TFileOutputStream; import de.schlichtherle.truezip.file.TVFS; import de.schlichtherle.truezip.nio.file.TPath; import java.io.OutputStream; import java.nio.file.Files; import java.util.Random; public class ZipBenchMark { private static final String FILE_TPATH = "testA.zip" , FILE_TFILE = "testB.zip" ; public static void main( String args[]) throws Exception { long start; Random r = new Random(0); byte bytes[] = new byte [65536]; // Test TPath start = System .currentTimeMillis(); TPath base = new TPath(FILE_TPATH); System .out.println( "Start (TPath)" ); for ( int i=0; i<10000; i++) { OutputStream os = Files.newOutputStream(base.resolve( "dump" + i + ".byt" )); r.nextBytes(bytes); os.write(bytes); os.close(); } System .out.println( "Completed writing (TPath): " + ( System .currentTimeMillis() - start)); TVFS.umount(); System .out.println( "Finalized (TPath): " + ( System .currentTimeMillis() - start)); // Test TFile start = System .currentTimeMillis(); TFile file = new TFile(FILE_TFILE); System .out.println( "Start (TFile)" ); for ( int i=0; i<10000; i++) { TFile fileInArchive = new TFile(file, "dump" + i + ".byt" ); OutputStream os = new TFileOutputStream(fileInArchive); r.nextBytes(bytes); os.write(bytes); os.close(); } System .out.println( "Completed writing (TFile): " + ( System .currentTimeMillis() - start)); TVFS.umount(); System .out.println( "Finalized (TFile): " + ( System .currentTimeMillis() - start)); } } On my machine, I get the following sample results: Start (TPath) Completed writing (TPath): 26980 Finalized (TPath): 27537 Start (TFile) Completed writing (TFile): 24568 Finalized (TFile): 24966 To demonstrate the effect of the JVM warmup, when I swap the usage of the APIs, I get the following sample results: Start (TFile) Completed writing (TFile): 26320 Finalized (TFile): 26848 Start (TPath) Completed writing (TPath): 24751 Finalized (TPath): 25121 To sum it up, there really isn't a noticeable performance difference between these APIs, but you need to be careful about the side effects of the API calls.
        Hide
        qui.tiqe added a comment -

        Thank you for taking the time to reply to my original "issue".

        "1) Please specify the OS version. The behavior of file systems and the JRE is quite different among Unix and Windows. I am running the following code on Mac OS X 10.9.2, Oracle JDK 1.7.0_51-b13."

        I'm running JDK 7u51 on Windows 7.

        "2) The JVM optimizes hot spots at run time, so for a real benchmark please wrap the test in another loop to compensate for JVM warm up time."

        True for actual benchmarking, though in this case what happens after "TVFS.umount();" is the interesting part.

        "3) In production, please make sure to ALWAYS close your streams, even in case of a throwable, e.g. within a finally block or with a try-with-resources statement."
        "4) For comparable results, you must delete the archive files before the test runs because TrueZIP behaves differently depending on whether the archive file already exists or not."

        The code was intended ONLY as an example; it reproduces the "slowness" which I'm experiencing in the actual code but is short enough to serve as an example (as error handling, etc. is left out). For me it was obvious that the code I posted was not intended as anything else than a quick and dirty example, but I guess I should have mentioned that (and especially the part where the created files should be deleted after running the code).

        "Now here is the simple solution: You have introduced superfluous calls to TFile.createNewFile() and Files.createFile(TPath). These calls have side effects, and in the case of Files.createFile(TPath), when looping over an archive file this results in quadratic runtime, which is what you were witnessing."

        Indeed, the calls are superfluous. But without them, the issue is not reproducible. That's why I stated in my original post: "Using TPath is at least twice as slow as TFile when each file in an archive is created before the contents is written". The question was therefore why this makes such a huge difference (as using TFile.createNewFile() has a negligible impact on performance)?

        Show
        qui.tiqe added a comment - Thank you for taking the time to reply to my original "issue". "1) Please specify the OS version. The behavior of file systems and the JRE is quite different among Unix and Windows. I am running the following code on Mac OS X 10.9.2, Oracle JDK 1.7.0_51-b13." I'm running JDK 7u51 on Windows 7. "2) The JVM optimizes hot spots at run time, so for a real benchmark please wrap the test in another loop to compensate for JVM warm up time. " True for actual benchmarking, though in this case what happens after "TVFS.umount();" is the interesting part. "3) In production, please make sure to ALWAYS close your streams, even in case of a throwable, e.g. within a finally block or with a try-with-resources statement." "4) For comparable results, you must delete the archive files before the test runs because TrueZIP behaves differently depending on whether the archive file already exists or not." The code was intended ONLY as an example; it reproduces the "slowness" which I'm experiencing in the actual code but is short enough to serve as an example (as error handling, etc. is left out). For me it was obvious that the code I posted was not intended as anything else than a quick and dirty example, but I guess I should have mentioned that (and especially the part where the created files should be deleted after running the code). "Now here is the simple solution: You have introduced superfluous calls to TFile.createNewFile() and Files.createFile(TPath). These calls have side effects, and in the case of Files.createFile(TPath), when looping over an archive file this results in quadratic runtime, which is what you were witnessing." Indeed, the calls are superfluous. But without them, the issue is not reproducible. That's why I stated in my original post: "Using TPath is at least twice as slow as TFile when each file in an archive is created before the contents is written" . The question was therefore why this makes such a huge difference (as using TFile.createNewFile() has a negligible impact on performance)?
        Hide
        Christian Schlichtherle added a comment - - edited

        The reason is that TFile.createNewFile() and Files.createFile(TPath, ...) are implemented in fundamentally different ways:

        TFile.createNewFile() simply creates a node in the virtual file system for the archive file. No I/O is done at all. This is the equivalent of an mknod() POSIX call.

        Files.createFile(Path, ...) allocates a new SeekableByteChannel and then immediately closes it. This is very expensive because the TrueZIP Kernel needs to allocate a temporary I/O buffer (usually a file) so that you can seek in it, connect a new SeekableByteChannel to it and then immediate close it. Then again, you start overwriting the same temporary I/O buffer. Finally, when calling TVFS.umount(), all 10.000 temporary I/O buffers get assembled into the archive file, which is why you are witnessing double runtime. Note that this implements a write-back caching strategy. There is also a write-through caching strategy implemented in the kernel, but not actually used. The write-through strategy would even result in quadratic runtime because the kernel would have to do an implicit umount upon the second write to the same archive entry within the loop.

        Unfortunately I have no influence on the implementation of Files.createFile(Path, ...) because it is a static method. I once contacted Alan Bateman, the specification lead for the NIO.2 API on the matter, but it doesn't seem Oracle would add the equivalent of an mknod() call to the NIO.2 API.

        Show
        Christian Schlichtherle added a comment - - edited The reason is that TFile.createNewFile() and Files.createFile(TPath, ...) are implemented in fundamentally different ways: TFile.createNewFile() simply creates a node in the virtual file system for the archive file. No I/O is done at all. This is the equivalent of an mknod() POSIX call. Files.createFile(Path, ...) allocates a new SeekableByteChannel and then immediately closes it. This is very expensive because the TrueZIP Kernel needs to allocate a temporary I/O buffer (usually a file) so that you can seek in it, connect a new SeekableByteChannel to it and then immediate close it. Then again, you start overwriting the same temporary I/O buffer. Finally, when calling TVFS.umount(), all 10.000 temporary I/O buffers get assembled into the archive file, which is why you are witnessing double runtime. Note that this implements a write-back caching strategy. There is also a write-through caching strategy implemented in the kernel, but not actually used. The write-through strategy would even result in quadratic runtime because the kernel would have to do an implicit umount upon the second write to the same archive entry within the loop. Unfortunately I have no influence on the implementation of Files.createFile(Path, ...) because it is a static method. I once contacted Alan Bateman, the specification lead for the NIO.2 API on the matter, but it doesn't seem Oracle would add the equivalent of an mknod() call to the NIO.2 API.
        Hide
        qui.tiqe added a comment -

        Thank you for the detailed explanation, it is much appreciated!

        Show
        qui.tiqe added a comment - Thank you for the detailed explanation, it is much appreciated!

          People

          • Assignee:
            Christian Schlichtherle
            Reporter:
            qui.tiqe
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: