llvm-ranlib: Bus Error in regressions + fix

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
I ran the LLVM regression tests today (via make check) and noticed that
llvm-ranlib crashes with a Bus Error on my test system (a fairly old
RedHat 9 system), using the latest CVS version. I did some digging and
I think I know what the problem is, and I have attached a quick and
dirty patch that fixes the problem for me, but I need a suggestion
about how it should be integrated properly. Here are the details:

To reproduce the crash, run llvm-ranlib on the "GNU.a" file in the
llvm/test/Regression/Archive directory (make a copy first: it corrupts
it). It then crashes with a Bus Error.

The stack trace is:

#0  0x4207c1aa in memcpy () from /lib/tls/libc.so.6
#1  0x400d55e8 in std::basic_streambuf<char, std::char_traits<char>
 >::xsputn(char const*, int) () from /usr/lib/libstdc++.so.5
#2  0x4009c818 in std::basic_filebuf<char, std::char_traits<char>
 >::xsputn(char const*, int) () from /usr/lib/libstdc++.so.5
#3  0x400cbed1 in std::ostream::write(char const*, int) ()
    from /usr/lib/libstdc++.so.5
#4  0x0829c9d0 in llvm::Archive::writeMember(llvm::ArchiveMember
const&, std::basic_ofstream<char, std::char_traits<char> >&, bool,
bool, bool) (
     this=0x8356088, member=@0x8356180, ARFile=@0xbfffd630,
     CreateSymbolTable=false, TruncateNames=false, ShouldCompress=false)
     at ArchiveWriter.cpp:294
#5  0x0829d297 in llvm::Archive::writeToDisk(bool, bool, bool) (
     this=0x8356088, CreateSymbolTable=true, TruncateNames=false,
     Compress=false) at ArchiveWriter.cpp:439
#6  0x081a5618 in main (argc=2, argv=0xbfffd9b4) at llvm-ranlib.cpp:76
#7  0x42015574 in __libc_start_main () from /lib/tls/libc.so.6


At frame #4 (Archive::writeMember) looks like this:

>   // Write the (possibly compressed) member's content to the file.
>   ARFile.write(data,fSize);

If I examine the backtrace, fSize equals 46, and "data" points to 46
null bytes. However, the "data" pointer is invalid, since if I inspect
it *before* the crash, the crash does not occur.

frame #5 (Archive::writeToDisk) looks like this:

>       // If there is a foreign symbol table, put it into the file now.
> Most
>       // ar(1) implementations require the symbol table to be first
> but llvm-ar
>       // can deal with it being after a foreign symbol table. This
> ensures
>       // compatibility with other ar(1) implementations as well as
> allowing the
>       // archive to store both native .o and LLVM .bc files, both
> indexed.
>       if (foreignST) {
>         writeMember(*foreignST, FinalFile, false, false, false);
>       }
So I tracked back the foreignST pointer, and when it is set the "data"
pointer is *not* 46 null bytes. It is valid data mmap-ed from the
archive file. But when it gets to the call to writeMember, that data
pointer is no longer valid. Running "strace" on llvm-ranlib solved the
mystery. Here are the relevant calls:

open("temp.GNU.a", O_RDONLY)            = 13
fstat64(13, {st_mode=S_IFREG|0600, st_size=4210, ...}) = 0
mmap2(NULL, 8192, PROT_READ, MAP_PRIVATE, 13, 0) = 0x40017000

** The source file is mapped, and a lot of stuff happens **

open("temp.GNU.a", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 15
fstat64(15, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0

** Here the source file is TRUNCATED. Essentially, this invalidates the
data pointer. Two lines follow in the trace: **

_llseek(15, 0, [0], SEEK_CUR)           = 0
--- SIGBUS (Bus error) @ 0 (0) ---



So the fix is pretty simple: before opening the file again, unlink it.
This has the effect of creating a *new* file, instead of overwriting
the old data. I've attached my quick-and-dirty patch that will only
work on Unix. I'm not sure how this should be solved correctly. The
other strange part is why hasn't anyone else seen this problem? I would
think that this would occur pretty reliably on all systems. Any ideas?

Evan Jones



--
Evan Jones
http://evanjones.ca/
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

archive.patch (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Reid Spencer
Evan,

Your patch uses an operating system call that is not portable. All non-portable
code needs to be located in the lib/System library. I'm not sure why this
problem appears on an old Red Hat system. Perhaps the C++ io library is not up
to snuff on that platform? What compiler are you using?

Reid.

Evan Jones wrote:

> I ran the LLVM regression tests today (via make check) and noticed that
> llvm-ranlib crashes with a Bus Error on my test system (a fairly old
> RedHat 9 system), using the latest CVS version. I did some digging and I
> think I know what the problem is, and I have attached a quick and dirty
> patch that fixes the problem for me, but I need a suggestion about how
> it should be integrated properly. Here are the details:
>
> To reproduce the crash, run llvm-ranlib on the "GNU.a" file in the
> llvm/test/Regression/Archive directory (make a copy first: it corrupts
> it). It then crashes with a Bus Error.
>
> The stack trace is:
>
> #0  0x4207c1aa in memcpy () from /lib/tls/libc.so.6
> #1  0x400d55e8 in std::basic_streambuf<char, std::char_traits<char>
>  >::xsputn(char const*, int) () from /usr/lib/libstdc++.so.5
> #2  0x4009c818 in std::basic_filebuf<char, std::char_traits<char>
>  >::xsputn(char const*, int) () from /usr/lib/libstdc++.so.5
> #3  0x400cbed1 in std::ostream::write(char const*, int) ()
>    from /usr/lib/libstdc++.so.5
> #4  0x0829c9d0 in llvm::Archive::writeMember(llvm::ArchiveMember const&,
> std::basic_ofstream<char, std::char_traits<char> >&, bool, bool, bool) (
>     this=0x8356088, member=@0x8356180, ARFile=@0xbfffd630,
>     CreateSymbolTable=false, TruncateNames=false, ShouldCompress=false)
>     at ArchiveWriter.cpp:294
> #5  0x0829d297 in llvm::Archive::writeToDisk(bool, bool, bool) (
>     this=0x8356088, CreateSymbolTable=true, TruncateNames=false,
>     Compress=false) at ArchiveWriter.cpp:439
> #6  0x081a5618 in main (argc=2, argv=0xbfffd9b4) at llvm-ranlib.cpp:76
> #7  0x42015574 in __libc_start_main () from /lib/tls/libc.so.6
>
>
> At frame #4 (Archive::writeMember) looks like this:
>
>>   // Write the (possibly compressed) member's content to the file.
>>   ARFile.write(data,fSize);
>
>
> If I examine the backtrace, fSize equals 46, and "data" points to 46
> null bytes. However, the "data" pointer is invalid, since if I inspect
> it *before* the crash, the crash does not occur.
>
> frame #5 (Archive::writeToDisk) looks like this:
>
>>       // If there is a foreign symbol table, put it into the file now.
>> Most
>>       // ar(1) implementations require the symbol table to be first
>> but llvm-ar
>>       // can deal with it being after a foreign symbol table. This
>> ensures
>>       // compatibility with other ar(1) implementations as well as
>> allowing the
>>       // archive to store both native .o and LLVM .bc files, both
>> indexed.
>>       if (foreignST) {
>>         writeMember(*foreignST, FinalFile, false, false, false);
>>       }
>
>
> So I tracked back the foreignST pointer, and when it is set the "data"
> pointer is *not* 46 null bytes. It is valid data mmap-ed from the
> archive file. But when it gets to the call to writeMember, that data
> pointer is no longer valid. Running "strace" on llvm-ranlib solved the
> mystery. Here are the relevant calls:
>
> open("temp.GNU.a", O_RDONLY)            = 13
> fstat64(13, {st_mode=S_IFREG|0600, st_size=4210, ...}) = 0
> mmap2(NULL, 8192, PROT_READ, MAP_PRIVATE, 13, 0) = 0x40017000
>
> ** The source file is mapped, and a lot of stuff happens **
>
> open("temp.GNU.a", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 15
> fstat64(15, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
>
> ** Here the source file is TRUNCATED. Essentially, this invalidates the
> data pointer. Two lines follow in the trace: **
>
> _llseek(15, 0, [0], SEEK_CUR)           = 0
> --- SIGBUS (Bus error) @ 0 (0) ---
>
>
>
> So the fix is pretty simple: before opening the file again, unlink it.
> This has the effect of creating a *new* file, instead of overwriting the
> old data. I've attached my quick-and-dirty patch that will only work on
> Unix. I'm not sure how this should be solved correctly. The other
> strange part is why hasn't anyone else seen this problem? I would think
> that this would occur pretty reliably on all systems. Any ideas?
>
> Evan Jones
>
>
> --
> Evan Jones
> http://evanjones.ca/
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
On Nov 22, 2005, at 17:18, Reid Spencer wrote:
> Your patch uses an operating system call that is not portable. All
> non-portable code needs to be located in the lib/System library.

Yep! I know. That is why I posted it for discussion. I'm not sure if
this is the "right" way to fix the problem, or if there is a different
fix that should be applied (like perhaps copying the data out of the
mmap-ed archive?).

> I'm not sure why this problem appears on an old Red Hat system.
> Perhaps the C++ io library is not up to snuff on that platform? What
> compiler are you using?

It is very strange to me that it doesn't appear on other systems. I'll
try to load LLVM on my bleeding edge Debian laptop tomorrow and see
what happens there.

I am pretty certain that this has nothing to do with the C++ library,
and everything to do with the behaviour of mmap when the file that was
mmaped is modified. I actually can reproduce this behaviour with the
attached C test case. The program mmaps a file called 'data,' prints
the last byte, truncates the file, then tries to read the last byte
again. It causes a Bus Error on both the RedHat system and my Mac OS X
workstation. Hence, this appears to be valid (or at least common) mmap
behaviour.

rn-spra1c07:~ ejones$ dd if=/dev/zero of=data bs=1 count=4096
4096+0 records in
4096+0 records out
4096 bytes transferred in 0.067263 secs (60895 bytes/sec)
rn-spra1c07:~ ejones$ ./mmaptest
last byte = 0x00
Bus error

I can also reproduce it with a minimal LLVM example, also attached.
That program needs the "GNU.a" file in the current directory. It opens
the archive and scans through all the members, printing out the first
byte of each one. Then it truncates the file and repeats that
experiment. It also causes a Bus Error.

Essentially, this is what happens in ArchiveWriter.cpp:429. This bug
will be triggered by any archive that has a native symbol table, since
that member (foreignST) references data that was mmaped from the
original file. All the other members are copied from the temporary
archive, so they are not a problem.

Evan Jones



--
Evan Jones
http://evanjones.ca/

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

mmaptest.c (1K) Download Attachment
llvm-buserror.cpp (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Reid Spencer
Evan Jones wrote:

  > I am pretty certain that this has nothing to do with the C++ library,
> and everything to do with the behaviour of mmap when the file that was
> mmaped is modified. I actually can reproduce this behaviour with the
> attached C test case. The program mmaps a file called 'data,' prints the
> last byte, truncates the file, then tries to read the last byte again.
> It causes a Bus Error on both the RedHat system and my Mac OS X
> workstation. Hence, this appears to be valid (or at least common) mmap
> behaviour.

Yes, this is the correct behavior for mmap in such a situation. The mapped
file, when it is truncated, invalidates the memory corresponding to truncated
portion of the file. The memory is taken out of the virtual memory table so
that any attempt to access generates a, you guessed it, bus error.

>
> rn-spra1c07:~ ejones$ dd if=/dev/zero of=data bs=1 count=4096
> 4096+0 records in
> 4096+0 records out
> 4096 bytes transferred in 0.067263 secs (60895 bytes/sec)
> rn-spra1c07:~ ejones$ ./mmaptest
> last byte = 0x00
> Bus error
>
> I can also reproduce it with a minimal LLVM example, also attached. That
> program needs the "GNU.a" file in the current directory. It opens the
> archive and scans through all the members, printing out the first byte
> of each one. Then it truncates the file and repeats that experiment. It
> also causes a Bus Error.
>
> Essentially, this is what happens in ArchiveWriter.cpp:429. This bug
> will be triggered by any archive that has a native symbol table, since
> that member (foreignST) references data that was mmaped from the
> original file. All the other members are copied from the temporary
> archive, so they are not a problem.

The file gets corrupted because it is overwriting itself. The strace showed
that it opened the same file for reading and writing with 2 file handlers. This
isn't what the code is supposed to do. The TmpArchive variable in
ArchiveWriter.cpp is supposed to reference a unique file name and it is not.
At that point in the ArchiveWriter, it is trying to insert the symbol table. It
does that by creating a temporary file and mmaping the original file. The
temporary file is written with the symbol table and then it writes the entire
content of the mmaped file into the temporary file (single write using mmaped
pointer).  When that is done, it renames the temporary file to that of the
original.  The problem is, the temporary and the original are the same file!
This is a failure of Path::makeUnique, which is system dependent.

I don't have a debugging environment handy to track this down, but I would
suggest that you break out a debugger and investigate the following:

1. What is the path name associated with TmpArchive? If its the same as the
path name associated with archPath then that's a bug, probably introduced when
Path::makeUnique is called from Path::createTemporaryFileOnDisk which is called
from line 377 of ArchiveWriter.cpp.

2. If item 1. holds, break in Path::makeUnique and see how it is computing the
temporary name. There are three mechanisms: mkstemp, mktemp, and "manual". I
don't know which mechanism it is using or why its not creating a unique file
name. If your Red Hat system is really old, its possible one of the system
mechanisms is broken and you'll need to adjust the code for the broken (but
available) library call.

Reid.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
On Nov 22, 2005, at 19:10, Reid Spencer wrote:
> 1. What is the path name associated with TmpArchive? If its the same
> as the path name associated with archPath then that's a bug, probably
> introduced when Path::makeUnique is called from
> Path::createTemporaryFileOnDisk which is called from line 377 of
> ArchiveWriter.cpp.

This does not appear to be the problem. I excluded the lines from the
strace that created this temporary file. After line 377:

(gdb) p TmpArchive
$2 = {path = {static npos = 4294967295,
     _M_dataplus = {<allocator<char>> = {<No data fields>},
       _M_p = 0x835407c "temp.GNU.a-PozKFJ"}, static
_S_empty_rep_storage = {0,
       0, 4, 0}}}
(gdb) p archPath
$3 = {path = {static npos = 4294967295,
     _M_dataplus = {<allocator<char>> = {<No data fields>},
       _M_p = 0x83545f4 "temp.GNU.a5\b"}, static _S_empty_rep_storage =
{0, 0,
       4, 0}}}

So these two variables are pointing to different files, and the
creation of TmpArchive works just fine.  The strace including the parts
that reference the temporary file is appended to the end of the email.
The very last open to the "archPath" file is at line 429, where it
truncates it even though the foreignST pointer refers to data mmaped in
that file. Is this data supposed to be copied out of the original file,
or is another temporary supposed to be created and then the original
could be replaced using a file move operation instead?

I'll try this on my Debian unstable system tomorrow. If ranlib works
there, maybe I can track down the difference.

Evan Jones


open("temp.GNU.a", O_RDONLY)            = 3
fstat64(3, {st_mode=S_IFREG|0600, st_size=4210, ...}) = 0
mmap2(NULL, 8192, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40017000
gettimeofday({1132714484, 283020}, NULL) = 0
getpid()                                = 28656
open("temp.GNU.a-O1Q6E8", O_RDWR|O_CREAT|O_EXCL, 0600) = 4
close(4)                                = 0
open("temp.GNU.a-O1Q6E8", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
close(4)                                = 0

*** SIGNAL HANDLING REMOVED ***

open("temp.GNU.a-O1Q6E8", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
brk(0)                                  = 0x8357000
brk(0x8359000)                          = 0x8359000
fstat64(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x40019000
_llseek(4, 0, [0], SEEK_CUR)            = 0
_llseek(4, 0, [0], SEEK_SET)            = 0
_llseek(4, 0, [0], SEEK_SET)            = 0
_llseek(4, 0, [0], SEEK_SET)            = 0
_llseek(4, 0, [0], SEEK_SET)            = 0
brk(0)                                  = 0x8359000
brk(0x8369000)                          = 0x8369000
brk(0)                                  = 0x8369000
brk(0x836a000)                          = 0x836a000
mmap2(NULL, 2002944, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
-1, 0) = 0x4013b000
munmap(0x4013b000, 2002944)             = 0
_llseek(4, 0, [0], SEEK_SET)            = 0
_llseek(4, 0, [0], SEEK_SET)            = 0
_llseek(4, 0, [0], SEEK_SET)            = 0
write(4, "!<arch>\nevenlen/        11008330"..., 4040) = 4040
close(4)                                = 0
munmap(0x40019000, 4096)                = 0
access("temp.GNU.a-O1Q6E8", F_OK)       = 0
open("temp.GNU.a-O1Q6E8", O_RDONLY)     = 4
fstat64(4, {st_mode=S_IFREG|0600, st_size=4040, ...}) = 0
mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40019000
open("temp.GNU.a", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
fstat64(5, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x4001a000
_llseek(5, 0, [0], SEEK_CUR)            = 0
--- SIGBUS (Bus error) @ 0 (0) ---


--
Evan Jones
http://evanjones.ca/

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Reid Spencer
Evan Jones wrote:

> On Nov 22, 2005, at 19:10, Reid Spencer wrote:
>
>> 1. What is the path name associated with TmpArchive? If its the same
>> as the path name associated with archPath then that's a bug, probably
>> introduced when Path::makeUnique is called from
>> Path::createTemporaryFileOnDisk which is called from line 377 of
>> ArchiveWriter.cpp.
>
>
> This does not appear to be the problem. I excluded the lines from the
> strace that created this temporary file. After line 377:
>
> (gdb) p TmpArchive
> $2 = {path = {static npos = 4294967295,
>     _M_dataplus = {<allocator<char>> = {<No data fields>},
>       _M_p = 0x835407c "temp.GNU.a-PozKFJ"}, static _S_empty_rep_storage

That looks like mkstemp working correctly.

> = {0,
>       0, 4, 0}}}
> (gdb) p archPath
> $3 = {path = {static npos = 4294967295,
>     _M_dataplus = {<allocator<char>> = {<No data fields>},
>       _M_p = 0x83545f4 "temp.GNU.a5\b"}, static _S_empty_rep_storage =

What's with the "5\b" at the end? Looks like garbage to me. Not sure what's up
with that.
> {0, 0,
>       4, 0}}}
>
> So these two variables are pointing to different files, and the creation
> of TmpArchive works just fine.  The strace including the parts that
> reference the temporary file is appended to the end of the email.

Okay, the previous strace looked like they were both opening "temp.GNU.a";
perhaps the names were truncated in the strace.  Anyway, you're right, this
isn't the problem if the file names are different.

> The
> very last open to the "archPath" file is at line 429, where it truncates
> it even though the foreignST pointer refers to data mmaped in that file.

Yes, the foreignST points to a non-LLVM symbol table which must be retained for
compatibility with other AR implementations. Unfortunately, it is being held
from a file that is about to be mmap'd.

> Is this data supposed to be copied out of the original file,

That's one solution but it defeats the purpose/efficiency of the mmap.

or is
> another temporary supposed to be created and then the original could be
> replaced using a file move operation instead?

Another temporary file would be even slower than copying the memory in memory.
>
> I'll try this on my Debian unstable system tomorrow. If ranlib works
> there, maybe I can track down the difference.

Sounds like a plan. Its a bit difficult to debug this over email. If you get
stuck, let me know and I'll gen up a debug environment to take a look at it.
Chances are, however, that I won't be able to replicate this problem on FC3
since it passes the nightly test here.

>
  > open("temp.GNU.a", O_RDONLY)            = 3

Original file being opened (read-only, because we're not supposed to modify it)

> fstat64(3, {st_mode=S_IFREG|0600, st_size=4210, ...}) = 0
> mmap2(NULL, 8192, PROT_READ, MAP_PRIVATE, 3, 0) = 0x40017000
> gettimeofday({1132714484, 283020}, NULL) = 0
> getpid()                                = 28656
> open("temp.GNU.a-O1Q6E8", O_RDWR|O_CREAT|O_EXCL, 0600) = 4
The temporary file that will contain the symbol table
> close(4)                                = 0
> open("temp.GNU.a-O1Q6E8", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
> close(4)                                = 0
Not sure why this is opened twice .. probably an llvm::sys::Path issue.
>
> *** SIGNAL HANDLING REMOVED ***
>
> open("temp.GNU.a-O1Q6E8", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
Open yet again the temnporary file into which we'll build the symtab.
> brk(0)                                  = 0x8357000
> brk(0x8359000)                          = 0x8359000
> fstat64(4, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x40019000
This mmap is just allocating memory

> _llseek(4, 0, [0], SEEK_CUR)            = 0
> _llseek(4, 0, [0], SEEK_SET)            = 0
> _llseek(4, 0, [0], SEEK_SET)            = 0
> _llseek(4, 0, [0], SEEK_SET)            = 0
> _llseek(4, 0, [0], SEEK_SET)            = 0
> brk(0)                                  = 0x8359000
> brk(0x8369000)                          = 0x8369000
> brk(0)                                  = 0x8369000
> brk(0x836a000)                          = 0x836a000
> mmap2(NULL, 2002944, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS,
> -1, 0) = 0x4013b000
more memory allocations

> munmap(0x4013b000, 2002944)             = 0
> _llseek(4, 0, [0], SEEK_SET)            = 0
> _llseek(4, 0, [0], SEEK_SET)            = 0
> _llseek(4, 0, [0], SEEK_SET)            = 0
> write(4, "!<arch>\nevenlen/        11008330"..., 4040) = 4040
> close(4)                                = 0
> munmap(0x40019000, 4096)                = 0
> access("temp.GNU.a-O1Q6E8", F_OK)       = 0
> open("temp.GNU.a-O1Q6E8", O_RDONLY)     = 4
> fstat64(4, {st_mode=S_IFREG|0600, st_size=4040, ...}) = 0
> mmap2(NULL, 4096, PROT_READ, MAP_PRIVATE, 4, 0) = 0x40019000
map in the contents of the file we wrote.
> open("temp.GNU.a", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 5
wipe out the original file
> fstat64(5, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
> mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
> 0) = 0x4001a000
allocate memory
> _llseek(5, 0, [0], SEEK_CUR)            = 0
> --- SIGBUS (Bus error) @ 0 (0) ---

 From this strace it looks to me like the only solution is to copy the data
pointed to by foreignST into malloc'd space. Its just a symbol table so it
shouldn't be too huge.  The problem, as you've noted, is that the foreignST
pointer is pointing into file handle 3. When file handle 5 is opened, the
O_TRUNC parameter causes the memory pointed to by foreignST to be invalidated
(unmapped). When we try to read this data in order to write it back to
temp.GNU.a, it is getting the bus error trying to access the foreignST data.

There's two solutions:
(1) copy the data pointed to by foreignST
(2) build the new file with the symbol table in a 3rd temporary file which is
later renamed as the original (temp.GNU.a in this case).


Reid.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
On Nov 22, 2005, at 23:59, Reid Spencer wrote:
>> = {0,
>>       0, 4, 0}}}
>> (gdb) p archPath
>> $3 = {path = {static npos = 4294967295,
>>     _M_dataplus = {<allocator<char>> = {<No data fields>},
>>       _M_p = 0x83545f4 "temp.GNU.a5\b"}, static _S_empty_rep_storage =
> What's with the "5\b" at the end? Looks like garbage to me. Not sure
> what's up with that.

The implementation of std::string on this system does not always NULL
terminate strings. I verified that the path.length() value in that case
was 10, and calling path.c_str() returned the correct NULL terminated
string.


I found a system where llvm-ranlib "works," and I have figured out the
problem. Here is the crashed strace, after it opens the file:

open("temp.GNU.a", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 15
fstat64(15, {st_mode=S_IFREG|0600, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,
0) = 0x4001a000
_llseek(15, 0, [0], SEEK_CUR)           = 0
--- SIGBUS (Bus error) @ 0 (0) ---


and here is the "working" strace, after it opens the file:

open("temp.GNU.a", O_WRONLY|O_CREAT|O_TRUNC|O_LARGEFILE, 0666) = 17
write(17, "!<arch>\n", 8)               = 8
_llseek(17, 0, [8], SEEK_CUR)           = 0
write(17, "/               1100833490  0   "..., 106) = 106
_llseek(17, 0, [114], SEEK_CUR)         = 0


Note that in the "working" case, the C++ library decides to flush the
magic number to the file immediately. This causes the first page in the
MMAPed file to be allocated (with zeros). Hence, when we go to read the
foreignST pointer, it doesn't crash, but it has silently corrupted the
data. The foreignST pointer looks like this when it is first read from
the file:

(gdb) x/46bx foreignST->getData()
0xf6ffe044:     0x00    0x00    0x00    0x02    0x00    0x00    0x07    
0x4e
0xf6ffe04c:     0x00    0x00    0x07    0x4e    0x5f    0x5a    0x4e    
0x34
0xf6ffe054:     0x6c    0x6c    0x76    0x6d    0x35    0x49    0x73    
0x4e
0xf6ffe05c:     0x41    0x4e    0x45    0x66    0x00    0x5f    0x5a    
0x4e
0xf6ffe064:     0x34    0x6c    0x6c    0x76    0x6d    0x35    0x49    
0x73
0xf6ffe06c:     0x4e    0x41    0x4e    0x45    0x64    0x00


And it looks like this just before it gets written out:

(gdb) x/46bx foreignST->getData()
0xf6ffe044:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    
0x00
0xf6ffe04c:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    
0x00
0xf6ffe054:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    
0x00
0xf6ffe05c:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    
0x00
0xf6ffe064:     0x00    0x00    0x00    0x00    0x00    0x00    0x00    
0x00
0xf6ffe06c:     0x00    0x00    0x00    0x00    0x00    0x00


You can tell that it is getting corrupted by looking at the file with
"hexdump -C" before and after llvm-ranlib. The native symbol table is
supposed to start at offset 0x44.


>> Is this data supposed to be copied out of the original file,
> That's one solution but it defeats the purpose/efficiency of the mmap.

Well, only the native symbol table would need to be copied, and only in
the writeToDisk function, before the file gets invalidated.


>> another temporary supposed to be created and then the original could
>> be replaced using a file move operation instead?
> Another temporary file would be even slower than copying the memory in
> memory.

I'm not so sure about that. The archive file is being truncated anyway.
How much worse would it be to unlink the file before writing it out?
The operating system probably already deallocates the disk blocks as
soon as it is truncated. That also fixes the problem, since the
original mapped file still exists until it is unmapped.


> There's two solutions:
> (1) copy the data pointed to by foreignST
> (2) build the new file with the symbol table in a 3rd temporary file
> which is later renamed as the original (temp.GNU.a in this case).

(3) unlink the original file first (basically equivalent to (2))
(4) Write the foreignST into the TmpArchive file. Is there any reason
that this isn't possible? Then the final archive would be created in a
single pass, and it could just be moved into place.

Also, since writeToDisk() invalidates the mapped file that was
originally used to create the archive, shouldn't this function also
unmap that file and erase all its members? This would prevent and
further bugs like this from happening.

Evan Jones

--
Evan Jones
http://evanjones.ca/

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
On Nov 23, 2005, at 8:16, Evan Jones wrote:
> (4) Write the foreignST into the TmpArchive file. Is there any reason
> that this isn't possible? Then the final archive would be created in a
> single pass, and it could just be moved into place.

Ah. I see: It needs to be written in order to compute the offsets.
However, it would be possible to use a stringstream for this.

I think the best thing to do is to write the final archive to a
temporary file, and then move it into place. This has the advantage
that if something goes wrong the original archive will not be
corrupted. Additionally, all the members of the current archive object
instance stay valid, since the original mmap is still valid.

One disadvantage of this approach is that if CreateSymbolTable is
false, it will be slightly more expensive as it will build the archive
in memory before writing it to disk, instead of just writing it out to
disk directly.

> Also, since writeToDisk() invalidates the mapped file that was
> originally used to create the archive, shouldn't this function also
> unmap that file and erase all its members? This would prevent and
> further bugs like this from happening.

This is not necessary using the replace method described above.
However, it *would* be necessary on Windows since you can't
unlink/replace an open file on that platform. To fix this, the original
archive file must be closed before performing the move.

I've attached a patch that builds the temporary archive in a
stringstream, then writes the temporary file and moves it into place.
It fixes the bug on the old RedHat system.

Evan Jones


--
Evan Jones
http://evanjones.ca/

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

archive.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Reid Spencer
Evan Jones wrote:
> On Nov 23, 2005, at 8:16, Evan Jones wrote:
>
>> (4) Write the foreignST into the TmpArchive file. Is there any reason
>> that this isn't possible? Then the final archive would be created in a
>> single pass, and it could just be moved into place.
>
>
> Ah. I see: It needs to be written in order to compute the offsets.

Exactly.

> However, it would be possible to use a stringstream for this.

Aggg! No! Those perform horribly with anything more than a small amount of
data. Some Archive files can be huge (e.g. not fit in memory).

>
> I think the best thing to do is to write the final archive to a
> temporary file, and then move it into place. This has the advantage that
> if something goes wrong the original archive will not be corrupted.
> Additionally, all the members of the current archive object instance
> stay valid, since the original mmap is still valid.

Yeah, that sounds like the safest/sanest strategy.

>
> One disadvantage of this approach is that if CreateSymbolTable is false,
> it will be slightly more expensive as it will build the archive in
> memory before writing it to disk, instead of just writing it out to disk
> directly.

You can always use CreateSymbolTable as a flag to determine which mechanism to use.

>
>> Also, since writeToDisk() invalidates the mapped file that was
>> originally used to create the archive, shouldn't this function also
>> unmap that file and erase all its members? This would prevent and
>> further bugs like this from happening.
>
>
> This is not necessary using the replace method described above. However,
> it *would* be necessary on Windows since you can't unlink/replace an
> open file on that platform. To fix this, the original archive file must
> be closed before performing the move.

Yup. If you use the sys::Path class to do the "unlink" (removeFromDisk) then
the platform differences should be accounted for. Please don't put the direct
unlink call into ArchiveWriter.cpp.

>
> I've attached a patch that builds the temporary archive in a
> stringstream, then writes the temporary file and moves it into place. It
> fixes the bug on the old RedHat system.

Please test the performance of that on the large .a file in the regression
tests. I have my doubts about the scalability and speed for stringstream. Why
can't you just write it out directly to the temporary file?

Reid.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
On Nov 23, 2005, at 12:08, Reid Spencer wrote:
>> However, it would be possible to use a stringstream for this.
> Aggg! No! Those perform horribly with anything more than a small
> amount of data. Some Archive files can be huge (e.g. not fit in
> memory).

An archive can be too big to fit in memory? That would be a BIG
library. In that case, building a temporary in memory is a bad idea,
but then so is copying it twice, as is being done now. Maybe it really
would be better to scan over everything and build the symbol table, and
then scan over everything again and build the final archive. This would
avoid writing an extra copy out to disk, and it wouldn't be hard to do
this.

>> This is not necessary using the replace method described above.
>> However, it *would* be necessary on Windows since you can't
>> unlink/replace an open file on that platform. To fix this, the
>> original archive file must be closed before performing the move.
> Yup. If you use the sys::Path class to do the "unlink"
> (removeFromDisk) then the platform differences should be accounted
> for. Please don't put the direct unlink call into ArchiveWriter.cpp.

So this comment means that I should attempt to theoretically support
Windows, and close the current archive file before updating it?

>> I've attached a patch that builds the temporary archive in a
>> stringstream, then writes the temporary file and moves it into place.
>> It fixes the bug on the old RedHat system.
> Please test the performance of that on the large .a file in the
> regression tests. I have my doubts about the scalability and speed for
> stringstream. Why can't you just write it out directly to the
> temporary file?

I'll remove the stringstream, but this means that I need to create two
temporary files. However, this change is less invasive, so I'll rework
my patch again.

Evan Jones

--
Evan Jones
http://evanjones.ca/

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Reid Spencer
Evan Jones wrote:
> On Nov 23, 2005, at 12:08, Reid Spencer wrote:
>
>>> However, it would be possible to use a stringstream for this.
>>
>> Aggg! No! Those perform horribly with anything more than a small
>> amount of data. Some Archive files can be huge (e.g. not fit in memory).
>
>
> An archive can be too big to fit in memory? That would be a BIG library.

Yes, but they do exist. I've seen 1GByte archive files.

> In that case, building a temporary in memory is a bad idea, but then so
> is copying it twice, as is being done now. Maybe it really would be
> better to scan over everything and build the symbol table, and then scan
> over everything again and build the final archive. This would avoid
> writing an extra copy out to disk, and it wouldn't be hard to do this.

Yup, that would work too. It would, however, require significant rework in
ArchiveWriter.cpp code.

>> Yup. If you use the sys::Path class to do the "unlink"
>> (removeFromDisk) then the platform differences should be accounted
>> for. Please don't put the direct unlink call into ArchiveWriter.cpp.
>
> So this comment means that I should attempt to theoretically support
> Windows, and close the current archive file before updating it?

If you use sys::Path, it is not "theoretically" supported, it is supported.
To remove a file, use sys::Path::removeFromDisk, not unlink(2).

  > I'll remove the stringstream, but this means that I need to create two
> temporary files. However, this change is less invasive, so I'll rework
> my patch again.

Sounds good. We can change to the two-pass reading (when building a symbol
table) some other time. I'm sure you just want to get this little issue done
and over with :)

Reid.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
On Nov 23, 2005, at 18:17, Reid Spencer wrote:
>> So this comment means that I should attempt to theoretically support
>> Windows, and close the current archive file before updating it?
> If you use sys::Path, it is not "theoretically" supported, it is
> supported.

Well, except of course that it isn't currently possible to build LLVM
on Windows. Hence this is a somewhat theoretical issue.

> Sounds good. We can change to the two-pass reading (when building a
> symbol table) some other time. I'm sure you just want to get this
> little issue done and over with :)

I have a problem sometimes: I need to find *why* something goes wrong.

At any rate, I've attached the latest version of my patch. It has some
interesting details:

1. Before replacing the original archive file, it invalidates all the
members of the current Archive instance. This required me to refactor
the destructor code into a separate function. This should make this
code work on Windows.

2. I fixed a memory leak in the Archive destructor. The "foreignST"
member was never deleted.

3. There are now two temporary files created if we need to build the
symbol table, and they are moved over top of the original archive once
the process is complete.

Let me know if there are any comments. The part where the second
temporary file is kind of gross, since it needs to catch exceptions and
be erased if it crashes.

Evan Jones



--
Evan Jones
http://evanjones.ca/

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

archive.patch (5K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Jeff Cohen
Evan Jones wrote:

> On Nov 23, 2005, at 18:17, Reid Spencer wrote:
>
>>> So this comment means that I should attempt to theoretically support
>>> Windows, and close the current archive file before updating it?
>>
>> If you use sys::Path, it is not "theoretically" supported, it is
>> supported.
>
>
> Well, except of course that it isn't currently possible to build LLVM
> on Windows. Hence this is a somewhat theoretical issue.
>
News to me.  I do it all the time.  Notice the "win32" directory under
the root :)

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
On Nov 23, 2005, at 22:49, Jeff Cohen wrote:
> News to me.  I do it all the time.  Notice the "win32" directory under
> the root :)

Okay, so I spoke about something that I don't really know about, since
I don't use Windows. But what was the whole "Still can't compile
backend of frontend on Windows" discussion about? My impression is that
LLVM only builds under Cygwin. This doesn't really count (IMO) since it
makes Windows look like Unix anyway, from an API perspective.

In fact, with regards to the specific "cannot unlink while open" issue,
Cygwin works around it (partially) by removing the file as soon as the
file is closed. I'm not sure that would be sufficient in this case, but
it might be.

Evan Jones

--
Evan Jones
http://evanjones.ca/

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Jeff Cohen
Evan Jones wrote:

> On Nov 23, 2005, at 22:49, Jeff Cohen wrote:
>
>> News to me.  I do it all the time.  Notice the "win32" directory
>> under the root :)
>
>
> Okay, so I spoke about something that I don't really know about, since
> I don't use Windows. But what was the whole "Still can't compile
> backend of frontend on Windows" discussion about? My impression is
> that LLVM only builds under Cygwin. This doesn't really count (IMO)
> since it makes Windows look like Unix anyway, from an API perspective.

No, it builds using Microsoft's Visual C++.  In fact, there have been
more problems building it with cygwin than with Visual Studio.  The
project and solution files for Visual Studio are all under the win32
directory.  It uses the native Win32 API (see lib/System/Win32).  The
entire backend, except for a (very) few tools, builds natively.  The
front end cannot build this way because it's a hacked-up GCC, which is
well known for being unbuildable with VC++.  But that isn't really
LLVM's problem; we can just distribute a pre-built llvm-gcc binary
compiled with cygwin or mingw.  Anyway, the usefulness of LLVM is not
limited to compiling C/C++ code.

> In fact, with regards to the specific "cannot unlink while open"
> issue, Cygwin works around it (partially) by removing the file as soon
> as the file is closed. I'm not sure that would be sufficient in this
> case, but it might be.

Cygwin isn't involved, so this is irrelevant.  Windows does permit a
file to be deleted while it is still open, but only when it was opened
with delete share permission.  This requires that the file be opened
with the CreateFile API, which LLVM cannot do in a portable fashion.  
This functionality cannot be accessed via fopen.

>
> Evan Jones
>
> --
> Evan Jones
> http://evanjones.ca/
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
>
>
>

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Evan Jones-4
On Nov 24, 2005, at 8:32, Jeff Cohen wrote:
>  Windows does permit a file to be deleted while it is still open, but
> only when it was opened with delete share permission.  This requires
> that the file be opened with the CreateFile API, which LLVM cannot do
> in a portable fashion.  This functionality cannot be accessed via
> fopen.

Right, except that the semantics are different (delete on close, not
delete immediately), so it still doesn't solve the portability problem
in this case. Also, DeleteFile fails if the file is mmapped. From MSDN:

> The  DeleteFile function fails if an application attempts to delete a
> file that is open for normal I/O or as a memory-mapped file.

So in this case, I still need to do the ugly contortions to close the
old archive before replacing it.

Evan Jones

--
Evan Jones
http://evanjones.ca/

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Reid Spencer
In reply to this post by Evan Jones-4
Evan Jones wrote:

> Windows look like Unix anyway, from an API perspective.
>
> In fact, with regards to the specific "cannot unlink while open" issue,
> Cygwin works around it (partially) by removing the file as soon as the
> file is closed. I'm not sure that would be sufficient in this case, but
> it might be.

That occurs because of a sharing violation within the Windows operating system.
Windows doesn't have the notion of unlink. If you remove a file, its gone.
There are no links. Consequently, if you try to modify a file that is open by
another handle, you get a sharing violation.  To work around that, you have to
close or remove the file first. Cygwin does work around this, but that is not
our solution. We have Windows specific code in lib/System that also works
around this by closing the file handles properly.  That is why I suggested that
you use the sys::Path class so that the operations on the files work reliably
on all platforms (and if not, we only have to fix it in one place).

Reid.

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: llvm-ranlib: Bus Error in regressions + fix

Reid Spencer
In reply to this post by Evan Jones-4
Evan,

Your patch passed tests on my system and is now committed. Thanks!

Reid.

Evan Jones wrote:

> On Nov 23, 2005, at 18:17, Reid Spencer wrote:
>
>>> So this comment means that I should attempt to theoretically support
>>> Windows, and close the current archive file before updating it?
>>
>> If you use sys::Path, it is not "theoretically" supported, it is
>> supported.
>
>
> Well, except of course that it isn't currently possible to build LLVM on
> Windows. Hence this is a somewhat theoretical issue.
>
>> Sounds good. We can change to the two-pass reading (when building a
>> symbol table) some other time. I'm sure you just want to get this
>> little issue done and over with :)
>
>
> I have a problem sometimes: I need to find *why* something goes wrong.
>
> At any rate, I've attached the latest version of my patch. It has some
> interesting details:
>
> 1. Before replacing the original archive file, it invalidates all the
> members of the current Archive instance. This required me to refactor
> the destructor code into a separate function. This should make this code
> work on Windows.
>
> 2. I fixed a memory leak in the Archive destructor. The "foreignST"
> member was never deleted.
>
> 3. There are now two temporary files created if we need to build the
> symbol table, and they are moved over top of the original archive once
> the process is complete.
>
> Let me know if there are any comments. The part where the second
> temporary file is kind of gross, since it needs to catch exceptions and
> be erased if it crashes.
>
> Evan Jones
>
>
> --
> Evan Jones
> http://evanjones.ca/
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> LLVM Developers mailing list
> [hidden email]         http://llvm.cs.uiuc.edu
> http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev