Advice on implementing fast per-thread data

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Advice on implementing fast per-thread data

Brian Hurt

Hello- I'm looking to implement a new programming language using LLVM as a
back-end.  Generally it's looking very good, I only have one question.
The language is going to be an ML-style language, similiar to Haskell or
Ocaml, except explicitly multithreaded and (like Haskell but unlike Ocaml)
purely functional.  But this means that speed of allocation is essential-
purely functional languages have been measured to allocate an average of 1
word every 6 instructions.  Generally they allocate larger blocks (often
10-20 words at a shot) less often, but we're still talking about an insane
(to imperitive programming language standards) allocation rate.

What I'd like is something similiar to the Ocaml garbage collector, but
with a unique young heap for each thread (so I don't have to acquire a
lock to allocate).  But this requires me to have two words of thread-local
storage for every thread (young_heap_pointer and young_heap_limit).  So
this is the question I have: what is the best way to implement this, given
exceedingly tight constraints on performance?  I'd like something that
could be implemented on both Unix/Mac and Windows, just to make it more of
a problem.

By far the best way I can think of is to play games with the virtual
memory- have the two pointers (and maybe a little bit of other per-thread
information) on a global page all by themselves, and remap that virtual
page for every thread.  Then I could simply dereference the globals.  On
Unix I think I can do this playing games with mmap(), but I'm not sure how
to do this on Windows.  I played around a little with the VirtualAlloc
function, but it doesn't look like it does what I need.  Almost as good
would be to add a level of indirection, have a global pointer to where the
page is to be mapped, and just map it in the same place in every thread.

Another possibility, and I'm not sure how to do this in LLVM, would be to
sacrifice a register to hold the pointer to the unique per-thread
structure.  This would be worthwhile to me even on the register-starved
x86-32.  I suppose I could also just add a "hidden" (compiler-added and
-maintained) argument to every function which is the pointer to the
per-thread data.

Using the normal thread-local storage scares me, because I don't know the
performance implications.  Obviously calling a syscall every allocation
would be unacceptable, nor would I like an O(log N) tree-walking hit.
Ocaml's allocation is like 3-5 clock cycles for the common case (no minor
GC needed), so adding another 5 clock cycles to get the thread-local data
doubles the cost of allocation.  Adding 50 or 1000 clock cycles is right
out.  Basically, I'd like allocation to remain single-digit clock cycle
cost for the common case.  If someone can convince me that thread-local
storage is that fast, then we're good.  Otherwise, I need to look
elsewhere.

Opinions, advice, and/or alternative ideas about this problem would all be
greatly welcomed.  Which of the above alternatives do you think would work
best?  Or some other approach entirely?

Thanks.

Brian

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Advice on implementing fast per-thread data

Chris Lattner
On Mon, 4 Feb 2008, Brian Hurt wrote:
> Another possibility, and I'm not sure how to do this in LLVM, would be to
> sacrifice a register to hold the pointer to the unique per-thread
> structure.  This would be worthwhile to me even on the register-starved
> x86-32.  I suppose I could also just add a "hidden" (compiler-added and
> -maintained) argument to every function which is the pointer to the
> per-thread data.

Thread local storage (TLS) on Linux is better than this.  Instead of
sacrificing a GPR, it uses a segment register to reach the TLS area,
making it very very cheap.

> Using the normal thread-local storage scares me, because I don't know the
> performance implications.

You should read up about it then. :)
Start here: http://people.redhat.com/drepper/tls.pdf

-Chris

--
http://nondot.org/sabre/
http://llvm.org/
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Advice on implementing fast per-thread data

Brian Hurt


On Tue, 5 Feb 2008, Chris Lattner wrote:

> On Mon, 4 Feb 2008, Brian Hurt wrote:
>> Another possibility, and I'm not sure how to do this in LLVM, would be to
>> sacrifice a register to hold the pointer to the unique per-thread
>> structure.  This would be worthwhile to me even on the register-starved
>> x86-32.  I suppose I could also just add a "hidden" (compiler-added and
>> -maintained) argument to every function which is the pointer to the
>> per-thread data.
>
> Thread local storage (TLS) on Linux is better than this.  Instead of
> sacrificing a GPR, it uses a segment register to reach the TLS area,
> making it very very cheap.
>
>> Using the normal thread-local storage scares me, because I don't know the
>> performance implications.
>
> You should read up about it then. :)
> Start here: http://people.redhat.com/drepper/tls.pdf
>

Thank you.  You've just made my life about 3000% easier.  Somehow I've
missed __thread- I was thinking of the clunky POSIX threads
implementation.

Playing around a little bit with this, I find that:
static __thread int i;

int foo(void) {
  i += 1;
  return i;
}

compiles to:
foo:
         pushl   %ebp
         movl    %esp, %ebp
         movl    %gs:i@NTPOFF, %eax
         addl    $1, %eax
         movl    %eax, %gs:i@NTPOFF
         popl    %ebp
         ret

So, other than the segment override, this is no different than accessing a
global variable.  Which means I don't have to give up a clock cycle on
allocation speed for the common case (actually doing a collection is a
little bit trickier).

Brian

_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev
Reply | Threaded
Open this post in threaded view
|

Re: Advice on implementing fast per-thread data

Chris Lattner
On Tue, 5 Feb 2008, Brian Hurt wrote:
>> You should read up about it then. :)
>> Start here: http://people.redhat.com/drepper/tls.pdf
>>
>
> Thank you.  You've just made my life about 3000% easier.  Somehow I've

:) Happy to help,

> Playing around a little bit with this, I find that:
> static __thread int i;
>
> int foo(void) {
> i += 1;
> return i;
> }

> So, other than the segment override, this is no different than accessing a
> global variable.  Which means I don't have to give up a clock cycle on
> allocation speed for the common case (actually doing a collection is a
> little bit trickier).

Doing a collection should be relatively straight-forward.  You can take
the address of these TLS variables, and I believe that the address is
visible across threads.  Just take the address in the startup for each
thread that is created, and store it in some global list that the
collector walks.

-Chris

--
http://nondot.org/sabre/
http://llvm.org/
_______________________________________________
LLVM Developers mailing list
[hidden email]         http://llvm.cs.uiuc.edu
http://lists.cs.uiuc.edu/mailman/listinfo/llvmdev