Scroll down for previous days. ****** Day 3: ------- --- Steering Panel Discussion Q. How well has the pass 1/2/3 system worked this time around? A. Stage 1 is smoother these days because we reserve it for large but self-contained changes. Stage 2 is when we handle changes that touch large parts of the tree, so it still feels unstable, but the smoothness of Stage 1 nowadays is a real improvement. We'd like to have a more serious Stage 3 this time, when we really fix all the bugs, rather than giving up and doing the branch, then fixing bugs in the release branch. In other words, we'd like to cut the release very soon after doing the release branch. Q. (Dan J.) But that means we'll need somebody to nag developers to fix their bugs during Stage 3. A. By jove, you're right. Thanks for volunteering. Q. (Dan J.) The current system gives people no incentive to fix bugs until Stage 3. What about alternating weeks of Stage 1 and Stage 2? Or allowing new changes only if the tree is perceived as stable? A. How about using slush periods for two reasons: for stabilizing, and for testing large patches about to go in? But individual maintainers kind of do that already. Q. We need more and better autotesters A. Janis has one, but it's hard to debug because it only fires up when bootstrap breaks. Audience: we can make that happen more often (laughter) Q. Janis, don't you also have a manual regression finder? A. Yes, I can fire it off manually as well. It searches for which patch (from cvs on a particular branch) caused breakage. I should make that more available. Q. How can we get more ports into the primary platform list? A. If a company steps up and promises it will test and fix promptly, that will be looked on favorably. And even if you're only on the tertiary list and you test and fix problems promptly, that's good. Q. When will we switch from CVS to SVN? A. (Where's Dan Berlin?) We don't want to make too many changes simultaneously Thanks, everyone! ------- --- Porting gcc and Linux to Cell processor (Ulrich Wiegand, IBM) Cell is a 64 bit ppc, a 512K l2 cache, plus eight "spe's" (all ten of which take up about the same area on the chip). The ppc is an in-order pipeline, with two-way HW multithreading. Each SPE is an independent computer with a 256KB local store as fast as L1 cache, but no direct access to main memory; communication is via mailboxes, signals, DMA (using host virtual addresses), etc. The SPU has 128 128-bit registers, up to 16-way SIMD. Only one register class, all registers are the same. No virtual memory on the SPEs, no memory protection. Accesses must be 16 byte aligned; low three bits of address are ignored. Instructions are also fetched from local store. There's a branch-if-register-zero instruction. There's a compare-and-save-result-in-register instruction. So there are no flag bits, or the flag register is any regular register. No special stack pointer; just register 1 by convention. You can do byte accesses if you really want to. Audience: Q. Can one SPE access another SPE's local store? A. No. At this point, the projector bulb exploded. (Cheers from the audience.) Audience: The projector must have really been offended by your architecture... The SPE floating point is either 32 bit or 64 bit floating point. The format is the same as used by Sony's PS2. The 64 bit format is closer to IEEE than the 32 bit is. From the ppc side, you can memory map the SPE local store, and the system will send your accesses via the DMA channel. There are 16 DMA channels, and you can wait for individual channels to finish. We broke for lunch to give them time to fix the projector. Sony's going to use this chip for games; Toshiba and IBM might use it for other things. IBM's working on the "Cell Processor Based Workstation", which is a blade like the HS20 or JS20, with two Cell processors. This is the reference platform for Linux on Cell. IBM LTC Boeblingen (linux-2.6.12), the STI Design Center in Austin, and Sony's SCEA Foster City (gcc-3.4.1) have all been working on Linux and tools, and one of the first deployments will be the Barcelona Supercomputer Center (which already has a supercomputer built out of JS20 blades). They are trying hard to do all their changes the "right way" so they can get all their patches into upstream. The OS runs only on the PPC, but provides access to the SPE's via a file system abstraction. The SPEs are like devices that happen to be on die and are fully programmable. Current design, which will change as needed to get it in mainline: spufs directories represent virtual SPE context Files provide access to SPE resources /spu/myvirtspu/{mem,mbox,ibox,wbox,signal1,signal2,...,run} spufs_run ioctl on 'run' file controls execution Synchronous execution on behalf of PPE thread SPE can do DMA back to the user's address space 8 SPEs per chip; two chips; so you can see all 16. (Maybe they'll have the kernel schedule virtual SPE contexts onto physical SPEs, but that would be slow.) libspu makes the above more portable. Provides calls like spu_create_thread, etc., which calls pthread_create then does all the magic to load code onto the SPE and start it. This means each SPE thread has a real thread id on the host system. Language and library features: Toolchain supports C and a dialect of embedded C++ (no exceptions due to memory limitations). Vector data types & intrinsics, SPE assembler intrinsics, which lets you write SPE assembly without using traditional inline assembly. Limited library facilities available (256KB, eh?); they use a port of newlib You can use overlays if you need. "System calls" to offload processing to PPE The SPE can throw signals; these are translated to signals on the Unix side on the thread devoted to that SPE, and the spufs_run system call is restarted. You can really kill an SPE thread just by killing the corresponding host thread. "System calls" are implemented either in kernel space or in the libspu library (by checking return values from spufs_run, doing the call, and reissuing spufs_run). The SPE is a new ELF flavor EM_SPU 32 bit big-endian No shared library support (but there is some sort of relocatable plugin thingy) Manipulated via cross-binutils Cell: combined boject files SPE executable embedded as a whole by embedspu tool (a wrapper around objcopy) Contained in .spuelf.filename section in ppe object But there are public symbols on PPC side for the size and location of the SPE executable sections Eventually they'll support SPU->PPE symbol references (Carlos: HP has done something similar, have a look at it) GCC support PPE: handled by rs6000 back end - processor-specific tuning, pipeline description SPE: new spu backend - built as cross-compiler - handles vector data types, intrinsics - Middle-end support: branch hints, aggressive if-conversion - Future: gcc 4.x port exploiting auto-vectorization, SMS Debugging: Minimum is an SPE-only debugger that uses spufs instead of ptrace There's also a front end that makes multiple gdb's look like a single one, so you can debug a mixed PPE and SPE process without thinking about the details (although it's a leaky abstraction, prone to deadlock) He's looking into a unified GDB that handles this without kludges (Andrew: "A lot of people have tried...") Context switches (or even just looking at SPE registers from the PPE) take 50 tedious steps... this is why you try to avoid it. Lots of interesting issues in implementing nicer debugger support; he outlined them perhaps in hope of suggestions from the crowd GCC Test suite: they'll use binfmt_misc wrapper to let you run standalone SPE binaries. Cell documentation to be released in the near future Initial Linux distribution to be released soon ------- --- Sitting in hallway between talks, I decided to look at the openssl performance regression, http://gcc.gnu.org/PR19923, again. Eventually Michael Meissner wandered by and looked at it, too. He agreed that the problem is the loop; gcc isn't converting the while loop to a countdown loop anymore. ------- --- Structure Aliasing in GCC (Dan Berlin, IBM) Virtual SSA explicitly represents aliasing analysis gcc-4.0 didn't distinguish between structure fields when doing alias analysis, so it thought changing field a might also change field b. There were bug reports because e.g. when people did things like use global structure fields as loop variables, the generated code was poor. gcc-4.1 tries harder to do this right. (I hadn't had my coffee yet, so I couldn't capture the details.) The audience demanded that he go on and do the rest of his slides on pointer analysis. There are two general approaches: when you see a statement like 'p = q' where p and q are pointers, you can either say 'p can point to the same things as q' (this is called unification analysis, and is heavily patented by Microsoft), or you can say 'p points to at least the things q points to' (this is called subset analysis). These sets of things pointers can point to are used for copy constraints. When you see a cycle of pointer assignments (and quite large ones of thousands of pointers are common), you can collapse the cycle, since they all can point to the same things. Structure fields are now handled by creating separate virtual variables for all fields. As a result of this pointer analysis, gcc-4.1 is rather better at knowing that variables are not clobbered by stores to other variables. There is some work left to do, but most of this stuff is in mainline gcc-4.1 already. This info is not yet propagated to RTL. That would be possible, if it were useful. ****** Day 2: ------ --- [I had to run away for a couple hours to help deal with Summer of Code applications, so I missed a couple talks ] ------ --- Profile-driven feedback -fprofile-generate/-fprofile-use Currently migrating from rtl to tree level Automake support would be handy Limitations * profile sensitive to corruption * Minimal sanity checking possible, finds problems fairly often * Threads not supported(!) (we need to either lock the counters and be slow, or use TLS and be memory hog) (Maybe atomic increment? Only 32 bit :-( ) * Constructors/destructors are broken as runtime is initialized by them * problems loading plugin dlls * Only known functions can be used to fork/clone etc. If profile not available, you can estimate it statically from the program graph * basic block probs guessed first * bb frequencies computed by finding stable distribution of Markov chain * Counts remain 0 * Guessing machinery is used even with profile feedback when given branch/fn was not executed during training run Reliable heuristics: * loop iterations * _builtin_expect * noreturn * loop branch/exit Unreliable heuristics: * ptr usually != * opcode positive: intval >= 0, intval > 0 is likely * goto: usually not taken * early return unlikely * constant return unlikely if there is a return var case * negative constant return even less likely * return NULL unlikely Combining heuristics How often do the heuristics fire? reliable heuristic hitrates up to 90% (noreturn 100.01% :-) Coverage ranges from 0 to 20% Combined: hitrate 75%, not bad Basic optimizations * Inlining * loop unrolling/peeling/unswitching * Superblock formation (moving to tree level someday) * Hot/cold partitioning * basic block reordering * function partitioning * register allocation Value profile transformations * notice mul/dividing by power of 2 Result on specfint2000s * guessed: 6% speedup * real: 13% speedup * Currently bloats executable a bit, but it shouldn't eventually Real world results: * optimized gcc: 7% speedup compiling large file at -O2 * optimized bash: 10% speedup running configure scripts Future plans: * devirtualization * Revisit tree level optimizations * Solve the profile feedback problems * Perhaps low overhead profiling * make profile available to machine descriptions for instruction choice * Replace more -Os checks by hot/cold block checks ------ -- Inter-Module Analysis (Geoff Keating, Apple) This talk covers stuff already done a while ago, and points at what he might do in future. * IMA for C: done, in 3.4 and 4.0 * IMA for C++: not done yet, is planned Historically, cc1 would exit after each translation unit With IMA, cc1 runs for multiple translation units at once, so it has much more information it could use to optimize with. IMA's been talked about for ages, but never done in gcc. We wanted to get it done in a couple months, so we picked a simple design and focussed just on C. (Later I'd like to add C++.) Driver changes: with IMA, cc1 takes multiple .c files and produces a single .s file which is assembled to a single .o file. (If you really need a .o file for each .c file, you can probably just create a zero-length .o file.) C Semantic changes: * Had to pop the outermost scope between translation units This is like what happens when you leave a function, so all the machinery was already there. * Disambiguous static objects in different TUs This was almost already there, wasn't hard * Structure type compatibility across TUs In C, structs (and unions and enums) with same name are compatible sometimes; before IMA, this was easy to check. IMA made this a bit harder. Cross-Module Linkage * foo() in foo.c and extern foo() in bar.c need to be linked by compiler if foo.c and bar.c are compiled together * gcc used to let you declare foo() extern first, and static later, and he wanted to support this (big mistake). When this was removed (in gcc4), they changed how cross-module linkage works But there's only one VAR_DECL, which is nice Implementation experience * Implementation was straightforward * Most bugs also visible with equivalent constructs at block scope, so they weren't introduced by this change * A few problems of scale exposed by building gcc with IPA: deep (100's of levels!) nesting, complex loop structure, huge single routines, hundreds of thousands of variables, all created by inlining and optimization passes Good way to find spots that are O(n^2) in number of basic blocks * 5% performance gain on gcc IMA for C++ * everybody wants it * here's how I'm thinking about doing it + reuse same driver changes + semantics different + One Definition Rule helps here C++ Types have Linkage * very different from C -- structs with same name must be identical. This is much easier. One Definition Rule If you have same struct defined in multiple places, they must be identical and mean the same thing IMA provides a speedup here; you can skip parsing the 2nd definition if you're in a hurry (according to the standard) export * permits one TU to use a template defined in another TU * just like any other declare/define pair (at least with IMA) * Put into the C++ standard because obviously something like it was needed... but they didn't know exactly what yet. * (Took EDG 4 person-years to do this) Pre-Compiled Headers * For C++, PCH can be separate module * Allows PCH that contains export * Needs better dependency analysis Multilanguage IMA using one big .s file? * Could we reuse the same scheme? * we should some of do the work, since it would force the front ends to have clearly defined interfaces * but it'd be a can of worms An Elaborate Solution * have each front end write out IMA for each .c/.cc/.java/.f file * a JIT or interpreter or link-time optimizing compiler would then use the IL files * Sound familiar? * Not planned. We'd love it, though. Can somebody do it in the car on the way home. (Audience: our car's full. Audience: Q. What about RMS? A. We haven't asked yet. Well, we've been negotiating this for two years, and signs are encouraging that he might let us read and write an IL. Audience: Q. shall we do this in XML, and change our optimizations into XSLT transformations? Or perhaps VHDL? A. Go away. Audience: Q. how do we handle Oracle with this? Can we do whole program optimization on just the hot spots? A. Nah, people do use big machine and do whole-program optimization on Oracle, so don't worry.) A long and interesting discussion ensued on how this would really work. This would let us catch errors that currently our linker can't see. --- Interprocedural constant propagation optimizations * Helps when a function is called with a contant argument * At least one SPEC program has this kind of call * Works by producing extra version of functions which know that particular argument(s) are constant * In tree-profiling branch, scheduled for inclusion in 4.1 ----- --- Interprocedural Analysis This was originally motivated by the struct reorg work mentioned earlier, and it has limitations, but it's cheap enough to run at -O1. * What's new for gcc is that these analysis passes are now applied to an entire compilation unit * used to only be possible in the front end * Changing this required a lot of work * Took a year to beat the frontends into submission * World is now safe for doing IPA. Other people can now put new IPA passes into this framework. Call Side Effect Analysis * Determine which static variables may be read or written as a side effect of a call * Requires *complete* and *correct* call graph Front ends must tell the *whole* story for this to work * Must assume that any function not seen can call back into the module and thus get access to the static variables * For a *standard* c library you can do better (except qsort and bsearch, and in glibc, printf, which has a strange feature where you can register callbacks -- and printf is everywhere for debugging code, so this is a problem. Can we remove that extension, please?) Read-only and non-addressable variable detection * It seems that the front ends aren't doing as good a job at this as this code does, so perhaps it should be removed from the front ends Q: why is this only turned on at -O2 for C? A: I would like to run it at -O1; it just doesn't do that right now Pure and const function detection * replaces the phase done at the RTL level which is nice, since it's easier to do correctly here * one problem: misses some cases because constant propagation and dead code elimination are not run before the detection step. Plan for that is we'll figure it out on the drive back :-) Type based alias analysis * simple idea * complex problem - what does "address is never taken" really mean? Depends on the language and one's reading of the standard What does "address is never taken" mean? * unions can screw you up - obviously breaks assumption Turns out the tree-level and rtl-level alias analysis were making different assumptions What if you take the address of a member of a structure? * As negotiated on the list recently, we will assume that users might do a manual upcast, and access members of surrounding structs; if we don't see them do that, then we assume no aliasing. (Guns don't kill people, people kill people.) If a type X escapes, its subtypes, supertypes, and types of its members all escape * Malloc and free are special cased to keep these from killing everything, but abstracted versions of these functions still cause problems In whole program mode, this algorithm is very effective. If we make this flow-sensitive, it would have a big effect on the program SPEC2000 that does superficial void abuse. This just one little step along the way to good aliasing analysis. It gives better starting point for the "points to" analysis. Transformation enhancements: promotion of static variables * lets you keep static variables in registers across calls refinements to call clobbering * reduces the number of variables listed as being call clobbered Percentage improvement on SPEC: up to 2% regression, up to 4% improvement The regressions are evidence of a register allocator problem Conclusion * This is the first round of IPA to be added to GCC * Most provide only modest improvement when compiling a single module (but big payoff will come someday when we do whole-program optimization) * Occasionally some trigger big changes * Many times, the improvement is lost because the analysis overwhelms downstream transformations * We might be able to be more aggressive for C++ structs --------- ******* Day 1: --------- --- Register Allocation discussion Vlad says he'll put his new register allocator infrastructure on a branch in three-four months. Another person is trying out something. His observation is that sometimes you have to reduce the register pressure below what the architecture can handle before you should even try to do register allocation at all. --------- --- Yet Another Register Allocator (Vladimir Makarov) Introduction of tree-ssa etc. increased register pressure a lot which causes performance regressions on x86 The current register allocator has been there since day 1, and really needs updating. There have been several attempts in the past to replace or improve register allocation, but all have failed for one reason or another. The problem is likely due to the extreme complexity of the reload pass. (So it's time to bring out the big guns.) Approach I am trying: build an infrastructure so we can experiment easily My goal is to remove reload (cheers from the audience). We probably will end up with different algorithms for machines with highly regular vs. highly irregular regular files. Data structures: allocno is a live range of pseudo-register or reload value. Copy represents potentional moves, loads, and stores. I call this representation 'pessimistic splitting' (as opposed to optimistic coalescing). ... Current status: started six months ago. Probably will take two years. Focussing only on x86 and x86_64. Many SPEC2000 tests are compiled correctly. No performance improvements yet (well, there's a 1.5% improvement in one benchmark that's really hard). again, this talk was very long and detailed, so I didn't capture much of it. --------- --- Structure Splitting (peeling) (IBM, Apple) Accessing one field of all elements of an array of structures means poor locality Two possible fixes: 1. change the access pattern linear loop transforms (Daniel Berlin) 2. change the data layout reorder data elements within data structure, but leave access paterns alone That's the subject of this talk. e.g. split the structure up to pack the first field densely at the beginning of the "array". Or reorder the members within the struct (less agressive). Data Layout Optimization Only do legal reorganizations Use CFG with profiling information to generate Close Proximity Graph to pick best places to split In extreme cases, they split big structs into many (10!) structs They even fix up calls to malloc if needed. (Audience: don't you have to fix up calls to memset, too? A. We disallow the optimization now if memset is used, but we should probably allow that and handle it properly eventually.) Performance results on SPEC benchmarks: Good news: nothing got any worse (rousing applause from audience) Better news: 'art' benchmark improved by 50%! Sad news: nothing else got any better, which was disappointing given that they had verified that this kind of fix worked well when done by hand. They think perhaps their safety analysis (which is flow-insensitive) is too safe. Maybe when they do flow-sensitive safety analysis, that'll just fix it. Compile time overhead: 1 to 6%. Not too bad. Code is in struct-reorg brance. All three stages ready and working. They're about to start submitting bits of it for comment, and will submit to mainline as soon as it makes sense. Their presentation was long and detailed, and given by three people, so it look like they're very serious about this. (Unfortunately, I figured out how to check my email remotely at the beginning of the talk, so I spent most of my time trying to not laugh out loud at jokes forwarded by my wife instead of paying attention... bad Dan...) --------- --- Static stack requirement analysis in gcc Not much of a problem in single-threaded apps Multithreaded apps, though, are hard because you have to set the stack size for each thread on creation, and 32 bit virtual space is not infinite And the page protection at the end of the stack isn't really effective without stack checking. In safety critical systems (e.g. DO-178b std), stack overflows are considered major threats. Static guarantees are highly desirable. Customers in this situation are willing to accept some limitations on the language. Stack checking (-fstack-check) approach can be expensive, doesn't give static guarantee Instrumentation and Testing approach Fill with 0xdeadbeef before calling, check on return Lets you know how much you're using Easy to do, widely available (e.g. vxworks has a gui for it) But approach is expensive, not very strong guarantees How much testing needed? DO-178b requires full coverage testing, but that's not enough to find the longest path! So it would be nice if you could find the longest paths first, and start by testing them. Static analysis approach with gcc * Find the stack frame size of each fn (new option -fstack-usage) * Get complete call graph (new option -fcallgraph) * Compute max path on weighted graph -fstack-usage option Even handles alloca to some extent Found a 300K function stack frame in GNAT runtime Finds problems that -fstack-check misses sometimes Found problem in libgcc in unwind-dw2.c's execute_cfa_program Using perl for postprocessing for the moment Experiments on real code are promising with small apps Experiments with huge apps (1.5MB): no reliable results due to massive use of recursion, dynamically sized locals, and indirect calls in this particular app. Did find that compilation time impact of options was not bad at all. Results took only seconds to obtain. So this is promising for other large apps that refrain from heavy recursion, etc. And it's really nice because you get feedback early in development rather than during testing. Future work: * port these options into gcc-4.x * use higher-level semantic info to deal with indirect and dispatching calls * productize the postprocessor (it's too rough now) * integrate with IDE Conclusion * With proper language restrictions, guarantees against stack overflows can be obtained early in development * gcc offers an ideal framework for this kind of analysis Audience: Q. when will you get this into gcc-4.x? A. Maybe in a couple months? Need to think about format a bit more, design isn't really finalized. --------- --- Lunch Anthony and I ended up tagging along with Geoff Keating, Eric Christopher, and Angela Thomas. Now I can say I did lunch with somebody who works at Tivo! :-) --------- --- Intel: C/C++ compatibility on Linux Started off explaining why binary compatibility from one compiler to another is really important for Linux. Currently, there may be *no* version of gcc that supports all the third party libraries you want to use! Getting third party closed source library developers to support new versions can be hard (and expensive). We can ameliorate this by trying to provide maximum backwards compatibility. There are four kinds of compatibility: * Source (can compile old source on new system) * Binary (can run old binary on new system) * Compiler (can mix code compiled on old and new systems) * Library (library header files can be updated without making code dependent on the older version incompatibility; allows you to build an app against an old library, and run it agains a new one. This is getting more important as more and more functions move into templates in headers) We're doing pretty well at the first two. And recently we're starting to do well at compiler compatibility, thanks to the new c++ abi, and its adoption by the linux development community, including C++ abi conformance test suites for compilers, and mix-and-match testing to identify additional issues the ABI may not cover (e.g. stack unwinding). "Steps to Making Two Compilers Interoperable" This is going on, and issues get worked out now and then e.g. on the abi mailing list, and shows that people are really working towards full interoperability. Sometimes you have to implement the other compiler's ABI deviations Cross Version C++ Library Compatibility Proposal: Attribute Strong Maybe attribute Strong could be used to version standard C++ library versions. ... using namespace ns2 __attribute((__strong__)); ... Means "search this namespace even if you normally wouldn't". (ten line example actually makes this clearer, but I can't type it here) Use this to pull in particular versions of std:: in the standard include files. Mangling done with original namespace name, e.g. my_version2::, not with std::. Library developers can explicitly say std_version2:: instead of std:: if they want a stable version, using a macro to make it easy to compile with new version later. Users can then mix old third-party libraries built with old libstd++ in programs built with new libstd++! Some offort required, but at least it'd be possible. Audience: Q. Don't forget about debugger compatibility Q. Conversion routines? How can it work? Automatic conversion in overload resolution sounds dangerous. A. The user would have to invoke them explicitly sometimes, and it won't work for all programs. You'll get compile errors in the case you're worried about. Q. What about old versions already released? A. This is a suggestion for future versions of libstdc++. Q. Will you be submitting patches? A. We wanted to see if you shot us for simply suggestion the idea first. Q. What about model-specific code (amd, i586, etc.) Will this let library developers release model-specific fat libraries? Q. Why isn't namespace aliasing an appropriate solution to this problem? A (from audience) because it doesn't let you link both old and new into same binary. gcc uses attrbute strong to let you build half your app with optimized libraries and half with debug mode libraries. Let's take this offline. --------- --- Discussion Geoff Keating mentioned that they're taking the trouble at Apple of letting people use new compilers with old system libraries (so they can build with gcc-4.0 for older releases of OS X). -isysroot is very handy for this. --------- --- Paolo: "The C++ Library is being enhanced" a big chunk of tr1 is in gcc-4.0 "A shared-ownership semantics smart pointer is probably one of the most requested additions, and one of the hardest to implement correctly. (std::auto_ptr provides transfer-of-ownership semantics, which aren't needed as often) So they added two new class templates: shared_ptr p(new X); You can get the reference count for debugging purposes weak_ptr stores a "weak reference" to an object referenced by a shared_ptr. Audience member: I've used shared_ptr already in a multithreaded server. It was very slick, thanks! This means atomicity primitives now have to be supported and uniform across all targets. Already in mainline for i486, x86_64, ia64, powerpc, alpha, s390x. TR1 adds reference_wrapper, a way to put references into containers (it adds CopyConstructable and Assignable semantics to references). Without this, you can't pass objects through template parameters to functions that expect references. There's now a nice function<> wrapper that lets you hold any kind of function-like object in a variable. There's a fun generalization of bind_1st, bind_2nd: bind(f, 3, _1)(four); bind(f, _2, _1)(four, six); where _1 means "bind the first argument", etc. This is hard for the compiler front end - it spends a lot of time in name lookup. We're 2x slower than the boost version of bind. Audience: we know how to speed this up, just need to find the time to do it. Metaprogramming / type traits improvements Large set of categories: is_void, is_integer, is_pointer, etc; is_scalar, etc; is_const, etc. Also new stuff: const size_t align = alignment_of::value; typedef aligned_storage<8, align>::type aligned_type; yields "a type usable as uninitialized storage for any object whose size is at lmost 8 and whose alignment is a divisor of align" Tuple types: generalization of std::pair. tie(a, b) = c; Finally, standard replacements for __gnu_cxx::hash_map - unordered_{set,map,multiset,multimap} And a specialization is provided for basic_string, no more defining your own! Waiting on copyright assignment from Google so Matt Austern can maintain his own code. tr1 pushes the language to its limits. Some of this stuff is next to impossible to implement in pure C++ without compiler support, e.g. is_union vs. is_class. So implementors are allowed to add extensions to support this for now; c++200x will standardize reflection. aligned_storage *can't* be implemented properly in gcc right now, so they're cheating and using specializations of _Align in the range 1 to 32 for now. Can't add tr1:: identifiers to std:: because users might have e.g. #define is_void 1 On the subject of possible future enhancements to the core language: In c++200x, they might even provide a way to avoid macro clashes, e.g. #define A 9 #scope int A = 3; // ok! #endscope Audience: no! Was that proposed over beer? Or, on a much bigger scope, modules, e.g. putting std into a "module", in which case macros defined in your translation unit cannot possibly affect anything defined in module std. People at EDG are trying this out... using export... yow. And, later, tr2 will add lots more stuff, but that's for the next gcc summit. Thanks to Matt Austern, Boost, and many others for their contributions to gcc's implementation of tr1. Audience: I can confirm that we'll need builtins for the reflection stuff. It's ugly, so I haven't contributed it yet. --------- --- Toon Moene: Compiling a million line weather prediction program Toon started out with a quick introduction to the physics of weather prediction and how to model them in Fortran, and went on to list in detail the errors the guy had compiling his million-line app with gfortran-4.0.0. gfortran found lots of problems in his code, which was nice, and only had one ICE. After about 15 minutes, he dove in to interesting problems. * gfortran's -O2 heuristics regarding induction variable reduction are a bit off for him (Audience says mainline is better at this already) * gfortran's autovectorization's setup is expensive because it has to handle misaligned arrays sometimes; he'd like the common case to be fast and push the misaligned case out of line, please. The audience says this is fixed, but not yet in mainline. Audience questions: Q. Do you care about higher level things like OpenMP? A. He says they don't use OpenMP anymore, since everything is done with MPI. Q. You say gfortran is better at catching errors. Have you filed bug reports against these other compilers? A. Novel idea. We haven't yet. Intermission Talked with David Edelsohn of IBM a bit about performance regressions in gcc-3.4 and gcc-4.0. He says gcc-4.0 will do a lot better at c++ than gcc-3.4 did; they figured out how to scalarize object members, so they can pull things into registers better. He also says that when filing PRs about performance regressions, it's helpful not only to have a small testcase and the resulting assembly, but also if possible the smallest change to the C testcase that works around the performance problem. --------- --- Breakfast The waiter at the conference breakfast said "They told us to watch out for you guys. Are you wierd?" --------- ******* Day -1: I take the hotel's first shuttlebus of the morning, and arrive at the airport at 5:10 for a 6:22 flight. When I try to check in, I find that my reservation for the Chicago to Ottawa leg is intact, but there is no record of my San Jose to Chicago flight! The agent tells me that missing my flight yesterday caused my entire trip to be cancelled, and that I was lucky any of it was left. Worse, the flight in question was overbooked, so she couldn't get me a seat. She put me on standby and told me to get over there and hope. At the gate, I paced nervously while the other passengers boarded. At the end, about five people were waiting on standby for the one remaining seat. The gate agent was angrily talking with a distraught passenger, repeatedly telling her "I don't have time to deal with you. Do you understand me? Go to the main counter and tell them what you want to do." The 6:22 filled up, and I had to wait for the next flight. I called Southwest and ask if they could rescue me. Sadly, they don't have a direct flight to Chicago or Ottawa from San Jose. To pass the time, I fire up the wireless and see if I can connect; there are two T-Mobile access points and one Waypoint access point. I can connect to both services, but decline to pay their high fees. Still it's nice to know everything's working technically, after the problem connecting in the hotel. I also figure out how to click the middle mouse button (I have to disable UltraNav in the control panel). When the gate counter opens to check in for the 8am flight, I ask where I am in the standby order. They scratch their heads, call the main desk, and tell me to get on the white courtesy phone and call their main office. I do, and the nice lady tells me that since I missed my flight yesterday, all the rest of my flights have been cancelled, my tickets are worthless, and it will cost me $1900 to continue my trip. After grumbling and kicking a bit, I gave in and bought the new ticket. I am now scheduled to arrive in Ottawa at 10:30 PM, so I'll miss the opening night party. Just as well; I'm not in a party mood at the moment. Seems to me Travelocity ought to put a warning on its multihop page like this: "Warning: failure to use one leg of this trip will automatically cancel all later legs. These are not individual tickets, you must use them all exactly as issued, or they all become void. Even simply being late for one flight may render the entire set of tickets null and void." Which brings up an obvious question: Why did I buy a multihop trip, rather than booking a bunch of one-way flights? Had I done that, I would have been able to breeze through my initial LAX checkin via the self-service checkin booths, and made the flight; and even if I didn't make the flight, having them separate would have kept American from invalidating all my tickets when I jumped carriers on the first leg. I actually made it on the next flight. Amazing they didn't cancel my ticket for looking at them funny, or something. The flight was uneventful. Well, we did see a big cluster of a couple hundred windmills, we did hear a scary Whoop Whoop noise for a while from the cockpit, and the airplane did pitch up suddenly to avoid a jet coming straight for us about an hour outside Chicago. (Oddly, nobody else noticed. I guess that's just as well, too.) I tried to tell the nice lady from Paris sitting next to me about the windmills, but it took me about two minutes to figure out how to communicate it to her. (I don't think I'll forget "moulin a vent" ever again.) I couldn't understand half of her heavily accented English, so I guess I'm no better than that receptionist. I knew I was no longer too mad about the ticket and the delay when I started hearing They Might Be Giants songs in my head. At O'Hare, I had three hours to kill, so I wandered around, sampling the fine McDonalds cuisine, reading, and trying the wireless. The first three places I tested (a big food court, and gates G8 and K1) had no usable access points, but the fourth place (the other side of my gate, G8) had a usable T-Mobile access point. Unfortunately, there was also a Cinnebon next to my gate. I held out for nearly three hours, then succumbed to the aroma, knowing full well that the faux "caramel pecan" rolls they sell always disappoint once in the mouth. And sure enough, the bread was tough, the "caramel" sauce cloyingly sweet and artificially flavored, and the "pecans" were actually pralines, not nuts. It's as if the soulless corporate machine that is Cinnebon actively tries to avoid producing the kind of yummy cinnamon roll made by raising yeast dough, rolling it in sugar and cinnamon, topping with real pecans and butter, and heating so the butter starts to melt. (I fondly remember them tasting particularly good at the Pie and Burger at 4AM after a hard night upgrading the scoreboard at the local stadium, but that's another story :-) Once again, I swear never to buy another Cinnabon as long as I live, but someday the arousing aroma of cinnamon will once again overcome me. Damn you, Cinnabon! Damn you! The Ottawa airport and customs went extremely quickly, as the airport was nearly deserted but nicely staffed. I exchanged some money, bought four bus tickets at the information desk, and took the #97 bus into town. The hotel was an easy stroll from the bus stop, and my roommates were already there. A quick call to LC and DS at home and off to bed. ******* Day -2: Arrived at airport at 7:25 for an 8:05 flight (hey, it works with Southwest). Parked at Park One, right next to terminal 1. Asked the valet where American terminal is, was told Terminal 3. Walked to Terminal 3, oops, then to Terminal 4. At 7:40, tried using the self-serve checkin, but failed, since my Day 2 flight ends in Toronto, and that's apparantly enough to taint my Day 1 trip. Crap. So I get in line. The line takes 40 minutes, and I miss the flight. They put me on standby for the 10AM flight, but tell me it's fully booked, and I probably won't make it. Just as I'm about the hand them my suitcase, I realize that Southwest probably has seats free. Sure enough, a call to 1-800-IFLYSWA later, I have a guaranteed seat on the SWA 10AM flight. I try to get a gate agent's attention to tell them I'm switching airlines, but they're all far too busy, so after a minute I give up and head to Terminal 1. I get there early enough to fly standby on the 9AM flight, and wonder of wonders, I actually fly at 9AM. Score! While in line, I run into a friend from the office going to the same place, so I hop a ride in his rental car. Double score! But when we get to the office, he drops me off, and we both forget about my suitcase in his trunk. I don't realize this until 8PM as I'm leaving for my hotel. By 10PM, I've verified that the suitcase is in the trunk of vehicle 7231792, lost somewhere in a sea of 1200 cars spread across four lots, and the rental car company is sorry, but they won't be able to locate it until after my flight leaves the next day. Oh, well, all I'll miss out of it is my cellphone charger; the rest is easily replacable. While checking in at the hotel, I overheard the receptionist struggling to understand somebody on the phone. "French", she said. I offered to help, so she handed me the phone. The voice on the line said "Je voudrais parler a Gabrielle Cohen Enriquez". I told the receptionist the name, and she said "Oh, ok." Some people just can't understand heavy accents, I guess. The hotel has wireless Internet, but I can't connect to it for some reason. Fortunately, they have a real Ethernet jack, and that works. I can remote log in to my workstation, and run my mail client via remote X over ssh from my Windows laptop using Cygwin's nice rootless X Windows server. (OK, Mozilla is almost unusably slow on a 1Mbit/sec remote X link, but I am able to look up things on our intranet, which is a lifesaver. Correctness first, then performance. Cygwin's X server has really come a long ways recently; it's very slick, as long as you know how to start it (hint: type 'startx').) I struggled a bit with X text selection -- couldn't figure out how to click the middle mouse button. -- fin --