Ok, I'm coming back on this point, a friend teach me that if we use the += (resp -=) to build + (resp -), we don't need additionally temporary object. It would also comply the the standard way of doing it in C++.The downside is that this always requires a tmp. matrix object which affects performance.
Would that be a good solution for you ?
Note that the fact to add an additional code call level would not degrade performance, as all of it is inlined.
BTW: I've been doing some testing, and having to build a temporary object while using the "*" to build the "*=" doesn't result on any measurable difference. (granted, I didn't spend time to setup a first class test setup and timing, but still, the actual computation should be far longer that any temporary object creation)
BTW2: Inline cause a slight increase in performance, but as much a my testing goes, it's around the 1% increase. I don't know how much it increase overall software size, but I would be interested in experimenting on that part. Are there any rationals on it for inlining all of it ?