Posts Tagged MMX

Semi-global MMX

Yesterday I came across the paper “Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information”. I have read this paper before, but yesterday I got the urge to write some code and decided to implement it. The semi-global (SG) matching technique is similar to the dynamic programming approaches for stereo-vision, except that it integrates costs across multiple paths (e.g., horizontal, vertical, and some diagonals), 16 of these in total. Instead of doing the typical DP backward pass where the solution is extracted, SG uses the sum of these costs through a pixel as the score and selects the best disparity label for each pixel.

Anyhow, the implementation is straightforward (with a sum-of-absolute difference matching score, at least). In the paper it is reported that the optimization takes roughly 1 second for some standard benchmarks, which is pretty fast for what is going on. My initial implementation was on the order of dozens of seconds for similar sequences.

The author suggests the use of SIMD, and definitely seems to have taken care in his implementation. Enter MMX. I haven’t written any MMX code for a while, but with compiler intrinsics I know that it is not as hard as the first time I did it (using inline assembly). I decided to take a crack at optimizing my implementation.

My first thoughtless implementation was to do several pixels in parallel. This quickly turned out to be problematic, as the multiple scan traversals would make it impossible to do this efficiently (at least I think so). The other solution is to parallelize the innermost loop that along a scan that for each disparity at a pixel must find the minimum over all the previous pixels cost plus some penalty that is dependent on the distance between the disparities.

/** \brief MMX-based per-pixel accumulation of scores.
 */
static void mmx_inner(const cost_type_t * score_base,
                      cost_type_t * obase, cost_type_t * lcost,
                      cost_type_t * lcost_prev, cost_type_t  mcost_many[4],

                      int ndisp, const cost_type_t p1[4], const cost_type_t p2[4]){

  __m64 this_mins;
  __m64 mn = *((__m64 *)mcost_many);
  cost_type_t * next = lcost;

  for(int d0=0; d0<ndisp; d0+=4){

    __m64 penalty = *((__m64*)(p1));
    __m64 n = *((__m64*)(lcost_prev + d0));

    //Only consider middle shifts (Are these correct?)
    n = _m_pminsw(n, _m_pshufw(n, 0x90) + penalty);

    n = _m_pminsw(n, _m_pshufw(n, 0xF9) + penalty);

    //Need two more operations for boundaries...
    if(d0){
      __m64 t = *((__m64*)(lcost_prev + d0 - 1));

      n = _m_pminsw(n, _m_pshufw(t, 0xE7) + penalty);
    }

    if(d0+1<ndisp){
      __m64 t = *((__m64*)(lcost_prev + d0 + 1));

      n = _m_pminsw(n, _m_pshufw(n, 0x24) + penalty);
    }

    penalty = *((__m64*)(p2));

    //Now do all the disparities
    for(int d=0; d<ndisp; d+=4){

      __m64 m5 = *((__m64*)(lcost_prev + d));
      __m64 m6 = _m_pminsw(m5, _m_pshufw(m5, 0x93));

      m6 = _m_pminsw(m6, _m_pshufw(m5, 0x4E));

      m6 = _m_pminsw(m6, _m_pshufw(m5, 0x39));

      n = _m_pminsw(n, m6 + penalty);
    }

    __m64 cost_many = *((__m64*)(score_base + d0));
    __m64 c = cost_many + n - mn;

    *((__m64 *)(lcost + d0)) = c;
    *((__m64 *)(obase + d0)) += c;//*((__m64 *)(out + ndisp*x + d0)) + c;


    if(d0==0) this_mins = c;
    else this_mins = _m_pminsw(this_mins, c);
  }

  this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x93));

  this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x4E));

  this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x39));

  *((__m64 *)(mcost_many)) = this_mins;
}

Now isn’t that pretty? Here’s the kicker. On the system I was developing this on (AMD Phenom II), I get a speedup (rather slowdown) of 0.43 (where MMX takes 2.29s, and basic takes 0.997s, for pass through a 1024×1024 image with 32 disparities).

But, on the machine that I had originally developed and tested (older compiler, Intel Quad Core), I get a speedup of roughly 10x, where the MMX takes 0.944s, whereas the basic takes 7.8s). This was the machine that I wrote the original basic implementation and decided it was slow enough to warrant hand optimization.

I am convinced that gcc’s internal vectorization has been improved significantly. Running the gcc 4.3.2 executable on the Intel machine, I get a little bit of a speedup: 1.2 (MMX takes 0.95s and Basic takes 1.12s).

Lesson: writing MMX code is fun (in that you feel like you accomplished something), but the last couple times I tried the compiler has done a better job (and it is much faster at it too!).

For the tsukuba data set, my code takes about 1.4s for 32 disparity levels. Without doing forward backward comparison, it gives pretty decent results too:
d

It is still slower than what is reported in the paper (and the paper is doing forward and backward comparisons). Now add some threading, we can get this down to about 0.59s (4 threads on quad core, where some other user is eating up some of one CPU). That is a bit better. We can get this down to about 0.38s (where 0.1s is the non-parallel matching cost computation and image loading). That is somewhat better; not real-time, but not worth any more effort to optimize.

I’ve uploaded the current source for this if you care to check it out. I don’t guarantee correctness or cleanliness: sgstereo. I was originally looking (albeit quickly) for semi-global matching source code and it wasn’t online in the first few hits. Now it is definitely online.

Next time I will have to remember to try compiling on a newer compiler before wasting any time with intrinsics. I think this has happened before…

, , , , ,

No Comments