# Posts Tagged stereo

### Variational Displacement Map Recovery

Shortly after working on some variational scene flow (from a single moving camera), I thought it might be a good idea to implement the same ideas to reconstruct both a displacement map and a flow map on top of a base mesh. The variational formulation for displacement map estimation is more or less the same. I parameterized the displacement as displacement along the normal (something that we have done before), so the objective is to find the displacements on the mesh such that the image score is minimized (in this case, pairwise SSD scores), while having a regularization constraint over the displacements (and flow vectors) in the uv-coordinates.

I had implemented this idea, and barely tested it on anything. This last week, I figured that I could use parts of the project to generate some data. So I wanted tos hare my results. Below is a sample of three input images from a synthetic sequence. The images were lit from several lights to ensure the best conditions for the shape estimation (e.g., the SSD score wouldn’t get confused). The results look pretty good. And they should. This is a pretty good situation for stereo.

The idea of solving for flow required that there were multiple images of the object deforming over time. Again, I tested this on a similar sequence, where now the object had some texture (to enable the flow recovery), and I also introduced some motion. The idea is now to recover both the displacement map (that ensures stereo consistency at time t=0), and also the 3D flow map that warps this image forward in time (t > 0). Ideally, there would also be some temporal consistency between flow maps at (t>0), but for now I simply solved for the displacement and flow simultaneously for pairs (t=0, t=1), (t=0, t=2), etc

In this case the input sequences look something like the sequence below:

Again, the reconstruction, for the most part was okay. There is one exception: the displaced meshes sometimes overlap/intersect, which means that they are not as useful in the application that I wanted to use them in (that is without post processing). Notice that there is flow roughly in the regions of the eyse and near the mouth, which agrees with the input sequence. The displacement looked similar to the non flowed case.

The resulting, recovered mesh appears beside the input rendering in the following video. I could have probably chosen the regularization parameters better. If the video doesn’t load, try this link: flowed_result.

### Semi-global MMX

Yesterday I came across the paper “Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information”. I have read this paper before, but yesterday I got the urge to write some code and decided to implement it. The semi-global (SG) matching technique is similar to the dynamic programming approaches for stereo-vision, except that it integrates costs across multiple paths (e.g., horizontal, vertical, and some diagonals), 16 of these in total. Instead of doing the typical DP backward pass where the solution is extracted, SG uses the sum of these costs through a pixel as the score and selects the best disparity label for each pixel.

Anyhow, the implementation is straightforward (with a sum-of-absolute difference matching score, at least). In the paper it is reported that the optimization takes roughly 1 second for some standard benchmarks, which is pretty fast for what is going on. My initial implementation was on the order of dozens of seconds for similar sequences.

The author suggests the use of SIMD, and definitely seems to have taken care in his implementation. Enter MMX. I haven’t written any MMX code for a while, but with compiler intrinsics I know that it is not as hard as the first time I did it (using inline assembly). I decided to take a crack at optimizing my implementation.

My first thoughtless implementation was to do several pixels in parallel. This quickly turned out to be problematic, as the multiple scan traversals would make it impossible to do this efficiently (at least I think so). The other solution is to parallelize the innermost loop that along a scan that for each disparity at a pixel must find the minimum over all the previous pixels cost plus some penalty that is dependent on the distance between the disparities.

/** \brief MMX-based per-pixel accumulation of scores. */ static void mmx_inner(const cost_type_t * score_base, cost_type_t * obase, cost_type_t * lcost, cost_type_t * lcost_prev, cost_type_t mcost_many[4], int ndisp, const cost_type_t p1[4], const cost_type_t p2[4]){ __m64 this_mins; __m64 mn = *((__m64 *)mcost_many); cost_type_t * next = lcost; for(int d0=0; d0<ndisp; d0+=4){ __m64 penalty = *((__m64*)(p1)); __m64 n = *((__m64*)(lcost_prev + d0)); //Only consider middle shifts (Are these correct?) n = _m_pminsw(n, _m_pshufw(n, 0x90) + penalty); n = _m_pminsw(n, _m_pshufw(n, 0xF9) + penalty); //Need two more operations for boundaries... if(d0){ __m64 t = *((__m64*)(lcost_prev + d0 - 1)); n = _m_pminsw(n, _m_pshufw(t, 0xE7) + penalty); } if(d0+1<ndisp){ __m64 t = *((__m64*)(lcost_prev + d0 + 1)); n = _m_pminsw(n, _m_pshufw(n, 0x24) + penalty); } penalty = *((__m64*)(p2)); //Now do all the disparities for(int d=0; d<ndisp; d+=4){ __m64 m5 = *((__m64*)(lcost_prev + d)); __m64 m6 = _m_pminsw(m5, _m_pshufw(m5, 0x93)); m6 = _m_pminsw(m6, _m_pshufw(m5, 0x4E)); m6 = _m_pminsw(m6, _m_pshufw(m5, 0x39)); n = _m_pminsw(n, m6 + penalty); } __m64 cost_many = *((__m64*)(score_base + d0)); __m64 c = cost_many + n - mn; *((__m64 *)(lcost + d0)) = c; *((__m64 *)(obase + d0)) += c;//*((__m64 *)(out + ndisp*x + d0)) + c; if(d0==0) this_mins = c; else this_mins = _m_pminsw(this_mins, c); } this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x93)); this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x4E)); this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x39)); *((__m64 *)(mcost_many)) = this_mins; }

Now isn’t that pretty? Here’s the kicker. On the system I was developing this on (AMD Phenom II), I get a speedup (rather slowdown) of 0.43 (where MMX takes 2.29s, and basic takes 0.997s, for pass through a 1024×1024 image with 32 disparities).

But, on the machine that I had originally developed and tested (older compiler, Intel Quad Core), I get a speedup of roughly 10x, where the MMX takes 0.944s, whereas the basic takes 7.8s). This was the machine that I wrote the original basic implementation and decided it was slow enough to warrant hand optimization.

I am convinced that gcc’s internal vectorization has been improved significantly. Running the gcc 4.3.2 executable on the Intel machine, I get a little bit of a speedup: 1.2 (MMX takes 0.95s and Basic takes 1.12s).

Lesson: writing MMX code is fun (in that you feel like you accomplished something), but the last couple times I tried the compiler has done a better job (and it is much faster at it too!).

For the tsukuba data set, my code takes about 1.4s for 32 disparity levels. Without doing forward backward comparison, it gives pretty decent results too:

It is still slower than what is reported in the paper (and the paper is doing forward and backward comparisons). Now add some threading, we can get this down to about 0.59s (4 threads on quad core, where some other user is eating up some of one CPU). That is a bit better. We can get this down to about 0.38s (where 0.1s is the non-parallel matching cost computation and image loading). That is somewhat better; not real-time, but not worth any more effort to optimize.

I’ve uploaded the current source for this if you care to check it out. I don’t guarantee correctness or cleanliness: sgstereo. I was originally looking (albeit quickly) for semi-global matching source code and it wasn’t online in the first few hits. Now it is definitely online.

Next time I will have to remember to try compiling on a newer compiler before wasting any time with intrinsics. I think this has happened before…