Posts Tagged vision

JavaScript depth mesh renderer

Was playing around with some old code for generating depth maps and decided to create a demo that renders the depth maps in WebGL. The color video is stacked vertically on the depth texture so that the two will always be in sync. Looked into packing the depth into the 24-bit RGB channels, but as the video codec is using YUV there was significant loss and the depth looked horrible. A better approach would be to pack the data into the YUV channels, but I didn’t try. For this example, the depth is only 8-bit.

You can see the one video here:

Stacked color and dpeth

Stacked color and depth


, , ,



This is a camera calibration tool inspired by the toolbox for matlab. Give it a few checkerboard images, select some corners, and you get internal calibration parameters for your images.


  • Output to xml/clb file
  • Radial distortion

Windows binaries (source code) to come.

, ,

No Comments

Region-based tracking

I recently had to revisit some tracking from last year. In the process, I uncovered some other relevant works that were doing region-based tracking by essentially segmenting the image at the same time as registering a 3D object. The formulation is similar to Chan-Vese for image segmentation, however, instead of parameterizing the curve or region directly wiht a level-set, the segmentation is parameterized by the 3D pose parameters of a known 3D object. The pose parameters are then found by a gradient descent.

Surprisingly, the formulation is pretty simple. There are two similar derivations:
1) PWP3D: Real-time segmentation and racking of 3D objects, Prisacariu & Reid
2) S. Dambreville, R. Sandhu, A.Yezzi, and A. Tannenbaum. Robust 3D Pose Estimation and Efficient 2D Region-Based Segmentation from a 3D Shape Prior. In Proc. European Conf. on Computer Vision (ECCV), volume 5303, pages 169-182, 2008.
3) There is another one from 2007. Schamltz.

This is similar to aligning a shape to a known silhouette, but in this case the silhouette is estimated at the same time.

The versions are really similar, and I think the gradient in the two versions can be brought into agreement if you use the projected normal as an approximation to the gradient of the signed distance function (you then have to use the fact that = 0). This should actually benefit implementation 1) because you don’t need to compute the signed distance of the projected mesh.

I implemented the version (2), in what took maybe 1 hour late at night, and then a couple more hours of fiddling and testing it the next day. Version 1) claims real-time; my implementation isn’t real-time, but is probably on the order of a frame or 2 a second. The gradient descent seems quite dependent on the step size parameter, which IMO needs to change based on the scene context (size ot object, contrast between foreground and background).

Here are some of the results. In most of the examples, I used a 3D model of a duck. The beginning of the video illustrates that the method can track with a poor starting point. In fact, the geometry is also inaccurate (it comes from shape-from-silhouette, and has some artifacts on the bottom). In spite of this, the tracking is still pretty solid, although it seems more sensitive to rotation (not sure if this is just due to the parameterization).

Here are 4 videos illustrating the tracking (mostly on the same object).  The last one is of a skinned geometry (probably could have gotten a better example, but it was late, and this was just for illustration anyhow).

, ,

No Comments

Variational Displacement Map Recovery

Shortly after working on some variational scene flow (from a single moving camera), I thought it might be a good idea to implement the same ideas to reconstruct both a displacement map and a flow map on top of a base mesh.  The variational formulation for displacement map estimation is more or less the same.  I parameterized the displacement as displacement along the normal (something that we have done before), so the objective is to find the displacements on the mesh such that the image score is minimized (in this case, pairwise SSD scores), while having a regularization constraint over the displacements (and flow vectors) in the uv-coordinates.

I had implemented this idea, and barely tested it on anything.  This last week, I figured that I could use parts of the project to generate some data.  So I wanted tos hare my results.  Below is a sample of three input images from a synthetic sequence.  The images were lit from several lights to ensure the best conditions for the shape estimation (e.g., the SSD score wouldn’t get confused).  The results look pretty good. And they should.  This is a pretty good situation for stereo.

Input images for the displacement map estimation

Input images for the displacement map estimation


Base mesh, rrecovered displaced mesh, and recovered displaceme map

The idea of solving for flow required that there were multiple images of the object deforming over time.  Again, I tested this on a similar sequence, where now the object had some texture (to enable the flow recovery), and I also introduced some motion.  The idea is now to recover both the displacement map (that ensures stereo consistency at time t=0), and also the 3D flow map that warps this image forward in time (t > 0).  Ideally, there would also be some temporal consistency between flow maps at (t>0), but for now I simply solved for the displacement and flow simultaneously for pairs (t=0, t=1), (t=0, t=2), etc

In this case the input sequences look something like the sequence below:

Again, the reconstruction, for the most part was okay.  There is one exception: the displaced meshes sometimes overlap/intersect, which means that they are not as useful in the application that I wanted to use them in (that is without post processing).  Notice that there is flow roughly in the regions of the eyse and near the mouth, which agrees with the input sequence.  The displacement looked similar to the non flowed case.

The u, v, and w- components of the flow for the last image.

The u, v, and w- components of the flow for the last image.

The resulting, recovered mesh appears beside the input rendering in the following video.  I could have probably chosen the regularization parameters better.  If the video doesn’t load, try this link: flowed_result.

, , , , , ,

No Comments


Capture from multiple libdc1394 /firewire cameras on different hosts. Each host can have several cameras. The library bases the multi-host communication on the pvm library. Higher level classes are used to develop the UI (Qt)

Mode of operation assumes that recorded files will be sent back and used in post processing. Files can be encoded on the client side (using mencoder) and then transferred back, or saved in a raw binary form for transfer after. Alternatively, for short sequences, or when the disk or encoding is a bottleneck, the client side saves images in memory buffers and transfers them after the recording is complete. Raw files from multiples hosts are merged and can be viewed with a simple multi-video viewer. File transfer is done with a combination of scp/ssh and TCP/IP sockets; as such it will probably take some effort to port to windows.

MCap supports changing features and settings, or loading configurations of cameras from files (in a crude manner).

Mcap screenshot

Issues: changing video modes sometimes causes failures (possibly due to trying to allocate more bandwidth than the1394 bus can handle). Failure is painful due to

Planned (Maybe): add ability to plugin framesinks (for transfer, encoding), or other filters and processing so the client can do processing in real-time applications.

To come: possibly a system architecture, source code, and some videos.


, , , ,

No Comments

Semi-global MMX

Yesterday I came across the paper “Accurate and Efficient Stereo Processing by Semi-Global Matching and Mutual Information”. I have read this paper before, but yesterday I got the urge to write some code and decided to implement it. The semi-global (SG) matching technique is similar to the dynamic programming approaches for stereo-vision, except that it integrates costs across multiple paths (e.g., horizontal, vertical, and some diagonals), 16 of these in total. Instead of doing the typical DP backward pass where the solution is extracted, SG uses the sum of these costs through a pixel as the score and selects the best disparity label for each pixel.

Anyhow, the implementation is straightforward (with a sum-of-absolute difference matching score, at least). In the paper it is reported that the optimization takes roughly 1 second for some standard benchmarks, which is pretty fast for what is going on. My initial implementation was on the order of dozens of seconds for similar sequences.

The author suggests the use of SIMD, and definitely seems to have taken care in his implementation. Enter MMX. I haven’t written any MMX code for a while, but with compiler intrinsics I know that it is not as hard as the first time I did it (using inline assembly). I decided to take a crack at optimizing my implementation.

My first thoughtless implementation was to do several pixels in parallel. This quickly turned out to be problematic, as the multiple scan traversals would make it impossible to do this efficiently (at least I think so). The other solution is to parallelize the innermost loop that along a scan that for each disparity at a pixel must find the minimum over all the previous pixels cost plus some penalty that is dependent on the distance between the disparities.

/** \brief MMX-based per-pixel accumulation of scores.
static void mmx_inner(const cost_type_t * score_base,
                      cost_type_t * obase, cost_type_t * lcost,
                      cost_type_t * lcost_prev, cost_type_t  mcost_many[4],

                      int ndisp, const cost_type_t p1[4], const cost_type_t p2[4]){

  __m64 this_mins;
  __m64 mn = *((__m64 *)mcost_many);
  cost_type_t * next = lcost;

  for(int d0=0; d0<ndisp; d0+=4){

    __m64 penalty = *((__m64*)(p1));
    __m64 n = *((__m64*)(lcost_prev + d0));

    //Only consider middle shifts (Are these correct?)
    n = _m_pminsw(n, _m_pshufw(n, 0x90) + penalty);

    n = _m_pminsw(n, _m_pshufw(n, 0xF9) + penalty);

    //Need two more operations for boundaries...
      __m64 t = *((__m64*)(lcost_prev + d0 - 1));

      n = _m_pminsw(n, _m_pshufw(t, 0xE7) + penalty);

      __m64 t = *((__m64*)(lcost_prev + d0 + 1));

      n = _m_pminsw(n, _m_pshufw(n, 0x24) + penalty);

    penalty = *((__m64*)(p2));

    //Now do all the disparities
    for(int d=0; d<ndisp; d+=4){

      __m64 m5 = *((__m64*)(lcost_prev + d));
      __m64 m6 = _m_pminsw(m5, _m_pshufw(m5, 0x93));

      m6 = _m_pminsw(m6, _m_pshufw(m5, 0x4E));

      m6 = _m_pminsw(m6, _m_pshufw(m5, 0x39));

      n = _m_pminsw(n, m6 + penalty);

    __m64 cost_many = *((__m64*)(score_base + d0));
    __m64 c = cost_many + n - mn;

    *((__m64 *)(lcost + d0)) = c;
    *((__m64 *)(obase + d0)) += c;//*((__m64 *)(out + ndisp*x + d0)) + c;

    if(d0==0) this_mins = c;
    else this_mins = _m_pminsw(this_mins, c);

  this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x93));

  this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x4E));

  this_mins = _m_pminsw(this_mins, _m_pshufw(this_mins, 0x39));

  *((__m64 *)(mcost_many)) = this_mins;

Now isn’t that pretty? Here’s the kicker. On the system I was developing this on (AMD Phenom II), I get a speedup (rather slowdown) of 0.43 (where MMX takes 2.29s, and basic takes 0.997s, for pass through a 1024×1024 image with 32 disparities).

But, on the machine that I had originally developed and tested (older compiler, Intel Quad Core), I get a speedup of roughly 10x, where the MMX takes 0.944s, whereas the basic takes 7.8s). This was the machine that I wrote the original basic implementation and decided it was slow enough to warrant hand optimization.

I am convinced that gcc’s internal vectorization has been improved significantly. Running the gcc 4.3.2 executable on the Intel machine, I get a little bit of a speedup: 1.2 (MMX takes 0.95s and Basic takes 1.12s).

Lesson: writing MMX code is fun (in that you feel like you accomplished something), but the last couple times I tried the compiler has done a better job (and it is much faster at it too!).

For the tsukuba data set, my code takes about 1.4s for 32 disparity levels. Without doing forward backward comparison, it gives pretty decent results too:

It is still slower than what is reported in the paper (and the paper is doing forward and backward comparisons). Now add some threading, we can get this down to about 0.59s (4 threads on quad core, where some other user is eating up some of one CPU). That is a bit better. We can get this down to about 0.38s (where 0.1s is the non-parallel matching cost computation and image loading). That is somewhat better; not real-time, but not worth any more effort to optimize.

I’ve uploaded the current source for this if you care to check it out. I don’t guarantee correctness or cleanliness: sgstereo. I was originally looking (albeit quickly) for semi-global matching source code and it wasn’t online in the first few hits. Now it is definitely online.

Next time I will have to remember to try compiling on a newer compiler before wasting any time with intrinsics. I think this has happened before…

, , , , ,

No Comments

Color Tracker

This last friday: I was reading a paper that integrated time of flight sensors with stereo cameras for real-time human tracking and I came across some old papers for color skin tracking. I couldn’t help myself from trying out an implementation of the tracker: color_tracker and color_tracker2

It is really simple, but works pretty good for fast motions.

Another reference was on boosting for detection of faces in images. I was tempted to try and implement a version. It’s a good thing it was closer to the end of a Friday, so I didn’t start…which is probably a good thing as I have little use for such a beast.

Vids will probably only play if you have VLC or Divx installed (encoded with ffmpeg and the usual incompatibility problems)


No Comments

Swiss Ranger

gtk swiss-ranger

We reciently got a SR3K Swiss Ranger time of flight range finder. After trying to get some data out of thing (their API is really straightforward), I realized I was going to need something similar to their windows API (provided with the developer kit) to probe all the available parameters.

This code is the result of that work. It is not much, but gives a way to try out
the device using c/c++ in Linux. I realize that they provide sample code for python (and matlab), but there were some problems getting that to run out of the box (mostly because the drivers are 32-bit, meaning that the python interpreter has to be 32-bit, and I didn’t have a 32-bit version installed.)

After all of this, I think that the best settings are the following: AutoExposure on (with a desiredPos of about 50); Median Fitler on, AN-SWF on (whatever this is), and a extra bilateral sw filtering (my own).

The screenshots below show the contrl window and some of the display modes.

To illustrate the noiseness of the data (after only median filtering) is displayed in the blender
screenshot below. This is compared to the right image with the extra AN-SWF and bilateral filtering.

Source is also available. Requires nimage to build, but the binary should probably work as long as 32-bit versions of gtkgl-2.0 and gtk+-2.0 are installed.

No Comments


More to come.

, ,

No Comments


More to come.

, ,

No Comments


More to come.

, ,

No Comments

Image-Based Face Tracking for VR

Not so much of a research topic as it was an attempt to duplicate some YouTube video that did Head Tracking for VR using a WiiMote (This wasn’t just some YouTube Video, but actually quite a popular one.)

There is a company seeingmachines that has a commercial version of a head tracker that they use for a similar purpose.

The non-pattern based one uses a 4DOF SSD tracker. I’m currently working on a port of the demo for the Mac.


No Comments

Phd Progress

Vision-based Modeling of dynamic Appearance and Deformation of Human Actors


The goal of the proposed research is to capture a concise representation of
human deformation and appearance in a low-cost environment that can be seamlessly integrated
into entertainment and Virtual Reality applications–applications that often require
relighting, high levels of realism, and rendering from novel viewpoints. Geometric modeling
techniques exist in the current state of the field, but they focus solely on
geometric deformation and ignore the appearance component. I propose
a combined model of both geometric and appearance deformation, leading
to a more compact representation while still achieving high levels of realism.
Furthermore, many existing methods require input data from artists or expensive
Laser scanners. The proposed vision-based method is less expensive
and requires only limited manual intervention.

In a controlled environment, several cameras will be used to acquire a dense
static geometry and a basic appearance model of a street-clothed human (e.g., a texture mapped model).
A skeletal model is manually aligned with the existing geometry,
and the actor performs a sequence of predefined motions. Multi-image vision-based
tracking techniques are used to extract the kinematic motion parameters that account for the
gross motion of the object. Multi-view stereo and silhouette cues, and temporal
coherence are used to extract the time-varying residual geometric deformation. These time-dependent
geometric quantities, and the time and view-dependent photometric quantities are used to create
the final compact model of appearance and geometric variation. Following several graphics techniques
for modeling articulated deformation and leveraging existing tools for image-based rendering,
the deformation model is built on top of linear blend skinning. Appearance and geometric variation
are modeled together as a function of abstract interpolating parameters
that include, e.g., relative joint angles and view angles. Examples of deformation include muscle bulging
and some forms of cloth budging. We model this compactly using a dimensionality
reduction and sparse data interpolation, where several low-dimensional subspaces and corresponding
interpolation spaces are used, effectively clustering
portions of the surface that are affected by the same abstract interpolators. Contrast to many of the
vision-based human modeling techniques, our complete model can be re-rendered under novel viewpoint,
novel animation, and even novel illumination when the illumination during capture is calibrated

The specified model also has potential applications in the use of appearance and
deformation transfer between different subjects. Furthermore, the combined geometric and appearance
model will almost directly transfer to similar domains, such as modeling hand deformations, or
related domains, as facial deformations. An improved model of geometry and appearance could also
be used to improve markerless vision-based motion capture. This work also hopes to better identify the
range of clothing and deformations that can be both captured and satisfactory modeled as a function
of relative joint angles neighboring rigid body velocities and other related external factors, such as direction
of gravity, and wind.


The third version of candidacy report is now available. The document is currently
a work in progress and should not be redistributed candidacy.pdf (Updated: Wed Apr 18 12:01)

Update: I passed my candidacy, more relevant details to the exam are on a separate page


Multi-view tracking preliminary results.

Testing of collision bounding representation


No Comments

Capture System

The original capture system project. This started as a capture system written by Keith and Myself in matlab. About a year later (2003/2004) I wrote a c++ version with wxWidgets.

It turned out to be very useful for several other projects.


  • Cross-platform
  • Plugin style interface (very simple)
  • 1394 capture or load from images
  • Internal tools for computing texture coordinates

Other links:

, ,

No Comments

Gimp – Integrate Normals

This plugin is used to integrate normal maps into depth maps. As input the plugin takes an rgb description of a normal map, and outputs the corresponding depth map. There are two modes of operation.

More details and screenshots to come.

As an example, the program can be used to create normal maps from depth maps, or more importantly depth maps from normal maps.

To illustrate, consider the input depth map:

In some vision problems (e.g., shape from shading), you have a normal map and you want to recover the depth map. The plugin can be used to generate the depth map from the normal map(left). Running the plugin once more (with this typical input), we can recover the depth:

The interface has several modes for packing either the normals or the depth, but if you want to use them, you will need to look at the code. Requires the fast fourier transform library fftw3 to compile and run. Using a sparse linear least squares solver, lsqr, which is included in the source.

, ,

No Comments