The following is a brief report on the evaluation of the Kinect as a user-computer input device, examples and notes are given that may be helpful for others starting out in similar projects. The libraries used are those provided on openkinect.org, namely, libfreenect. This history of these is relatively recent (at the time of writing), the open sourcing of the code from Hector Mart occurred only 6 weeks ago.
Kinect, technology licensed from Primesense
One can request two "images" from the Kinect, one a standard image from a normal camera and the other the depth map derived by a infra-red structured light emitter and camera. The IR emitter is the left most circle (when viewed from the front), the central circle is the normal camera, and the rightmost circle is the IR camera. The resolution of the camera image is 640x480 at 30fps (there is a 1280x1024 mode that runs at 15 fps. There are different formats the image data can be supplied as, for example: standard RGB, raw Bayer, or YUV.
The depth image is also supplied as 640x480 at 30fps, there is also a high resolution mode (1280x1024) at a lower frame rate, around 10fps if the high resolution camera data is also being acquired. The maximum dynamic range is 11 bit (2048 states). There is also control over the tilt of the Kinect, the appearance of the single LED light, and one can receive data from the accelerometer. These will not be discussed here.Single pointer example
The first stage of any camera tracking is often background subtraction. With the Kinect the depth image makes this trivial and not susceptible to changes in lighting. In the following the algorithm is very simple, choose a depth (foreground hand in this example) and calculate the depth values within some tolerance of that depth (red), take a running average for a smooth tracked result (green ellipse).
Two hand driver example
The following is a simple angular interface using two hands, estimating their relative positions and deriving the angle from the horizon. An important feature of both these is how stable the derived statistics are and the minimal latency, the capture runs at 25 fps.
The white patches in the images/videos presented here are either those the IR camera cannot see due to obstruction by a closer object, or IR cold objects (such as the large black leather couch).
In my experience the whole dynamic range of 2048 was never encountered. I chose to maintain the range encountered so far and stored the depths in float buffer normalised from 0 to 1. Depth 2047 is reserved to indicate no reliable depth estimate.
The image camera has a slightly wider (horizontal and vertical) field of view compared to the depth camera. To correct for this a calibration process is required and a scaling (first approximation) applied.
While not employed in the examples here some improvement can be be achieved by performing a non-linear transformation of the depth values, essentially a gamma correction. A power of 3 is proposed.
The skeleton model supported by the Kinect is not yet fully exposed through the library. The best source for this are the open source drivers from Primesense.
For those who might think about mounting the Kinect on the ceiling pointing down, it appears the gears for tilting are not able to deal with that.