These are my notes on WWDC 2017 Session 506, called “Vision Framework: Building on Core ML.” I’ve linked to Wikipedia technical terms that I wasn’t familiar with.

What is it?

High-level on-device solutions to computer vision problems through one simple API.

– Brett Keating, WWDC 2017

What can it do?

Vision framework can detect faces, based on deep learning. Apple says this gives higher precision and higher recall than previous technologies, such as Core Image or AVCapture. This allows for better detection of smaller faces, side views (“strong profiles,”) partially blocked (“occluded”) faces, “including hats and glasses.”

The full feature list:

  • Face landmarks, a “constellation of points” on the face. Essentially, tracing eyes, nose, mouth, and the chin.
  • Photo Stitching (“Image Registration”) with two techniques: “translation only” and full “homography.”
  • Rectangle, barcode, and text detection.
  • Object tracking, for faces or other rectangles in video.
  • Automatically integrate CoreML models directly into Vision. (Apple showed a demo using an MNIST handwriting recognition model, & Core Image filters to read the number four from a sticky note.)

Three Steps To Vision

  1. Use a VNRequest subclass to ask Vision for something. For example, VNDetectBarcodesRequest for barcodes.
  2. Pass the request to one of two kinds of request handlers, along with a completion block.
  3. In the completion block, we get back the initial request, its results array populated with “observations,” like VNClassificationObservation or VNDetectedObjectObservation.

Request Handlers

Vision offers two request handlers:

  1. Image Request Handlers: For “interactive” exploration of a still image. Image Request Handlers retain images for their lifecyle, as a performance optimization.

  2. Sequence Request Handler: For tracking movement across video. Sequence request handlers don’t have the same optimization. (Imagine how many frames would need to be cached for a 30 second video clip at 24 frames per second - 720.)

Best Practices

Apple discussed three areas for best practices:

1. Image Type

Vision supports a wide variety of image types/sources: CVPixelBuffer, CGImageRef, CIImage, NSURL, and NSData.

  • Use CVPixelBuffers for streaming. (You get CVPixelBuffers from CMSampleBuffer from a camera stream’s VideoDataOut.) These are a low level format for in-memory RGB data.
  • Use URL for accessing images that are saved to disk or NSData for images from the web. For URL based images, you don’t need to pass EXIF orientation data. (But you can specify if you want to override the default.)
  • You can pass in CIImages from Core Image.
  • Finally, if you already have a UIImage or NSImage, get the CGImageRef and pass that into Vision. Easy.

2. What am I going to do with the image?

  • Use the appropriate handler. (VNImageRequestHandler or VNSequenceRequestHandler.)
  • Don’t pre-scale images.
  • Do pass in EXIF orientation data. (Except for URL based images.)

3. Performance

Dispatch to a background queue, so Vision doesn’t block your UI. In the completion handler, remember to dispatch back to the main queue if you’re updating UI.


Apple ended the session with a demo.