These are my notes on WWDC 2017 Session 506, called “Vision Framework: Building on Core ML.” I’ve linked to Wikipedia technical terms that I wasn’t familiar with.
What is it?
High-level on-device solutions to computer vision problems through one simple API.
– Brett Keating, WWDC 2017
What can it do?
Vision framework can detect faces, based on deep learning. Apple says this gives higher precision and higher recall than previous technologies, such as Core Image or AVCapture. This allows for better detection of smaller faces, side views (“strong profiles,”) partially blocked (“occluded”) faces, “including hats and glasses.”
The full feature list:
- Face landmarks, a “constellation of points” on the face. Essentially, tracing eyes, nose, mouth, and the chin.
- Photo Stitching (“Image Registration”) with two techniques: “translation only” and full “homography.”
- Rectangle, barcode, and text detection.
- Object tracking, for faces or other rectangles in video.
- Automatically integrate CoreML models directly into Vision. (Apple showed a demo using an MNIST handwriting recognition model, & Core Image filters to read the number four from a sticky note.)
Three Steps To Vision
- Use a
VNRequestsubclass to ask Vision for something. For example,
- Pass the request to one of two kinds of request handlers, along with a completion block.
- In the completion block, we get back the initial request, its
resultsarray populated with “observations,” like
Vision offers two request handlers:
Image Request Handlers: For “interactive” exploration of a still image. Image Request Handlers retain images for their lifecyle, as a performance optimization.
Sequence Request Handler: For tracking movement across video. Sequence request handlers don’t have the same optimization. (Imagine how many frames would need to be cached for a 30 second video clip at 24 frames per second - 720.)
Apple discussed three areas for best practices:
1. Image Type
Vision supports a wide variety of image types/sources:
CVPixelBuffers for streaming. (You get
CMSampleBufferfrom a camera stream’s
VideoDataOut.) These are a low level format for in-memory RGB data.
URLfor accessing images that are saved to disk or
NSDatafor images from the web. For URL based images, you don’t need to pass EXIF orientation data. (But you can specify if you want to override the default.)
- You can pass in
CIImages from Core Image.
- Finally, if you already have a
NSImage, get the
CGImageRefand pass that into Vision. Easy.
2. What am I going to do with the image?
- Use the appropriate handler. (
- Don’t pre-scale images.
- Do pass in EXIF orientation data. (Except for URL based images.)
Dispatch to a background queue, so Vision doesn’t block your UI. In the completion handler, remember to dispatch back to the main queue if you’re updating UI.
Apple ended the session with a demo.