Vision
These are my notes on WWDC 2017 Session 506, called “Vision Framework: Building on Core ML.” I’ve linked to Wikipedia technical terms that I wasn’t familiar with.
What is it?
High-level on-device solutions to computer vision problems through one simple API.
– Brett Keating, WWDC 2017
What can it do?
Vision framework can detect faces, based on deep learning. Apple says this gives higher precision and higher recall than previous technologies, such as Core Image or AVCapture. This allows for better detection of smaller faces, side views (“strong profiles,”) partially blocked (“occluded”) faces, “including hats and glasses.”
The full feature list:
- Face landmarks, a “constellation of points” on the face. Essentially, tracing eyes, nose, mouth, and the chin.
- Photo Stitching (“Image Registration”) with two techniques: “translation only” and full “homography.”
- Rectangle, barcode, and text detection.
- Object tracking, for faces or other rectangles in video.
- Automatically integrate CoreML models directly into Vision. (Apple showed a demo using an MNIST handwriting recognition model, & Core Image filters to read the number four from a sticky note.)
Three Steps To Vision
- Use a
VNRequest
subclass to ask Vision for something. For example,VNDetectBarcodesRequest
for barcodes. - Pass the request to one of two kinds of request handlers, along with a completion block.
- In the completion block, we get back the initial request, its
results
array populated with “observations,” likeVNClassificationObservation
orVNDetectedObjectObservation
.
Request Handlers
Vision offers two request handlers:
-
Image Request Handlers: For “interactive” exploration of a still image. Image Request Handlers retain images for their lifecyle, as a performance optimization.
-
Sequence Request Handler: For tracking movement across video. Sequence request handlers don’t have the same optimization. (Imagine how many frames would need to be cached for a 30 second video clip at 24 frames per second - 720.)
Best Practices
Apple discussed three areas for best practices:
1. Image Type
Vision supports a wide variety of image types/sources: CVPixelBuffer
, CGImageRef
, CIImage
, NSURL
, and NSData
.
- Use
CVPixelBuffer
s for streaming. (You getCVPixelBuffer
s fromCMSampleBuffer
from a camera stream’sVideoDataOut
.) These are a low level format for in-memory RGB data. - Use
URL
for accessing images that are saved to disk orNSData
for images from the web. For URL based images, you don’t need to pass EXIF orientation data. (But you can specify if you want to override the default.) - You can pass in
CIImage
s from Core Image. - Finally, if you already have a
UIImage
orNSImage
, get theCGImageRef
and pass that into Vision. Easy.
2. What am I going to do with the image?
- Use the appropriate handler. (
VNImageRequestHandler
orVNSequenceRequestHandler
.) - Don’t pre-scale images.
- Do pass in EXIF orientation data. (Except for URL based images.)
3. Performance
Dispatch to a background queue, so Vision doesn’t block your UI. In the completion handler, remember to dispatch back to the main queue if you’re updating UI.
Demo
Apple ended the session with a demo.