Vision
These are my notes on WWDC 2017 Session 506, called “Vision Framework: Building on Core ML.” I’ve linked to Wikipedia technical terms that I wasn’t familiar with.
What is it?
High-level on-device solutions to computer vision problems through one simple API.
– Brett Keating, WWDC 2017
What can it do?
Vision framework can detect faces, based on deep learning. Apple says this gives higher precision and higher recall than previous technologies, such as Core Image or AVCapture. This allows for better detection of smaller faces, side views (“strong profiles,”) partially blocked (“occluded”) faces, “including hats and glasses.”
The full feature list:
- Face landmarks, a “constellation of points” on the face. Essentially, tracing eyes, nose, mouth, and the chin.
- Photo Stitching (“Image Registration”) with two techniques: “translation only” and full “homography.”
- Rectangle, barcode, and text detection.
- Object tracking, for faces or other rectangles in video.
- Automatically integrate CoreML models directly into Vision. (Apple showed a demo using an MNIST handwriting recognition model, & Core Image filters to read the number four from a sticky note.)
Three Steps To Vision
- Use a
VNRequestsubclass to ask Vision for something. For example,VNDetectBarcodesRequestfor barcodes. - Pass the request to one of two kinds of request handlers, along with a completion block.
- In the completion block, we get back the initial request, its
resultsarray populated with “observations,” likeVNClassificationObservationorVNDetectedObjectObservation.
Request Handlers
Vision offers two request handlers:
-
Image Request Handlers: For “interactive” exploration of a still image. Image Request Handlers retain images for their lifecyle, as a performance optimization.
-
Sequence Request Handler: For tracking movement across video. Sequence request handlers don’t have the same optimization. (Imagine how many frames would need to be cached for a 30 second video clip at 24 frames per second - 720.)
Best Practices
Apple discussed three areas for best practices:
1. Image Type
Vision supports a wide variety of image types/sources: CVPixelBuffer, CGImageRef, CIImage, NSURL, and NSData.
- Use
CVPixelBuffers for streaming. (You getCVPixelBuffers fromCMSampleBufferfrom a camera stream’sVideoDataOut.) These are a low level format for in-memory RGB data. - Use
URLfor accessing images that are saved to disk orNSDatafor images from the web. For URL based images, you don’t need to pass EXIF orientation data. (But you can specify if you want to override the default.) - You can pass in
CIImages from Core Image. - Finally, if you already have a
UIImageorNSImage, get theCGImageRefand pass that into Vision. Easy.
2. What am I going to do with the image?
- Use the appropriate handler. (
VNImageRequestHandlerorVNSequenceRequestHandler.) - Don’t pre-scale images.
- Do pass in EXIF orientation data. (Except for URL based images.)
3. Performance
Dispatch to a background queue, so Vision doesn’t block your UI. In the completion handler, remember to dispatch back to the main queue if you’re updating UI.
Demo
Apple ended the session with a demo.