4.1. Computer Vision Solution Types
💡 First Principle: Computer vision capabilities differ by what the OUTPUT contains. Image classification returns one label for the whole image. Object detection returns bounding boxes with locations. OCR returns extracted text. The output structure tells you which capability you need.
What breaks without this distinction: You'll see scenarios like "identify products on store shelves with their positions" and need to choose between classification, detection, and OCR. Without understanding output differences, these blur together. But "with their positions" means bounding boxes—that's object detection. Classification gives only a label; detection gives labels PLUS locations.
Imagine these capabilities like different ways of describing a photo. Classification is captioning: "This is a beach scene." Detection is annotating: "Person at top-left, umbrella at center, dog at bottom-right." OCR is transcribing: "The sign says PRIVATE BEACH." For instance, consider a retail scenario: "Does this shelf contain our product?" requires classification. "Where on the shelf are our products?" requires detection. "What price is shown on the tag?" requires OCR. What output type does your scenario need—a label, a location, or text? Answer that question and you've identified the capability.
Building on the Output Structure concept from Section 1.2.2, let's examine each computer vision capability in detail. The key distinction is what the output contains.