In The utility of image descriptions in the initial stages of vision: A case study of printed text, Watt and Dakin (2010) describe a model that integrates mechanisms at both early and middle stages of visual processing, and provide a demonstration of the application of the model to the relational organization of printed text. In the following, we discuss a number of the merits of this approach, but argue that it is (at this stage) highly difficult to assess the utility of this model as a plausible description of human visual processing. First, we indicate that the authors' description of the model is underspecified. Second, we question the generalizability of the model. Third, we argue that the model needs to be directly compared to quantitative empirical data. Fourth, we argue that the model needs to be directly compared to alternative models.