Neural Information Processing

Details

Huber, L. S., Mast, F. W., and Wichmann, F. A. (2024). Contrasting learning dynamics: Immediate generalisation in humans and generalisation lag in deep neural networks. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

Behavioral comparisons of human and deep neural network (DNN) models of object recognition help to benchmark and improve DNN models but also might help to illuminate the intricacies of human visual perception. However, machine-to-human comparisons are often fraught with difficulty: Unlike DNNs, which typically learn from scratch using static, uni-modal data, humans process continuous, multi-modal information and leverage prior knowledge. Additionally, while DNNs are predominantly trained in a supervised manner, human learning heavily relies on interactions with unlabeled data. We address these disparities by attempting to align the learning processes and examining not only the outcomes but also the dynamics of representation learning in humans and DNNs.
We engaged humans and DNNs in a task to learn representations of three novel 3D object classes. Participants completed six epochs of an image classification task—reflecting the train-test iteration process common in machine learning—with feedback provided only during training phases. To align the starting point of learning we utilized pre-trained DNNs. This experimental design ensured that both humans and models learn new representations from the \textit{same} static, uni-modal inputs in a supervised learning environment, enabling side-by-side comparison of learning dynamics. We collected over 6,000 trials from human participants and compared the observed dynamics with various DNNs. While DNNs exhibit learning dynamics with fast training progress but lagging generalization, human learners often display a simultaneous increase in train and test performance, showcasing immediate generalization. However, when solely focusing on test performance, DNNs show good alignment with the human generalization trajectory.
By synchronizing the learning environment and examining the full scope of the learning process, the present study offers a refined comparison of representation learning. Collected data reveals both similarities and differences between human and DNN learning dynamics. This disparity emphasizes that global assessments of DNNs as models of human visual perception seem problematic without considering specific modeling objectives.

Klein, T., Brendel, W., and Wichmann, F. A. (2024). Error consistency between humans and machines as a function of presentation duration. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (talk).

Within the last decade, Artificial Neural Networks (ANNs) have emerged as powerful computer vision systems that match or exceed human performance on some benchmark tasks such as image classification. But whether current ANNs are suitable computational models of the human visual system remains an open question: While ANNs have proven to be capable of predicting neural activations in primate visual cortex, psychophysical experiments show behavioral differences between ANNs and human subjects as quantified by error consistency.
Error consistency is typically measured by briefly presenting natural or corrupted images to human subjects and asking them to perform an n-way classification task under time pressure. But for how long should stimuli ideally be presented to guarantee a fair comparison with ANNs?
Here we investigate the role of presentation time and find that it strongly affects error consistency. We systematically vary presentation times from 8.3ms to >1000ms followed by a noise mask and measure human performance and reaction times on natural, lowpass-filtered and noisy images. Our experiment constitutes a fine-grained analysis of human image classification under both image corruption and time pressure, showing that even drastically time-constrained humans who are exposed to the stimuli for only a two frames, i.e. 16.6ms, can still solve our 8-way classification task with success rates above chance. Importantly, the shift and slope of the psychometric function relating recognition accuracy to presentation time also depends on the type of corruption. In addition we find that error consistency also depends systematically on presentation time.
Together our findings raise the question of how to properly set presentation time in human-machine comparisons. Second, the differential benefit of longer presentation times depending on image corruption is consistent with the notion that recurrent processing plays a role in human object recognition, at least for difficult to recognise images.

Künstle, D.-E., and Wichmann, F. A. (2024). Effect or artifact? Assessing the stability of comparison-based scales. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

Measuring the subjective similarity of stimuli—for example, the visual impression of materials or the categories in object images—can be achieved through multidimensional scales. These scales represent each stimulus as a point, with inter-point distances reflecting similarity measurements from behavioral experiments. An intuitive task used in this context is the ordinal triplet comparison: ”Is stimulus i more similar to stimulus j or k?”. Modern ordinal embedding algorithms infer the (metric) scale from a subset of (ordinal) triplet comparisons, remaining robust to observer response noise. However, the unknown residual errors raise concerns about interpreting the scale’s exact shape and whether additional data may be necessary or helpful. These observations demand an examination of the scale’s stability. Here, we present an approach to visualize the variation of comparison-based scales via bootstrapping techniques and a probabilistic model. Simulation experiments demonstrate that variation is broadly captured by an ensemble of scales estimated from resampled trials. However, common methods to align the ensemble parts in size, rotation, and translation can distort the local variation. For example, standardization results in zero variation at some points but bloats variation at others, while Procrustes analysis leads to uniform “noise” distribution across all scale points. To address this, we propose a probabilistic model to identify the variation at individual scale points. In essence, we ”wiggle” scale points to observe changes in triplet correspondence, indicating their stability level. These localized estimates are combined in the ensemble to provide a robust measure of variation. Simulations validate our approach, while case studies on behavioral datasets emphasize the practical relevance. In these case studies, we visualize perceptual estimates through regions instead of points and identify the most variable stimuli or trials. Beyond data analysis, our stability measures enable further downstream tasks like adaptive trial selection to expedite experiments.

Künstle, D.-E., and Wichmann, F. A. (2024). The Robustness of Metric Perceptual Representations Derived From Ordinal Comparisons. Computational and Mathematical Models in Vision (MODVIS), St. Pete Beach, FL, USA (poster).

Experiments quantifying perceived stimulus similarity with ordinal comparisons, such as the triplet task, are becoming increasingly popular. Using ordinal embedding algorithms, the perceived similarity is approximately represented by distances between points in a putative internal space. The stimuli can range from material perception and visual distortions to object images and words. Whilst theoretical results guarantee that a metric representation can be reconstructed from ordinal comparisons, it remains unclear if realistic, that is at least partially inconsistent and lapsing observers’ data from a limited number of trials may not result in severely distorted representations in practice.
Here, we present an investigation into the robustness of metric representations using simulations as well as behavioral data. We discuss three approaches to quantify the (potential) distortions between the estimated representation from the (typically unknown) ground truth.
First, we look at performance metrics and see that the popular triplet error is an insufficient predictor of distortion. Second, we examine bootstrapping approaches, which, in general, provide a valuable statistical tool for many different investigations. We show, however, that bootstrapping applied to ordinal embeddings cannot be trusted locally. Finally, we introduce a probabilistic interpretation of ordinal comparisons and convert local inconsistencies to uncertainties in the representation space.
Overall, our simulations show that metric representations from ordinal comparisons are robust and that additional data can, perhaps not surprisingly, compensate for inconsistent observer responses. Suitable performance metrics are valuable indicators of whether the estimated representation reliably captures the effects under investigation.
Additional distributional assumptions of probabilistic models convert local inconsistencies into uncertainties and help visualize representational distortions. Thus, our methods help determine if the differences between points in an inferred perceptual representation are an effect or rather an artifact.

Reichert, J., and Wichmann, F. A. (2024). A modular image-computable psychophysical spatial vision model. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

To explain the initial encoding of pattern information in the human visual system, the standard psychophysical spatial vision model is based on channels specific to spatial frequency and orientation, followed by divisive normalization (contrast gain-control). Schütt and Wichmann (2017, Journal of Vision) developed an image-computable implementation of the standard model and showed it to be able to explain data for contrast detection, contrast discrimination, and oblique and natural-image masking. Furthermore, the model induces a sparse encoding of luminance information. Whilst the model's MATLAB code is publicly available, it is non-trivial to extend, or integrate into larger pipelines because it does not provide a modular, pluggable programming framework. Based on the previous MATLAB implementation we developed a modular image-computable implementation of this spatial vision model as a PyTorch framework. Furthermore, we added a number of refinements, like a jointly spatially and spatial frequency dependent contrast gain-control. With luminance images as input, it is easy to employ the model on real-world images. Using the same psychophysical data, we compare our model’s predictions of contrast detection, contrast discrimination, and oblique and natural-image masking with the previous implementation. The major advantage of our framework, however, derives from its modularity and the automatic differentiation offered by PyTorch as these facilitate the implementation and evaluation of new components for the spatial vision model. Furthermore, our framework allows the integration of this psychophysically validated spatial vision model into larger image-processing pipelines. This could be used to take inputs from retina models instead of from pre-computed luminance images or to further process the model’s outputs with higher-level vision models. Given its flexibility, the model could also be used as a plug-in for or replacement of parts of artificial neural networks, which would enable comparison of aspects of human and machine vision.

Schmittwilken, L., Wichmann, F. A., and Maertens, M. (2024). Do mechanisms of sinusoidal contrast sensitivity account for edge sensitivity? Computational and Mathematical Models in Vision (MODVIS), St. Pete Beach, FL, USA (talk).

Visual sensitivity varies with the spatial frequency (SF) content of the visual input. This established characteristic of early vision is known as the contrast sensitivity function (CSF). The CSF represents the notion that the visual system decomposes information in SF selective channels [1], which inspired the now standard model of spatial vision. This model has been extensively tested with low-contrast sinusoidal gratings because their narrow SF spectra are ideal to isolate putative SF selective mechanisms.
It is less extensively studied how well these mechanisms account for sensitivity to perhaps more relevant stimuli with broad SF spectra and high contrasts. As a middle ground between simple gratings and complex scenes, we investigated how well mechanisms underlying sinusoidal contrast sensitivity can account for sensitivity to high-contrast edges which abound in the environment and have broad SF spectra.
For this, we probed human edge sensitivity empirically (2-AFC, edge localization) and tested how well a spatial vision model could account for the data. We probed human edge sensitivity in the absence and presence of three broadband noises (white, pink, brown) and three narrow-band noises (center SFs: 0.5, 3, 9 cpd, one octave band- width) while varying the peak SFs of the edges. As edge stimuli we used three Cornsweet luminance profiles (peak SFs: 0.5, 3, 9 cpd), totaling in 21 stimulus conditions.

Wu, S., Yoerueten, M., Wichmann, F. A., and Schulz, E. (2024). Normalized Cuts Characterize Visual Recognition Difficulty of Amorphous Image Sub-parts. Computational and Systems Neuroscience (COSYNE), Lisbon, Portugal (poster)

Upon glimpsing at an image, we effortlessly perceive structures from within. What charac- terizes this process? Historically, gestalt psychologists have suggested that people tend to group nearby similar image parts together as a whole. Can an algorithm that partitions images into sub-parts based on similitude characterize visual perception behavior? We look at the normalized min-cut algorithm and its correlation to the recognition difficulty of image parts. The algorithm transfers an image seg- mentation problem into a graph-cutting problem, approximating an energy optimization problem that preserves within-graph similarities. We study whether the number of computational steps needed for the algorithm to isolate an image part correlates with participants’ difficulty in recognizing that part, and whether higher exposure time correlates with further computational steps. We propose a psy- chophysics paradigm to study subjects’ recognition behavior upon seeing images tiled by amorphous sub-parts. We found that an increasing cut no. of image subpart (more computation steps) is harder for subjects to recognize after a brief exposure time, and longer exposure time increases the recognition ease, consistent with the model’s prediction that higher cut no. demands more computation steps to isolate. Our study relates the recognition difficulty of image parts with the computational resources needed to solve an optimization problem of grouping by similarity.

Klein, T., Brendel, W., and Wichmann, F. A. (2023). Feature Visualizations do not sufficiently explain hidden units of Artificial Neural Networks. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

Artificial Neural Networks (ANNs) have been proposed as computational models of the primate ventral stream, because their performance on tasks such as image classification rivals or exceeds human baselines. But useful models should not only predict data well, but also offer insights into the systems they represent, which remains a challenge for ANNs. We here investigate a specific method that has been proposed to shed light on the representations learned by ANNs: Feature Visualizations (FVs), that is, synthetic images specifically designed to excite individual units ("neurons") of the target network. Theoretically, these images should visualize the features that a unit is sensitive to, like receptive fields in neurophysiology.
We conduct a psychophysical experiment to establish an upper bound on the interpretability afforded by FVs, in which participants need to match five sets of exemplars (natural images that highly activate certain units) to five sets of FVs of the same units—a task that should be trivial if FVs were informative. Extending earlier work that has cast doubts on the utility of this method, we show that (1) even human experts perform hardly better than chance when trying to match a unit's FVs to its exemplars and that (2) matching exemplars to each other is much easier, even if only a single exemplar is shown per set. Presumably, this difficulty is not caused by so-called polysemantic units (i.e. neurons that code for multiple unrelated features, possibly mixing them in their visualizations) but by the unnatural visual appearance of FVs themselves. We also investigate the effect of visualizing units from different layers and find that the interpretability of FVs declines in later layers–contrary to what one might expect, given that later layers should represent semantic concepts.
These findings highlight the need for better interpretability-techniques, if ANNs are ever to become useful models of human vision.

Künstle, D.-E., and Wichmann, F. A. (2023). Measuring lightness constancy with varying realism. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

In everyday life, surface colours typically appear constant under different lighting despite the (potentially) dramatic changes in luminance—human observers are (nearly) colour constant. Failures of colour constancy using simple artificial stimuli are often explicable in terms of luminance normalization and color adaptation, but not in realistic scenes with multiple light sources in which the perceiver ought to separate all illuminants from the surface reflectances. What remains to be established, however, is the link between lightness constancy using simple artificial stimuli versus realistic scenes. Here we present a lightness scaling experiment to investigate the impact of realism on lightness perception. Observers see physically accurate renderings of indoor scenes that contain three target patches of varying illumination and luminance; the observers have to judge which lateral target patch appears more similar to the center target. Based on these triad judgments, we estimated 1D and 2D lightness scales for different manipulations of scene realism with an ordinal embedding algorithm, a modern variant of multidimensional scaling. Preliminary results show a mixed picture of lightness constancy's dependence on scene realism.
Scene realism per se appears not critical; instead, we observe lightness constancy if the scene contains spotlights with clearly visible, sharp light cones. Targets in a scene without sharp light cones—but still clearly visible illuminants, e.g. a ceiling window—were judged based on the brightness both in the full and reduced realism conditions. Our results may indicate that it is not realism per se that is crucial for lightness constancy but the ease with which individual illuminants be identified through the local context.

Sauer, Y.,  Künstle, D.-E., Wichmann, F. A., and Wahl, S. (2023). Psychophysical scale of optical distortions of multifocal spectacle lenses. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

Multifocal lenses have regions of different optical power, correcting far and near vision for presbyopes with one pair of glasses. Optical distortions are an unavoidable side effect of those lenses. From previous research and reports of spectacle wearers we know that such distortions cause unnatural and often discomforting motion percepts. Depending on design choices and the intended refractive correction, the strength and shape of distortions vary between lenses. For the design of spectacle lenses it is important to understand how perceived distortions depend on different parameters of the spectacle lens. In this work we analyze physical distortions induced by multifocal glasses with varying correction on a subjective, psychophysical scale of distortions. To measure the internal scale we performed a psychophysical experiment in virtual reality (VR). The image in VR was transformed to replicate the distortions of ten simulated lenses. Subjects looked around freely in a virtual indoor environment while three different distortions were presented consecutively in each trial; they were asked for the pair of distortions that appeared more similar: first to second, or second to third. We estimated the scale as a coordinate for every lens with an ordinal embedding algorithm, so that the distances in the scale maximally agree with the subjects' perceived similarities. The fit is best with a single coordinate dimension for the perceived distortions, even though we varied multiple lens parameters, resulting in complex transformations of the images. This perceived distortion is increasing with the optical power in the far vision area of the lens and the additional power in the near vision area of the lens. Distortions for lenses with negative optical power in the far area appear closer to undistorted vision with increasing power in the near area, suggesting a compensation of distortions for negative lenses.

Schmittwilken, L., Wichmann, F. A., and Maertens, M. (2023). Is edge sensitivity more than contrast sensitivity? Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

How does the human visual system extract relevant information from the pattern of light on the retinae? The psychophysical study of early vision has led to important insights in the initial processing characteristics of human vision, and we have image-computable models which quantitatively account for relevant psychophysical findings. Many of these findings come from experiments with sinusoidal gratings and Gabor patches, which, due to their narrow spectra, are thought to be ideal to isolate and test mechanisms of early vision. However, during natural viewing relevant stimulus features such as object boundaries involve sharp luminance discontinuities (i.e. edges) and thus broad spectra. Here we explore whether the computational mechanisms which account for the perception of sinusoidal gratings can also predict psychophysical sensitivity to (isolated) edges. Psychophysically, we probe human edge sensitivity (2-AFC, position of edge) in the presence of different types of noise. Edges were Cornsweet edges with peak frequencies at 0.5, 3 and 9 cpd, and noise types were white, pink, and brown as well as three narrowband noises with center frequencies at 0.5, 3 and 9 cpd. Computationally, we implemented several edge models using standard components of existing edge models (single or multiple odd-symmetric log-Gabor filters) and standard components from spatial vision models (divisive contrast gain control, signal-detection-theory based decoders of varying complexity). We find that several different models can reasonably account for the data and thus conclude that human contrast and edge sensitivity are likely employing similar computational mechanisms—results from narrowband sinusoidal gratings thus generalize to more natural, broadband stimuli. Finally, we discuss the assumptions and design choices involved in our computational modeling and make both explicit to foster their critical examination.

Geirhos, R., Narayanappa, K., Mitzkus, B., Thieringer, T., Bethge, M., Wichmann, F. A., and Brendel, W. (2022). The bittersweet lesson: data-rich models narrow the behavioural gap to human vision. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (talk).

A major obstacle to understanding human visual object recognition is our lack of behaviourally faithful models. Even the best models based on deep learning classifiers strikingly deviate from human perception in many ways. To study this deviation in more detail, we collected a massive set of human psychophysical classification data under highly controlled conditions (17 datasets, 85K trials across 90 observers). We made this data publicly available as an open-sourced Python toolkit and behavioural benchmark called "model-vs-human", which we use for investigating the very latest generation of models. Generally, in terms of robustness, standard machine vision models make much more errors on distorted images, and in terms of image-level consistency, they make very different errors than humans. Excitingly, however, a number of recent models make substantial progress towards closing this behavioural gap: "simply" training models on large-scale datasets (between one and three orders of magnitude larger than standard ImageNet) is sufficient to, first, reach or surpass human-level distortion robustness and, second, to improve image-level error consistency between models and humans. This is significant given that none of those models is particularly biologically faithful on the implementational level, and in fact, large-scale training appears much more effective than, e.g., biologically-motivated self-supervised learning. In the light of these findings, it is hard to avoid drawing parallels to the "bitter lesson" formulated by Rich Sutton, who argued that "building in how we think we think does not work in the long run" - and ultimately, scale would be all that matters. While human-level distortion robustness and improved behavioural consistency with human decisions through large-scale training is certainly a sweet surprise, this leaves us with a nagging question: Should we, perhaps, worry less about biologically faithful implementations and more about the algorithmic similarities between human and machine vision induced by training on large-scale datasets?

Künstle, D., von Luxburg, U., and Wichmann, F. A. (2022). Estimating the perceived dimensionality of psychophysical stimuli using a triplet accuracy and hypothesis testing procedure. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

In vision research we are often interested in understanding the mapping of complex physical stimuli to their perceptual dimensions. This mapping can be explored experimentally with multi-dimensional psychophysical scaling. One fruitful approach is the combination of triplet comparison judgments, asking if stimulus A or B is more similar to C, together with ordinal embedding methods. Ordinal embedding infers a point representation such that the distances in the inferred, internal perceptual space agree with the observer’s judgments. One fundamental problem of psychophysical scaling in multiple dimensions is, however, that the inferred representation only reflects perception if it has the correct dimensionality. Unfortunately, the methods to derive the “correct” dimensionality were thus far not satisfactory for noisy, behavioural data (e.g. no clear “knee” in the stress-by-dimension graph of multi-dimensional scaling). Here we propose a statistical procedure inspired by model selection to choose the dimensionality: Dimensionality can be tuned to prevent both under- and overfitting. The key elements are, first, measuring the scale’s quality by the number of correctly predicted triplets (cross-validated triplet accuracy). Second, performing a statistical test to assess if adding another dimension improves triplet accuracy significantly. In order to validate this procedure we simulated noisy and sparse judgments and assessed how reliably the ground-truth dimensionality could be identified. Even for high noise levels we are able to identify a lower-bound estimate of the perceived dimensions. Furthermore, we studied the properties and limitations of our procedure using a variety of behavioural datasets from psychophysical experiments. We conclude that our procedure is a robust tool in the exploration of new perceptual spaces and is able to help identify a lower bound on the number of perceptual dimensions for a given dataset. The identified dimensions and the resulting representation can then be related to perceptual processes in order to explore human vision.

Schönmann, I., Künstle, D., and Wichmann, F. A., (2022). Using an Odd-One-Out Design Affects Consistency, Agreement and Decision Criteria in Similarity Judgement Tasks Involving Natural Images. Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster).

Recently, similarity judgement tasks have been employed to estimate the perceived similarity of natural images (Hebart, Zheng, Pereira, & Baker, 2020). Such tasks typically take the form of triplet questions in which participants are presented with a reference image and two additional images and are asked to indicate which of the two is more similar to the reference. Alternatively, participants can be presented with three images and asked to indicate the odd one out. Though both questions are mathematically similar, they might affect participants’ decision criteria, the agreement among observers, or the consistency of single observers—these possibilities have hitherto not been assessed. To address these issues, we presented four observers with triplets from three image sets designed to juxtapose different perceptual and conceptual features. Using a soft ordinal embedding algorithm—a machine learning version of a multidimensional scaling—we represented the images in a two-dimensional space such that the Euclidean distances between images reflected observers' choices. Agreement between observers was assessed through a leave-one-out procedure in which embeddings based on three observers served to predict the respective fourth observer's choices. Consistency was calculated as the proportion of identical choices in a repeat session. Here we show that design choices in similarity judgement tasks can indeed affect results. The odd-one-out design resulted in greater embedding accuracy, higher agreement among, and higher consistency within observers. Hence, an individual observer's choices could be better predicted in the odd-one-out than in the triplet design. However, predicting individual responses was only possible for image sets for which participants could report a predominant relationship. Otherwise, predictability dropped to close to chance level. Our results suggest that seemingly innocuous experimental variations—standard triplet versus odd-one-out—can have a strong influence on the resulting perceptual spaces. Furthermore, we note severe limitations regarding the predictive power of models relying on pooled observer data.

Huber, L. S., Geirhos, R. and Wichmann, F. A. (2021)
The developmental trajectory of object recognition robustness: comparing children, adults, and CNNs
Vision Sciences Society, virtual meeting (V-VSS), (talk)

Core object recognition refers to the ability to rapidly recognize  objects in natural scenes across identity-preserving transformations,  such as variation in perspective, size or lighting. In laboratory object  recognition tasks using 2D images, adults and Convolutional Neural  Networks (CNNs) perform close to ceiling. However, while current CNNs  perform poorly on distorted images, adults' performance is robust  against a wide range of distortions. It remains an open question whether  this robustness is the result of superior information representation  and processing in the human brain, or due to extensive experience  (training) with distorted visual input during childhood. In case of the  latter, we would expect robustness to be low in early childhood and  increase with age. Here we investigated the developmental trajectory of  core object recognition robustness. We first evaluated children's and  adults' object classification performance on undistorted images and then  systematically tested how recognition accuracy degrades when images are  distorted by salt-and-pepper noise, eidolons, or texture-shape  conflicts. Based on 22,000 psychophysical trials collected in children  aged 4–15 years, our results show that: First, while overall performance  improves with age, already the youngest children showed remarkable  robustness and outperformed standard CNNs on moderately distorted  images. Second, weaker overall performance in younger children is due to  weak performance on a small subset of image categories, not reduced  performance across all categories. Third, when recognizing objects,  children—like adults but unlike standard CNNs—heavily rely on shape but  not on texture cues. Our results suggest that robustness emerges early  in the developmental trajectory of human object recognition and is  already in place by the age of four. The robustness gap between humans  and standard CNNs thus cannot be explained by a mere accumulation of  experience with distorted visual input, and is more likely explained by a  difference in visual information representation and processing.

Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M. and Wichmann, F. A. (2020)
Unintended cue learning: Lessons for deep learning from experimental psychology
Vision Sciences Society, virtual meeting (V-VSS), (poster)

Recently, Deep Neural Networks (DNNs) have become a major tool and model in vision science. However, DNNs often fail unexpectedly. For example, they are highly vulnerable to noise and struggle to transfer their performance from the lab to the real world. In experimental psychology, unexpected failures are often the consequence of unintended cue learning.
For example, rats trained to perform a colour discrimination experiment may appear to have learned the task but fail unexpectedly once the odour of the colour paint is controlled for, revealing that they exploited an unintended cue—smell—to solve what was intended to be a vision experiment. Here we ask whether unexpected failures of DNNs too may be caused by unintended cue learning. We demonstrate that DNNs are indeed highly prone to picking up on subtle unintended cues: neural networks love to cheat. For instance, in a simple classification problem with two equally predictive cues, object shape and object location, human observers unanimously relied on object shape whereas DNNs used object location, a strategy which fails once an object appears at a different location. Drawing parallels to other recent findings, we show that a wide variety of DNN failures can be understood as a consequence of unintended cue learning: their over-reliance on object background and context, adversarial examples, and a number of stunning generalisation errors. The perspective of unintended cue learning unifies some of the key challenges for DNNs as useful models of the human visual system. Drawing inspiration from experimental psychology (with its years of expertise in identifying unintended cues), we argue that we will need to exercise great care before attributing high-level abilities like "object recognition" or "scene understanding" to machines. Taken together, this opens up an opportunity for the vision sciences to contribute towards a better and more cautionary understanding of deep learning.

Bruijns, S. A., Meding, K., Schölkopf, B. and Wichmann, F. A. (2019)
Phenomenal Causality and Sensory Realism
European Conference on Visual Perception (ECVP), Leuven, BEL (poster)

One of the most important tasks for humans is the attribution of causes and effects---in diverse contexts, including visual perception.  Albert Michotte was one of the first to systematically study causal visual perception using his now well-known launching event paradigm. Launching events are the collision and transfer of movement between two objects---featureless disks in the original experiments. The perceptual simplicity of the original displays allows for insight into the basics of the mechanisms governing causal perception.  We wanted to study the relation between causal ratings for launching in the usual abstract setting and launching collisions in a photo-realistic setting. For this purpose we presented typical launching events with differing temporal gaps, as well as the same launching processes with photo-realistic billiard balls, and also photo-realistic billiard balls with realistic physics, i.e. an initial rebound of the first ball after collision and a short sliding phase of the second ball. We found that simply giving the normal launching stimulus realistic visuals lead to lower causal ratings, but realistic visuals together with realistic physics evoked higher ratings. We discuss this initially perhaps counter-intuitive result in terms of cue conflict and the seemingly detailed (implicit) physical knowledge embodied in our visual system.

Flachot, A. C., Schütt, H. H., Fleming, R. W., Wichmann, F. A. and Gegenfurtner, K. R. (2019)
Color Constancy in Deep Neural Networks
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Color constancy contributes to our visual system’s ability to recognize objects. Here, we explored whether and how Deep Neural Networks can learn to identify the colours of objects across varying illuminations. We devised a 6-layer feedforward network (3 convolutional layers, 2 fully connected layers, one classification layer). The network was trained to classify the reflectances of objects. Stimuli consisted of the cone absorptions in rendered images of 3D objects, generated using 2115 different 3D-models, the reflectancies of 330 different Munsell chips, 265 different natural illuminations. One model, Deep65, was trained under a fixed daylight D65 illumination, while DeepCC, was trained under varying illuminations. Both networks were capable of learning the task, reaching 69% and 82% accuracy for DeepCC and Deep65 respectively on their validation sets (chance perfor- mance is 0.3%). In cross validation, however, Deep65 fails when tested on inputs with varying illuminations. This is the case even when chromatic noise is added during training, mimicking some of the effects of the varying illumi- nation. DeepCC, on the other hand, performs at 73% when tested on a fixed D65 illumination. Importantly, color categorization errors were systematic, reflecting distances in color space. We then removed some cues for color constancy from the input images. DeepCC was slightly affected when hiding a panel of colorful patches, which had constant reflectance across all input images. Removing the complete image background deteriorated perfor- mance to nearly the level of Deep65. A multidimensional scaling analysis of both networks showed that they represent Munsell space quite accurately, but more robustly in DeepCC. Our results show that DNNs can be trained on color constancy, and that they use similar cues as observed in humans (e.g., Kraft & Brainard, PNAS 1999). Our approach allows to quickly test the effect of image manipulations on constancy performance.

Flachot, A. C., Schütt, H. H., Fleming, R. W., Wichmann, F. A. and Gegenfurtner, K. R. (2019)
Color Constancy in Deep Neural Networks
European Conference on Visual Perception (ECVP), Leuven, BEL (talk)

Color constancy is our ability to perceive constant colors across varying illuminations. Here, we trained a deep neural network to be color constant and compared it with humans.
We trained a 6-layer feedforward network to classify the reflectances of objects. Stimuli consisted of the cone excitations in 3D-rendered images of 2115 different 3D-shapes, the reflectances of 330 different Munsell chips and 278 different natural illuminations. One model, Deep65, was trained under a fixed daylight D65 illumination, while DeepCC, was trained under varying illuminations. Testing was done with 4 new illuminants with equally spaced CIELab chormaticities, 2 along the daylight locus and 2 orthogonal to it.
We found an average color constancy of 0.69 for DeepCC, and constancy was higher along the daylight locus (0.86 vs 0.52). When gradually taking cues away from the scene, constancy decreased. For an isolated object on a neutral background, it was close to zero, which was also the level of constancy for Deep65. Within the DeepCC model, constancy gradually increased throughout the network layers.
Overall, the DeepCC network shares many similarities to human color constancy. It also provides a platform for detailed comparisons to behavioral experiments.

Geirhos, R., Rubisch, P., Rauber, J., Medina Temme, C. R., Michaelis, C., Brendel, W., Bethge, M. and Wichmann, F.A. (2019)
Inducing a human-like shape bias leads to emergent human-level distortion robustness in CNNs
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (talk)

Convolutional neural networks (CNNs) have been proposed as computational models for (rapid) human object recognition and the (feedforward-component) of the primate ventral stream. The usefulness of CNNs as such models obviously depends on the degree of similarity they share with human visual processing. Here we investigate two major differences between human vision and CNNs, first distortion robustness - CNNs fail to cope with novel, previously unseen distortions - and second texture bias - unlike humans, standard CNNs primarily recognise objects by texture rather than shape. During our investigations we discovered an intriguing connection between the two: inducing a human-like shape bias in CNNs makes them inherently robust against many distortions. First we show that CNNs cope with novel distortions worse than humans even if many distortion-types are included in the training data. We hypothesised that the lack of generalisation in CNNs may lie in fundamentally different classification strategies: Humans primarily use object shape, whereas CNNs may rely more on (easily distorted) object texture. Thus in a second set of experiments we investigated the importance of texture vs. shape cues for human and CNN object recognition using a novel method to create texture-shape cue conflict stimuli. Our results, based on 49K human psychophysical trials and eight widely used CNNs, reveal that CNNs trained with typical "natural" images indeed depend much more on texture than on shape, a result in contrast to the recent literature claiming human-like object recognition in CNNs.
However, both differences between humans and CNNs can be overcome: we created a suitable dataset which induces a human-like shape bias in CNNs during training. This resulted in an emerging human-level distortion robustness in CNNs. Taken together, our experiments highlight how key differences between human and machine vision can be harnessed to improve CNN robustness - and thus make them more similar to the human visual system - by inducing a human-like bias.

Lang, B., Aguilar, G., Maertens, M. and Wichmann, F. A. (2019)
The influence of observer lapses on maximum-likelihood difference scaling
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

The psychometric function relates an observer’s performance to some physical stimulus quantity in a psychophysical task. Performance is characterized by several parameters of the psychometric function such as the point of subjective equality or the slope. Apart from these primary parameters of interest, two other parameters have been modelled to increase goodness-of-fit: guesses and lapses. Lapses refer to mistakes that are independent of the stimulus level. For example, when an observer mixes up response buttons or lapses in attention. Here, we explore the question whether an explicit modelling of the lapse rate would also improve the estimation of perceptual scales in procedures such as Maximum Likelihood Difference Scaling (MLDS). MLDS is a psychophysical method to derive perceptual scales from forced-choice judgments of suprathreshold stimulus differences. It was tested for its robustness against violations of several model assumptions (Maloney and Yang, 2003), but the influence of lapses on estimated scales has not yet been studied systematically.
We run computer simulations to test how a stimulus-independent error rate influences scale estimates in MLDS. We simulated data from different statistical models: we include the classical implementation of MLDS as a generalized linear model (GLM), a Bayesian implementation of the same GLM, as well as two models that explicitly model the lapse rate. We also used the models to reanalyse data from a previous study (Wiebel, Aguilar, and Maertens, 2017), to test the effect of modelling the lapse rate in actual data. In the simulations, lapses lead to an overestimation of the internal noise. In the reanalysis of the experimental data we found that for experienced observers with a low noise estimate the different models did not differ much. For observers with a higher internal noise estimate, models that considered the lapse rate resulted in scales with a smaller internal noise estimate.

Li, S., Lang, B., Aguilar, G., Maertens, M. and Wichmann, F. A. (2019)
Comparing scaling methods in lightness perception
European Conference on Visual Perception (ECVP), Leuven, BEL (poster)

Psychophysical scaling methods measure the perceptual relations between stimuli that vary along one or more physical dimensions.
Maximum-likelihood difference scaling (MLDS) is a recently developed method to measure perceptual scales which is based on forced-choice comparisons between stimulus intervals. An alternative scaling method that is based on adjusting stimulus intervals is equisection scaling.
In MLDS, an observer has to answer which of two shown intervals is greater. In equisection scaling the observer adjusts values between two anchoring points such that the resulting intervals are perceived as equal in magnitude.
We compared MLDS and bisection scaling, a variant of equisection scaling, by replicating a lightness scaling experiment with both methods. Bisection scaling is attractive because it requires less data than MLDS. We found that, qualitatively, the lightness scales recovered by each method agreed in terms of their shape. However, the bisection measurements were more noisy. Even worse, scales from the same observers but measured in different sessions sometimes differed substantially.
We would therefore not advise to use equisection scaling as a method on its own. But we suggest that it can be usefully employed to choose more favourable sampling points for a subsequent MLDS experiment.

Meding, K., Schölkopf, B. and Wichmann, F. A. (2019)
Perception of temporal dependencies in autoregressive motion
European Conference on Visual Perception (ECVP), Leuven, BEL (poster)

Understanding the principles of causal inference in the visual system has a long history, certainly since the seminal studies by Michotte. During the last decade, a new type of causal inference algorithms has been developed in statistics. These algorithms use the dependence structure of residuals in a fitted additive noise framework to detect the direction of causal information from data alone (Peters et al., 2008). In this work we investigate whether the human visual system may employ similar causal inference algorithms when processing visual motion patterns, focusing, as in the theory, on the arrow-of-time. Our data suggest that human observers can discriminate forward and backward played movies of autoregressive (AR) motion with non-Gaussian additive independent noise, i.e., they appear sensitive to the subtle temporal dependencies of the AR-motion, analogous to the high sensitivity of human vision to spatial dependencies in natural images (Gerhard et al., 2013). Intriguingly, a comparison to known causal inference algorithms suggests that humans employ a different strategy. The results demonstrate that humans can use spatiotemporal motion patterns in causal inference tasks. This finding raises the question of whether the visual system is tuned to motion in an additive noise framework.

Schütt, H. H. and Wichmann, F. A. (2019)
A divisive model of midget and parasol ganglion cells explains the contrast sensitivity function
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

A long standing proposition in psychophysics is that there are two temporal channels in spatial vision: One temporal lowpass filter which is spatially bandpass and one temporal bandpass filter which is spatially lowpass. Here we equate these two channels with the midget and parasol ganglion cells of the primate retina respectively. Parasol cells show a frequency doubling response at high spatial frequencies, i.e. they respond to both polarities of high spatial frequency gratings. This is usually thought to result from them pooling over smaller units. If true, this explanation predicts that signals with both high spatial and high temporal frequency should be detectable, but their spatial orientation and phase should not be perceivable, i.e. parasol cells act not as labelled lines in this scenario. We confirm this prediction by finding a difference between detection and orientation discrimination thresholds at high spatial and temporal frequency, not observed at other spatial-temporal frequency combinations. We model midget and parasol cells using a standard divisive model dividing a generalized Gaussian center by a larger surround. When scaling cells proportional to their known separation in the retina, we can make predictions for the perception performance of human observers assuming optimal decoding after a single quadratic non-linearity. To confirm the predictions of our model we measured contrast sensitivity functions (CSFs) under diverse conditions and fit data from the literature. Our model fits the CSF under different temporal conditions and changes in size with fixed spatial parameters and can thus replace previous CSF models which require new parameters for each condition and have no mechanistic interpretation. Finally, our model provides a counter hypothesis to the textbook explanation for the CSF which describes it as the envelope of the spatial frequency tuned channels in V1; instead, we believe its shape to result from properties of retinal cells.

Wichmann, F. A.  (2019)
Object recognition in man and machine
European Conference on Visual Perception (ECVP), Leuven, BEL (talk)

Convolutional neural networks (CNNs) have been proposed as computational models for (rapid) human object recognition and the (feedforward-component) of the primate ventral stream. The usefulness of CNNs as such models obviously depends on the degree of similarity they share with human visual processing. In my talk I will discuss two major differences we have found between human vision and current feedforward CNNs: First distortion robustness---unlike the human visual system typical feedforward CNNs fail to cope with novel, previously unseen distortions. Second texture bias---unlike humans, and contrary to widespread belief, standard CNNs primarily recognise objects by texture rather than by object shape.
Both differences between humans and CNNs can be diminished, however: we created a suitable dataset which induces a human-like shape bias in standard, feedforward CNNs during training. The shape-bias of the CNN was accompanied by emerging human-level distortion robustness. Taken together, our experiments highlight how key differences between human and machine vision can be harnessed to improve CNN robustness by inducing a human-like bias. I will discuss the remaining behavioural differences between CNNs and humans in object recognition in terms of architectural and training discrepancies.

Wu, S., Geirhos, R. and Wichmann, F. A.  (2019)
An early vision-inspired visual recognition model improves robustness against image distortions compared to a standard convolutional neural network
EPFL Neuro Symposium: Neuroscience meets Deep Learning, Lausanne, CH (poster)

Convolutional neural networks (CNNs) have been proposed as models for human ventral stream object recognition due to excellent performance on image recognition tasks and representational similarities with monkey neural recordings and human fMRI data [1,2]. At the same time, CNNs differ from human visual recognition in substantial ways. For example, CNN image recognition performance deteriorates much faster with image distortions compared to human observers [3]. Computationally, deep convolutional networks lack neurally-inspired components of local gain control which are ubiquitous in biological sensory systems [4]. We here test whether these two key differences between human and machine vision might be linked: Does incorporating mechanisms of human early visual processing lead to an improvement in CNN distortion robustness?
To this end, we build a hybrid model consisting of an early vision inspired front end combined with a standard CNN back end, and train it on an object recognition task (the CIFAR-10 dataset). The front end is an image-computable model of human early visual processing with an overcomplete set of spatial frequency filters and divisive normalization followed by a nonlinearity, which has previously been shown to be a good fit for human psychophysical data [5]. The back end is ResNet-18, a standard CNN.Interestingly, this hybrid network leads to robustness improvements on various types of image distortions compared to a vanilla CNN that lacks an early vision front end. This model could serve as a starting point to better understand the benefits of biologically plausible components and their potential for today's machine vision systems.

Lang, B., Aguilar, G., Maertens, M. and Wichmann, F. A. (2018)
Generating Photorealistic Stimuli for Psychophysical Experiments
14th Biannual Conference of the German Cognitive Science Society (KogWis), Darmstadt, FRG (poster)

The field of lightness perception investigates how humans infer surface reflectance, what we would call the “achromatic colour” or intensity of a surface in everyday life.

Stimuli used in lightness perception often consist of two-dimensional patches in different shades of grey. Such simple stimuli are agnostic towards reflectance and illumination of the patches they are made off.

This may be problematic, since we are interested in the perception of surface reflectance, not the perception of the luminance arriving at the retina.

Therefore we argue that to completely understand lightness perception it may be important to study lightness perception in more natural stimuli which are not ambiguous in respect to reflectance and illumination. Advances in computer hardware and computer graphics allow the generation of photorealistic looking images of 3D scenes, where we can control both reflectance and illumination at our will.

We developed a method to adjust the luminance of image regions by varying the contribution of different light sources in the scene. This is achieved by solving a linear equation system for the light source intensities. We then use this method to generate stimuli for a scaling experiment, comparing rendered stimuli against reduced stimuli consisting of checkers on an articulated background.

Preliminary results show that, at least for the still comparatively simple stimulus arrangement tested, the local surround has a larger influence on the perceptual scales than the (realistic) scene embedding.

Meding, K., Hirsch, M. and Wichmann, F. A. (2018)
Retinal image quality of the human eye across the visual field
14th Biannual Conference of the German Cognitive Science Society (KogWis), Darmstadt, FRG (poster)

When humans are watching scenes in the surrounding world, light rays propagate through the lens of the human eye onto the retina. This is the first stage of all visual processing. Despite the robust design of the lens, aberrations impair visual quality, especially in the periphery. Wavefront sensing is used to measure individual lens aberrations and to calculate point spread functions (PSFs) and retinal images on patches across the visual field. However, it is still unknown how optics impair image quality as a whole and no systematic analysis has been conducted up to now.
In this work, data from peripheral wavefront aberration sensing is used in combination with the Efficient-Filter-Flow algorithm to calculate retinal images for a wider visual field. We extended the work on monochromatic PSFs from Watson[1] with data from Polans et al.[2] to generate retinal images for a 50° × 80° visual field. This is the first time that retinal images for a larger visual field are obtained from measured wavefront aberrations. These images show that the optical performance of the human lens is remarkably constant up to 25°. Additionally, an anisotropy in image quality between nasal and temporal side was found.
The dataset from Jaeken et al.[3] was used to check this anisotropy for more subjects. We quantified the optical quality by calculating PSFs and modulation transfer functions (MTFs). This second dataset verified the difference between the nasal and temporal side and additionally revealed differences in off-axis behavior of emmetropes and myopes.
Furthermore, we could identify the anisotropy between nasal and temporal side in a psychophysical experiment by measuring the visual acuity with two point sources.
We anticipate our results to be a starting point for the computation of retinal images for large visual fields. With aberration measurements for large visual fields, one could further investigate the claim that the optics of a human eye show a wide-angle-lens behavior as proposed by Navarro in 1993. Additionally, improved computing of retinal images could benefit vision models using these, e.g. early vision models or gaze prediction models.

 

Schütt, H. H., Rothkegel, L., Trukenbrod, H. A., Engbert, R. and Wichmann, F. A. (2018)
Predicting the fixation density over time
Computational and Mathematical Models in Vision (MODVIS), St. Pete Beach, FL, USA (talk)

When modelling human eye movements we usually separate bottom-up and top-down effects, i.e. whether some effect is caused by the stimulus or by some internal state of the observer like their tasks or intentions. We also separate the orthogonal dimension whether the features used to guide eye movements are low-level—like local contrast, luminance or orientation content—or highlevel—like object locations, scene congruence or scene category. Furthermore, humans display systematic tendencies in their eye movements, preferring certain saccade lengths and directions and a tendency to continue moving into the same direction.

It is unclear how these different factors interact to determine were we look and most models only include a small selection of the mentioned influence factors. To disentangle them, we analyse the dependencies between fixations and the fixation densities over time. In our analysis we include how well fixation densities are predicted by, first, low-level bottom-up saliency, including a saliency model based on our early spatial vision model (Schütt and Wichmann, 2017),and, second, a recent DNN-based saliency model including low- and high-level bottom-up saliency (DeepGaze II, Kümmerer et al., 2016).

To separate top-down effects, we use two datasets: One Corpus dataset in which 105 subjects looked at 90 images to memorize them and a search dataset in which 10 subjects searched for 6 different targets with varying spatial frequency and orientation content superimposed over 25 natural images 8 times each resulting in 480 searches per image.

Based on the Corpus Dataset we separate the exploration into three phases: An onset response with the first saccade, an initial exploration lasting around 10 fixations and a final equilibrium phase. First fixations are most predictable but follow a different density than later ones. During the initial exploration fixations gradually become less predictable. Finally, the fixation density stops broadening and the equilibrium state is reached in which fixations are least focussed but still favour the same areas as during the exploration.

The prediction quality of all saliency models follows the curve of predictable information. They predict fixations best at the beginning and gradually get worse. The simple saliency model based on our early spatial vision model performs as well as classical saliency models. However, DeepGaze II performs substantially better by using high-level information throughout the whole trial. This advantage is present at the latest 200 ms after image onset; however, the predictive power of the early vision saliency model for the first fixation(s) is much better than for later fixations, and as a corollary the advantage of DeepGaze II is relatively small for the first fixation(s).

On the search dataset all saliency models perform badly after a small initial prediction success, even if the non-linearity and central fixation bias are newly adjusted to the search data. Instead we observe that subjects adjust where they look and their eye movement dynamics to the target they search for. Specifically they make shorter saccades and exhibit shorter fixation times for higher frequency targets than for lower frequency targets.

Our observations confirm that bottom-up guidance of eye movements can be overwritten almost entirely by task effects in static natural scenes. Nonetheless our data support some early bottom-up guidance, which includes high-level features already for the very first saccade.

Schütt, H. H., Rothkegel, L., Trukenbrod, H. A., Engbert, R. and Wichmann, F. A. (2018)
Predicting fixation densities over time from early visual processing
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Bottom up saliency is often cited as a factor driving the choice of fixation locations of human observers, based the (partial) success of saliency models to predict fixation densities in free viewing. However, these observations are only weak evidence for a causal role of bottom up saliency in natural viewing behaviour. To test bottom up saliency more directly we analyse the performance of a number of saliency models---including our own saliency model based on our recently published model of early visual processing (Schütt & Wichmann, 2017, JoV)---as well as the theoretical limits for predictions over time. On free viewing data our model performs better than classical bottom up saliency models, but worse than the current deep learning based saliency models incorporating higher level information like knowledge about objects. However, on search data all saliency models perform worse than the optimal image independent prediction. We observe that the fixation density in free viewing is not stationary over time, but changes over the course of a trial. It starts with a pronounced central fixation bias on the first chosen fixation, which is nonetheless influenced by image content. Starting with the 2nd to 3rd fixation, the fixation density is already well predicted by later densities, but more concentrated. From there the fixation distribution broadens until it reaches a stationary distribution around the 10th fixation. Taken together these observations argue against bottom up saliency as a mechanistic explanation for eye movement control after the initial orienting reaction in the first one to two saccades, although we confirm the predictive value of early visual representations for fixation locations. The fixation distribution is, first, not well described by any stationary density, second, is predicted better when including object information and, third, is badly predicted by any saliency model in a search task.

 

Schütt, H. H., Rothkegel, L., Trukenbrod, H. A., Engbert, R. and Wichmann, F. A. (2018)
Predicting fixation densities over time from early visual processing
European Conference on Visual Perception (ECVP), Triest, IT (poster)

Low-level saliency is often cited as a factor driving the choice of fixation locations of human observers, based the (partial) success of saliency models to predict fixation densities in free viewing. To test this hypothesis more directly we analyse a number of saliency models as well as the theoretical limits for predictions over time. The fixation density in free viewing is not stationary over time, but changes over the course of a trial. The first saccade is biased towards the image centre, but is nonetheless influenced by image content. Starting with the 2nd fixation, the fixation density is similar to but more concentrated than later fixation densities and predictions profit from high-level image information. From there the fixation distribution broadens until it reaches a stationary distribution around the 10th fixation. Taken together these observations argue against low-level saliency as a mechanistic explanation for eye movement control after the initial orienting reaction.

 

Schütt, H. H., Rothkegel, L., Trukenbrod, H. A., Engbert, R. and Wichmann, F. A. (2018)
Predicting the fixation densitiy over time
14th Biannual Conference of the German Cognitive Science Society (KogWis), Darmstadt, FRG (talk)

When modelling human eye movements we usually separate bottom-up and top-down effects, i.e. whether
some effect is caused by the stimulus or by some internal state of the observer like their tasks or intentions. We
also separate the orthogonal dimension whether the features used to guide eye movements are low-level—like
local contrast—or high-level—like object locations. Furthermore, humans display systematic tendencies in
their eye movements like their preference for certain saccade lengths and directions.
To disentangle these factors, we analyse how well fixation densities are predicted over time by, first, low-level
bottom-up saliency, including a saliency model based on our early spatial vision model (Schütt and Wichmann,
2017), and, second, a recent DNN-based saliency model including low- and high-level bottom-up saliency
(DeepGaze II, Kümmerer et al., 2016).
To manipulate top-down effects, we use two datasets: One Corpus dataset in which 105 subjects looked at
90 images to memorize them and a search dataset in which 10 subjects searched for 6 different targets with
varying spatial frequency and orientation content superimposed over 25 natural images 8 times each resulting
in 480 searches per image.
Based on the Corpus Dataset we separate the exploration into three phases: An onset response with the
first saccade, an initial exploration lasting around 10 fixations and a final equilibrium phase. First fixations
are most predictable but follow a different density than later ones. During the initial exploration fixations
gradually become less predictable. Finally, the fixation density stops broadening and the equilibrium state is
reached in which fixations are least focussed but still favour the same areas as during the exploration.
All saliency models predict fixations best at the beginning and gradually get worse. The simple saliency
model based on our early spatial vision model performs as well as classical saliency models. However, DeepGaze
II performs substantially better by using high-level information throughout the whole trial. This advantage is
present at the latest 200 ms after image onset. On the search dataset all saliency models perform badly after a
small initial prediction success. Instead we observe that subjects adjust where they look to the target they
search for.
Our observations confirm that bottom-up guidance of eye movements can be overwritten almost entirely
by task effects in static natural scenes. Nonetheless our data support some early bottom-up guidance, which
includes high-level features already for the first saccade.

Wichmann, F. A. and Schütt, H. H. (2018)
Modelling early influences on visual perception
European Conference on Visual Perception (ECVP), Trieste, IT (talk)

Models of spatial vision usually start with spatial frequency and orientation specific channels applied to an image which is already coded in contrast units relative to the background luminance ignoring earlier processing. To investigate the effects of pre-neural processing, we use an image-computable model of early spatial vision, which we published recently and investigate how this models’ behaviour changes with different preprocessing schemes. We discuss the effect of local transformations to luminance contrast, which results in much higher sensitivity in dark image regions. Additionally, we discuss the optics of the eye, which are interestingly asymmetric in degrading more quickly towards the nasal visual field mimicking the faster decline in receptor density in this direction. We find big improvements of model performance for natural image masking data, when earlier influences are taken into consideration. These results argue for the importance of very early visual processing.

 

Funke, C. M., Wallis, T. S. A., Ecker, A. S., Gatys, L. A., Wichmann, F. A. and Bethge, M. (2017)
A parametric texture model based on deep convolutional features closely matches texture appearance for humans
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Much of our visual environment consists of texture—“stuff” like cloth, bark or gravel as distinct from “things” like dresses, trees or paths—and we humans are adept at perceiving textures and their subtle variation. How does our visual system achieve this feat? Here we psychophysically evaluate a new parameteric model of texture appearance (the CNN texture model; Gatys et al., 2015) that is based on the features encoded by a deep convolutional neural network (deep CNN) trained to recognise objects in images (the VGG-19; Simonyan and Zisserman, 2015). By cumulatively matching the correlations of deep features up to a given layer (using up to five convolutional layers) we were able to evaluate models of increasing complexity. We used a three-alternative spatial oddity task to test whether model-generated textures could be discriminated from original natural textures under two viewing conditions: when test patches were briefly presented to the parafovea (“single fixation”) and when observers were able to make eye movements to all three patches (“inspection”). For 9 of the 12 source textures we tested, the models using more than three layers produced images that were indiscriminable from the originals even under foveal inspection. The venerable parameteric texture model of Portilla and Simoncelli (Portilla and Simoncelli, 2000) was also able to match the appearance of these textures in the single fixation condition, but not under inspection. Of the three source textures our model could not match, two contain strong periodicities. In a second experiment, we found that matching the power spectrum in addition to the deep features used above (Liu et al., 2016) greatly improved matches for these two textures. These results suggest that the features learned by deep CNNs encode statistical regularities of natural scenes that capture important aspects of material perception in humans.

Geirhos, R., Janssen, D., Schütt, H. H., Bethge, M. and Wichmann, F. A. (2017)
Of Human Observers and Deep Neural Networks: A Detailed Psychophysical Comparison
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Deep Neural Networks (DNNs) have recently been put forward as computational models for feedforward processing in the human and monkey ventral streams. Not only do they achieve humanlevel performance in image classification tasks, recent studies also found striking similarities between DNNs and ventral stream processing systems in terms of the learned representations (e.g. Cadieu et al., 2014, PLOS Comput. Biol.) or the spatial and temporal stages of processing (Cichy et al., 2016, arXiv).
In order to obtain a more precise understanding of the similarities and differences between current DNNs and the human visual system, we here investigate how classification accuracies depend on image properties such as colour, contrast, the amount of additive visual noise, as well as on image distortions resulting from the Eidolon Factory. We report results from a series of image classification (object recognition) experiments on both human observers and three DNNs (AlexNet, VGG16, GoogLeNet). We used experimental conditions favouring singlefixation, purely feedforward processing in human observers (short presentation time of t = 200 ms followed by a high contrast mask); additionally, we used exactly the same images from 16 basic level categories for human observers and DNNs. Under nonmanipulated conditions we find that DNNs indeed outperformed human observers (96.2% correct versus 88.5%; colour, fullcontrast, noisefree images). However, human observers clearly outperformed DNNs for all of the image degrading manipulations: most strikingly, DNN performance severely breaks down with even small quantities of visual random noise. Our findings reinforce how robust the human visual system is against various image degradations, and indicate that there may still be marked differences in the way the human visual system and the three tested DNNs process visual information. We discuss which differences between known properties of the early and higher visual system and DNNs may be responsible for the behavioural discrepancies we find.

Rothkegel, L. O. M., Schütt, H. H.,Trukenbrod, H. A., Wichmann, F. A. and Engbert, R. (2017)
We know what we can see - peripheral visibility of search targets shapes eye movement behavior in natural scenes
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Influences of target features on fixation locations and search durations have been widely studied. What happens to properties of the eyes' scanpath when looking for different targets has not been investigated as thoroughly. One important target aspect is how far into the periphery it is detectable, which can be varied by changing its spatial frequency content. Here we show that human participants adapt their eye movement behavior immediately when searching for different targets in natural scenes. In our study, participants searched natural scenes for 6 artificial targets with different spatial frequency content. High spatial frequency targets led to shorter fixation durations and smaller saccade amplitudes than low spatial frequency targets. The effect of the smaller saccade amplitudes appeared from the first of eight experimental sessions, without training, persisted throughout all sessions and disappeared when subjects were not told which of the targets to search for. Fixation durations were shorter for high spatial frequency targets after one training session, also persisted throughout all subsequent sessions and disappeared when subjects were not told which of the targets to search for. The differences in eye movement patterns between low and high spatial frequency targets led to longer search times to find the high frequency targets, but the probability to find the target within 10 seconds was unchanged. As high spatial frequency targets can not be detected far into the periphery, it is adaptive to choose a scanning strategy with shorter fixation durations and shorter saccade amplitudes when searching for high spatial frequency targets. Our results suggest that humans are capable to adequately adapt their eye movement behavior instantaneously according to the spatial frequency content of the target, implicitly adapting to how far from the fovea targets can be detected.

Schütt, H. H., Rothkegel, L., Trukenbrod, H. A., Engbert, R. and Wichmann, F. A. (2017)
Testing an Early Vision Model on Natural Image Stimuli
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Early visual processing has been studied extensively over the last decades. From these studies a relatively standard model emerged of the first steps in visual processing. However, most implementations of the standard model cannot take arbitrary images as input, but only the typical grating stimuli used in many of the early vision experiments. Previously we presented an image based early vision model implementing our knowledge about early visual processing including oriented spatial frequency channels, divisive normalization and optimal decoding (Schütt & Wichmann, VSS, 2016). The model explains the classical psychophysical data reasonably well, matching the performance of the non-image based models for contrast detection, contrast discrimination and oblique masking data.
Here we report tests of our model using natural images, exploiting the benefits of image based models of visual processing. First, we assessed the performance of the model against human observer thresholds for detecting noise Gabors masked by patches of natural scenes (Alam et. al., JoV, 2014). Our model predicts the thresholds for this masking experiment well, although it slightly overestimates the sensitivity of observers. Second, we investigated the channel activities for natural scene patches fixated by observers in a free viewing eye movement experiment. Before normalization channel activities follow typically observed biases of natural scenes, including the decline in energy over spatial frequency and the stronger activity along the cardinal axes. After divisive inhibition, the distribution activity is no longer skewed towards low spatial frequencies, while the preference for cardinal axes is preserved. Finally, we observe that the channels are extremely sparsely activated: each natural image patch activates few channels and each channel is activated by few stimuli.
Thus our model is able to generalize from simple grating stimuli to natural image stimuli, and it reproduces normative desiderata stemming from the efficient coding hypothesis and natural image statistics.

Schütt, H. H., Rothkegel, L., Trukenbrod, H. A., Engbert, R. and Wichmann, F. A. (2017)
Using an Image-Computable Early Vision Model to Predict Eye Movements
European Conference on Visual Perception (ECVP), Berlin, FRG (poster)

It is widely believed that early visual processing influences eye movements via bottom-up visual saliency calculations. However, direct tests of this hypothesis in natural scenes have been rare as image-computable models of early visual processing were lacking. We recently developed an image-computable early vision model, and thus we now have the means to investigate the connection from early vision to eye movements.
Here we explore eye movement data measured while subjects searched for simple early vision inspired targets like Gabors and Gaussian blobs overlaid over natural scenes. We compare the output of the early vision model processing patches around fixated locations with randomly chosen patches. Additionally we use a neural network to predict the fixation density from the early vision model output. Finally we use the model's predictions of target detectability to predict search performance.
We find clear differences between the early vision outputs at fixated locations which roughly follow the activations generated by the target alone. Additionally the fixation density can be predicted reasonably well from the early vision outputs using different weightings for different targets. Finally, target detectability at the specific location predicts search performance in terms of both the probability of finding the target and the time needed to find them.
Our findings show a clear dependence between eye movements and early visual processing. Additionally they highlight the possibility to use our spatial vision model as a preprocessing step for models of mid- and high-level vision.

Schütt, H. H., Rothkegel, L., Trukenbrod, H. A., Reich, S., Wichmann, F. A. and Engbert, R. (2017)
Likelihood-based Parameter Estimation and Comparison of Dynamical Eye
European Conference on Eye Movements (ECEM), Wuppertal, FRG (talk)

Recently, eye movement models aim to predict full scanpaths instead of fixation densities only. Therefore, parameter estimation, model analysis and comparison of such models are now of essential importance. We propose a likelihood based approach for model analysis in a fully dynamical framework that includes time-ordered experimental data and illustrate its use for the recent SceneWalk model. First we show that we can directly compute a likelihood for any model, which predicts a distribution for the next fixation given the previous ones. Computing a likelihood makes the full range of mathematical statistics available for such models. Namely we can perform Frequentist and Bayesian parameter estimation, which additionally provides credible intervals informing us how well which parameters are constrained by the data. Using hierarchical models inference is even possible for individual observers, which allows us to fit individual differences in saccade lengths with the model. Furthermore, our likelihood approach can be used to compare different models. In our example, the dynamical framework is shown to outperform non-dynamical models. Additionally, the likelihood based evaluation differentiates model variants, which produced indistinguishable predictions on hitherto used statistics. Our results indicate that the likelihood approach is a promising framework for models of full scanpaths.

Wallis, T. S. A., Funke, C. M., Ecker, A. S., Gatys, L. A., Wichmann, F. A. and Bethge, M. (2017)
Towards matching peripheral appearance for arbitrary natural images using deep features
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Due to the structure of the primate visual system, large distortions of the input can go unnoticed in the periphery, and objects can be harder to identify. What encoding underlies these effects? Similarly to Freeman & Simoncelli (Nature Neuroscience, 2011), we developed a model that uses summary statistics averaged over spatial regions that increases with retinal eccentricity (assuming central fixation on an image). We also designed the averaging areas such that changing their scaling progressively discards more information from the original image (i.e. a coarser model produces greater distortions to original image structure than a model with higher resolution). Different from Freeman and Simoncelli, we use the features of a deep neural network trained on object recognition (the VGG-19; Simonyan & Zisserman, ICLR 2015), which is state-of-the art in parametric texture synthesis. We tested whether human observers can discriminate model-generated images from their original source images. Three images subtending 25 deg, two of which were physically identical, were presented for 200 ms each in a three-alternative temporal oddity paradigm. We find a model that, for most original images we tested, produces synthesised images that cannot be told apart from the originals despite producing significant distortions of image structure. However, some images were readily discriminable. Therefore, the model has successfully encoded necessary but not sufficient information to capture appearance in human scene perception. We explore what image features are correlated with discriminability on the image (which images are harder than others?) and pixel (where in an image is the hardest location?) level. While our model does not produce "metamers", it does capture many features important for the appearance of arbitrary natural images in the periphery.

Wallis, T. S. A., Funke, C. M., Ecker, A. S., Gatys, L. A., Wichmann, F. A. and Bethge, M. (2017)
Matching peripheral scene appearance using deep features: Investigating image-specific variance and contributions of spatial attention
European Conference on Visual Perception (ECVP), Berlin, FRG (poster)

The visual system represents the periphery as a set of summary statistics. Cohen, Dennett and Kanwisher (TICS 2016) recently proposed that this influential idea can explain the discrepancy between experimental demonstrations that we can be insensitive to large peripheral changes and our rich subjective experience of the world. We present a model that summarises the information encoded by a deep neural network trained on object recognition (VGG-19; Simonyan & Zisserman, ICLR 2015) over spatial regions that increase with retinal eccentricity (see also Freeman & Simoncelli, 2011). We synthesise images that approximately match the model response to a target scene, then test whether observers can discriminate model syntheses from original scenes using a temporal oddity task. For some images, model syntheses cannot be told apart from the original despite large structural distortions, but other images were readily discriminable. Can focussed spatial attention overcome the limits imposed by summary statistics? We test this proposal in a pre-registered cueing experiment, finding no evidence that sensitivity is strongly affected by cueing spatial attention to areas of large pixel- or conv5-MSE between the original and synthesised image. Our results suggest that human sensitivity to summary-statistic-invariant scrambling of natural scenes depends more on the image content than on eccentricity or spatial attention. Accounting for this in a feedforward summary statistic model would require that the model also encodes these content-specific dependencies.

Janssen, D. H. J., Schütt, H. H. and Wichmann, F. A. (2016)
Some observations on the psychophysics of Deep Neural Networks
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Deep convolutional neural networks (DNNs) are currently very popular, drawing interest for their high performance on object classification tasks. Additionally, they are being examined for purported parallels between their hierarchical features and those found in systems of biological vision (e.g. Yamins etal., 2014).
Human vision has been studied extensively by psychophysics using simple grating stimuli, and many experimental results can be accommodated within a model where linear filters are followed by point-wise non-linearities as well as non-linear interactions between filters (Goris et al., 2013). However, two of the most striking failures of current spatial vision models are their inability to account for the contrast-modulation experiments by Henning et al. (1975) and the plaid-masking experiments by Derrington and Henning (1989).
Googlenet and Alexnet are two DNNs performing well on object recognition. We ran contrast-modulated and plaid-masking stimuli through the networks and extracted the layer activations. Since these networks are fully deterministic, we designed an optimal linear decoder around the assumption of late, zero-mean additive noise, where the variance of the noise was calibrated to match human performance on contrast detection experiments. Unlike human observers, neither Alexnet nor Googlenet show any trace in any of their layers of masking by contrast-modulated gratings. Worse still, adding the contrast-modulated mask strongly facilitated detection. Using plaid-masks, Googlenet again showed strong facilitation. Alexnet, on the other hand, shows plaid-masking effects at least qualitatively similar to those found in human observers. However, this was only true for the last layers, the "object" layers, not the early layers. Strong claims that DNNs mirror the human visual system appear premature. Not only do the DNNs fail to show the masking effects found in human observers, different DNNs were found to behave wildly differently to simple stimuli.

Lewke, B., Wallis, T. S. A. and Wichmann, F. A. (2016)
The influence of semantic information on early visual processing in natural scenes
Tagung experimentell arbeitender Psychologen (TeaP), Heidelberg, FRG (poster)

Recent research has demonstrated that early visual processing can be influenced by contextual and semantic effects in natural scenes, presumably via feedback mechanisms. Here we investigate semantic influences on orientation sensitivity on the level of single trials using a dual-report paradigm. In a temporal 2AFC, observers identified the relative orientation of a peripherally-presented (7 degrees from the fovea) Gabor probe embedded in a natural scene, and identified the object (16 alternative discrimination) on which the probe was superimposed. We experimentally varied the amount of scene context by changing the area of the image presented around the probe, and disrupted semantic processing while preserving local image structure by presenting scenes both upright and inverted. Including more scene context improved object identification performance while inverting the scene impaired it. Orientation sensitivity was largely unaffected by scene manipulations, and there was weak evidence that thresholds were invariant to whether the object identification was correct on a given trial. The results suggest that either scene context does not affect the precision of orientation coding, or that in our experimental paradigm observers were able to segregate the probe from the surrounding scene, thus mitigating any contextual influence. We plan further experiments to discriminate these possibilities.

Rothkegel, L. O. M., Trukenbrod, H. A., Schütt, H. H., Wichmann, F. A. and Engbert, R. (2016)
Reducing the central fixation bias: The influence of scene preview
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Models that aim at predicting fixation locations of observers in a natural scene can be based on assumptions on bottom up processes, top down processes, and systematic eye movement tendencies. Among the best predictors of fixation locations in an image is the distance of an image location from the image center. Because this central fixation bias (Tatler, 2007) is independent of image content, initial fixation position, screen or eye position, it is a very strong nuisance factor for scene viewing experiments under laboratory conditions. Scene preview from as short as 75 ms has been shown (Vo & Henderson, 2010) to influence saccade target selection effectively. Thus, a short scene preview might alter eye guidance during initial fixations and result in a reduced central fixation bias. In a scene viewing experiment, we manipulated the initial fixation position to keep the eyes for a certain amount of time fixated on one location before they were allowed to explore the scene freely for five seconds. As a result, we found that the central fixation bias was reduced for all pretrial fixation times from 125 ms to 1000 ms compared to a control group without scene preview. There were no systematic differences between the different pretrial fixation times. Our results have important practical implications for the evaluation of models of visual attention, since controlled initial fixation durations reduce attentional capture of the stimulus and, therefore, the central fixation bias.

Schütt, H. H., Baier, F. and Fleming, R. W. (2016)
Perception of light source distance from shading patterns
Tagung experimentell arbeitender Psychologen (TeaP), Heidelberg, FRG (poster)

Illumination is thought to play a central role in the perception of shape and surface properties, but little is know about how we estimate properties of the illumination itself. In particular, the perception of light source distance has not been studied before. Varying the distance of a light source from an object alters both the intensity and spatial distribution of surface shading patterns. We tested whether observers can use such cues to infer light source distance.
Participants viewed stereoscopic renderings of rough objects with Lambertian or glossy surfaces, which were illuminated by a point light source at a range of distances. In one experiment, they adjusted the position of a small probe dot in 3D to report the apparent location of the light in the scene. In a second experiment, they adjusted the shading on one object (by moving an invisible light source), until it appeared to be illuminated from the same distance as another object.
Participants’ responses on average increased linearly with the true light source distance in both experiments and all conditions, suggesting that they have clear intuitions about how light source distance affects shading patterns for a variety of different surfaces. However, there were also systematic errors: Subjects overestimated light source distance in the probe adjustment task, and in both experiments roughness and glossiness affected responses. Subjects perceived the light source to be nearer to smooth and/or glossy objects. Furthermore, perceived light source distance varied substantially between and within subjects.
The differences between conditions were predicted surprisingly well by a simplistic model based only on the area of the image that exceeds a certain intensity threshold.
Thus, although subjects can report light source distance, they may rely on simple---sometimes erroneous---heuristics to do so.

Schütt, H. H. and Wichmann, F. A. (2016)
An Image-Based Model for Early Visual Processing
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Early spatial vision was explored extensively over several decades in many psychophysical detection and discrimination experiments, and thus a large body of data is available. Goris et al. (2013; Psych. Rev.) integrated this psychophysical literature and proposed a model based on maximum-likelihood decoding of a neurophysiologically inspired population of model neurons. Their neural population model (NPM) is able to predict several data sets simultaneously, using a single set of parameters. However, the NPM is only one-dimensional, operating on the activity of abstract spatial frequency channels. Thus it cannot be applied to arbitrary images as a generic front-end to explore the influence of early visual processing on mid- or high-level vision. Bradley et. al. (2014; JoV), on the other hand, presented a model operating on images. Their model is thus able to make predictions for arbitrary images. However, compared to the NPM, their model lacks in nonlinear processing, which is replaced by an effective masking contrast depending on the detection target. Thus while Bradley et al. fit a range of detection data they do not fit nonlinear aspects of early vision like the dipper function. Here we combine both approaches and present a model which includes nonlinear processing and operates on images. In addition, the model applies optical degradation and retinal processing to the image before it is passed to a spatial frequency and orientation decomposition followed by divisive inhibition. For the optical transfer function of the eye and the distribution of retinal midget ganglion cells we use the approximations of Watson (2013, 2014; JoV). We tested the predictions of our model against a broad range of early psychophysical literature and found it predicts some hallmarks of early visual processing like the contrast sensitivity function under different temporal conditions and the dipper function for contrast discrimination.


Wallis, T. S. A., Bethge, M. and Wichmann, F. A. (2016)
Testing models of peripheral encoding using metamerism in an oddity paradigm
Tagung experimentell arbeitender Psychologen (TeaP), Heidelberg, FRG (poster)

Most of the visual field is peripheral, and, compared to the fovea, the periphery encodes visual input with less fidelity.
What information is encoded and what is lost in the visual periphery?
A systematic way to answer this question is to determine how sensitive the visual system is to different kinds of lossy image perturbations.
A difficulty of this approach is that in addition to information loss in the visual system, there are other factors that can reduce performance in behavioural experiments; for example, task performance may be limited by cognitive factors such as attention or memory.
Here we develop and explore an experimental paradigm that probes the detectability of perturbations of natural image structure with high sensitivity.
Observers compared modified images to original natural scenes in a temporal three-interval oddity task.
We consider several lossy image transformations, including Gaussian blur and textures synthesised from the Portilla and Simoncelli algorithm .
While our paradigm demonstrates metamerism (physically different images that appear the same) under some conditions, in general we find that contrary to an extreme "lossy representation" account of peripheral encoding, humans can be capable of impressive sensitivity to deviations from natural appearance.
The results force us to consider richer representations of peripheral image structure.

Wallis, T. S. A., Ecker, A. S., Gatys, L. A., Funke, C. M., Wichmann, F. A. and Bethge, M. (2016)
Seeking summary statistics that match peripheral visual appearance using naturalistic textures generated by Deep Neural Networks
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

An important hypothesis that emerged from crowding research is that the perception of image structure in the periphery is texture-like. We investigate this hypothesis by measuring perceptual properties of a family of naturalistic textures generated using Deep Neural Networks (DNNs), a class of algorithms that can identify objects in images with near-human performance. DNNs function by stacking repeated convolutional operations in a layered feedforward hierarchy. Our group has recently shown how to generate shift-invariant textures that reproduce the statistical structure of natural images increasingly well, by matching the DNN representation at an increasing number of layers. Here, observers discriminated original photographic images from DNN-synthesised images in a spatial oddity paradigm. In this paradigm, low psychophysical performance means that the model is good at matching the appearance of the original scenes. For photographs of natural textures (a subset of the MIT VisTex dataset), discrimination performance decreased as the DNN representations were matched to higher convolutional layers. For photographs of natural scenes (containing inhomogeneous structure), discrimination performance was nearly perfect until the highest layers were matched, whereby performance declined (but never to chance). Performance was only weakly related to retinal eccentricity (from 1.5 to 10 degrees) and strongly depended on individual source images (some images were always hard, others always easy). Surprisingly, performance showed little relationship to size: within a layer-matching condition, images further from the fovea were somewhat harder to discriminate but this result was invariant to a three-fold change in image size (changed via up/down sampling). The DNN stimuli we examine here can match texture appearance but are not yet sufficient to match the peripheral appearance of inhomogeneous scenes. In the future, we can leverage the flexibility of DNN texture synthesis for testing different sets of summary statistics to further refine what information can be discarded without affecting appearance.

Wichmann, F. A., Eichert, N. and Schütt, H. H. (2016)
An Image-Based Multi-Channel Model for Light Adaptation
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (talk)

The human visual system is sensitive to luminances ranging from 10-6 to 108 cd/sqm (Hood & Finkelstein, 1986, Handbook of Perception and Human Performance, Vol 1). Given the more limited dynamic range of the photoreceptors and subsequent neurons, effective light adaptation must thus be an essential property of the visual system.

In an important study Kortum & Geisler (1995, Vision Research) measured contrast increment thresholds for increment-Gabor probes on flashed backgrounds in the presence of steady-state backgrounds, exploring how the spatial frequency of the increment-Gabor affects thresholds, i.e. how light adaptation and spatial vision interact. In addition to their experiments, Kortum & Geisler presented a successful model in which the incoming signal undergoes multiplicative and subtractive adapting stages followed by non-linear transduction and late noise.

Here we significantly expand and modify their model: First, our model is image-based and thus accepts any image as input, whereas the original model is only applicable to the putative scalar activations of sine wave stimuli. Second, our model has a single set of parameters for the multiplicative and subtractive adaptation stages, followed by a multi-scale pyramid decomposition (Simoncelli & Freeman, 1995, IEEE International Conference on Image Processing, Vol. III). The spatial frequency dependence of the thresholds is modelled via the DC components’ selective influence on the variance of the late noise in the decision stage of the model. In the original model by Kortum & Geisler the parameters of the multiplicative and subtractive adaptation stages are all spatial frequency dependent, which is problematic if one believes adaptation to happen very early in the visual system, before the signal is split into separate spatial frequency channels.

Our image-based multi-channel light adaptation model not only accounts well for the data of Kortum & Geisler (1995), but in addition captures, for example, the effects of test patches of different size (Geisler, 1979, Vision Research).

Aguilar, G., Wichmann, F. A. and Maertens, M. (2015)
Comparing sensitivity estimates from MLDS and forced-choice methods in a slant-from-texture experiment
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Maximum likelihood difference scaling (MLDS) is a method for the estimation of perceptual scales based on an equal-variance, Gaussian, signal detection model (Maloney & Yang, 2003). It has recently been shown that perceptual scales derived with MLDS allowed the prediction of near threshold discrimination performance for the Watercolor effect (Devinck & Knoblauch, 2012). The use of MLDS-based scales to predict sensitivity is psychophysically attractive, because the MLDS scale estimation promises to require a comparatively small amount of data relative to classical forced-choice procedures, such as the method of constant stimulus. However, the relationship between estimates from MLDS and forced-choice procedures is not yet well characterized with respect to their bias and variability. It also remains to be tested whether their close correspondence applies to stimuli other than the Watercolor effect. Here, we studied these issues by comparing the MLDS and forced-choice methods in a slant-from-texture experiment. We used a ‘polka dot’ texture pattern which was slanted 0 to 80 deg away from fronto-parallel and viewed through a circular aperture (Rosas et al., 2004). We first obtained perceptual scales describing the relationship between physical and perceived slant using MLDS with the method of triads. Based on individual scales we measured slant discrimination thresholds at different performance levels (d’=0.5, 1 and 2) for four different standard slants using standard forced-choice procedures. We obtained a high correspondence between slant thresholds obtained from the two methods. The variability of the estimates, however, depended heavily on the amount of data for MLDS, somewhat questioning its efficiency. Furthermore, the correspondence between the two methods was reduced in the lower region of the MLDS scale. We conclude that MLDS scales can be used to estimate sensitivity, however, we would advise caution with respect to the generalizability of the correspondence between MLDS and forced-choice based sensitivity measures across experimental tasks.

Acknowledgement: This work was supported by the German Research Foundation, Research Training Group GRK 1589/1 and grant DFG MA5127/1-1.

Schütt, H., Harmeling, S., Macke, J. and Wichmann, F. A. (2015)
Psignifit 4: Pain-free Bayesian inference for psychometric functions
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (poster)

Psychometric functions are frequently used in vision science to model task performance. These sigmoid functions can be fit to data using likelihood maximization, but this ignores the reliability or variance of the point estimates. In contrast Bayesian methods automatically calculate this reliability. However, using Bayesian methods in practice usually requires expert knowledge, user interaction and computation time. Also most methods---including Bayesian ones---are vulnerable to non-stationary observers (whose performance is not constant). For such observers all methods, which assume a stationary binomial observer are overconfident in the estimates.

We present Psignifit 4, a new method for fitting psychometric functions, which provides an efficient Bayesian analysis based on numerical integration, which requires little user-interaction and runs in seconds on a common office computer. Additionally it fits a beta-binomial model increasing the stability against non-stationarity and contains standard settings including a heuristic to set the prior based on the interval of stimulus levels in the experimental data. Obviously all properties of the analysis can be adjusted.

To test our method it was run on extensive simulated datasets. First we tested the numerical accuracy of our method with different settings and found settings which calculate a good estimate fast and reliably.

Testing the statistical properties, we find that our method calculates correct or slightly conservative confidence intervals in all tested conditions, including different sampling schemes, beta-binomial observers, other non-stationary observers and adaptive methods. When enough data was collected to overcome the small sample bias caused by the prior, the point estimates are also essentially unbiased.

In summary we present a user-friendly, fast, correct and comprehensively tested Bayesian method to fit psychometric functions, which handles non-stationary observers well and is freely available as an MATLAB implementation online.

Schütt, H., Trukenbrod, H. A., Rothkegel, L. and Engbert, R. (2015)
Test of a Dynamical Model for Natural Scene Exploration
European Conference on Eye Movements (ECEM), Wien, Austria (poster)

Recently, we proposed a dynamical model for the generation of scanpaths during free view- ing of natural scenes (Engbert et al., 2015, J Vis). Here we studied important aspects of the simulated scanpath (e.g., saccade length distributions and clustering of fixations) by comparison with a new experimental data set in combination with an alternative parameter- fitting procedure. In the new experiment, participants reinspected images for a second time, which changes many aspects of the eye movement traces including statistical aggregation of fixations. We apply point process statistics and spatial correlation functions to compare experiments and model simulations. Furthermore, we investigate different methods to fit the model to data, based on a maximum likelihood approach and Bayesian Markov Chain Monte-Carlo sampling (MCMC). Our results indicate that variations of model parameters between different viewing conditions can be interpreted as strategic variations in viewing behavior.

Wallis, T. S. A., Bethge, M. and Wichmann, F. A. (2015)
Metamers of the ventral stream revisited
Vision Sciences Society (VSS), St. Pete Beach, FL, USA (talk)

Peripheral vision has been characterised as a lossy representation:

Information present in the periphery is discarded to a greater degree than in the fovea. What information is lost and what is retained?

Freeman and Simoncelli (2011) recently revived the concept of metamers (physically different stimuli that look the same) as a way to test this question. Metamerism is a useful criterion, but several details must be refined. First, their paper assessed metamerism using a task with a significant working memory component (ABX). We use a purely spatial discrimination task to probe perceptual encoding. Second, a strong test of any hypothesised representation is to what extent it is metameric for a real scene. Several subsequent studies have misunderstood this to be the result of the paper. Freeman and Simoncelli instead only compared synthetic stimuli to each other.

Pairs of stimuli were synthesised from natural images such that they were physically different but equal under the model representation.

The experiment then assessed the scaling factor (spatial pooling region as a function of retinal eccentricity) required to make these two synthesised images indiscriminable from one another, finding that these scaling factors approximated V2 receptive field sizes. We find that a smaller scale factor than V2 neurons is required to make the synthesised images metameric for natural scenes (which are also equal under the model). We further show that this varies over images and is modified by including the spatial context of the target patches. While this particular model therefore fails to capture some perceptually relevant information, we believe that testing specific models against the criteria that they should *discard as much information as possible* while *remaining metameric* is a useful way to understand perceptual representations psychophysically.

Aguilar, G., Wichmann, F. A. and Maertens, M. (2014)
On the role of two-dimensional cues for perceived differences in slant
European Conference on Visual Perception (ECVP), Belgrad, SRB (poster)

An experimental paradigm frequently used in the study of human depth perception is the inference of surface slant from texture. Previous work has shown that slant is consistently underestimated when close to fronto-parallel, and this has been interpreted as a reduced sensitivity to fronto-parallel slants. For textures with discrete elements, such as 'polka dots', a number of two-dimensional cues change concomitantly with slant. While it was assumed that these cues are used to infer slant, and that differences in discrimination performance reflect thus differences in slant sensitivity, it has been recently suggested that some differences might rather reflect sensitivity differences to the two-dimensional cues themselves (Todd et al., 2010).

To further test this possibility, we derived perceptual scales for slants defined by texture using the maximum likelihood difference scaling method (Maloney & Yang, 2003). We studied the influence of different two-dimensional cues that varied contingently with slant (elements' aspect and area ratios, scaling contrast, among others), and compared the experimental results with observer models that selectively responded to each cue. The derived perceptual scales confirm lower sensitivities at slants close to fronto-parallel. Interestingly, none of the different cues was sufficient in explaining the sensitivity differences across the entire perceptual scale.

This work was supported by the German Research Foundation, Research Training Group GRK 1589/1 and grant DFG MA5127/1-1.

Betz, T., Wichmann, F. A., Shapley, R. and Maertens, M. (2014)
Testing the ODOG brightness model with narrowband noise stimuli
European Conference on Visual Perception (ECVP), Belgrad, SRB (poster)

The oriented difference of Gaussians (ODOG) model (Blakeslee and McCourt, 1999) computes the perceived brightness of an image region through the operation of orientation and frequency selective filters. The model successfully predicts the brightness of test regions for a number of brightness phenomena, including White’s illusion.

Salmela and Laurinen (2009) showed that masking White's illusion with narrow-band noise affects the strength of the illusion. The effect of the narrow-band noise mask depended on its spatial frequency and orientation, suggesting that orientation- and frequency-selective mechanisms are important for generating the illusion. The ODOG model is comprised of orientation- and frequency-selective mechanisms, and thus the effect of narrow-band noise on its output is a critical test of its adequacy as a mechanistic account of White’s illusion. We analyzed the effects of narrow-band noise on the output of the ODOG model. The orientation of the noise had a similar effect on model and human performance, but the model was much more sensitive to the noise than were human observers. Furthermore, the model was sensitive to noise at low frequencies, where observers were unaffected. We explain the discrepancies through a detailed analysis of the individual components and computational steps of the model.

Janssen, D. and Wichmann, F. A. (2014)
Subband decompositions are inherently incompatible with (most) non-linear models of visual perception
Tagung experimentell arbeitender Psychologen (TeaP), Giessen, FRG (poster)

Image-driven models in vision predict image perceptibility. Typically, such models combine subband decomposition (e.g. Steerable Pyramid) with a late, non-linear decision stage (e.g. Minkowski norm).

The interaction between subband decompositions and the decision non-linearity, however, creates a problem. A subband decomposition represents images as a set of overlapping bands with different peak sensitivities in frequency and orientation. Image content that matches a peak sensitivity is represented as a maximal value in its band. Frequencies or orientations that fall between two peak sensitivities, however, are represented as multiple smaller values in adjacent bands.

For example, the Minkowksi norm---a standard decision non-linearity---maps the "activity" values within the bands to a single value. Depending on the Minkowski-exponent beta, the norm weighs the values in the vector equally or increasingly favours the highest values. Thus, the contribution of image content falling close to the peak sensitivities of the bands is treated differently to that falling between peaks. As a result, models combining subband decomposition with non-linear processing show an undesirable dependence of their response on the---arbitrarily chosen---peak sensitivities.

Such a dependence implies, for example, that image detectability should oscilate with viewing distance. This is clearly not observed in reality.

Janssen, D. and Wichmann, F. A. (2014)
Improving models of early vision through Bayesian analysis
European Conference on Visual Perception (ECVP), Belgrad, SRB (poster)

Computational models are often used in vision science to formalize theories about how the visual system functions. The most general models attempt to relate visual input---images---to results obtained through psychophysical experiments.
Typically, image-based models contain many parameters. Maximum likelihood estimation (MLE) is a commonly used technique to statistically estimate the model parameters from the data, but MLE only provides point estimates. Even when MLE is complemented by a variability estimate based on the normal approximation at the posterior mode, it does not explore the confidence regions of, and dependencies between, model parameters in sufficient detail.
We present an image-driven model of early vision within a Bayesian framework in which we estimate the posterior distributions of the parameters via Markov Chain Monte Carlo (MCMC) sampling. This provides us both with a "best fitting model" and estimates of the confidence intervals on and correlations between the model parameters. We demonstrate how this information helped guide our model-design and avoid non-obvious pitfalls that would not have been apparent via MLE.
Finally, we provide an aside on why subband decompositions do not function well within standard models of early vision and provide a computationally intensive but functional alternative.

Janssen, D. and Wichmann, F. A. (2014)
A better image decomposition for computational vision models
European Mathematical Psychology Group Meeting (EMPG), Tübingen, FRG (poster)

Image-driven models of vision attempt to express the perceptibility of images by relating the actual image to measures of human behavior. Typically, such models combine a subband decomposition (e.g. Steerable Pyramid, Gabor wavelet banks) with subsequent non-linearities (e.g. point-nonlinearities, Minkowski-norm pooling, divisive contrast normalization). The combination of these two, however, cause aliasing-like artifacts to occur in model behavior in the domains of spatial frequency and orientation.
This aliasing behavior can lead to some very strange predictions like the perceptibility of certain image aspects oscillating with viewing distance. We propose a number of methods of spacing filters more homogenously in frequency and orientation. Our approach solves this problem, while additionally more closely resembling what we know about frequency sensitivities in V1.
One such method, which we explore in great detail, is based on methods of spatial statistics, namely a Gibbs process. A Gibbs process represents distributions of points in a space through a pair-potential function, a function that expresses the likelihood of certain interpoint distances occuring. We represent spatial filters as points in 4D space (x, y, frequency, orientation) and generate our sampling grids through statistical sampling from this Gibbs process.
These new methods of generating image decompositions, although much slower and less efficient numerically, offer robust and statistically rigorous methods of sampling frequency and orientation content from images. Additionally, they do not suffer from the frequency- and orientation aliasing problem that occurs in more traditional image decompositions.

Kümmerer, M., Wallis, T. and Bethge, M. (2014)
How close are we to understanding saliency?
Bernstein Conference (BCCN), Göttingen, FRG (poster)

The study of saliency, here defined as a spatial combination of image features that can capture fixation patterns, is important for both human perception and for computer vision research. However, human fixations are not only determined by the image presented on the screen but also by myriad behavioural biases, including a tendency to look at the centre of the screen and mechanistic temporal effects [1]. The influence of these biases is large [2] and strongly task dependent [3]. Any comparison of saliency models must take these biases into account.

Here we separate the contribution of image-based saliency and several behavioural biases using a probabalistic framework and information theoretic evaluation. We compare several prominent saliency models to a nonparametric gold standard, in which we capture all the spatial structure that can be explained in the data set. We model fixations using point processes [4], and compare models using mutual information. Using a popular data set [5], we show that the total amount of mutual information that can be extracted from an image about the spatial structure of fixations is 2.1 bits/fixation. The best performing model explains 59% of this total information. Remarkably, a model that ignores image content altogether but captures the observers' centre bias explains 42% of the total information. Once the effect of spatial behavioural biases are removed, saliency models explain only 29% of the mutual information. Thus, purely spatial saliency remains a significant challenge.

Finally, we extend the point process framework to capture temporal dependencies between eye movements, revealing that including a tendency to fixate near to the previous fixation [see also 6] increases the likelihood by more than the best saliency models. If we are interested in understanding how people look at images, focusing on only spatial information ignores important structure that is easily accounted for.

Schütt, H. and Wichmann, F. A. (2014)
Uncertainty effects in visual psychophysics
Tagung experimentell arbeitender Psychologen (TeaP), Giessen, FRG (poster)

Uncertainty effects refer to decreased sensitivity in detection experiments when the subject is uncertain about the physical properties of the stimulus (such as its size, position or spatial frequency). In particular, uncertainty about the spatial frequency of a sinusoidal grating is thought to cause a large drop in detectability (Davis & Graham, 1980).
The uncertainty model of Pelli (1985) and a recent model by Goris, Putzeys, Wagemans, and Wichmann (2013) both account for this effect. However, the models make different predictions on the slope of the psychometric function. Pelli’s model predicts a rise of the slope when the threshold rises, while the model of Goris et al. predicts a pure shift without any slope change.
We measured psychometric functions for detection of sinusoidal gratings with spatial frequencies of 3, .75 and 12 cyc/deg, alone and intermixed under uncertainty.
Surprisingly, we do not find a sizeable uncertainty effect in our data. We measured at least 13800 trials per subject. With the statistical power due to this large number of trials we can exclude effect sizes of the magnitude found in the literature. We discuss different explanations for this lack of effect in our data compared to previous studies.

Schütt, H., Harmeling, S., Macke, J. and Wichmann, F. A. (2014)
Pain-free Bayesian inference for psychometric functions
European Conference on Visual Perception (ECVP), Belgrad, SRB (poster)

To estimate psychophysical performance, psychometric functions are usually modeled as sigmoidal functions, whose parameters are estimated by likelihood maximization. While this approach gives a point estimate, it ignores its reliability (its variance). This is in contrast to Bayesian methods, which in principle can determine the posterior of the parameters and thus the reliability of the estimates. However, using Bayesian methods in practice usually requires extensive expert knowledge, user interaction and computation time. Also many methods---including Bayesian ones---are vulnerable to non-stationary observers (whose performance is not constant).

Our work provides an efficient Bayesian analysis, which runs within seconds on a common office computer, requires little user-interaction and improves robustness against non-stationarity. A Matlab implementation of our method, called PSIGNFIT 4, is freely available online. We additionally provide methods to combine posteriors to test the difference between psychometric functions (such as between conditions), obtain posterior distributions for the average of a group, and other comparisons of practical interest.

Our method uses numerical integration, allowing robust estimation of a beta-binomial model that is stable against non-stationarities. Comprehensive simulations to test the numerical and statistical correctness and robustness of our method are in progress, and initial results look very promising.

Tobias, S., Wallis, T., Bethge, M. and Wichmann, F. A. (2014)
Human sensitivity to spatial distortions in the periphery
Bernstein Conference (BCCN), Göttingen, FRG (poster)

The resolution of spatial position, spatial-frequency and orientation encoding decays with distance from the fovea. Furthermore, when the area surrounding a target is cluttered, objects become even more difficult to identify and discriminate ("crowding"). We investigated the coding of position and orientation information by measuring human sensitivity to spatial distortions applied to letter stimuli. Spatial distortions offer a useful method to probe sensitivity to position and orientation because they can be applied to arbitrary images. Observers identified which of four peripherally-presented letters was distorted (4AFC task). We varied the amplitude and frequency characteristics of two band-limited distortion techniques: radial frequency distortion patterns [1] and bandpass noise distortion [2]. In a second condition, target letters were surrounded by four flanking letters to induce crowding. We characterise the change in sensitivity across the spatial scale of distortion for both distortion types and under crowded conditions. We will present an image-based metric of contour change that quantifies the size of the distortions on the same physical axis, and discuss the degree to which sensitivity to spatial distortions can be accounted for by early visual mechanisms.

Wallis, T., Dorr, M. and Bex, P. (2014)
A Bayesian multilevel modelling approach to characterising contrast sensitivity in naturalistic movies
European Mathematical Psychology Group Meeting (EMPG), Tübingen, FRG (poster)

Sensitivity to luminance contrast is a fundamental property of the visual system.

We presented contrast increments gaze-contingently within naturalistic video, freely viewed by observers, to examine contrast increment detection performance in a way that approximates the natural environmental input of the visual system.

On each trial, one spatial scale of the video sequence was incremented in contrast in a local region at one of four locations relative to the observer's current gaze point.

The target was centred 2 degrees from the fovea and smoothly blended with the surrounding unmodified video in space (Gaussian, SD = 0.5

deg) and time (modified raised cosine with 120~ms at maximum amplitude).

Five observers made forced-choice responses to the location of the target (4AFC), resulting in approximately 25,000 Bernoulli trials.

Contrast discrimination performance is typically modelled by assuming the underlying contrast response follows a nonlinear transducer function, which is used to determine the expected proportion correct via signal detection theory.

We implemented this model in a Bayesian multilevel framework (with a population level across subjects) and estimated the posterior over model parameters via MCMC.

Our data poorly constrain the parameters of this model to interpretable values, and constraining the model using strong priors taken from previous research provides a poor fit to the data.

In contrast, logistic regression models were better constrained by the data, more interpretable, and provide equivalent prediction performance to the best-performing nonlinear transducer model.

We explore the properties of an extended logistic regression that incorporates both eye movement and image content features to predict performance.

Using this varying-intercept model, we demonstrate the characteristic contrast sensitivity function with a peak in the range of 0.75--3 cycles per degree.

Barthelmé, S., Trukenbrod, H., Engbert, R. and Wichmann, F. A. (2013)
Point process models for eye movement data
Annual Meeting of the Society of Mathematical Psychology (MathPsych), Potsdam, FRG (talk)

Whenever eye movements are measured, a central part of the analysis has to do with where subjects fixate, and why they fixated where they fixated. To a first approximation, a set of fixations can be viewed as a set of points in space: this implies that fixations are spatial data and that the analysis of fixation locations can be beneficially thought of as a spatial statistics problem. We will argue that thinking of fixation locations as arising from point processes is a very fruitful framework for eye movement data.
We will provide a tutorial introduction to some of the main ideas of the field of spatial statistics, focusing especially on spatial Poisson processes. We will show how point processes help relate image properties to fixation locations. In particular they express quite naturally the idea that how image features predict fixations might vary from one image to another. We will review other methods of analysis used in the literature, show how they relate to point process theory, and argue that thinking in terms of point processes substantially extends the range of analyses that can be performed and clarify their interpretation.

Betz, T., Maertens, M. and Wichmann, F. A. (2013)
Spatial filtering vs edge integration: comparing two computational models of lightness perception
European Conference on Visual Perception (ECVP), Bremen, FRG (poster)

The goal of computational models of lightness perception is to predict the perceived lightness of any surface in a scene based on the luminance value at the corresponding retinal image location. Here, we compare two approaches that have been taken towards that goal: the oriented difference-of-Gaussian (ODOG) model [Blakeslee and McCourt, 1999, Vision Research, 39(26): 4361–4377], and a model based on the integration of edge responses [Rudd, 2010, Journal of Vision, 10(14): 1–37]. We reimplemented the former model, and extended it by replacing the ODOG filters with steerable pyramid filters [Simoncelli and Freeman, 1995, IEEE, ICIP proceedings, 3], making the output less dependent on the specific spatial frequencies present in the input. We also implemented Rudd's edge integration idea and supplemented it with an image-segmentation stage to make it applicable to more complex stimuli than the ones he considered. We apply both models to various stimuli that have been experimentally used to probe lightness perception (e.g. disk-annulus configurations, White's illusion, Adelson's checkerboard). The model outputs are compared relative to human lightness responses. The discrepancies between the models and the human data can be used to infer those model components that are critical to capture human lightness perception.

Engbert, R., Trukenbrod, H. A. and Wichmann, F. A. (2013)
Using spatial point processes to evaluate models of eye guidance in scene viewing
European Conference on Eye Movements (ECEM), Lund, S (poster)

The distribution of fixation locations on a stationary visual scene can be interpreted as an intensity function of an underlying spatial point process (Illian et al., 2008). In point process theory, we try to analyze the point-to-point interactions to infer possible generating mechanisms.The pair correlation function provides a mathematical measure of the density and statistical interaction of neighboring points. We explore the possibility to apply the pair correlation function in the spatial statistics of fixation locations generated from individual scanpaths of human observers. We demonstrate that the inhomogeneous pair correlation function removes first-order heterogeneity induced by systematic variation of saliency within a given scene from second-order spatial statistics. Results indicate significant spatial clustering at short length scales. Finally, we use the inhomogeneous pair correlation function for the evaluation of a dynamical model of saccade generation in active vision during scene perception.

Engbert, R., Trukenbrod, H. A. and Wichmann, F. A. (2013)
Using spatial point processes to evaluate models of eye guidance in scene viewing
Annual Meeting of the Society of Mathematical Psychology (MathPsych), Potsdam, FRG (poster)

The distribution of fixation locations on a stationary visual scene can be interpreted as an intensity function of an underlying spatial point process (Illian et al., 2008). In point process theory, we try to analyze the point-to-point interactions to infer possible generating mechanisms.The pair correlation function provides a mathematical measure of the density and statistical interaction of neighboring points. We explore the possibility to apply the pair correlation function in the spatial statistics of fixation locations generated from individual scanpaths of human observers. We demonstrate that the inhomogeneous pair correlation function removes first-order heterogeneity induced by systematic variation of saliency within a given scene from second-order spatial statistics. Results indicate significant spatial clustering at short length scales. Finally, we use the inhomogeneous pair correlation function for the evaluation of a dynamical model of saccade generation in active vision during scene perception.

Gerhard, H. E., Wichmann, F. A. and Bethge, M. (2013)
How sensitive is the human visual system to the local statistics of natural images?
Computational and Mathematical Models in Vision (MODVIS), Naples, FL, USA (talk)

Several physiological links between natural image regularities and visual representation have been made using probabilistic natural image models. However, such models had not yet been linked directly to perceptual sensitivity. Here we present results from a new test of model efficacy based on perceptual discrimination. Observers viewed two sets of image samples on every trial: one set of natural images, the other set matched in joint probability under a natural image model (generated by shuffling the natural set's content subject to model assumptions). Task: which set contains true natural images? We tested 8 models from one capturing only 2nd-order correlations to one among the current state-of-the-art in capturing higher-order correlations. Discrimination performance was accurately predicted by model likelihood, an information theoretic measure of model efficacy, and was overall quite high indicating that the visual system's sensitivity to higher-order regularities in natural images far surpasses that of any current image model.

Janssen, D. and Wichmann, F. A. (2013)
Bayesian analysis of a psychophysical model of human pattern detection
Bernstein Conference (BCCN), Tübingen, FRG (poster)

In psychophysics and the behavioural neurosciences, computational models are used to bridge the gap between raw data and understanding. Researchers formulate mathematical models summarizing the experimental data into a more comprehensible set of model parameters, typically estimated using maximum likelihood (ML) methods.
Traditional ML methods have a number of problems, however, which arise from the fact that they only find the single best solution, but do not explore the parameter space fully. First, the parameter space may contain very different model specifications that still produce almost similar results. Second, the parameters can be correlated. Third, maximum likelihood estimates by themselves provide no insight into the variability of the estimated parameters. Without estimates of variability of the fitted parameters it is difficult to interpret the modelling results, however. One traditional solution is to estimate the variability (standard errors) from the normal approximation at the mode found by ML estimation. As a natural alternative, we propose and demonstrate Bayesian methods of exploring the parameter space which not only provides model solutions, but also provides information on the distributions of, and correlations between, the model parameters.
We apply Bayesian methods to psychophysical models of human pattern detection. Such models assume that the retinal image is first analysed through (nearly) independent and linear pathways tuned to different spatial frequencies and orientations. Second, the activation within the pathways is non-linearly transformed, either via a non-linear transducer or via a divisive contrast gain control mechanism. Third, the outputs are subjected to a suitable noise source and are then combined into a single decision variable via a simple Minkowski-norm. We discuss how our Bayesian exploration of model parameters reflects on model quality and interpretability, and thus provides useful model diagnostics.

Maertens, M., Wichmann, F. A. and Shapley, R. M. (2013)
Context affects lightness at the level of surfaces
OSA Fall Vision Meeting, Houston, TX, USA (poster)

The accurate perception of object attributes such as surface lightness is vital for the successful interaction with the environment. However, it is unknown how the visual system assigns lightness values to surfaces based on the intensity distribution in the retinal image. It has been shown that the perceived lightness of an image region is influenced by its context but whether that influence involves so-called low-, mid- or high-level mechanisms is disputed. To probe the level of lightness perception psychophysically we manipulated the surface character of a target region and measured its influence on perceived lightness. We show that only when the target region was consistent with the perceptual interpretation of a surface it was subject to a subjective brightening (assimilation) effect. When the target region was consistent with a spotlight, and hence inconsistent with surface perception, there was no concomitant increase in perceived lightness. These results suggest that the effect of context on the lightness of an image region is not deterministic, but can instead be modulated by other attributes of the image region that imply a mid-level scene interpretation.

Putzky, P., Wichmann, F. A. and Macke, J. H. (2013)
A statistical framework for molecular psychophysics
Osnabrück Computational Cognition Alliance Meeting (OCCAM), Osnabrück, FRG (poster)

Psychophysical experiments aim to provide accurate descriptions of how an observer's responses depend on a presented stimulus. However, perceptual decision making can – aside from the physical stimulus – be influenced by various intrinsic factors such as the subject's attentive state or previous experience.
Although it is well accepted that observers show inter-trial dependencies in their responses, standard data analysis is often performed on responses averaged across trials. In 1964 Green introduced the idea of 'molecular psychophysics', i.e. the analysis of psychophysical data on a trial-by-trial basis. Yet, the statistical tools for applying these kinds of analyses to standard psychophysical data have been lacking. We present a probabilistic framework for perceptual decision making which is (1) compatible with a commonly used framework for estimating psychometric functions (Wichmann and Hill, 2001) and (2) enables the experimenter to separate non-stimulus related responses from stimulus responses. We use a hierarchical mixture of experts to represent the perceptual decision making process, and perform Bayesian variational inference to determine the factors that influence decisions on individual trials (Bishop and Svensen, 2003).
In a simulation study we show that ignoring intrinsic factors, such as serial dependency, can lead to a systematic bias in the inference of a psychometric function. Our method successfully captures, visualizes and corrects for these effects and is applicable to a wide range of psychophysical data. Thus it has the potential to contribute to a more realistic understanding of perceptual decision making.

Trukenbrod, H. A., Wichmann, F. A. and Engbert, R. (2013)
A dynamical model of attentional selection for saccades during scene viewing
European Conference on Eye Movements (ECEM), Lund, S (poster)

When viewing a scene, we reorient our fovea about three times per second to inspect areas of interest with high visual acuity. Stimulus‐driven (bottom-up/saliency) factors as well as task (top‐down) factors have been identified to predict individual fixation locations. However, little is known about the dynamical rules that induce the generation of sequences of fixations (i.e., the eye’s scanpath). Here we propose a computational model to investigate the dynamical interaction between the build-­up of saliency and the inhibition of recently selected targets. In our model, fixation sequences are determined by the interaction of two processing maps. In a first map, an attentional processing window generates the build-up of a saliency field by dynamical rules. A secondary motor map keeps track of recently fixated targets. Finally, both maps interact to generate a movement­‐planning field for saccadic eye movements. Our simulations predict properties of the experimentally observed eye‐movement data, e.g., distribution of saccade amplitudes. The new computational model represents a promising framework to investigate the link between internal saliency of a scene and subsequent target selection during free viewing of natural scenes.

Wichmann, F. A. (2013)
Machine learning methods for system identification in sensory psychology
Annual Meeting of the Society of Mathematical Psychology (MathPsych), Potsdam, FRG (keynote talk)

As a prerequisite to quantitative psychophysical models of sensory processing it is necessary to know to what extent decisions in behavioral tasks depend on specific stimulus features, the perceptual cues: Given the high-dimensional input, which are the features the sensory systems base their computations on? Over the last years we have developed inverse machine learning methods for (potentially nonlinear) system identification, and have applied them to identify regions of visual saliency (Kienzle et al., JoV, 2009), to gender discrimination of human faces (Wichmann et al., 2005; Macke & Wichmann, 2010), and to the identification of auditory tones in noise (Schönfelder & Wichmann, 2012; 2013). In my talk I will concentrate on how stimulus-response data can be analyzed relying on L1-regularized multiple logistic regression. This method prevents both over-fitting to noisy data and enforces sparse solutions. In simulations, "behavioral" data from a classical auditory tone-in-noise detection task were generated, and L1-regularized logistic regression precisely identified observer cues from a large set of covarying, interdependent stimulus features (a setting where standard correlational and regression methods fail). In addition, the method succeeds for deterministic as well as probabilistic observers. The detailed decision rules of the simulated observers could be reconstructed from the estimated model weights, thus allowing predictions of responses on the basis of individual stimuli. Data from a real psychophysical experiment confirm the power of the proposed method.

Wichmann, F. A. (2013)
Models of Early Spatial Vision: Bayesian Statistics and Population Decoding
European Conference on Visual Perception (ECVP), Bremen, FRG (invited talk)

In psychophysical models of human pattern detection it is assumed that the retinal image is analyzed through (nearly) independent and linear pathways (“channels”) tuned to different spatial frequencies and orientations followed by a simple maximum-output decoding rule. This hypothesis originates from a series of very carefully conducted and frequently replicated psychophysical pattern detection, summation, adaptation, and uncertainty experiments, whose data are all consistent with the simple model described above. However, spatial-frequency tuned neurons in primary visual cortex are neither linear nor independent, and ample evidence suggests that perceptual decisions are mediated by pooling responses of multiple neurons. Here I will present recent work by Goris, Putzeys, Wagemans & Wichmann (Psychological Review, in press), proposing an alternative theory of detection in which perceptual decisions develop from maximum-likelihood decoding of a neurophysiologically-inspired model of population activity in primary visual cortex. We demonstrate that this model predicts a broad range of classic detection results. Using a single set of parameters, our model can account for several summation, adaptation and uncertainty effects, thereby offering a new theoretical interpretation for the vast psychophysical literature on pattern detection. One key component of this model is a task-specific, normative decoding mechanisms instead of a task-independent maximum-output---or any Minkowski-norm---typically employed in early vision models.

This opens the possibility that perceptual learning may at least sometimes be understood in terms of learning the weights of the decoder: Why and when can we successfully learn it, as in the examples presented by Goris et al. (in press)? Why do we fail to learn it in other cases, e.g. Putzeys, Bethge, Wichmann, Wagemans & Goris (PLoS Computational Biology, 2012)?

Furthermore, the success of the Goris et al. (2013) model highlights the importance of moving away from ad-hoc models designed to account for data of a single experiment, and instead moving towards more systematic and principled modeling efforts accounting for many different datasets using a single model.

Finally, I will briefly show how statistical modeling can complement the mechanistic modeling approach by Goris et al. (2013). Using a Bayesian graphical model approach to contrast discrimination, I show how Bayesian inference allows to estimate the posterior distribution of the parameters of such a model. The posterior distribution provides diagnostics of the model that help drawing meaningful conclusions from a model and its parameters.

Barthelmé, S., Trukenbrod, H. A., Engbert, R. and Wichmann, F. A. (2012)
Fixation patterns as point processes
Vision Sciences Society (VSS), Naples, FL, USA (poster)

Sequences of eye movements are composed of fixations and saccades. In a lot of cases, what is of interest is essentially where in space fixations occur: what is relevant is the point pattern formed by successive fixations. This suggests that the analysis of eye movements could benefit from models of point processes developed in spatial statistics, including latent Gaussian fields. We show that these are a valuable tool for the understanding of eye movement patterns. We focus on questions occuring in the study of overt attention. When subjects select what part of a stimulus to attend, several factors are often at work. Where people look will depend on what information they need, on what the stimulus is, but also on things less directly relevant to the analysis: for example, the common bias for central locations over peripheral ones. To understand the strategies at work, one must be able to somehow separate these different factors. One might ask, for example, what part of fixations on natural images can be explained by the presence of high contrast edges, or other low-level features. But how do we formalise the idea that the pattern of fixations is partly due to contrast, and partly not? Based on techniques borrowed from Functional Data Analysis we formulate a framework for the analysis of fixation locations. This allows us to describe fixation distributions in a non-parametric way. To introduce some regularity we assume that fixation distributions do not vary completely freely but are functions of some known variables, based on the stimulus or on trial history. The use of log-additive decompositions lets one separate out the various factors at work, and allows for a direct evalution of their relative influence.

Barthelmé, S., Trukenbrod, H. A., Engbert, R. and Wichmann, F. A. (2012)
Analysing fixations using latent Gaussian fields
Computational and Systems Neuroscience (COSYNE), Salt Lake City, UT, USA (poster)

Although eye movements are often described as arising from one of the simplest decision mechanisms, they sometimes reflect fairly sophisticated behaviour and are not trivial to predict (Schütz et al., 2011). One important aspect of eye movement sequences is naturally fixation locations – i.e, where people choose to look. Several factors are often at work, because where people look will depend on what information they need, on what the stimulus is, but also on things less directly relevant to the analysis: for example, the common bias for central locations over peripheral ones (Tatler and Vincent, 2009). To understand the strategies at work, one must be able to somehow separate these different factors.

Based on techniques borrowed from Functional Data Analysis (Ramsay and Silverman, 2005) and Spatial Statistics (Møller et al., 1998) we formulate a framework for the analysis of fixation locations. We use Latent Gaussian Fields to directly describe conditional fixation distributions, adapting the Logistic Gaussian Process of Lenk (1988). This allows us to describe fixation distributions in a non-parametric way. To introduce some regularity we assume that fixation distributions do not vary completely freely but are functions of some known variables, based on the stimulus or on trial history. The use of log-additive decompositions lets one separate out the various factors at work.

We show that our framework is extremely useful for the analysis of saliency in natural images. It is customary to analyse the role of low-level saliency by focusing on the properties of regions that are empirically salient, i.e. regions that subjects often look at. However, we know that people tend to fixate the same locations regardless of what the image is, and make small saccades rather than large ones. The implication is that not all fixations signal the same level of intrinsic saliency. Using a dataset collected by Kienzle et al. (2009), we illustrate how intrinsic saliency can be inferred using our framework.

Fründ, I., Wichmann, F. A. and Macke, J. H. (2012)
Dealing with sequential dependencies in psychophysical data
Vision Sciences Society (VSS), Naples, FL, USA (poster)

Psychophysical experiments are the standard approach for quantifying functional abilities and prop- erties of sensory systems, and for linking observed behaviour to the underlying neural mechanisms. In most psychological experiments, human observers or animals respond to multiple trials that are presented in a sequence, and it is commonly assumed that these responses are independent of responses on previous trials, as well as of stimuli presented on previous trials. There are, however, multiple reasons to question the ubiquitous assumption of “independent and identically distributed trials”. In addition, it has been reported that inter-trial dependencies are pronounced in behaving animals (Busse et al, 2011, J Neurosci). These observations raise two central questions: First, how strong are sequential dependencies in psychophysical experiments? Second, what are statistical methods that would allow us to detect these dependencies, and to deal with them appropriately? Here, we present a statistical modelling framework that allows for quantification of sequential dependencies, and for investigating their effect on psychometric functions estimated from data. In particular, we extend a commonly used model for psychometric functions by including additional regressors that model the effect of experimental history on observed responses. We apply our model to both simulated data and multiple real psychophysical data-sets of experienced human observers. We show that our model successfully detects trial by trial dependencies if they are present and allows for a statistical assessment of the significance of these dependencies. We find that, in our data- sets, the majority of human observers displays statistically significant history dependencies. In addition, we show how accounting for history dependencies can lead to changes in the estimated slopes of psychometric functions. As sequential dependencies are presumably stronger in inexperienced observers or behaving animals, we expect that methods like the ones presented here will become important tools for modelling psychophysical data.

Fründ, I., Wichmann, F. A. and Macke, J. H. (2012)
Dealing with sequential dependencies in psychophysical data
Computational and Systems Neuroscience (COSYNE), Salt Lake City, UT, USA (poster)

Psychophysical experiments are the standard approach for quantifying functional abilities and prop- erties of sensory systems, and for linking observed behaviour to the underlying neural mechanisms. In most psychological experiments, human observers or animals respond to multiple trials that are presented in a sequence, and it is commonly assumed that these responses are independent of responses on previous trials, as well as of stimuli presented on previous trials. There are, however, multiple reasons to question the ubiquitous assumption of “independent and identically distributed trials”. In addition, it has been reported that inter-trial dependencies are pronounced in behaving animals (Busse et al, 2011, J Neurosci). These observations raise two central questions: First, how strong are sequential dependencies in psychophysical experiments? Second, what are statistical methods that would allow us to detect these dependencies, and to deal with them appropriately? Here, we present a statistical modelling framework that allows for quantification of sequential dependencies, and for investigating their effect on psychometric functions estimated from data. In particular, we extend a commonly used model for psychometric functions by including additional regressors that model the effect of experimental history on observed responses. We apply our model to both simulated data and multiple real psychophysical data-sets of experienced human observers. We show that our model successfully detects trial by trial dependencies if they are present and allows for a statistical assessment of the significance of these dependencies. We find that, in our data- sets, the majority of human observers displays statistically significant history dependencies. In addition, we show how accounting for history dependencies can lead to changes in the estimated slopes of psychometric functions. As sequential dependencies are presumably stronger in inexperienced observers or behaving animals, we expect that methods like the ones presented here will become important tools for modelling psychophysical data.

Gerhard, H. and Wichmann, F. A. (2012)
How sensitive is the human visual system to the local statistics of natural images?
Bernstein Conference, München, FRG (poster)

A key hypothesis in sensory system neuroscience is that sensory representations are adapted to the statistical regularities in sensory signals and thereby incorporate knowledge about the outside world. Supporting this hypothesis, several probabilistic models of local natural image regularities have been proposed that reproduce neural response properties. Although many such physiological links have been made, these models have not been linked directly to visual sensitivity. Previous psychophysical studies focus on global perception of large images, so little is known about sensitivity to local regularities. We present a new paradigm for controlled psychophysical studies of local natural image regularities and use it to compare how well such models capture perceptually relevant image content. To produce image stimuli with precise statistics, we start with a set of patches cut from natural images and alter their content to generate a matched set of patches whose statistics are equally likely under a model’s assumptions. Observers have the task of discriminating natural patches from model patches in a forced choice experiment. The results show that human observers are remarkably sensitive to local correlations in natural images and that no current model is perfect for patches as small as 5 by 5 pixels or larger. Furthermore, discrimination performance was accurately predicted by model likelihood, an information theoretic measure of model efficacy, which altogether suggests that the visual system possesses a surprisingly large knowledge of natural image higher-order correlations, much more so than current image models. We also perform three cue identification experiments where we measure visual sensitivity to selected natural image features. The results reveal several prominent features of local natural image regularities including contrast fluctuations and shape statistics.

Maertens, M. and Wichmann, F. A. (2012)
When luminance increment thresholds depend on apparent lightness
Vision Sciences Society (VSS), Naples, FL, USA (poster)

The just noticeable difference (JND) between two stimulus intensities increases proportional to the background intensity (Weber's law). It is less clear, however, whether the JND is a function of the proximal or perceptual stimulus intensity. In the domain of achromatic surface colors the question would translate to whether the JND depends on local luminance or surface lightness. In the laboratory using simple stimuli such as uniform patches, proximal (luminance) and perceived intensity (lightness) often coincide. Reports that tried to disentangle the two factors yielded inconsistent results (e.g. Heinemann, 1961 JEP 61 389-399; Cornsweet and Teller, 1965 JOSA 55(10) 1303-1308; Zaidi and Krauskopf, 1985 Vision Res 26 759-62; McCourt and Kingdom, 1996 Vision Res 36 2563-73; Henning, Millar and Hill, 2000 JOSA 17(7) 1147-1159; Hillis and Brainard, 2007 CurrBiol 17 1714-1719). Following a previous experiment (Maertens and Wichmann, 2010 JVis 10 424) we measured discrimination thresholds in the Adelson checkerboard pattern for two equiluminant checks which differed in lightness (black vs. white). Discrimination performance was measured in two conditions: in the 'blob' condition, the increment was a two-dimensional gaussian centered on the check, in the 'check' condition, the increment was a constant that was added to the entire check. Performance was assessed in a 2-interval forced-choice and a yes-no task. In the 'blob' condition thresholds were indistinguishable between equiluminant checks and did not differ between the tasks. In the 'check' condition thresholds differed between equiluminant checks and were elevated for the lighter one. This was true for the yes-no task and to a lesser extent in the 2-IFC task. We think that these results require discussion beyond the question for the appropriate type of increment. We believe that the visual system might respond fundamentally different to light emanating from meaningful surfaces and to isolated spots of light.

Trukenbrod, H. A., Barthelmé, S., Wichmann, F. A. and Engbert, R. (2012)
Color does not guide eye movements: Evidence from a gaze-contingent experiment
European Conference of Visual Perception (ECVP), Alghero, I (poster)

Color plays a crucial role in everyday life and supports actions like searching specific objects. Whether this results from modified eye guidance has not systematically been explored. Using natural scenes, we investigated the influence of color on eye movements in a visual search task. The availability of color information was limited to a constant area around fixation by presenting gaze-contingent stimuli. The remaining image was masked by a luminance-matched grayscale version of the scene. Across trials, we used six different mask sizes. Fixated stimuli ranged from black-and-white to fully colored images. Before each trial, participants were instructed to look for a bullseye-shaped target defined either by luminance or luminance plus color. Our results show that color information was not used to guide eye movements. Except for minor disruptions in conditions with small masks, fixation durations and saccade amplitudes did not differ across conditions. Even looking for a specific color target did not change statistical measures of eye guidance. While it is beyond question that color supports vision, our results suggest that color does not modify eye-movement control. While color might help to facilitate processes like object segmentation, oculomotor control seems to be unaffected by color information.

Barthelmé, S., Trukenbrod, Engbert and Wichmann, F. A. (2011)
Inferring intrisinc saliency from free-viewing data
Bernstein Conference, Freiburg, FRG (poster)

A central debate in the literature on the deployment of attention in natural images is whether the observed fixation patterns reflect the work of basic low-level saliency mechanisms or, on the contrary, higher-level, object-based behaviour (Nuthmann & Henderson, 2010). The most common approach is to have subjects explore freely natural images ("free-viewing"), and collect fixation locations. Locations which are fixated by the subjects are said to be salient. Therefore, when analysing such data, the goal is often to correlate local image features with fixations, to see if the former can predict the latter (Kienzle et al. 2009).

However, one important problem is the assumption that all fixations reflect the same underlying level of intrinsic saliency: this is unreasonable because subjects show patterns in their fixated locations irrespective of image content. For example, a well-documented bias is simply to fixate around the center, regardless of what the image is (Torralba et al., 2006). Therefore, a fixation around the center is not necessarily motivated by high intrinsic saliency, because subjects tend to fixate around the center anyway. However, they need good reasons to go look at locations far away from the center, and off-center fixations should receive greater weight.

The raw data used to fit and test image-based models of visual saliency is therefore difficult to work with, because it confuses image-dependent factors and image-independent ones. To untangle these factors, we have developed a statistical model based on a log-additive decomposition of the fixation probability density. Taking inspiration in functional data analysis (Ramsay & Silverman, 2005) and the analysis of spatial data using log-Gaussian Cox processes (Møller et al., 1998), we show that it is possible to approximate the intrinsic saliency in an image by analysing large datasets from free-viewing tasks.

Dold, H. M. H., Fründ, I. and Wichmann, F. A. (2011)
Separate Bayesian inference reveals model properties shared between multiple experimental conditions.
Bernstein Conference, Freiburg, FRG (poster)

Statistical modeling produces compressed and often more meaningful descriptions of experimental data. Many experimental manipulations target selected parameters of a model, and to interpret these parameters other model components need to remain constant.
For example, perceptual psychologists are interested in the perception of luminance patterns depending on their contrast. The model describing this data has two critical parameters: the contrast that elicits a predefined performance, the threshold, and the rate of performance change with increases in contrast, the slope. Typical experiments target threshold differences, assuming constant slope across conditions. This situation requires a balance between model complexity to perform joint inference of all conditions and the simplicity of isolated fits in order to apply robust standard procedures. We show how separate analysis of experimental conditions can be performed such that all conditions are implicitly taken into account. The procedure is mathematically equivalent to a single Gibbs sampling step in the joint model embracing all conditions. We present a very natural way to check whether separate treatment of each condition or a joint model is more appropriate.
The method is illustrated for the specific case of psychometric functions; however the procedure applies to all models that encompass multiple experimental conditions. Furthermore, it is straight forward to extend the method to models that consist of multiple modules.

Dold, H. M. H., Fründ, I. and Wichmann, F. A. (2011)
How to identify a model for spatial vision?
Computational and Systems Neuroscience (COSYNE), Salt Lake City, UT, USA (poster)

Empirical evidence gathered in experimental work often leads to computational models which help make progress towards a more complete understanding of the phenomenon under study. This increased knowledge in turn enables the design of better subsequent experiments. In the case of psychophysical experiments and models of spatial vision|multi-channel linear / non-linear cascade models|this experiment-modeling-cycle resulted in an almost factory production of data. Experimental variants ranged from detection and discrimination to summation and adaptation experiments. While the model was certainly productive in terms of experimental output, it is not clear yet to what extent the experimental data really helped to identify the model. We use Markov Chain Monte Carlo sampling to estimate the parameter posterior distributions of an image-driven spatial vision model. The inspection of the posterior distribution allows to draw conclusions on whether the model is well parametrized, on the applicability of model components and the explanatory power of the data. Specically, we show that Minkowski pooling over channels as a decision stage does not allow to recover all parameters of an upstream static non-linearity for typical experimental data. Our results, and the approach in general, are not only relevant for psychophysics and should be applicable to computational models not only in spatial vision but in the neurosciences in general.

 

Fründ, I., Haenel, V. and Wichmann, F. A. (2011)
Leistungsschwankungen in wahrnehmungspsychologischen Verhaltensmessungen
Tagung Experimentell Arbeitender Psychologen, Halle(Saale), FRG (talk)

Große Teile der Wahrnehmungspsychologie loten die Grenzen des Erkennens aus: Gesucht wird der Kontrast, bei dem ein Muster gerade noch gesehen, oder der Schalldruck, bei welchem ein Ton gerade noch gehört werden kann.

In einem typischen Experiment gibt ein Beobachter an, ob er einen Reiz erkannt hat oder nicht. Der Reiz wird mit verschiedenen Intensitäten (Kontraste, Schalldruckpegel, ... ) präsentiert.
Die psychometrische Funktion ist ein parametrisches Modell, welches den Zusammenhang zwischen der Reizintensität und der Erkennenswahrscheinlichkeit in solchen Experimenten zusammenfasst. Bei dieser Zusammenfassung nimmt man an, dass die Leistung des Beobachters während des Experimentes konstant bleibt. Man ignoriert also Leistungsschwankungen durch Lernen, Aufmerksamkeit oder Veränderungen der Antwortstrategie.
In unserem Beitrag zeigen wir zweierlei:
1. Werden Leistungsschwankungen ignoriert, führt dies zur Überschätzung der Zuverlässigkeit der psychometrischen Funktion.
2. Wir präsentieren ein Verfahren, um die Stärke der Leistungsschwankungen abzuschätzen, und so die Zuverlässigkeit der Zusammenfassung korrekt anzugeben.

Fründ, I., Dold, H. M. H. and Wichmann, F. A. (2011)
Statistical model structure represented by Bayesian priors
Society for Mathematical Psychology Meeting, Somerville, MA, USA (talk)

Bayesian statistics applies the rules of probability to combine experimental evidence with prior knowledge. In practice "prior knowledge" often reflects mathematical convenience rather than real prior knowledge, however.The approximate equality of parameters across experimental conditions is an example of statistical model structure and a form of prior knowledge.
Previous approaches to Bayesian model fitting used prior distributions typically to describe parameter ranges, while parameter equality was enforced by fitting a sufficiently complex model to all experimental conditions.
The disadvantage is that the approach often produces complex models that are difficult to evaluate.
The challenge is to retain a simple model structure and to include prior knowledge about parameter equality without impairing the model fit significantly. Here we demonstrate how information about the statistical model structure can be recast as Bayesian priors in a collection of isolated, simple models.
We illustrate the method using the psychometric function, a dose rate mixture model that is particularly important in perceptual psychology.
In a first step, a posterior distribution is determined for each experimental condition in isolation.
In a second step, these posterior distributions are combined across conditions to provide informed priors for Bayesian inference.
The so obtained Bayesian priors reflect real prior knowledge about the model structure.
This procedure can be interpreted as a single step of a Gibbs sampler in the full model.

Fründ, I., Wichmann, F. A. and Macke, J. H. (2011)
Sequential dependencies in perceptual decisions
European Mathematical Psychology Group Meeting, Paris, F (talk)

In most psychological experiments, observers respond to multiple trials that are presented in a sequence. In perceptual psychology, it is common to assume that these responses are independent of responses on previous trials, as well as of stimuli presented on previous trials. There are, however, multiple reasons to question the ubiquitous assumption of "independent trials"-- for example, responses in cognitive experiments depend on previous stimuli and responses, and it is unclear why perceptual tasks should be unaffected by such serial dependencies. This observation raises two central questions: First, how strong are trial by trial dependencies in psychophysical experiments? Second, what are statistical methods that would allow us to detect these dependencies, and to deal with them appropriately?
Here, we present a model that allows for quantification of such trial by trial dependencies and apply it to psychophysical data-sets from perceptual decision tasks. Using multiple data-sets from one auditory and two visual experiments as well as simulated data, we show that our model successfully detects trial by trial dependencies if they are present and allows for a statistical assessment of the significance of these dependencies. Although the strength and direction of trial by trial dependencies varied considerably between observers, significant trial by trial dependencies were observed in 6 out of 7 observers. For those observers, model fits improved considerably if trial by trial history was incorporated into the model. The trial by trial dependencies we observed could be well captured by linear superposition of effects form multiple previous responses and stimuli.
We conclude that previous trials and responses influence responses in perceptual tasks, too.

Fründ, I., Wichmann, F. A. and Macke, J. H. (2011)
Sequential dependencies in perceptual decisions
Bernstein Conference, Freiburg, FRG (poster)

In most psychological experiments, observers respond to multiple trials that are presented in a sequence. In perceptual psychology, it is common to assume that these responses are independent of responses on previous trials, as well as of stimuli presented on previous trials. There are, however, multiple reasons to question the ubiquitous assumption of "independent trials" for example, responses in cognitive experiments depend on previous stimuli and responses, and it is unclear why perceptual tasks should be unaffected by such serial dependencies. This observation raises two central questions: First, how strong are trial by trial dependencies in psychophysical experiments? Second, what are statistical methods that would allow us to detect these dependencies, and to deal with them appropriately?
Here, we present a model that allows for quanti□cation of such trial by trial dependencies and apply it to psychophysical data-sets from perceptual decision tasks. Using multiple data-sets from one auditory and two visual experiments as well as simulated data, we show that our model successfully detects trial by trial dependencies if they are present and allows for a statistical assessment of the signi□cance of these dependencies. Although the strength and direction of trial by trial dependencies varied considerably between observers, signi□cant trial by trial dependencies were observed in 6 out of 7 observers. For those observers, model fits improved considerably if trial by trial history was incorporated into the model. The trial by trial dependencies we observed could be well captured by linear superposition of e□ects form multiple previous responses and stimuli.
We conclude that previous trials and responses influence responses in perceptual tasks, too.

Gerhard, H. E., Wiecki, T., Wichmann, F. A. and Bethge, M. (2011)
Perceptual sensitivity to statistical regularities in natural images
The 9th Göttingen Meeting of the German Neuroscience Society, Göttingen, FRG (poster)

Introduction:
A long standing hypothesis is that neural representations are adapted to environmental statistical regularities(Attneave 1954, Barlow 1959), yet the relation between the primate visual system’s functional properties and thestatistical structure of natural images is still unknown. The central problem is that the high-dimensional space ofnatural images is difficult to model. While many statistical models of small image patches that have beensuggested share certain neural response properties with the visual system (Atick 1990, Olshausen&Field 1996,Schwarz&Simoncelli 2001), it is unclear how informative they are about the functional properties of visualperception. Previously, we quantitatively evaluated how different models capture natural image statistics usingaverage log-loss (e.g. Eichhorn et al, 2009). Here we assess human sensitivity to natural image structure bymeasuring how discriminable images synthesized by statistical models are from natural images. Our goal is toimprove the quantitative description of human sensitivity to natural image regularities and evaluate variousmodels’ relative efficacy in capturing perceptually relevant image structure.

Methods:
We measured human perceptual thresholds to detect statistical deviations from natural images. The task was twoalternative forced choice with feedback. On a trial, two textures were presented side-by-side for 3 seconds: one atiling of image patches from the van Hateren photograph database, the other of model-synthesized images (Figure1A). The task was to select the natural image texture. We measured sensitivity at 3 patch sizes (3x3, 4x4, & 5x5 pixels) for 7 models. Five were natural image models: arandom filter model capturing only 2nd order pixel correlations (RND), the independent component analysis model(ICA), a spherically symmetric model (L2S), the Lp-spherical model (LpS), and the mixture of ellipticallycontoured distributions (MEC) with cluster number varied at 4 levels (k = 2, 4, 8, & 16). For MEC, we also usedpatch size 8x8. We also tested perceptual sensitivity to independent phase scrambling in the Fourier basis (IPS)and to global phase scrambling (GPS) which preserves all correlations between the phases and between theamplitudes but destroys statistical dependences between phases and amplitudes. For each type, we presented 30different textures to 15 naïve subjects (1020 trials/subject).

Results:
Figure 1B shows performance by patch size for each model. Low values indicate better model performance as thesynthesized texture was harder to discriminate from natural. Surprisingly, subjects were significantly above chancein all cases except at patch size 3x3 for MEC. This shows that human observers are highly sensitive to localhigher-order correlations as the models insufficiently reproduced natural image statistics for the visual system.Further, the psychometric functions’ ordering parallels nicely the models’ average log-loss ordering, beautifully sowithin MEC depending on cluster number, suggesting that the human visual system may have near perfectknowledge of natural image statistical regularities and that average log-loss is a useful model comparison measurein terms of perceptual relevance. Next, we will determine the features human observers use to discriminate thetextures’ naturalness which can help improve statistical modeling of perceptually relevant natural image structure.

Goris, R., Putzeys, T., Wagemans, J. and Wichmann, F. A. (2011)
A neural population model for pattern detection
Computational and Systems Neuroscience (COSYNE), Salt Lake City, UT, USA (poster)

Behavioural pattern detection experiments have greatly advanced our understanding of the computations performed
by the early visual system to extract information from the retinal image. Up to now, psychophysical
near-threshold measurements have been taken to suggest that observers select the maximum response from a
bank of parallel linear visual filters, each sensitive to a specific image resolution, to perform detection. However,
spatial-frequency tuned neurons in primary visual cortex are neither linear, nor independent and ample evidence
emphasizes that perceptual decisions are mediated by pooling responses of multiple neurons. Why then does the
aforementioned model do so well in explaining pattern detection? One possibility is that near-threshold stimuli are
too weak to drive the early visual system’s nonlinearities and activate only few sensory neurons. Alternatively, the ability of this theory to account for threshold experiments modelled in isolation belies the fact that its assumptions
about pattern detection are inherently wrong. Here, we challenge both a linear channel model (LCM) and a neural
population model (NPM) to fit a broad range of well-known and robust psychophysical pattern detection results,
using a single set of parameters. In the LCM, psychophysical decisions reflect maximum-output decoding of linear
and independent spatial frequency channels. In the NPM, perceptual choice behaviour is driven by maximumlikelihood
decoding of a population of normalized spatial-frequency tuned units resembling V1-neurons. We find
that the LCM fails to satisfactorily explain pattern detection. The NPM, on the other hand, can fully account for
pattern detectability as investigated in behavioural summation, adaptation and uncertainty experiments. This work
thus offers a new theoretical interpretation for the vast psychophysical literature on pattern detection in which both
normalization and maximum-likelihood decoding turn out to be crucial.

Goris, R., Putzeys, T., Wagemans, J. and Wichmann, F. A. (2011)
A neural population model for pattern detection
Vision Sciences Society (VSS), Naples, FL, USA (poster)

Behavioural pattern detection experiments have greatly advanced our understanding of the computations performed
by the early visual system to extract information from the retinal image. Up to now, psychophysical
near-threshold measurements have been taken to suggest that observers select the maximum response from a
bank of parallel linear visual filters, each sensitive to a specific image resolution, to perform detection. However,
spatial-frequency tuned neurons in primary visual cortex are neither linear, nor independent and ample evidence
emphasizes that perceptual decisions are mediated by pooling responses of multiple neurons. Why then does the
aforementioned model do so well in explaining pattern detection? One possibility is that near-threshold stimuli are
too weak to drive the early visual system’s nonlinearities and activate only few sensory neurons. Alternatively, the ability of this theory to account for threshold experiments modelled in isolation belies the fact that its assumptions
about pattern detection are inherently wrong. Here, we challenge both a linear channel model (LCM) and a neural
population model (NPM) to fit a broad range of well-known and robust psychophysical pattern detection results,
using a single set of parameters. In the LCM, psychophysical decisions reflect maximum-output decoding of linear
and independent spatial frequency channels. In the NPM, perceptual choice behaviour is driven by maximumlikelihood
decoding of a population of normalized spatial-frequency tuned units resembling V1-neurons. We find
that the LCM fails to satisfactorily explain pattern detection. The NPM, on the other hand, can fully account for
pattern detectability as investigated in behavioural summation, adaptation and uncertainty experiments. This work
thus offers a new theoretical interpretation for the vast psychophysical literature on pattern detection in which both
normalization and maximum-likelihood decoding turn out to be crucial.

Putzeys, T., Bethge, M., Wichmann, F. A., Wagemans, J. and Goris, R. (2011)
A new perceptual bias reveals suboptimal bayesian decoding of sensory responses
European Conference on Visual Perception (ECVP), Toulouse, F (poster)

Much of our understanding of sensory decoding stems from the comparison of human to ideal observer performance in simple two-alternative discrimination tasks. The optimal Bayesian decoding strategy consists of integrating noisy neural responses into a reliable function that captures the likelihood of specific stimuli being present. As only two stimulus values are relevant in a two-alternative discrimination task, the likelihood function has to be read out at two precise locations to obtain a likelihood ratio. Here, we report a new perceptual bias suggesting that human observers make use of a less optimal likelihood read-out strategy when discriminating grating spatial frequencies. Making use of spectrally filtered noise, we induce an asymmetry in the stimulus frequency likelihood function. We find that perceived grating frequency is significantly altered by this manipulation, indicating that the likelihood function was sampled with remarkably low precision. Although observers are provided with prior knowledge of the two relevant grating frequencies on each trial, they evaluate the likelihood of a broad range of irrelevant frequencies. Overall, our results suggest that humans perform estimation of a stimulus variable of unknown quantity rather than evaluation of two known alternatives when discriminating grating spatial frequencies.

Schönfelder, V. H. and Wichmann, F. A. (2011)
Extracting Auditory Cues in Tone-in-Noise detection with a Sparse Feature Selection Algorithm
Bernstein Conference, Freiburg, FRG (poster)

Introduction: As a classical paradigm in auditory psychophysics, Tone-in-Noise (TiN) detection still presents a challenge as regards the question which auditory cues human observers use to detect the signal tone (Fletcher, 1938). For narrow band noise, no conclusive answer has been given as to which stimulus features explain observer behavior on a trial-by-trial level (Davidson, 2009). In the present study a large behavioral data set for TiN detection was analyzed with a modern machine learning algorithm, L1- regularized logistic regression (Tibshirani, 1996). Enforcing sparse solutions, this method serves as a feature selection technique allowing the identification of the set of features that is critical to explain observer behavior (Schönfelder and Wichmann, 2011).

Methods: An extensive data set (>20'000 trials/observer) was collected with six naïve observers performing TiN detection in a yes/no paradigm. Stimuli were short (200 ms) sound burst consisting of a narrow band gaussian noise masker (100 Hz) centred around a signal tone (500 Hz). Data was collected in blocks with fixed signal-to-noise ratios (SNRs) at four levels along the slope of the psychometric function. Data on response consistency was also collected, estimated from responses to pairs of similar stimuli and serving as a measure of reproducibility of single trial decisions. Subsequently, linear observer models were fit to the data with an L1- regularized logistic regression, for each observer and each SNR separately. The set of features used during data fitting consisted of three components: energy, sound spectrum and envelope spectrum, with each component comprising one (energy) or multiple (spectra) scalar entries characterizing the presented sound.

Results: In terms of the psychometric function, observers could hardly be distinguished, only one – a trained musician – had a significantly lower threshold than the rest. Nevertheless, the analysis of perceptual features resulted in two groups of subjects using different combinations of auditory cues, as already observed by Richards (1993). Energy alone, as suggested by Green and Swets (1966), was not sufficient to explain responses, nor was the shape of the envelope spectrum, as proposed by Dau (1996). Instead, most observers relied dominantly on a mixture of sound energy and asymmetric spectral filters, with a peak frequency centered above the signal tone and a negative lobe below. These filters may correspond to off-frequency listening effects or result from the asymmetry of the auditory filters. The results suggest that observers relied on multiple detectors instead of one single feature in this task. Differences in detection strategy across different SNR were not observed. In general, observers showed poor consistency in their responses, in particular for low SNR. Nevertheless, single-trial predictions from the extracted observer models were reliable within the boundaries dictated by response consistency (Neri, 2006).

Schönfelder, V. H. and Wichmann, F. A. (2011)
Peering into the Black Box — Using sparse feature selection to identify critical stimulus properties in audition
Berlin Brain Days, Berlin, FRG (talk)

In order to understand and predict human behaviour in perceptual tasks, we need to learn which stimulus features are critical to the observer's decisions. As a particular instance of this general question, we investigated a classical task in auditory psychophysics, Tone-in-Noise detection. Understanding the perceptual mechanisms behind this task is of general interest to our basic understanding of auditory processing. A number of sound features have been proposed that human listeners use to detect a pure tone masked by noise. So far, however, no conclusive answer has been given as to which features are critical to explain behaviour on a trial-by-trial level. We collected a large data set for Tone-in-Noise detection with six naive listeners. Relying on sparse feature selection algorithms developed in machine learning we fit an observer model that combines a set of sound features as well as the history of previous responses. Listeners could hardly be discriminated in terms of psychometric performance averaged across trials. Nevertheless, we observed substantial differences as regards the features that explain single-trial response behaviour. For all observers, only a mixture of multiple sound features — including energy, spectral fine structure and envelope spectrum — could account for individual decisions. In addition, for three listeners, responses in previous trials played a significant role for the current decision. Contrasting trials with slow and fast reaction times, differences in feature weighting were also found within individuals. In conclusion, only a mixture of different stimulus features combined with the history of previous responses was generally sufficient to explain trial-by-trial behaviour of individual listeners in the present task.

Dold, H. M. H., Dähne, S. and Wichmann, F. A. (2010)
Effects of arbitrary structural choices on the parameters of early spatial vision models
Vision Sciences Society (VSS), Naples, FL, USA (poster)

Typical models of early spatial vision are based on a common, generic structure: First, the input image is processed by multiple spatial frequency and orientation selective filters. Thereafter, the output of each filter is non-linearly transformed, either by a non-linear transducer function or, more recently, by a divisive contrast-gain control mechanism. In a third stage noise is injected and, finally, the results are combined to form a decision. Ideally, this decision is consistent with experimental data (Legge and Foley, 1980 Journal of the Optical Society of America 70(12) 1458-1471; Watson and Ahumada, 2005 Journal of Vision 5 717-740).

Often a Gabor filter bank with fixed frequency and orientation spacing forms the first processing stage. These Gabor filters, or Gaussian derivative filters with suitably chosen parameters, both visually resemble simple cells in visual cortex. However, model predictions obtained with either of those two filter banks can deviate substantially (Hesse and Georgeson, 2005 Vision Research 45 507-525). Thus, the choice of filter bank potentially influences the fitted parameters of the non-linear transduction/gaincontrol stage as well as the decision stage. This may be problematic: In the transduction stage, for example, the exponent of a Naka-Rushton type transducer function is interpreted to correspond to different mechanisms, e.g. a mechanism based on stimulus energy if it is around two. Here we systematically examine the influence of arbitrary choices regarding filter bank properties−the filter form, number and additional parameters−on the psychophysically interesting parameters at subsequent stages of early spatial vision models. We reimplemented different models within a Python modeling framework and report the modeling results using the ModelFest data (Carney et al., 1999, DOI:10.1117/12.348473).

Dold, H. M. H., Fründ, I. and Wichmann, F. A. (2010)
Bayesian estimation of shared parameters
Berlin Brain Days, Berlin, FRG (talk)

Experimental data are typically noisy, and one aspect of fitting a model to one’s data is to parametrically describe the data and to see how different experimental conditions affect the model parameters. When comparing results from different experimental conditions we expect some model parameters to vary, but others to remain more or less constant. Sometimes we would even like to enforce that some parameters stay constant; in the following we term parameters that stay constant across experimental conditions “shared parameters.”
In Bayesian estimation, each parameter is associated with a prior distribution and a posterior distribution, which describe the probability of a given parameter value before and after observing the data, respectively. For shared parameters it is required that their posterior distributions under the different experimental conditions are highly overlapping. One solution to achieve the posterior distribution overlap is to choose suitable prior distributions. The key
challenge is to find such priors without impairing the model fit significantly.
Here we present a two-step technique to estimate suitable prior distributions under a concurrency constraint and illustrate our technique using psychometric functions.The posterior distributions obtained through the estimation procedure are highly overlapping and follow an analytically derived distribution. The goodness-of-fit remains comparable to goodness-of-fit obtained from estimation without the concurrency assumption, if the shared parameters actually originate from the same distribution. The proposed prior distributions are thus superior to so-called uninformative priors, but are chosen in a well defined way and do not reflect the scientist’s prejudices or assumptions.

Fründ, I., Haenel, V. and Wichmann, F. A. (2010)
Estimating psychometric functions in nonstationary observers
Vision Sciences Society (VSS), Naples, FL, USA (poster)

The psychometric function relates a physical dimension, such as stimulus contrast, to the responses of an observer.
This relation is conveniently summarized by fitting a parametric model to the responses.
In fitting such a model, we typically assume the responses to be independent of each others and that responses at the same stimulus level follow the same distribution.
However, there is evidence that casts doubt on the validity of this independence assumption: responses in psychophysical tasks are mutually dependent due to factors such as learning, fatigue, or fluctuating motivation.
These kinds of dependencies are summarized as nonstationary behavior.
From a theoretical point of view, nonstationarity renders inference about psychometric functions incorrect -- it can result in rejection of otherwise correct psychometric functions or wrong credible intervals for thresholds and other characteristics of the psychometric function.
However, it is not known how severe these errors are and how to properly correct for them.
We simulated a number of observers with different types of nonstationary behavior.
Psychometric functions were fitted for a large number of experimental settings, defined by the number of trials, the number of experimental blocks, and the task (2AFC vs yes-no).
We present criteria to identify psychometric functions that are influenced by nonstationarity, and develop strategies that can be applied in different statistical paradigms -- frequentist and Bayesian -- to correct for errors introduced by nonstationary behavior.
A software that automates the proposed procedures will be made available.

Fründ, I., Haenel, V. and Wichmann, F. A. (2010)
Estimating psychometric functions in nonstationary observers
European Mathematical Psychology Group Meeting (EMPG), Jyväskylä, FIN (talk)

The psychometric function relates a physical dimension, such as stimulus contrast, to the responses of an observer.
This relation is conveniently summarized by fitting a parametric model to the responses.
In fitting such a model, we typically assume the responses to be independent of each others and that responses at the same stimulus level follow the same distribution.
However, there is evidence that casts doubt on the validity of this independence assumption: responses in psychophysical tasks are mutually dependent due to factors such as learning, fatigue, or fluctuating motivation.
These kinds of dependencies are summarized as nonstationary behavior.
From a theoretical point of view, nonstationarity renders inference about psychometric functions incorrect -- it can result in rejection of otherwise correct psychometric functions or wrong credible intervals for thresholds and other characteristics of the psychometric function.
However, it is not known how severe these errors are and how to properly correct for them.
We simulated a number of observers with different types of nonstationary behavior.
Psychometric functions were fitted for a large number of experimental settings, defined by the number of trials, the number of experimental blocks, and the task (2AFC vs yes-no).
In general, nonstationary behavior resulted in severly underestimated credible intervals.
We present criteria to identify psychometric functions that are influenced by nonstationarity, and develop strategies that can be applied in different statistical paradigms --- frequentist and Bayesian --- to correct for errors introduced by nonstationary behavior.
A software that automates the proposed procedures is available.

Goris, R., Putzeys, T., Wagemans, J. and Wichmann, F. A. (2010)
Neural population code model for pattern detection
Research in Encoding And Decoding of Neural Ensembles (AREADNE), Santorini, GR (poster)

The visual system initially encodes the retinal image in terms of its “basic” constituents.
Subsequently these basic constituents are used to construct the complex visual percepts
that allow us to perform visual tasks. This process is referred to as decoding. A multitude of
psychophysical detection threshold measurements has previously been taken as evidence
that visual encoding is performed by independent, linear, spatial-frequency tuned detectors,
so-called channels. In this view, the decoding process in detection tasks is formalized simply
as selecting the output of the maximally responsive channel.
However, the traditional channel model is at odds with recent neurophysiological findings:
spatial-frequency tuned neurons in primary visual cortex are neither linear, nor independent
due to squaring and gain-control mechanisms. Furthermore, ample evidence emphasizes
that behavioural performance in perceptual tasks is mediated by pooling responses of
multiple neurons, rather than only relying on the most responsive neuron.
Here we show that the crucial psychophysical findings that have led to the linear
independent channel model can be explained even better by a population code model
consisting of a neurophysiologically inspired encoding front-end, followed by a populationdecoding
stage that approximates optimal Bayesian decoding. We simulated V1 population
responses using the normalization model of simple cells and found that a simple
combination rule successfully predicts square wave detectability, summation of far-apart
frequencies, as well as the complex changes in contrast sensitivity following pattern
adaptation. Intriguingly, the statistical characteristics of the discharge of cortical neurons
allow this near-optimal readout rule to be computed in a bottom-up way. This can in addition
explain the remarkable resistance to stimulus uncertainty displayed by human observers
performing contrast detection.
Importantly, thus, all the data hitherto believed to imply linear, independent psychophysical
channels can be linked to well-understood and simple physiological nonlinearities using statistical decision theory as the bridge.

Maertens, M. and Wichmann, F. A. (2010)
On the relationship between luminance increment thresholds and apparent brightness
Vision Sciences Society (VSS), Naples, FL, USA (poster)

It has long been known that the just noticeable difference (JND) between two stimulus intensities increases proportional to the background intensity - Weber's law. It is less clear, however, whether the JND is a function of the physical or apparent stimulus intensity. In many situations, especially in the laboratory using simple stimuli such as uniform patches or sinusoidal gratings, physical and perceived intensity coincide. Reports that tried to disentangle the two factors yielded inconsistent results (e.g. Heinemann, 1961 Journal of Experimental Psychology 61 389-399; Cornsweet and Teller, 1965 Journal of the Optical Society of America 55(10) 1303-1308; Henning, Millar and Hill, 2000 Journal of the Optical Society of America 17(7) 1147-1159; Hillis and Brainard, 2007 Current Biology 17 1714-1719). A necessary condition for estimating the potential effect of appearance on JNDs is to quantify the difference between physical and apparent intensity in units of physical intensity, because only that will allow to predict the expected JNDs. In the present experiments we utilized a version of the Craik-O'Brien-Cornsweet stimulus (Purves, Shimpi and Lotto, 1999 Journal of Neuroscience 19 8542-8551) to study the relationship between JNDs and apparent brightness. We quantitatively assessed apparent brightness using a paired comparison procedure related to maximum-likelihood difference scaling (Maloney and Yang, 2003 Journal of Vision 3(8) 573-585), in which observers compared the perceptual difference between two pairs of surface intensities. Using the exact same stimulus arrangement, that is, two pairs of surfaces, we asked observers to detect a luminance increment in a standard spatial 4-alternative forced-choice (4-AFC) task.

Schönfelder, V. H. and Wichmann, F. A. (2010)
Machine Learning in Auditory Psychophysics: System Identification with Sparse Pattern Classifiers
10th Biannual Conference of the German Society for Cognitive Science (KogWis), Potsdam, FRG (talk)

The identification of critical features (cues) of the input stimulus on which observers base their decisions is a main objective in psychophysics. In auditory experiments, a number of cues are typically available. For direct cue identification, multiple regression analysis has been established exploiting correlations between features and responses [Ahumada and Lovell, 1971]. This method is prone to emphasise non-critical features, however, when they correlate with critical cues. Recent methods from Machine Learning, in particular pattern classifiers, provide a powerful alternative for quantitatively modelling behaviour [Macke and Wichmann, 2010]. We propose a general approach applicable to a broad class of auditory tasks: Using the outcome of psychophysical experiments, i. e., stimuli as input and subject decisions as output, we train pattern classifiers in order to mimic observer responses. When algorithm and observer show similar behaviour, we presume the underlying decision mechanism and employed cues also to be similar. Here, we focus on the classical paradigm of Tone-in-Noise (TiN) detection by H. Fletcher. As yet, it has not been conclusively demonstrated, which cues observers rely on to solve this task [Davidson et al., 2009]. In simulations, we show that both a linear Support Vector Machine and Logistic Regression with sparse regularisation can explicitly identify different observer strategies, across a wide range of psychometric performances and even for noisy observers. In contrast to multiple regression, the reconstruction of employed cues is mostly unaffected by correlating features. We then analyse a massive data set collected with naive observers performing TiN detection in a Yes/No-paradigm. Employing classically proposed and newly established feature sets, we investigate observer cues and cue switching strategies and demonstrate how psychophysical measures, such as response times and sensitivity, are incorporated into our statistical analysis.

Macke, J. H. and Wichmann, F. A. (2009)
Estimating Critical Stimulus Features from Psychophysical Data: The Decision-Image Technique Applied to Human Faces
Vision Sciences Society (VSS), Naples, FL, USA (poster)

One of the main challenges in the sensory sciences is to identify the stimulus features on which the sensory systems base their computations: they are a pre-requisite for computational models of perception. We describe a technique—decision-images— for extracting critical stimulus features based on logistic regression. Rather than embedding the stimuli in noise, as is done in classification image analysis, we want to infer the important features directly from physically heterogeneous stimuli. A Decision-image not only defines the critical region-of-interest within a stimulus but is a quantitative template which defines a direction in stimulus space. Decision-images thus enable the development of predictive models, as well as the generation of optimized stimuli for subsequent psychophysical investigations. Here we describe our method and apply it to data from a human face discrimination experiment. We show that decision-images are able to predict human responses not only in terms of overall percent correct but are able to predict, for individual observers, the probabilities with which individual faces are (mis-) classified. We then test the predictions of the models using optimized stimuli. Finally, we discuss possible generalizations of the approach and its relationships with other models.

Macke, J. H. and Wichmann, F. A. (2009)
Estimating Critical Stimulus Features from Psychophysical Data: The Decision-Image Technique Applied to Human Faces
Vision Sciences Society (VSS), Symposium: Modern Approaches to Modeling Visual Data, Naples, FL, USA (invited talk)

One of the main challenges in the sensory sciences is to identify the stimulus features on which the sensory systems base their computations: they are a pre-requisite for computational models of perception. We describe a technique---decision-images--- for extracting critical stimulus features based on logistic regression. Rather than embedding the stimuli in noise, as is done in classification image analysis, we want to infer the important features directly from physically heterogeneous stimuli. A Decision-image not only defines the critical region-of-interest within a stimulus but is a quantitative template which defines a direction in stimulus space. Decision-images thus enable the development of predictive models, as well as the generation of optimized stimuli for subsequent psychophysical investigations. Here we describe our method and apply it to data from a human face discrimination experiment. We show that decision-images are able to predict human responses not only in terms of overall percent correct but are able to predict, for individual observers, the probabilities with which individual faces are (mis-) classified. We then test the predictions of the models using optimized stimuli. Finally, we discuss possible generalizations of the approach and its relationships with other models.

Macke, J. H. and Wichmann, F. A. (2009)
Predicting psychophysical responses from stimulus features: A statistical evaluation of human gender categorization models
Vision Sciences Society (VSS), Naples, FL, USA (poster)

One of the main challenges in visual psychophysics is to identify the stimulus features on which the visual system bases its computations: they are a pre-requisite for computational models of perception. Here, we use logistic regression for extracting critical stimulus features and predicting the responses of observers in psychophysical experiments. Rather than embedding the stimuli in noise, as is done in classification-image analysis, we infer the important features directly from physically heterogeneous stimuli. Using this approach -which we call ‘decision-image analysis‘- we predict the decisions of observers performing a gender-classification task with human faces as stimuli. Our decision-image models are able to predict human responses not only in terms of overall percent-correct, but predict, for individual observers, the probabilities with which individual faces are (mis-)classified. Comparing the prediction performance of different models can be used to rigorously rule out some seemingly plausible models of human classification performance: We show that a simple prototype classifier, popular in so-called “norm-based” models of face perception, is inadequate for predicting human responses. In contrast, an optimised generalised linear model can predict responses with remarkable accuracy. While this predictor is based on a single linear filter, this filter is not aligned with the first principal component of the stimulus set, in contrast to what has been proposed by proponents of “eigenface-based” models. In addition, we show how decision-images can be used to design optimised, maximally discriminative stimuli, which we use to test the predictions of our models. Finally, the performance of our model is correlated with the reaction times (RTs) of observers on individual stimuli: responses with short RTs are more predictable than others, consistent with the notion that short RTs may reflect earlier, more perceptual decisions modelled well by our decision-images, whereas longer RTs may be indicative of a larger cognitive or top-down component.

Schönfelder, V. H. and Wichmann, F. A. (2009)
Machine Learning in Auditory Psychophysics: System Identification beyond Regression Analysis
ITD Processing, Frauenwörth, FRG (poster)

The identification the critical aspects ("cues") of the input stimulus on which observers base their decisions represents a main objective in psychophysics. Specifically in auditory experiments, often even for the simplest tasks, a number of cues are available to the observers, from sound energy to spectral variations over time. In general, comparing subject performance with ideal observers of specific cues is not sufficient to determine the critical features, especially when subjects employ a combination of cues or switch cues depending on their reliability. For direct cue identification, multiple regression analysis has been established that exploits correlations between specific features and behavioural responses. This method is however limited to the linear case and is prone to emphasise non-critical features when they correlate with critical features, as is often the case with auditory stimuli. Recently, statistical algorithms from Machine Learning substantially assist psychophysics in quantitatively modelling and explaining behaviour. Specifically, methods of pattern identification provide a very powerful and flexible alternative to classical regression analysis. First, regularisation during classifier training gradually eliminates non-critical features. Second, straightforward non-linear extensions of simple linear classifiers are able to capture highly complex relations between stimulus and behaviour. We propose a general approach that can be applied to very a broad class of auditory tasks: Using the outcome of psychophysical experiments, i. e. the sound stimuli as input and subject decisions as output, we train pattern classification algorithms in order to mimic observer responses. When algorithm and observer show a similar behaviour, we presume that the underlying decision mechanism and employed cues may also be similar. In subsequent psychophysical tests, this presumption will be directly tested. Here, we focus on the classical paradigm of tone detection in centered narrow-band noise (TiN), established by Fletcher 70 years ago. Despite a long history of research, no conclusive answer has yet been provided to the question, which cues observers rely on to solve this task. In preliminary simulations, we show that a linear Support Vector Machine (SVM) can indeed clearly discriminate between different observer strategies. Compared to classical linear regression, the reconstruction of employed cues is much more specific and less biased by correlations between different features. Our simulations also provide an estimate on the amount of psychophysical data required for reliable analysis and interpretation. Finally, the method was extended to non-linear classifiers, increasing flexibility and thus sparing signal pre-processing, but complicating feature identification.

Schönfelder, V. H. and Wichmann, F. A. (2009)
Machine Learning in Auditory Psychophysics: System Identification beyond Regression Analysis
Berlin Brain Days, Berlin, FRG (poster)

The identification the critical aspects ("cues") of the input stimulus on which observers base their decisions represents a main objective in psychophysics. Specifically in auditory experiments a number of cues are available to the observers. In general, comparing subject performance with ideal observers is not sufficient to determine the critical features, especially when subjects employ a combination of cues or switch cues. Recently, statistical algorithms from Machine Learning substantially assist psychophysics in quantitatively modelling and explaining behaviour, providing a very powerful and flexible alternative to classical regression analysis. We propose a general approach that can be applied to very a broad class of auditory tasks: Using the outcome of psychophysical experiments, i. e. the sound stimuli as input and subject decisions as output, we train classification algorithms in order to mimic observer responses. When algorithm and observer show a similar behaviour, we presume that the underlying decision mechanism may also be similar. Subsequently, this presumption will be directly tested. Here, we focus on the classical paradigm of tone detection in centered narrow-band noise. Despite a long history of research, no conclusive answer has yet been provided to the question, which cues observers rely on to solve this task. While experimental data is still being collected, we show in preceding simulations that a linear Support Vector Machine and Logistic Regression can indeed clearly discriminate between different observer strategies. Compared to classical linear regression, the reconstruction of employed cues is much more accurate and less biased by correlations between different features. Our simulations also provide an estimate on the amount of psychophysical data required for reliable analysis and interpretation.

Wichmann, F. A. and Henning, G. B. (2009)
Spatial-frequency tuning develops over time
Vision Sciences Society (VSS), Naples, FL, USA (poster)

Recent neurophysiological observations on the development of orientation and spatial-frequency tuning in the primary visual cortex are equivocal—some studies report that tuning is virtually complete as soon as it can be measured while others report significant sharpening of both aspects of tuning. The issue is important because it bears on the mechanisms that produce the tuning: is tuning acquired simply from careful combination of the weakly tuned lower mechanisms in the midbrain (LGN) or is intra-cortical or even cortico-thalamic interaction required to produce the much tighter tuning seen in cortical cells? Here we provide unequivocal behavioural evidence derived from psychophysical, 2-AFC critical band-masking experiments with human observers, showing that spatial-frequency tuning develops over the first tens of milliseconds. We discuss the implications for the implementation of the characteristics we measure.

Wichmann, F. A. and Henning, G. B. (2009)
Spatial-frequency tuning develops over time
ARVO Annual Meeting, Fort Lauderdale, FL, USA (poster)

Recent neurophysiological observations on the development of orientation and spatial-frequency tuning in the primary visual cortex are equivocal—some studies report that tuning is virtually complete as soon as it can be measured while others report significant sharpening of both aspects of tuning. The issue is important because it bears on the mechanisms that produce the tuning: is tuning acquired simply from careful combination of the weakly tuned lower mechanisms in the midbrain (LGN) or is intra-cortical or even cortico-thalamic interaction required to produce the much tighter tuning seen in cortical cells? Here we provide unequivocal behavioural evidence derived from psychophysical, 2-AFC critical band-masking experiments with human observers, showing that spatial-frequency tuning develops over the first tens of milliseconds. We discuss the implications for the implementation of the characteristics we measure.

Wichmann, F. A., Kienzle, W., Schölkopf, B. and Franz, M. (2009)
Non-linear System Identification: Visual Saliency Inferred from Eye-Movement Data
Vision Sciences Society (VSS), Symposium: Modern Approaches to Modeling Visual Data, Naples, FL, USA (invited talk)

Humans perceives the world by directing the center of gaze from one location to another via rapid eye movements, called saccades. In the period between saccades the direction of gaze is held fixed for a few hundred milliseconds (fixations). It is primarily during fixations that information enters the visual system. Remarkably, however, after only a few fixations we perceive a coherent, high-resolution scene despite the visual acuity of the eye quickly decreasing away from the center of gaze: This suggests an effective strategy for selecting saccade targets.
Top-down effects, such as the observer's task, thoughts, or intentions have an effect on saccadic selection. Equally well known is that bottom-up effects-local image structure-influence saccade targeting regardless of top-down effects. However, the question of what the most salient visual features are is still under debate. Here we model the relationship between spatial intensity patterns in natural images and the response of the saccadic system using tools from machine learning. This allows us to identify the most salient image patterns that guide the bottom-up component of the saccadic selection system, which we refer to as perceptive fields. We show that center-surround patterns emerge as the optimal solution to the problem of predicting saccade targets. Using a novel nonlinear system identification technique we reduce our learned classifier to a one-layer feed-forward network which is surprisingly simple compared to previously suggested models assuming more complex computations such as multi-scale processing, oriented filters and lateral inhibition. Nevertheless, our model is equally predictive and generalizes better to novel image sets. Furthermore, our findings are consistent with neurophysiological hardware in the superior colliculus. Bottom-up visual saliency may thus not be computed cortically as has been thought previously.

Drewes, J., Hübner, G., Wichmann, F. A., Gegenfurtner, K. R. (2008)
How natural are natural images?
European Conference of Visual Perception (ECVP), Utrecht, NL (poster)

The global amplitude spectrum allows for surprisingly accurate classification of natural scenes, particularly animal vs non-animal. However, humans evidently do not utilize this in classification tasks. In a new approach, we represent the images of the Corel Stock Photo library (CSPL) by means of frequency, orientation, and location. Achieving 78% classification accuracy, we discovered an apparently photographically-induced artifact in the animal images of the CSPL: the consistent use of depth of field causes upper image regions to be out of focus, while the image centre is always well focused. This affects the distribution of high-frequency energy within an image, explaining why simple classifiers can reach relatively high classification accuracy. This does not correlate to human performance. Comparing the CSPL to the Tübingen natural image database (TNID), we found the TNID to be generally more difficult for human and algorithmic classification, yet also affected by the photographic artifact. These results show a strong effect in a popular image database, greatly affecting algorithmic classification while barely affecting human performance.

Schönfelder, V. H. and Wichmann, F. A. (2008)
Machine Learning and Psychophysics: Unveiling Tone-in-Noise detection
Bernstein Conference, München, FRG (poster)

Detecting pure tones in bands of noise is the basis for the study of many complex skills in audition. Since Fletcher's first experiments in 1938, "Tone-in-Noise detection", represents a standard paradigm in auditory psychophysics to characterize human performance in this task. Still, no conclusive answer has been given, as to exactly which cues observers rely on, particularly when tones are presented in narrow bands of noise.

Following a data-driven approach to examine a broad class of possible features, we will train pattern recognition and regression algorithms to sound stimuli and corresponding observer responses. When the derived classifier matches human performance and the classifier‚Äôs decision function has been extracted, the corresponding features are interpreted as candidate cues for human perception — an assumption which can be directly verified in subsequent psychophysical tests.

Schönfelder, V. H. and Wichmann F. A. (2008)
Machine Learning and Auditory Psychophysics: Unveiling Tone-in-Noise detection
Berlin Brain Days, Berlin, FRG (talk)

A central capacity in human hearing is frequency analysis, to separate and iden- tify individual components of a complex sound. An understanding of this ability represents the foundation for the study of most complex auditory tasks, such as decomposition of sound streams into musical instruments in a concert hall. About 70 years ago, H. Fletcher established a classical psychophysical paradigm to characterise the spectral resolution of the human auditory system: "Tone-in- Noise detection" (TiN). Despite a long history of research, however, no conclu- sive answer has yet been provided to the question, how observers detect pure tones centered in narrow-band noise.

Commonly, a small set of features, such as total energy, fluctuations in envelope or regularity of zero–crossings are considered as cues observers might rely on and directly tested in 2-interval forced choice (2IFC) tasks, with one interval containing noise alone and the other additionally containing a signal tone. Instead of following a modeling approach relying on such a small set of stimu- lus features, we intend to investigate a larger and more general class of cues. Recently, statistical algorithms from Machine Learning assist psychophys-ics in quantitively modelling and explaining behaviour. Using the outcome of 2IFC-TiN detection experiments, i. e. preprocessed sound stimuli as input and observer decisions as output, we train pattern classification algorithms in order to mimic observer responses. When algorithm and observer show a similar behaviour, we presume that the underlying decision mechanism and employed cues may also be similar. In subsequent psychophysical tests, this presumption will be directly tested.While the collection of experimental data is still pending, we show in prelimi- nary computer simulations, that different observer strategies, based on energy or spectral filters, can indeed be discriminated by a simple linear support-vector classifier in combination with signal preprocessing by spectral analysis and envelope extraction.

Wichmann, F. A., Kienzle, W., Schölkopf, B. and Franz, M. (2008)
Visual saliency re-visited: Center-surround patterns emerge as optimal predictors for human fixation targets
Vision Sciences Society (VSS), Naples, FL, USA (poster)

Humans perceives the world by directing the center of gaze from one location to another via rapid eye movements, called saccades. In the period between saccades the direction of gaze is held fixed for a few hundred milliseconds (fixations). It is primarily during fixations that information enters the visual system. Remarkably, however, after only a few fixations we perceive a coherent, high-resolution scene despite the visual acuity of the eye quickly decreasing away from the center of gaze: This suggests an effective strategy for selecting saccade targets.

Top-down effects, such as the observer's task, thoughts, or intentions have an effect on saccadic selection. Equally well known is that bottom-up effects-local image structure-influence saccade targeting regardless of top-down effects. However, the question of what the most salient visual features are is still under debate. Here we model the relationship between spatial intensity patterns in natural images and the response of the saccadic system using tools from machine learning. This allows us to identify the most salient image patterns that guide the bottom-up component of the saccadic selection system, which we refer to as perceptive fields. We show that center-surround patterns emerge as the optimal solution to the problem of predicting saccade targets. Using a novel nonlinear system identification technique we reduce our learned classifier to a one-layer feed-forward network which is surprisingly simple compared to previously suggested models assuming more complex computations such as multi-scale processing, oriented filters and lateral inhibition. Nevertheless, our model is equally predictive and generalizes better to novel image sets. Furthermore, our findings are consistent with neurophysiological hardware in the superior colliculus. Bottom-up visual saliency may thus not be computed cortically as has been thought previously.

 

 

Jäkel, F., Schölkopf, B. and Wichmann, F.A. (2007)
About the triangle inequality in perceptual spaces
Computational and Systems Neuroscience (COSYNE), Salt Lake City, UT, USA (poster)

Perceptual similarity is often formalized as a metric in a multi-dimensional space. Stimuli are points in the space and stimuli that are similar are close to each other in this space. A large distance separates stimuli that are very different from each other. This conception of similarity prevails in studies from color perception and face perception to studies of categorization. While this notion of similarity is intuitively
plausible there has been an intense debate in cognitive psychology whether perceived dissimilarity satisfies the metric axioms. In a seminal series of papers, Tversky and colleagues have challenged all of the metric axioms [1,2,3].
The triangle inequality has been the hardest of the metric axioms to test experimentally. The reason for this is that measurements of perceived dissimilarity are usually only on an ordinal scale, on an interval scale at most. Hence, the triangle inequality on a finite set of points can always be satisfied, trivially, by adding a big enough constant to the measurements. Tversky and Gati [3] found a way to test the triangle inequality in conjunction with a second, very common assumption. This assumption is segmental additivity [1]: The distance from A to C equals the distance from A to B plus the distance from B to C, if B is “on the way”. All of the metrics that had been suggested to model similarity also had this assumption of segmental additivity, be it the Euclidean metric, the Lp-metric, or any Riemannian geometry. Tversky and Gati collected a substantial amount of data using many different stimulus sets, ranging from perceptual to cognitive, and found strong evidence that many human similarity judgments cannot be accounted for by the usual models of similarity. This led them to the conclusion that either the triangle inequality has to be given up or one has to use metric models with subadditive metrics. They favored the first solution. Here, we present a principled subadditive metric based on Shepard’s universal law of generalization [4].
Instead of representing each stimulus as a point in a multi-dimensional space our subadditive metric stems from representing each stimulus by its similarity to all other stimuli in the space. This similarity function, as for example given by Shepard’s law, will usually be a radial basis function and also a positive definite kernel. Hence, there is a natural inner product defined by the kernel and a metric that is induced by the inner product. This metric is subadditive. In addition, this metric has the psychologically desirable property that the distance between stimuli is bounded.

Kienzle, W. , Macke, J. H., Wichmann, F. A., Schölkopf, B. and Franz, M. O. (2007)
Nonlinear receptive field analysis: Making kernel methods interpretable
Computational and Systems Neuroscience (COSYNE), Salt Lake City, UT, USA (poster)

Identification of stimulus-response functions is a central problem in systems neuroscience and related areas. Prominent examples are the estimation of receptive fields and classification images [1]. In most cases, the relationship between a high-dimensional input and the system output is modeled by a linear (first-order) or quadratic (second-order) model. Models with third or higher order dependencies are seldom used, since both parameter estimation and model interpretation can become very difficult.

Recently, Wu and Gallant [3] proposed the use of kernel methods, which have become a standard tool in machine learning during the past decade [2]. Kernel methods can capture relationships of any order, while solving the parameter estmation problem efficiently. In short, the stimuli are mapped into a high-dimensional feature space, where a standard linear method, such as linear regression or Fisher discriminant, is applied. The kernel function allows for doing this implicitly, with all computations carried out in stimulus space. As a consequence, the resulting model is nonlinear, but many desirable properties of linear methods are retained. For example, the estimation problem has no local minima, which is in contrast to other nonlinear approaches, such as neural networks [4].

Unfortunately, although kernel methods excel at modeling complex functions, the question of how to in- terpret the resulting models remains. In particular, it is not clear how receptive fields should be defined in this context, or how they can be visualized. To remedy this, we propose the following definition: noting that the model is linear in feature space, we define a nonlinear receptive field as a stimulus whose image in feature space maximizes the dot-product with the learned model. This can be seen as a generalization of the receptive field of a linear filter: if the feature map is the identity, the kernel method becomes linear, and our receptive field definition coincides with that of a linear filter. If it is nonlinear, we numerically invert the feature space mapping to recover the receptive field in stimulus space.

Experimental results show that receptive fields of simulated visual neurons, using natural stimuli, are cor- rectly identified. Moreover, we use this technique to compute nonlinear receptive fields of the human fixation mechanism during free-viewing of natural images.



Kienzle, W., Wichmann, F. A., Schölkopf, B. and Franz, M. O. (2007)
Learning the influence of spatio-temporal variations in local image structure on visual saliency
Tübingen Perception Conference (TWK), Tübingen, FRG (poster)

Computational models for bottom-up visual attention traditionally consist of a bank of Gabor-like or Difference-of-Gaussians filters and a nonlinear combination scheme which combines the filter responses into a real-valued saliency measure [1]. Recently it was shown that a standard machine learning algorithm can be used to derive a saliency model from human eye movement data with a very small number of additional assumptions. The learned model is much simpler than previous models, but nevertheless has state-of-the-art prediction performance [2]. A central result from this study is that DoG-like center-surround filters emerge as the unique solution to optimizing the predictivity of the model. Here we extend the learning method to the temporal domain. While the previous model [2] predicts visual saliency based on local pixel intensities in a static image, our model also takes into account temporal intensity variations. We find that the learned model responds strongly to temporal intensity changes ocurring 200-250ms before a saccade is initiated. This delay coincides with the typical saccadic latencies, indicating that the learning algorithm has extracted a meaningful statistic from the training data. In addition, we show that the model correctly predicts a significant proportion of human eye movements on previously unseen test data.

Kienzle, W., Wichmann, F. A., Schölkopf, B. and Franz, M. O. (2007)
Center-surround filters emerge from optimizing predictivity in a free-viewing task
Computational and Systems Neuroscience (COSYNE), Salt Lake City, UT, USA (poster)

In which way do the local image statistics at the center of gaze differ from those at randomly chosen image
locations? In 1999, Reinagel and Zador [1] showed that RMS contrast is significantly increased around
fixated locations in natural images. Since then, numerous additional hypotheses have been proposed, based
on edge content, entropy, self-information, higher-order statistics, or sophisticated models such as that of
Itti and Koch [2].
While these models are rather different in terms of the used image features, they hardly differ in terms
of their predictive power. This complicates the question of which bottom-up mechanism actually drives
human eye movements. To shed some light on this problem, we analyze the nonlinear receptive fields of
an eye movement model which is purely data-driven. It consists of a nonparametric radial basis function
network, fitted to human eye movement data. To avoid a bias towards specific image features such as
edges or corners, we deliberately chose raw pixel values as the input to our model, not the outputs of
some filter bank. The learned model is analyzed by computing its optimal stimuli. It turns our that there
are two maximally excitatory stimuli, both of which have center-surround structure, and two maximally
inhibitory stimuli which are basically flat. We argue that these can be seen as nonlinear receptive fields of
the underlying system. In particular, we show that a small radial basis function network with the optimal
stimuli as centers predicts unseen eye movements as precisely as the full model.
The fact that center-surround filters emerge from a simple optimality criterion—without any prior assumption
that would make them more probable than e.g. edges, corners, or any other configuration of pixels
values in a square patch—suggests a special role of these filters in free-viewing of natural images.

References
[1] P. Reinagel, A. M. Zador, Natural scene statistics at the centre of gaze, Network: Computation in Neural
Systems, 1999
[2] L. Itti, C. Koch, E. Niebur, A saliency-based search mechanism for overt and covert shifts of visual
attention, Vision Research, 2000

Wiebel, Ch. and Wichmann, F. A. (2007)
Oblique- and plaid-masking re-visited
Tagung experimentell arbeitender Psychologen (TeaP), Mannheim, FRG (talk)

Almost all current models of early spatial vision assume the existence of independent linear channels tuned to a limited range of spatial frequencies and orientations, i.e. that the early visual system acts akin to a sub-band coder. Some classic studies cast doubt on this notion, however, in particular the results reported by Henning, Hertz & Broadband (1975) using amplitude-modulated gratings and those of Derrington & Henning (1989) using 2-D plaid maskers. In our study we explore the unusually strong masking induced by spacetime coincident plaids in more detail: in particular, we determine the masking dependency on stimulus presentation time and contrast asymmetry between the plaid components.
Such data will help to constrain non-linear extensions to the current channels model.

Drewes, J., Wichmann, F. A. and Gegenfurtner, K. R. (2006)
Classification of natural scenes: Critical features revisited
Tübingen Perception Conference (TWK), Tübingen, FRG (poster)

Human observers are capable of detecting animals within novel natural scenes with remarkable speed and accuracy. Despite the seeming complexity of such decisions it has been hypothesized that a simple global image feature, the relative abundance of high spatial frequencies at certain orientations, could underly such fast image classification [1].
We successfully used linear discriminant analysis to classify a set of 11.000 images into “animal” and “non-animal” images based on their individual amplitude spectra only [2]. We proceeded to sort the images based on the performance of our classifier, retaining only the best and worst classified 400 images ("best animals", "best distractors" and "worst animals", "worst distractors").
We used a Go/No-go paradigm to evaluate human performance on this subset of our images. Both reaction time and proportion of correctly classified images showed a significant effect of classification difficulty. Images more easily classified by our algorithm were also classified faster and better by humans, as predicted by the Torralba & Oliva hypothesis.
We then equated the amplitude spectra of the 400 images, which, by design, reduced algorithmic performance to chance whereas human performance was only slightly reduced [3]. Most importantly, the same images as before were still classified better and faster, suggesting that even in the original condition features other than specifics of the amplitude spectrum made particular images easy to classify, clearly at odds with the Torralba & Oliva hypothesis.

[1] A. Torralba & A. Oliva, Network: Comput. Neural Syst., 2003
[2] Drewes, Wichmann, Gegenfurtner VSS 2005
[3] cf. Wichmann, Rosas, Gegenfurtner, VSS 2005

Drewes, J., Wichmann, F. A. and Gegenfurtner, K. R. (2006)
Classification of natural scenes: Critical features revisited
Tagung experimentell arbeitender Psychologen (TeaP), Mainz, FRG (poster)

Human observers are capable of detecting animals within novel natural scenes with remarkable speed and accuracy. Despite the seeming complexity of such decisions it has been hypothesized that a simple global image feature, the relative abundance of high spatial frequencies at certain orientations, could underly such fast image classification [1].
We successfully used linear discriminant analysis to classify a set of 11.000 images into “animal” and “non-animal” images based on their individual amplitude spectra only [2]. We proceeded to sort the images based on the performance of our classifier, retaining only the best and worst classified 400 images ("best animals", "best distractors" and "worst animals", "worst distractors").
We used a Go/No-go paradigm to evaluate human performance on this subset of our images. Both reaction time and proportion of correctly classified images showed a significant effect of classification difficulty. Images more easily classified by our algorithm were also classified faster and better by humans, as predicted by the Torralba & Oliva hypothesis.
We then equated the amplitude spectra of the 400 images, which, by design, reduced algorithmic performance to chance whereas human performance was only slightly reduced [3]. Most importantly, the same images as before were still classified better and faster, suggesting that even in the original condition features other than specifics of the amplitude spectrum made particular images easy to classify, clearly at odds with the Torralba & Oliva hypothesis.

[1] A. Torralba & A. Oliva, Network: Comput. Neural Syst., 2003
[2] Drewes, Wichmann, Gegenfurtner VSS 2005
[3] cf. Wichmann, Rosas, Gegenfurtner, VSS 2005

Drewes, J., Wichmann, F. A. and Gegenfurtner, K. R. (2006)
Classification of natural scenes: Critical features revisited
Vision Sciences Society (VSS), Sarasota, FL, USA (talk)

Human observers are capable of detecting animals within novel natural scenes with remarkable speed and accuracy. Despite the seeming complexity of such decisions it has been hypothesized that a simple global image feature, the relative abundance of high spatial frequencies at certain orientations, could underly such fast image classification [1].
We successfully used linear discriminant analysis to classify a set of 11.000 images into “animal” and “non-animal” images based on their individual amplitude spectra only [2]. We proceeded to sort the images based on the performance of our classifier, retaining only the best and worst classified 400 images ("best animals", "best distractors" and "worst animals", "worst distractors").
We used a Go/No-go paradigm to evaluate human performance on this subset of our images. Both reaction time and proportion of correctly classified images showed a significant effect of classification difficulty. Images more easily classified by our algorithm were also classified faster and better by humans, as predicted by the Torralba & Oliva hypothesis.
We then equated the amplitude spectra of the 400 images, which, by design, reduced algorithmic performance to chance whereas human performance was only slightly reduced [3]. Most importantly, the same images as before were still classified better and faster, suggesting that even in the original condition features other than specifics of the amplitude spectrum made particular images easy to classify, clearly at odds with the Torralba & Oliva hypothesis.

[1] A. Torralba & A. Oliva, Network: Comput. Neural Syst., 2003
[2] Drewes, Wichmann, Gegenfurtner VSS 2005
[3] cf. Wichmann, Rosas, Gegenfurtner, VSS 2005

Wichmann, F. A. and Henning, B. (2006)
The pedestal effect is caused by off-frequency looking, not nonlinear transduction or contrast gain-control
Tübingen Perception Conference (TWK), Tübingen, FRG (poster)

The pedestal or dipper effect is the large improvement in the detectability of a sinusoidal grating observed when the signal is added to a pedestal or masking grating having the signal's spatial frequency, orientation, and phase. The effect is largest with pedestal contrasts just above the ‘threshold’ in the absence of a pedestal. We measured the pedestal effect in both broadband and notched masking noise---noise from which a 1.5-octave band centered on the signal and pedestal frequency had been removed. The pedestal effect persists in broadband noise, but almost disappears with notched noise. The spatial-frequency components of the notched noise that lie above and below the spatial frequency of the signal and pedestal prevent the use of information about changes in contrast carried in channels tuned to spatial frequencies that are very much different from that of the signal and pedestal. We conclude that the pedestal effect in the absence of notched noise results principally from the use of information derived from channels with peak sensitivities at spatial frequencies that are different from that of the signal and pedestal. Thus the pedestal or dipper effect is not a characteristic of individual spatial-frequency tuned channels.

Wichmann, F. A. and Henning, B. (2006)
The pedestal effect is caused by off-frequency looking, not nonlinear transduction or contrast gain-control
Tagung experimentell arbeitender Psychologen (TeaP), Mainz, FRG (talk)

The pedestal or dipper effect is the large improvement in the detectability of a sinusoidal grating observed when the signal is added to a pedestal or masking grating having the signal's spatial frequency, orientation, and phase. The effect is largest with pedestal contrasts just above the ‘threshold’ in the absence of a pedestal. We measured the pedestal effect in both broadband and notched masking noise---noise from which a 1.5-octave band centered on the signal and pedestal frequency had been removed. The pedestal effect persists in broadband noise, but almost disappears with notched noise. The spatial-frequency components of the notched noise that lie above and below the spatial frequency of the signal and pedestal prevent the use of information about changes in contrast carried in channels tuned to spatial frequencies that are very much different from that of the signal and pedestal. We conclude that the pedestal effect in the absence of notched noise results principally from the use of information derived from channels with peak sensitivities at spatial frequencies that are different from that of the signal and pedestal. Thus the pedestal or dipper effect is not a characteristic of individual spatial-frequency tuned channels.

Wichmann, F. A. and Henning, B. (2006)
The pedestal effect is caused by off-frequency looking, not nonlinear transduction or contrast gain-control
Vision Sciences Society (VSS), Sarasota, FL, USA (poster)

The pedestal or dipper effect is the large improvement in the detectability of a sinusoidal grating observed when the signal is added to a pedestal or masking grating having the signal's spatial frequency, orientation, and phase. The effect is largest with pedestal contrasts just above the ‘threshold’ in the absence of a pedestal. We measured the pedestal effect in both broadband and notched masking noise---noise from which a 1.5-octave band centered on the signal and pedestal frequency had been removed. The pedestal effect persists in broadband noise, but almost disappears with notched noise. The spatial-frequency components of the notched noise that lie above and below the spatial frequency of the signal and pedestal prevent the use of information about changes in contrast carried in channels tuned to spatial frequencies that are very much different from that of the signal and pedestal. We conclude that the pedestal effect in the absence of notched noise results principally from the use of information derived from channels with peak sensitivities at spatial frequencies that are different from that of the signal and pedestal. Thus the pedestal or dipper effect is not a characteristic of individual spatial-frequency tuned channels.

Jäkel, F. and Wichmann, F. A. (2005)
Kernel-methods, similarity and exemplar theories of categorization
Annual Summer Interdisciplinary Conference (ASIC), Briançon, F (poster)

Kernel-methods are popular tools in machine learning and statistics that can be implemented in a simple feed-forward neural network. They have strong connections to several psychological theories. For example, Shepard's universal law of generalization can be given a kernel interpretation. This leads to an inner product and a metric on the psychological space that is different from the usual Minkowski norm. The metric has psychologically interesting properties: It is bounded from above and does not have additive segments. As categorization models often rely on Shepard's law as a model for psychological similarity some of them can be recast as kernel-methods. In particular, ALCOVE is shown to be closely related to kernel logistic regression. The relationship to the Generalized Context Model is also discussed. It is argued that functional analysis which is routinely used in machine learning provides valuable insights also for psychology.

Jäkel, F., Hill, J. and Wichmann, F. A. (2004)
m-Alternative-Forced-Choice: Improving the efficiency of the method of constant stimuli
Tübingen Perception Conference (TWK), Tübingen, FRG (poster)

We explored several ways to improve the efficiency of measuring psychometric functions without resorting to adaptive procedures. a) The number m of alternatives in an m-alternative-forced-choice (m-AFC) task improves the efficiency of the method of constant stimuli. b) When alternatives are presented simultaneously on different positions on a screen rather than sequentially time can be saved and memory load for the subject can be reduced. c) A touch-screen can further help to make the experimental procedure more intuitive. We tested these ideas in the measurement of contrast sensitivity and compared them to results obtained by sequential presentation in two-interval-forced-choice (2-IFC). Qualitatively all methods (m-AFC and 2-IFC) recovered the characterictic shape of the contrast sensitivity function in three subjects. The m-AFC paradigm only took about 60% of the time of the 2-IFC task. We tried m={2,4,8} and found 4-AFC to give the best model fits and 2-AFC to have the least bias.