注意力科学

How AI Content Analysis Works: The Technology Behind Attention Prediction

A technical deep-dive into how AI predicts where humans look at images. From DeepGaze neural networks to cognitive activation modeling.

When you upload an image to FlowDx, three AI systems analyze it simultaneously. Here's what actually happens under the hood — no marketing speak, just the technical reality.

Engine 1: Attention Prediction (DeepGaze IIE)

The attention heatmap is generated by DeepGaze IIE, developed at the University of Tübingen's Bethge Lab. It's the #1 ranked model on the MIT Saliency Benchmark.

How it works

DeepGaze IIE is built on top of two pre-trained neural networks (DenseNet-201 and ResNeXt-50) that extract visual features at multiple scales. These features are then combined through learned readout layers that predict fixation probability maps.

The key insight: the model doesn't just detect "bright" or "contrasty" regions. It learns complex feature interactions that correlate with actual human eye movements. The training data comes from Bylinskii et al. (2019)'s massive eye-tracking dataset — thousands of images with corresponding gaze data from hundreds of participants.

Accuracy

DeepGaze IIE achieves an AUC-Judd score of 0.87+ on the MIT benchmark, meaning its predictions correlate strongly with where actual humans look. For comparison, the theoretical upper bound (inter-subject agreement) is around 0.92.

Engine 2: Cognitive Activation Analysis

This engine estimates how different brain regions would respond to the visual stimulus, based on computational neuroscience models of visual processing.

The five dimensions

  • Visual Cortex (V1-V4) — Low-level visual processing: edges, colors, textures. Correlates with "visual impact." Based on Hubel & Wiesel's foundational work on visual cortex receptive fields.
  • Amygdala — Emotional salience detection. LeDoux (2000) showed this structure evaluates emotional content within 170ms.
  • Hippocampus — Memory encoding. Stern et al. (1996) demonstrated hippocampal activation predicts whether a stimulus will be remembered.
  • Prefrontal Cortex — Decision-making and action planning. Miller & Cohen (2001) established the PFC's role in goal-directed behavior.
  • Fusiform (FFA) — Face and body recognition. Kanwisher et al. (1997) discovered this face-selective region.

Engine 3: Gemini Vision AI Diagnosis

The third engine uses Google's Gemini multimodal AI with extended thinking to analyze the image holistically. Unlike the first two engines (which are specialized neural networks), Gemini performs high-level visual reasoning.

What Gemini does differently

Gemini can understand context, read text within images, identify compositional problems, and generate natural language recommendations. It receives the attention data from engines 1 and 2 as context, then produces:

  • Specific diagnosis of visual problems (not just "low attention" but "the white text is invisible against the light background")
  • Evidence-based recommendations (referencing the attention data)
  • Precise annotation coordinates for marking problem areas

Why Three Engines?

Each engine has blind spots:

  • DeepGaze predicts where people look but not why or how to fix it
  • Cognitive activation tells you which brain systems respond but not at which specific elements
  • Gemini understands context and meaning but lacks the perceptual accuracy of specialized saliency models

Together, they provide a complete picture: where attention goes (DeepGaze), how the brain responds (cognitive activation), and what to do about it (Gemini).

Try It Yourself

Upload any image to FlowDx and see all three engines at work. The analysis takes about 30 seconds and costs 1 credit.

References

Diagnose your content with FlowDx

Upload your thumbnails, covers, or ad creatives and get an AI-powered attention diagnostic report in 30 seconds.

Try FlowDx for Free