Method Heart: How the Texture Scorer Works

This page explains the core scoring logic in paper style, now with explicit equations and pseudocode. The key idea is to avoid treating texture suitability as a binary yes-or-no classification and instead estimate it as structured evidence consolidation: proposal fragmentation statistics, region-level boundary coherence, texture occupancy, and semantic counter-evidence are combined into a conservative score used for ranking. The system is therefore not trying to recognize object identity; it is estimating whether an image naturally supports RWTD-like texture partitioning.

Method and Evidence Open ADE20K Gallery Home

What Problem the Scorer Solves

Many natural images are visually rich but not texture-segmentation-friendly. A highly semantic scene can look complicated while still being a weak candidate for texture-transition analysis, and a visually simple scene can contain two or three strong material regimes that are ideal for texture segmentation. The scorer resolves this mismatch by ranking images according to texture-transition structure rather than semantic saliency, which is why the pipeline separates positive texture evidence from semantic counter-evidence before final ranking.

End-to-End Staged Pipeline (Visual Walkthrough)

Instead of describing the stages abstractly, this section shows one representative example per stage from the live manifest. The goal is to make the decision flow concrete: Stage A checks whether proposal structure is texture-friendly, geometry enrichment turns that structure into region-level texture evidence, Stage B tests semantic alignment using contrastive retrieval, and Stage D applies a conservative rubric that can veto semantically dominant false positives.

Stage A: Proposal Statistics Gate

This stage asks whether the proposal distribution looks like texture fragmentation rather than a single dominant object. It summarizes proposal count, dominance, small-region prevalence, and entropy into a single prior score.

Geometry Enrichment

This stage transforms region structure into texture evidence: boundary coherence, texture occupancy, object occupancy, and large-region organization. The overlay view is where boundary quality becomes visually interpretable.

Stage B: CLIP Retrieval Filter

This stage keeps images whose text-image contrast supports texture transition cues while suppressing object-centric semantics. Showing both a kept and filtered sample clarifies how the retrieval contrast behaves around the decision boundary.

Stage D: Rubric Audit (Optional)

This stage is a safety layer: it can preserve strong texture-dominant scenes, but it can also veto semantically problematic scenes even when upstream scores are high. The paired examples below show this conservative correction in action.

Global Objective and Fusion

For each image \(x_i\), we compute a final score \(s_i \in [0,100]\). The fusion is intentionally geometry-dominant: retrieval and rubric signals are corrective, while penalties enforce conservative behavior when evidence is inconsistent.

\[ s_i = \Pi_{[0,100]}\!\left( s_i^{\text{base}} + \lambda_B\,\Delta_i^B + \lambda_D\,\Delta_i^D - \rho_i \right), \]

\(\Pi_{[0,100]}\) denotes clipping to \([0,100]\); \(\rho_i\) is a penalty term from explicit risk flags and vetoes.

This form is the core design choice: when corrective terms are too large, the model drifts toward semantics; when they are absent, the ranking is under-calibrated on hard edge cases. The selected blend keeps interpretation stable while still allowing recovery of valid but rare texture scenes.

Stage A: Fragmentation Prior (Proposal Statistics)

Let \(\{a_{ij}\}_{j=1}^{N_i}\) be proposal areas for image \(i\), and \(A_i\) the image area. We define normalized proposal ratios \(r_{ij}=a_{ij}/A_i\), then summarize proposal fragmentation with count, dominance, small-region prevalence, and entropy. This yields a weak but robust prior that the image contains sufficient texture micro-structure.

\[ \ell_i = \max_j r_{ij},\quad m_i = \operatorname{median}_j(r_{ij}),\quad q_i = \frac{1}{N_i}\sum_j \mathbf{1}[r_{ij}<\tau_s],\quad h_i = -\sum_j p_{ij}\log_2 p_{ij},\ \ p_{ij}=\frac{r_{ij}}{\sum_k r_{ik}}. \]

\[ s_i^A = 100\Big(0.28\,u_N + 0.24\,u_\ell + 0.18\,u_m + 0.18\,u_q + 0.12\,u_h\Big), \]

where \(u_N,u_\ell,u_m,u_q,u_h\in[0,1]\) are normalized versions of count, dominance, median, small-mask rate, and entropy using fixed saturation ranges.

\[ g_i^A = \mathbf{1}\! \big[N_i\!\ge\!N_{\min}\ \land\ \ell_i\!\le\!\ell_{\max}\ \land\ m_i\!\le\!m_{\max}\ \land\ q_i\!\ge\!q_{\min}\ \land\ h_i\!\ge\!h_{\min}\big]. \]

In practice this stage is conservative and not expected to finalize ranking quality. Its job is to suppress obvious non-candidates while preserving uncertain cases for stronger downstream evidence, which is why unknown parse cases can continue rather than being deterministically dropped.

Geometry Backbone: Boundary-Coherent Texture Evidence

The geometry backbone provides the principal discriminative power by promoting boundary-coherent texture decomposition and penalizing semantic/object dominance or ambiguous clutter. Let \(b_i\) be boundary coherence score, \(t_i\) texture occupancy, \(o_i\) object fraction, \(a_i\) ambiguity fraction, and \(c_i\) clutter term. Let \(r_i\) and \(v_i\) denote region-count and strong-boundary bonuses derived from coarse region structure.

\[ s_i^{\text{base}} = \Pi_{[0,100]}\!\left(0.20\,s_i^A + 0.60\,b_i + \beta_t\,\phi(t_i) + \beta_r\,r_i + \beta_v\,v_i - \beta_o\,o_i - \beta_a\,a_i - \beta_c\,c_i\right). \]

Conceptually, this base score encodes the actual target structure of texture segmentation: two-to-four coherent material regions with meaningful boundaries. Because it is geometry-first rather than class-first, it remains aligned with non-semantic texture tasks and does not rely on object labels to appear strong.

Stage B: Retrieval Correction with Diversity-Preserving Keep Rule

Stage B computes retrieval evidence from positive and negative textual anchors. If \(e_i\) is the image embedding, \(\{t_q^+\}\) positive prompt embeddings, and \(\{t_q^-\}\) negative prompt embeddings, the retrieval contrast is:

\[ c_i = \max_q\,\operatorname{sim}(e_i,t_q^+) - \alpha\,\max_q\,\operatorname{sim}(e_i,t_q^-). \]

Instead of a single hard threshold, the keep policy uses a threshold-or-top-k union to preserve rare but valid positives:

\[ g_i^B = \mathbf{1}\left[c_i\ge\theta\ \lor\ i\in\bigcup_q\operatorname{TopK}_q\right]. \]

This keeps recall on minority texture modes without allowing retrieval to dominate the global ranking, since Stage B remains a correction layer on top of the geometry backbone.

Stage D: Rubric Correction, Semantic Veto, and Overlay Audit

Stage D evaluates only the shortlisted subset and returns a rubric score \(r_i\), a categorical decision \(d_i\in\{\text{match},\text{borderline},\text{not\_match}\}\), and a set of flags \(F_i\). The correction term includes a small continuous adjustment and discrete decision offsets, while hard veto logic can cap scores when semantic dominance or boundary-audit failure is detected.

\[ \Delta_i^D = \eta_r\,\frac{r_i-50}{50} + \eta_m\,\mathbf{1}[d_i=\text{match}] - \eta_n\,\mathbf{1}[d_i=\text{not\_match}], \]

\[ \rho_i = \sum_{f\in F_i}\omega_f\,\mathbf{1}[f\in\mathcal{P}] + \rho_i^{\text{veto}}, \]

\(\mathcal{P}\) is the penalized flag set (e.g., object-centric, weak boundary, synthetic/collage cues), and \(\rho_i^{\text{veto}}\) is an additional downgrade under strict safety rules.

This stage should be read as de-risking rather than primary ranking. It limits semantic leakage and unsafe boundary interpretations while preserving the geometry-first ordering.

Pseudocode: Single-Image Scoring

The following pseudocode is the operational scoring path for one image, written to mirror the implemented behavior while staying model-level rather than code-level.

Algorithm 1: ScoreOneImage

Input: image x_i, proposal set P_i, configuration Theta
Output: final score s_i and diagnostics D_i

1:  Compute Stage-A statistics from P_i -> (N_i, l_i, m_i, q_i, h_i)
2:  Compute s_i^A and gate g_i^A
3:  Extract geometry terms -> (b_i, t_i, o_i, a_i, c_i, r_i, v_i)
4:  Compute base score s_i^base
5:  If Stage-B enabled and (g_i^A or unknown-allowed):
6:      compute retrieval contrast c_i and gate g_i^B
7:      map c_i to normalized correction Delta_i^B
8:  Else:
9:      set Delta_i^B = 0 and g_i^B = 0
10: If Stage-D enabled and g_i^B and x_i in shortlist:
11:     run rubric evaluator -> (r_i, d_i, F_i)
12:     compute Delta_i^D and penalties rho_i (including veto rules)
13: Else:
14:     set Delta_i^D = 0 and rho_i = 0
15: Fuse terms: s_i = clip(s_i^base + lambda_B Delta_i^B + lambda_D Delta_i^D - rho_i, 0, 100)
16: Emit s_i together with all intermediate diagnostics D_i

Pseudocode: Dataset Ranking and Review Subset

At dataset scale, the miner first scores all candidates, then applies a deterministic review-subset policy so the hosted gallery remains tractable while still representing selected, borderline, and rejected regimes.

Algorithm 2: RankDatasetAndBuildReviewSet

Input: dataset X = {x_i}_{i=1}^M, review limit L
Output: ranked manifest R and hosted review subset H

1:  For each image x_i in X:
2:      (s_i, D_i) <- ScoreOneImage(x_i)
3:  Sort all images by descending s_i to get ranked manifest R
4:  Assign status labels from score intervals (selected/borderline/rejected)
5:  If |R| <= L: return R as hosted set H
6:  Else build H with deterministic class-balanced allocation:
7:      allocate quotas across selected/borderline/rejected
8:      keep high-score hard negatives plus diverse low-score negatives
9:      cap to L and preserve deterministic ordering by score
10: Return full manifest R and hosted subset H

This separation between full ranking and hosted subset is the reason the site can stay lightweight without changing the underlying scoring behavior.

From Continuous Score to Practical Decisions

The score remains continuous so users can choose operating points by task. The hosted review uses fixed status intervals for rapid browsing, while mining workflows can use a broader operational threshold to favor recall. These are not conflicting definitions; they are two operating points on the same calibrated ranking axis.

Interpretability claim: every final score is tied to separable terms and explicit gates, so large changes can be traced to concrete evidence shifts rather than opaque model behavior.

Reproducibility Note

The implementation uses fixed configuration defaults and deterministic subset construction for the live 1500-image browser set, so the ranking profile and status composition are reproducible from the same manifest snapshot. This site is intentionally separated from ARCHITEXTURE outputs to keep conclusions independent and auditable.