FindLab Starry

From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR

Nan XU · Shiheng LI · Shengchao HOU

FindLab

Abstract

We propose a new approach for a practical two-stage Optical Music Recognition (OMR) pipeline, with a particular focus on its second stage. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.

What problem we focus on

In complex piano notation, the main bottleneck is often not detecting individual noteheads, stems, or rests, but deciding how locally plausible events should be assembled into voices and timing. Several voices may overlap at nearly the same horizontal position, partial voices may appear only locally, and rhythmic logic can depend on tuplets, grace notes, or whole-measure rest conventions.

Piano measure with multiple voices overlapping at adjacent horizontal positions.
Multi-voice overlap. Local geometry alone is not enough to decide voice continuation: events that appear close together horizontally may belong to different voices, while a usable score requires globally coherent voice chains and time positions.

System Overview

Starry treats OMR as a staged transformation from images to editable music structure. The visual system operates at per-page and per-staff levels to produce event candidates, and regulation resolves measure-level topology into a serialized representation that can be exported to standard music-language formats such as MusicXML and LilyPond.

System overview of the Starry pipeline from images to music-language formats.
Starry overview. A compact overview of the full pipeline, from images through visual candidate generation and regulation to downstream music-language formats.
Visual pipeline for candidate generation in the Starry OMR system.
Layout analysis

Find page-level staff regions and structural guide lines before staff-local recognition.

Layout predictor input page. Layout predictor output heatmap.
Staff straightening

Estimate staff distortion, then render a
rectified staff crop for downstream predictors.

Gauge predictor input staff crop. Gauge predictor geometry output. Rectified straightened staff rendering.
Semantic heatmaps

Predict notation-class heatmaps that become local evidence for candidate assembly.

Semantic predictor input staff crop. Semantic predictor output heatmap composite.
Foreground mask

Separate notation foreground from the corrected staff image before symbol grouping.

Mask predictor input corrected staff. Mask predictor foreground output.
Brackets

Recognize staff grouping marks at the left edge of a system and serialize them as structural tokens.

{-}
Visual pipeline. The system first produces robust local evidence with layout analysis, staff processing, semantic heatmaps, OCR, and candidate assembly before any global structural commitment is made.
Assembly flow from semantic recognition to measure-level event candidates.
Assembly. Local semantic detections are grouped into measure-level event candidates, with attributes derived from geometric measurements and confidence cues from the recognition stage.

The Regulation Method: BeadSolver

BeadSolver treats measure-level decoding as a topology problem. Instead of asking the system to predict a complete polyphonic structure in one shot, it uses probability-guided tree search to explore candidate voice-chain assignments among event candidates, then selects a topology that becomes globally coherent in time.

Raw measure with severe voice ambiguity.
Raw ambiguous measure
Target voice structure after regulation.
Regulated voice structure

The key idea is to separate local evidence from structural commitment. The model estimates which continuation is plausible at each step, while the solver keeps multiple possibilities alive long enough to compare their consequences at the measure level.

Voice chains as a topology view over event candidates.
Voice chains as topology. Events are linked into ordered voice-wise chains. In this view, the main question is not only which symbols were detected, but how they continue across the measure and how different voices interleave.
From chained topology to time-consistent measure structure.
From topology to timing. Once a candidate chain structure is fixed, the solver can derive ticks, durations, and voice-wise timelines, and reject solutions that do not yield a coherent measure-level interpretation.

How BeadSolver Becomes an MDP

How can structure decoding across multiple voices be formulated as a Markov decision process? One useful intuition is to treat each barline as a space-time portal: the solver may finish one voice, jump through the barline, and continue from the beginning of another voice, while keeping the global measure structure consistent.

Animated intuition for multi-voice structure decoding as an MDP with barlines acting like space-time portals.
Barlines as portals. The animated path shows how decoding can move across voices while still building one coherent measure-level topology.

Searching Visualization

The following examples visualize how different solvers explore measure-level topology candidates and converge to coherent voice structures. Click a image to play animation.

BeadPicker

BeadPicker is the learned model inside the solver. At each step, it reads the current measure candidates together with the committed prefix, then estimates a probability distribution over which event should come next, including voice-closing boundary markers, and provides duration predispositions and tick estimates used by the evaluator.

BeadPicker Transformer architecture for topology recognition.
BeadPicker architecture. The model reads measure-level candidates, geometry, local hints, and prefix context, and produces successor probabilities and related fields used by the solver.

Show Cases

The examples below show a simplified player for Starry recognition results. You can play back the preview output and switch between the original score image and the segmented, denoised view to compare the two visualizations.

Try it yourself

Try the live demo on Hugging Face: upload a score image or PDF for Starry recognition, inspect and edit the recognized result, and export it to MusicXML and other standard music notation formats.

🤗 Open the Starry✨ live demo

Links