From Image to Music Language

Abstract

We propose a new approach for a practical two-stage Optical Music Recognition (OMR) pipeline, with a particular focus on its second stage. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.

What problem we focus on

In complex piano notation, the main bottleneck is often not detecting individual noteheads, stems, or rests, but deciding how locally plausible events should be assembled into voices and timing. Several voices may overlap at nearly the same horizontal position, partial voices may appear only locally, and rhythmic logic can depend on tuplets, grace notes, or whole-measure rest conventions.

Piano measure with multiple voices overlapping at adjacent horizontal positions. — Multi-voice overlap. Local geometry alone is not enough to decide voice continuation: events that appear close together horizontally may belong to different voices, while a usable score requires globally coherent voice chains and time positions.

System Overview

Starry treats OMR as a staged transformation from images to editable music structure. The visual system operates at per-page and per-staff levels to produce event candidates, and regulation resolves measure-level topology into a serialized representation that can be exported to standard music-language formats such as MusicXML and LilyPond.

Visual pipeline for candidate generation in the Starry OMR system. — Visual pipeline. The system first produces robust local evidence with layout analysis, staff processing, semantic heatmaps, OCR, and candidate assembly before any global structural commitment is made.

Layout predictor input page. — Visual pipeline. The system first produces robust local evidence with layout analysis, staff processing, semantic heatmaps, OCR, and candidate assembly before any global structural commitment is made.

Assembly flow from semantic recognition to measure-level event candidates. — Assembly. Local semantic detections are grouped into measure-level event candidates, with attributes derived from geometric measurements and confidence cues from the recognition stage.

The Regulation Method: BeadSolver

BeadSolver treats measure-level decoding as a topology problem. Instead of asking the system to predict a complete polyphonic structure in one shot, it uses probability-guided tree search to explore candidate voice-chain assignments among event candidates, then selects a topology that becomes globally coherent in time.

Raw measure with severe voice ambiguity. — Raw ambiguous measure

Target voice structure after regulation. — Regulated voice structure

The key idea is to separate local evidence from structural commitment. The model estimates which continuation is plausible at each step, while the solver keeps multiple possibilities alive long enough to compare their consequences at the measure level.

Voice chains as a topology view over event candidates. — Voice chains as topology. Events are linked into ordered voice-wise chains. In this view, the main question is not only which symbols were detected, but how they continue across the measure and how different voices interleave.

From chained topology to time-consistent measure structure. — From topology to timing. Once a candidate chain structure is fixed, the solver can derive ticks, durations, and voice-wise timelines, and reject solutions that do not yield a coherent measure-level interpretation.

How BeadSolver Becomes an MDP

How can structure decoding across multiple voices be formulated as a Markov decision process? One useful intuition is to treat each barline as a space-time portal: the solver may finish one voice, jump through the barline, and continue from the beginning of another voice, while keeping the global measure structure consistent.

Animated intuition for multi-voice structure decoding as an MDP with barlines acting like space-time portals. — Barlines as portals. The animated path shows how decoding can move across voices while still building one coherent measure-level topology.

Searching Visualization

The following examples visualize how different solvers explore measure-level topology candidates and converge to coherent voice structures. Click a image to play animation.

Chopin Op. 10 No. 3, m. 57.

Chopin Op. 25 No. 5, m. 90.

Chopin Op. 25 No. 3, m. 14.

Shanyue, m. 12.

BeadPicker

BeadPicker is the learned model inside the solver. At each step, it reads the current measure candidates together with the committed prefix, then estimates a probability distribution over which event should come next, including voice-closing boundary markers, and provides duration predispositions and tick estimates used by the evaluator.

Show Cases

The examples below show a simplified player for Starry recognition results. You can play back the preview output and switch between the original score image and the segmented, denoised view to compare the two visualizations.

Try it yourself

Try the live demo on Hugging Face: upload a score image or PDF for Starry recognition, inspect and edit the recognized result, and export it to MusicXML and other standard music notation formats.

🤗 Open the Starry✨ live demo

From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR