Deep Dive 01

Determinism: Tests vs. Formal Proofs

Your specification requires Coq machine-checked proofs for determinism verification. We propose N=10 reproducibility test suites with CI gates—the industry standard for ML pipelines, saving 6+ months and eliminating specialist dependencies.

Your Specification

Coq Formal Proofs

5 machine-checked lemmas:

  • CRT_Uniqueness
  • RNS_NoOverflow
  • GETA_Invariant_NoOverflow
  • Quantization_Determinism
  • Determinism_Composition

Risk: Requires scarce Coq specialist

Our Proposal

Test Suite (N=10)

Reproducibility verification:

  • Run pipeline N times, compare BLAKE3 hashes
  • Cross-node verification (different hardware)
  • CI gate on every PR (N=3 for speed)
  • Release gate (N=10, multi-node)

Risk: Standard skills, proven approach

Key Evidence

"When there is intentional randomness in training and testing, this repeatability can be obtained by controlling random seeds as a means to achieve a deterministic ordering of results."

— Mellinger et al., Carnegie Mellon SEI, January 2025
"The Myth of Machine Learning Non-Reproducibility" (SEI/CMU 2025)
Supported by: AI Magazine: ML Reproducibility (Wiley, April 2025)

Determinism Test Harness Python
def verify_determinism(manifest_id: str, n_runs: int = 10) -> bool:
    """Run pipeline N times, verify identical BLAKE3 hashes"""
    hashes = set()
    for _ in range(n_runs):
        set_deterministic_mode(seed=derive_seed(manifest_id))
        result = run_pipeline(manifest_id)
        hashes.add(blake3_hash(result))
    
    return len(hashes) == 1  # All runs identical

# PyTorch deterministic mode (Official Docs v2.9)
import torch
import os

def set_deterministic(seed):
    torch.manual_seed(seed)
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

Key Question for Discussion

"If we demonstrate that running the same pipeline 10 times on different nodes produces identical BLAKE3 hashes every time, would that satisfy your determinism requirement? Or is there a specific failure mode Coq proofs protect against that test suites don't?"

Escape Hatch

If Coq proofs are truly required post-MVP, they can be added later without changing the runtime architecture. The test suite provides immediate confidence; proofs can provide additional assurance for regulatory or contractual needs.

Deep Dive 02

Aggregation: Welford's vs. RNS

Your specification requires RNS residue arithmetic with Q31.32 fixed-point for incremental aggregation. We propose Welford's online algorithm—proven since 1962, used by PyTorch/NumPy/Pandas internally, achieving the same O(1) updates.

Your Specification

RNS + Q31.32

  • Residue Number System arithmetic
  • Q31.32 fixed-point representation
  • CRT reconstruction for results
  • 128-bit intermediate precision

Purpose: FPGA/ASIC hardware optimization

Our Proposal

Welford's Algorithm

  • ~100 lines of Python
  • Numerically stable (avoids catastrophic cancellation)
  • O(1) per update—same as RNS
  • IEEE 754 float64 + deterministic mode

Purpose: Software MVP, hardware-ready interfaces

Key Evidence

"This algorithm is much less prone to loss of precision due to catastrophic cancellation... It is a special case of an algorithm for computing arbitrary order moments."

— Wikipedia, "Algorithms for calculating variance"
Reference to Welford (1962), West (1979)

Welford's Algorithm Implementation Python
class IncrementalMelAggregator:
    """O(1) voice addition using Welford's online algorithm"""
    
    def __init__(self, n_mels=128, n_frames=100):
        self.n = 0
        self.mean = np.zeros((n_mels, n_frames), dtype=np.float64)
        self.M2 = np.zeros((n_mels, n_frames), dtype=np.float64)
    
    def add_voice(self, mel_spectrogram, weight=1.0):
        """Add a voice contribution in O(1) time"""
        self.n += 1
        delta = mel_spectrogram - self.mean
        self.mean += delta / self.n  # Welford's update
        delta2 = mel_spectrogram - self.mean
        self.M2 += delta * delta2

Key Question for Discussion

"Is hardware acceleration (FPGA/ASIC) in scope for this phase, or is it a future consideration? For software MVP, RNS adds complexity without performance benefit."

Escape Hatch

We'll design clean interfaces that allow RNS substitution when hardware targets are defined. Welford's for MVP software; RNS reserved for FPGA roadmap.

Deep Dive 03

Denoising: DeepFilterNet vs. Ensemble

Your specification calls for a FullSubNet + DCCRN + MetricGAN-U ensemble. DeepFilterNet3 achieves better quality scores as a single model, runs 25× faster, and simplifies determinism verification.

Your Specification

3-Model Ensemble

  • FullSubNet (PESQ ~2.78)
  • DCCRN (PESQ ~2.68)
  • MetricGAN-U (PESQ ~2.86)
  • 3× compute, 3× determinism surface

RTF: 0.35+ (combined)

Our Proposal

DeepFilterNet3

  • Single state-of-the-art model
  • PESQ 3.17 (better than ensemble)
  • Deep filtering, not complex masking
  • Production-deployed by multiple companies

RTF: 0.04 (25× faster)

Model PESQ STOI RTF (CPU) Complexity
FullSubNet 2.78 0.93 0.35 HIGH
DCCRN 2.68 0.94 0.28 MEDIUM
MetricGAN-U 2.86 0.93 0.40 HIGH
DeepFilterNet3 3.17 0.944 0.04 LOW

Key Evidence

"DeepFilterNet excels in noise reduction... achieving state-of-the-art speech enhancement benchmarks while achieving a real-time-factor of 0.19 on a single threaded notebook CPU."

— Schröter et al., ICASSP 2022, Interspeech 2023
"DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement"

Key Question for Discussion

"Can we validate DeepFilterNet on your actual audio samples? If it achieves MOS ≥ 3.5 on your test set, would a single-model approach be acceptable?"

Quality Escalation Path

Week 2: Test on your audio. Week 6: Target MOS 3.3. Week 8: Target MOS 3.5. If DeepFilterNet alone doesn't hit the bar: (1) WPE tuning, (2) spectral post-filtering, (3) lightweight secondary model. We won't start complex, but we have a clear path.

Deep Dive 04

Scale: 100K vs. 5M for MVP

Your specification targets 5M voices for MVP. Because aggregation is O(1), 100K proves the same architecture as 5M—scaling is linear storage, not algorithmic change. We include a 10M synthetic DiskANN PoC to validate billion-scale.

Your Specification

5M Voices MVP

  • Full-scale data ingestion
  • Production storage provisioning
  • Longer test runtime
  • More synthetic data generation

What it proves: Same as 100K

Our Proposal

100K + 10M PoC

  • 100K real voices: full pipeline validated
  • 10M synthetic embeddings: DiskANN PoC
  • Architecture proven for billion-scale
  • Faster bug discovery at smaller scale

What it proves: Identical architecture

Scale-up path: Linear storage, same code

Validation Target What It Proves
10K voices Basic pipeline works
100K voices Incremental aggregation at scale
1M voices Sharding strategy validated
5M voices Same as 1M, just more data

Why O(1) Changes Everything

Because Welford's algorithm has O(1) time complexity per update, the cost of adding voice N is identical whether N=100,001 or N=5,000,001. There is no algorithmic threshold where behavior changes.

— Cone Red Technical Analysis
Validated by: "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search" (NeurIPS 2019) — proves commodity hardware handles billion-scale indexes. Bug found at 5M would exist at 100K. Scale-up is storage, not code.

Key Question for Discussion

"What's driving the 5M target? Is it a contractual commitment, or an architecture validation goal? If architecture validation, 100K + 10M synthetic PoC achieves the same confidence faster."

Middle Ground Option

If 5M is a hard requirement, we can accommodate it. But we recommend proving architecture at 100K first, then scaling—it's faster to find and fix issues at smaller scale.