Technical Deep-Dives
Evidence-backed justification for each major technical decision in our counter-proposal. Citations, benchmarks, and code samples included.
Deep Dive 01
Determinism: Tests vs. Formal Proofs
Your specification requires Coq machine-checked proofs for determinism verification. We propose N=10 reproducibility test suites with CI gates—the industry standard for ML pipelines, saving 6+ months and eliminating specialist dependencies.
Coq Formal Proofs
5 machine-checked lemmas:
- CRT_Uniqueness
- RNS_NoOverflow
- GETA_Invariant_NoOverflow
- Quantization_Determinism
- Determinism_Composition
Risk: Requires scarce Coq specialist
Test Suite (N=10)
Reproducibility verification:
- Run pipeline N times, compare BLAKE3 hashes
- Cross-node verification (different hardware)
- CI gate on every PR (N=3 for speed)
- Release gate (N=10, multi-node)
Risk: Standard skills, proven approach
Key Evidence
"When there is intentional randomness in training and testing, this repeatability can be obtained by controlling random seeds as a means to achieve a deterministic ordering of results."
— Mellinger et al., Carnegie Mellon SEI, January 2025
"The Myth of Machine Learning Non-Reproducibility" (SEI/CMU 2025)
Supported by: AI Magazine: ML Reproducibility (Wiley, April 2025)
def verify_determinism(manifest_id: str, n_runs: int = 10) -> bool: """Run pipeline N times, verify identical BLAKE3 hashes""" hashes = set() for _ in range(n_runs): set_deterministic_mode(seed=derive_seed(manifest_id)) result = run_pipeline(manifest_id) hashes.add(blake3_hash(result)) return len(hashes) == 1 # All runs identical # PyTorch deterministic mode (Official Docs v2.9) import torch import os def set_deterministic(seed): torch.manual_seed(seed) torch.use_deterministic_algorithms(True) torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"
Key Question for Discussion
"If we demonstrate that running the same pipeline 10 times on different nodes produces identical BLAKE3 hashes every time, would that satisfy your determinism requirement? Or is there a specific failure mode Coq proofs protect against that test suites don't?"
Escape Hatch
If Coq proofs are truly required post-MVP, they can be added later without changing the runtime architecture. The test suite provides immediate confidence; proofs can provide additional assurance for regulatory or contractual needs.
Deep Dive 02
Aggregation: Welford's vs. RNS
Your specification requires RNS residue arithmetic with Q31.32 fixed-point for incremental aggregation. We propose Welford's online algorithm—proven since 1962, used by PyTorch/NumPy/Pandas internally, achieving the same O(1) updates.
RNS + Q31.32
- Residue Number System arithmetic
- Q31.32 fixed-point representation
- CRT reconstruction for results
- 128-bit intermediate precision
Purpose: FPGA/ASIC hardware optimization
Welford's Algorithm
- ~100 lines of Python
- Numerically stable (avoids catastrophic cancellation)
- O(1) per update—same as RNS
- IEEE 754 float64 + deterministic mode
Purpose: Software MVP, hardware-ready interfaces
Key Evidence
"This algorithm is much less prone to loss of precision due to catastrophic cancellation... It is a special case of an algorithm for computing arbitrary order moments."
— Wikipedia, "Algorithms for calculating variance"
Reference to Welford (1962), West (1979)
class IncrementalMelAggregator: """O(1) voice addition using Welford's online algorithm""" def __init__(self, n_mels=128, n_frames=100): self.n = 0 self.mean = np.zeros((n_mels, n_frames), dtype=np.float64) self.M2 = np.zeros((n_mels, n_frames), dtype=np.float64) def add_voice(self, mel_spectrogram, weight=1.0): """Add a voice contribution in O(1) time""" self.n += 1 delta = mel_spectrogram - self.mean self.mean += delta / self.n # Welford's update delta2 = mel_spectrogram - self.mean self.M2 += delta * delta2
Key Question for Discussion
"Is hardware acceleration (FPGA/ASIC) in scope for this phase, or is it a future consideration? For software MVP, RNS adds complexity without performance benefit."
Escape Hatch
We'll design clean interfaces that allow RNS substitution when hardware targets are defined. Welford's for MVP software; RNS reserved for FPGA roadmap.
Deep Dive 03
Denoising: DeepFilterNet vs. Ensemble
Your specification calls for a FullSubNet + DCCRN + MetricGAN-U ensemble. DeepFilterNet3 achieves better quality scores as a single model, runs 25× faster, and simplifies determinism verification.
3-Model Ensemble
- FullSubNet (PESQ ~2.78)
- DCCRN (PESQ ~2.68)
- MetricGAN-U (PESQ ~2.86)
- 3× compute, 3× determinism surface
RTF: 0.35+ (combined)
DeepFilterNet3
- Single state-of-the-art model
- PESQ 3.17 (better than ensemble)
- Deep filtering, not complex masking
- Production-deployed by multiple companies
RTF: 0.04 (25× faster)
| Model | PESQ | STOI | RTF (CPU) | Complexity |
|---|---|---|---|---|
| FullSubNet | 2.78 | 0.93 | 0.35 | HIGH |
| DCCRN | 2.68 | 0.94 | 0.28 | MEDIUM |
| MetricGAN-U | 2.86 | 0.93 | 0.40 | HIGH |
| DeepFilterNet3 | 3.17 | 0.944 | 0.04 | LOW |
Key Evidence
"DeepFilterNet excels in noise reduction... achieving state-of-the-art speech enhancement benchmarks while achieving a real-time-factor of 0.19 on a single threaded notebook CPU."
— Schröter et al., ICASSP 2022, Interspeech 2023
"DeepFilterNet: Perceptually Motivated Real-Time Speech Enhancement"
Key Question for Discussion
"Can we validate DeepFilterNet on your actual audio samples? If it achieves MOS ≥ 3.5 on your test set, would a single-model approach be acceptable?"
Quality Escalation Path
Week 2: Test on your audio. Week 6: Target MOS 3.3. Week 8: Target MOS 3.5. If DeepFilterNet alone doesn't hit the bar: (1) WPE tuning, (2) spectral post-filtering, (3) lightweight secondary model. We won't start complex, but we have a clear path.
Deep Dive 04
Scale: 100K vs. 5M for MVP
Your specification targets 5M voices for MVP. Because aggregation is O(1), 100K proves the same architecture as 5M—scaling is linear storage, not algorithmic change. We include a 10M synthetic DiskANN PoC to validate billion-scale.
5M Voices MVP
- Full-scale data ingestion
- Production storage provisioning
- Longer test runtime
- More synthetic data generation
What it proves: Same as 100K
100K + 10M PoC
- 100K real voices: full pipeline validated
- 10M synthetic embeddings: DiskANN PoC
- Architecture proven for billion-scale
- Faster bug discovery at smaller scale
What it proves: Identical architecture
Scale-up path: Linear storage, same code
| Validation Target | What It Proves |
|---|---|
| 10K voices | Basic pipeline works |
| 100K voices | Incremental aggregation at scale |
| 1M voices | Sharding strategy validated |
| 5M voices | Same as 1M, just more data |
Why O(1) Changes Everything
Because Welford's algorithm has O(1) time complexity per update, the cost of adding voice N is identical whether N=100,001 or N=5,000,001. There is no algorithmic threshold where behavior changes.
— Cone Red Technical Analysis
Validated by: "DiskANN: Fast Accurate Billion-point Nearest Neighbor Search" (NeurIPS
2019) — proves commodity hardware handles billion-scale indexes.
Bug found at 5M would exist at 100K. Scale-up is storage, not code.
Key Question for Discussion
"What's driving the 5M target? Is it a contractual commitment, or an architecture validation goal? If architecture validation, 100K + 10M synthetic PoC achieves the same confidence faster."
Middle Ground Option
If 5M is a hard requirement, we can accommodate it. But we recommend proving architecture at 100K first, then scaling—it's faster to find and fix issues at smaller scale.