An engineering foundation model cannot be trained from CAD files alone. The basic record should be an engineering episode: geometry, semantic tags, mesh, material, boundary conditions, solver provenance, convergence logs, field outputs, scalar quantities of interest, uncertainty, validation data, and the design decision that used the result.
Recent physics-model work highlights the issue. The Well provides 15TB of diverse physics simulations; PDEBench standardizes PDE tasks; PhysiX and GPhyT point toward physics foundation models; and 2026 bias-aware evaluation shows that current models are conditional rather than universal generalists. Data distribution, regimes, temporal scale, initial condition complexity, and OOD splits matter.
Minimum data conditions
- Editable geometry or B-rep plus mesh and semantic feature tags.
- Boundary and initial conditions with provenance.
- Material and manufacturing assumptions.
- Solver version, mesh strategy, convergence, residuals, warnings, and failed runs.
- Field outputs plus engineering QoIs such as drag, pressure drop, hot spot temperature, displacement, reaction force, or margin.
- Multi-fidelity level: screening, design review, validated model, or operational evidence.
- Regime-aware train/test/OOD splits rather than random splits only.
- Uncertainty, validation evidence, license, security, and traceability to requirements.
For RHX, the data strategy should be to accumulate learnable engineering episodes: Plan creates requirements and decision context, Sim creates load cases and physical evidence, Render connects geometry and material state to review context, and prototype tests close the loop.