Data Requirements for an Engineering Foundation Model

An engineering foundation model cannot be trained from CAD files alone. The basic record should be an engineering episode: geometry, semantic tags, mesh, material, boundary conditions, solver provenance, convergence logs, field outputs, scalar quantities of interest, uncertainty, validation data, and the design decision that used the result.

Recent physics-model work highlights the issue. The Well provides 15TB of diverse physics simulations; PDEBench standardizes PDE tasks; PhysiX and GPhyT point toward physics foundation models; and 2026 bias-aware evaluation shows that current models are conditional rather than universal generalists. Data distribution, regimes, temporal scale, initial condition complexity, and OOD splits matter.

Minimum data conditions

Editable geometry or B-rep plus mesh and semantic feature tags.
Boundary and initial conditions with provenance.
Material and manufacturing assumptions.
Solver version, mesh strategy, convergence, residuals, warnings, and failed runs.
Field outputs plus engineering QoIs such as drag, pressure drop, hot spot temperature, displacement, reaction force, or margin.
Multi-fidelity level: screening, design review, validated model, or operational evidence.
Regime-aware train/test/OOD splits rather than random splits only.
Uncertainty, validation evidence, license, security, and traceability to requirements.

For RHX, the data strategy should be to accumulate learnable engineering episodes: Plan creates requirements and decision context, Sim creates load cases and physical evidence, Render connects geometry and material state to review context, and prototype tests close the loop.

Data Requirements for an Engineering Foundation Model

Visual review map

Episode

Schema

Validation

Learning

Minimum data conditions