## Summary
This PR consolidates **18 separate CI workflow files** into a singโฆle **tiered orchestration workflow** that reduces costs, improves execution speed, and provides better failure visibility through intelligent early-exit strategies.
## CI Pipeline Architecture
```mermaid
graph TD
Start([Push/PR Event]) --> T1
subgraph Tier1["๐ช Tier 1: Gate Checks (2-5 min, GitHub runners)"]
T1A[Gate Checks<br/>Format, Style, Pre-commit]
T1B[Shellcheck<br/>Shell Script Validation]
T1C[MAVSDK Python<br/>Python Code Quality]
end
Start --> T1A & T1B & T1C
T1A & T1B & T1C --> T1Pass{All Pass?}
T1Pass -->|No| Fail1[โ Stop - Save ~55 min]
T1Pass -->|Yes| T2
subgraph Tier2["๐จ Tier 2: Basic Builds (5-8 min, AWS 4cpu)"]
T2A[Build SITL<br/>px4_sitl_default + ccache]
T2B[Basic Tests<br/>~2000 unit tests]
T2C[EKF Functional<br/>EKF validation]
end
T2 --> T2A & T2B & T2C
T2A & T2B & T2C --> T2Pass{All Pass?}
T2Pass -->|No| Fail2[โ Stop - Save ~40 min]
T2Pass -->|Yes| T3
subgraph Tier3["๐๏ธ Tier 3: Platform Builds (15-25 min, AWS 4cpu/8cpu)"]
T3A[Ubuntu Builds<br/>Install Scripts + SITL + FMUv5]
T3B[macOS Build<br/>Install Scripts + SITL + FMUv5]
T3C[ITCM Check<br/>Memory Validation]
T3D[Flash Analysis<br/>Binary Size Check]
T3E[Failsafe Sim<br/>Web Simulator]
end
T3 --> T3A & T3B & T3C & T3D & T3E
T3A & T3B & T3C & T3D & T3E --> T3Pass{All Pass?}
T3Pass -->|No| Fail3[โ Stop - Save ~20 min]
T3Pass -->|Yes| T4
subgraph Tier4["๐งช Tier 4: Integration Tests (15-30 min, AWS 4cpu)"]
T4A[SITL Tests<br/>Gazebo + MAVSDK]
T4B[ROS Integration<br/>ROS2 Galactic]
T4C[MAVROS Tests<br/>Mission + Offboard]
T4D[ROS Translation<br/>Humble + Jazzy]
end
T4 --> T4A & T4B & T4C & T4D
T4A & T4B & T4C & T4D --> T4Pass{All Pass?}
T4Pass -->|No| Fail4[โ Stop - Save ~30-60 min]
T4Pass -->|Yes| Summary
Summary[๐ CI Summary<br/>All Tiers Passed] --> T5
subgraph Tier5["๐ฏ Tier 5: Full Build Matrix (30-60 min, AWS 8cpu)"]
T5A[Build All Targets<br/>100+ Board Configs]
end
T5 --> T5A --> T5Pass{Pass?}
T5Pass -->|No| Fail5[โ Board Build Failed]
T5Pass -->|Yes| UploadCheck{Protected<br/>Branch or Tag?}
UploadCheck -->|Yes| Upload[๐ฆ Upload Artifacts<br/>S3 + GitHub Release]
UploadCheck -->|No| SuccessPR[โ
PR CI Complete]
Upload --> SuccessFull[โ
Full CI Complete]
style Tier1 fill:#e1f5e1
style Tier2 fill:#e3f2fd
style Tier3 fill:#fff3e0
style Tier4 fill:#f3e5f5
style Tier5 fill:#fce4ec
style Fail1 fill:#ffcdd2
style Fail2 fill:#ffcdd2
style Fail3 fill:#ffcdd2
style Fail4 fill:#ffcdd2
style Fail5 fill:#ffcdd2
style SuccessPR fill:#c8e6c9
style SuccessFull fill:#c8e6c9
```
## Problem with the Old Approach
Previously, PX4's CI ran **18+ independent workflow files** on every push/PR:
- `checks.yml`, `clang-tidy.yml`, `python_checks.yml`
- `compile_ubuntu.yml`, `compile_macos.yml`
- `sitl_tests.yml`, `ros_integration_tests.yml`, `mavros_mission_tests.yml`, `mavros_offboard_tests.yml`, `ros_translation_node.yml`
- `ekf_functional_change_indicator.yml`, `itcm_check.yml`, `flash_analysis.yml`, `failsafe_sim.yml`
- And more...
**Issues:**
1. **Wasteful execution** - All jobs ran in parallel regardless of early failures. If basic checks failed, expensive platform builds and integration tests still consumed CI resources.
2. **Poor cost efficiency** - No way to stop expensive jobs (macOS builds, full platform matrix, ROS tests) when fast checks failed.
3. **Slow feedback** - Developers had to wait for all workflows to complete to see overall status, even when basic linting failed.
4. **Maintenance burden** - Each workflow duplicated setup steps, cache configuration, and dependency declarations.
5. **No visibility into stages** - Hard to understand which phase of CI failed without checking multiple workflow runs.
## New Tiered Orchestration Approach
The new `ci-orchestrator.yml` implements a **5-tier waterfall pipeline** with intelligent early-exit at each stage:
### Tier 1: Gate Checks (2-5 min, GitHub runners)
**Fast quality gates that prevent bad code from consuming expensive resources**
- Code style checks (shellcheck, python formatting, MAVSDK validation)
- Static analysis
- Pre-commit hooks validation
**Exit strategy:** If any check fails, **stop immediately** - saves 30-60 minutes of expensive builds.
### Tier 2: Basic Builds (5-8 min, AWS 4cpu)
**Verify code compiles for primary SITL target**
- Build `px4_sitl_default` with ccache (250M cache, ~99% hit rate)
- Run basic unit tests (~2000 test cases)
- EKF functional validation
**Exit strategy:** If SITL doesn't build or basic tests fail, **stop before platform matrix** - saves 20-40 minutes of multi-platform builds.
### Tier 3: Platform Builds (15-25 min, AWS 4cpu + 8cpu)
**Build for all supported platforms**
- Ubuntu: Tests Install Scripts, and then Builds SITL and FMUv5
- macOS: Tests Install Scripts, and then Builds SITL and FMUv5 with Apple toolchain
- ITCM size validation
- Flash memory analysis
- Failsafe web simulator
**Exit strategy:** If any platform fails, **stop before integration tests** - saves 15-30 minutes of ROS/MAVROS/simulation tests.
### Tier 4: Integration Tests (15-30 min, AWS 4cpu)
**End-to-end testing with simulators and ROS**
- SITL integration tests (Gazebo, MAVSDK scenarios)
- ROS 2 integration (Galactic builds)
- MAVROS tests (Mission planning, Offboard control)
- ROS Translation Node (Humble, Jazzy compatibility)
**Exit strategy:** If integration tests fail, **stop before full board matrix** - saves 30-60 minutes of expensive board builds.
### Tier 5: Full Build Matrix (30-60 min, AWS 8cpu)
**Runs on all PRs and protected branches after Tiers 1-4 pass**
- Builds **all** hardware targets (100+ NuttX board configurations)
- Only executes after all previous tiers succeed
- Catches board-specific compilation issues before merge
- Uploads artifacts to S3 and GitHub releases only on protected branches/tags
**Why run on PRs?** Board-specific breakages need to be caught before merge. Running this only after 4 tiers pass ensures we don't waste resources on PRs that would fail earlier checks.
## Benefits
### ๐ฐ Cost Reduction
- **Early exit saves 50-90% of CI costs on failing PRs**
- Gate failure: Save ~85 min (stops after 2-5 min, skips all expensive jobs)
- Build failure: Save ~70 min (stops after 8 min, skips platforms + tests + boards)
- Platform failure: Save ~50 min (stops after 25 min, skips integration + boards)
- Integration failure: Save ~30-60 min (stops after 40 min, skips board matrix)
- **Smart caching reduces redundant builds**
- Dedicated cache keys per job type (no race conditions)
- Optimized cache sizes based on usage analysis (250M for SITL, 500M for EKF)
- ~99% cache hit rates on SITL builds
- **Single workflow execution** - No duplicate runs on protected branches
### โก Speed Improvements
- **Parallel execution within tiers** - Jobs at the same tier run concurrently on AWS infrastructure
- **Optimized runner allocation** - Fast checks on GitHub runners, builds on AWS 4cpu/8cpu instances
- **Reduced queue time** - Single workflow = single queue entry vs. 18+ separate queues
### ๐ Better Visibility
- **Tier-based status reporting** - See exactly which phase failed
- **Consolidated summary job** - Single point to check overall CI status
- **Reduced noise** - No need to check 18+ workflow tabs
### ๐ ๏ธ Easier Maintenance
- **Single source of truth** - All CI logic in one file
- **Consistent caching strategy** - Shared cache configuration patterns
- **Easier to modify** - Add/remove jobs without managing separate files
- **Better dependency tracking** - Explicit `needs:` declarations show job relationships
## Workflow Trigger Strategy
### CI Orchestrator (`ci-orchestrator.yml`)
**Triggers on:**
- All pull requests
- Pushes to `main`, `stable`, `beta`, `release/**`
**Does NOT trigger on:**
- Release tags (build-all-targets handles these directly)
### Build All Targets (`build_all_targets.yml`)
**Triggers on:**
- `workflow_run` - After CI orchestrator completes successfully (all branches/PRs)
- `push` - **Only** on release tags `v*` (orchestrator doesn't run on tags)
- `workflow_dispatch` - Manual trigger for debugging
**Key benefit:** No duplicate runs. Protected branches get orchestrator โ build-all-targets sequencing. Release tags trigger build-all-targets directly for artifact uploads.
## Migration Details
### Consolidated Workflows
Merged these workflows into tiers:
- **Tier 1:** `checks.yml`, `python_checks.yml`, MAVSDK validation
- **Tier 2:** `compile_ubuntu.yml` (SITL only), basic tests, `ekf_functional_change_indicator.yml`
- **Tier 3:** `compile_ubuntu.yml` (install scripts + targeted builds), `compile_macos.yml` (install scripts + targeted builds), `itcm_check.yml`, `flash_analysis.yml`, `failsafe_sim.yml`
- **Tier 4:** `sitl_tests.yml`, `ros_integration_tests.yml`, `mavros_mission_tests.yml`, `mavros_offboard_tests.yml`, `ros_translation_node.yml`
- **Tier 5:** `build_all_targets.yml` (now triggered via `workflow_run`)
### Job Consolidations
- **MAVROS tests** - Merged mission and offboard tests into single matrix job
- **Basic tests** - Combined unit tests and formatting checks
### Optimizations Applied
- **Cache tuning:**
- SITL cache: 120M โ 250M (eliminated thrashing, 284 cleanups โ ~0)
- EKF tests: Dedicated cache key + 500M (resolved race condition with SITL)
- Cache hit rates: 48% โ 99%+ on SITL builds
- **Shell compatibility:** Fixed bash/sh issues in ROS and emscripten steps
- **Container alignment:** Matched MAVSDK versions to container GLIBC versions
- **Dependency fixes:** Corrected job dependencies for proper waterfall execution
- **Trigger optimization:** Eliminated duplicate workflow runs on protected branches
## Testing
This orchestration has been tested through multiple iterations:
- โ
All tiers execute in proper order
- โ
Early exit works at each tier
- โ
Cache performance optimized (99%+ hit rates)
- โ
All tests pass equivalently to old workflows
- โ
Job consolidations maintain same test coverage
- โ
No duplicate workflow runs on protected branches
- โ
Single artifact upload per release tag
## Future Improvements
Potential next steps (out of scope for this PR):
- Container upgrades (Ubuntu 22.04/24.04 for newer GLIBC)
- Further job consolidations where appropriate
- Dynamic tier sizing based on PR scope
- Cost tracking and reporting per tier
---
**Breaking Change:** This replaces 18+ workflow files with a single orchestrator. Old workflows will be removed in a follow-up PR after validation period.