1Executive Snapshot
171candidates scanned
32X/search fallback signals
20YouTube signals
64GitHub repo signals
0arXiv fresh; 429
- Agent harness/runtime: 8/30 dev-web mẫu xoay quanh deterministic benchmark, portable harness, runtime OS → NEXA cần test harness nội bộ 2 tuần.
- CLI/IDE coding agents: 64 GitHub repos, repo manaflow-ai/cmux đạt 20,014 stars / 1,501 forks / 2,215 issues → nhu cầu orchestration cao, rủi ro issue-load cao.
- Sandbox/security: microsandbox 6,327 stars / 308 forks / 52 issues → enterprise adoption cần isolation trước khi auto-PR.
- Spec-as-source: HN item “specs feel more like source code” 1 point nhưng đúng hướng: coding-agent productivity phụ thuộc context/spec formalization → FARE ưu tiên repo+spec graph.
- Social completeness: 3/4 nhóm xã hội có tín hiệu (X/YT/Reddit), Facebook public 0 usable → confidence tổng 72%, không block publish.
2KOL/OG Feed Watch
| Platform | Author/Kênh | Metric | URL | Ý nghĩa CTO |
|---|---|---|---|---|
| HN | GodelNumbering · 2026-04-27 | 393 pts / 148 comments | dirac | OSS agent top TerminalBench claim → cần benchmark lại trên repo Fabbi. |
| HN | sjhalani7 · 2026-05-27 | 6 pts / 3 comments | VAEN | Portable AI coding-agent harness → packaging pattern cho NEXA eval. |
| HN | e2e4 · 2026-05-27 | 2 pts / 1 comment | DeepSWE | Frontier coding-agent measurement → bổ sung benchmark ngoài SWE-bench. |
| GitHub | manaflow-ai · 2026-05-28 | 20,014 stars / 1,501 forks | cmux | Multi-agent/workflow orchestration đang kéo adoption. |
| GitHub | gptme · 2026-05-28 | 4,311 stars / 390 forks | gptme | CLI agent OSS baseline cho comparison. |
| N/A | 0 usable | N/A — public search fallback không có link dùng được | Giảm confidence social-market 8 điểm. |
3Trend Radar
- HOT Harness/eval runtime: 5 HN/GitHub signals liên quan TerminalBench/SWE-like.
- HOT Sandbox cho agent: microsandbox 6,327 stars.
- EMERGING Spec-as-source/context graph: 2 dev-web signals, engagement thấp nhưng strategic fit cao.
- WATCH Parallel Claude/coding workers: Claudeverse 1 HN point, early.
- NOISE Vibe-code app launches: ≥3 HN items, low reusable enterprise value.
4CTO Evaluation Matrix
| Signal | Thesis | Evidence | Counter-signal | Fabbi implication | Conf. | Decision | Next validation |
|---|---|---|---|---|---|---|---|
| Harness-first agent | Agent value chuyển từ model prompt sang harness đo được. | VAEN, DeepSWE, Tracecore, TerminalBench mentions: 5+ signals. | Engagement HN thấp 1-6 pts trừ Dirac 393 pts. | NEXA/SYNCA cần eval pack nội bộ. | 78% | trial | 20 tasks từ 2 repo khách hàng, pass@1/cost/time. |
| Sandbox runtime | Enterprise chỉ scale khi agent execution bị giới hạn quyền. | microsandbox 6,327 stars; Amber capability runtime HN. | Security maturity chưa đủ audit. | AIOS governance + SYNCA risk gate. | 74% | trial | PoC locked filesystem/network 5 workflows. |
| Context/spec graph | Spec/codebase memory là bottleneck chính. | Repowise, spec-as-source, implicit knowledge threads: 3 signals. | Số liệu adoption thiếu. | FARE nên ưu tiên codebase intelligence. | 70% | adopt | Measure retrieval precision@10 trên 3 repos. |
| OSS CLI baseline | OSS agents đủ tốt để làm control group. | gptme 4,311 stars; cmux 20,014 stars; orca 3,551 stars. | Issue-load cmux 2,215 cao. | Giảm lock-in Claude/Codex khi benchmark. | 76% | trial | Compare Claude Code/Codex/gptme trên 30 tasks. |
5Repo Watch
| Repo | Momentum metric | Risk | Move |
|---|---|---|---|
| cmux | 20,014 stars / 1,501 forks / 2,215 issues | Issue pressure 2,215 | Watch architecture, not adopt raw. |
| microsandbox | 6,327 stars / 308 forks / 52 issues | Security audit N/A | PoC sandbox layer. |
| gptme | 4,311 stars / 390 forks / 11 issues | Enterprise controls N/A | Baseline CLI agent. |
| orca | 3,551 stars / 238 forks / 229 issues | 229 issues | Watch. |
6Paper / Benchmark / Product Watch
- arXiv fresh: 0 collected; reason HTTP 429 x5 → confidence benchmark-paper reduced.
- TerminalBench/DeepSWE: 4+ web signals; use as external eval inspiration, not direct KPI until reproduced.
- Claude Code/Codex/Cursor/Devin/OpenCode: product-specific fresh official changelog not captured in this run; track next 7d.
- Product move: OSS runtime/orchestration repos provide faster test surface than vendor announcements today.
7Impact Coverage
| Domain | Now 0-2w | Next 1-2m | Later 3-6m | Mode |
|---|---|---|---|---|
| FARE | Spec/codebase graph MVP; 3 repos. | Retrieval precision + agent context pack. | Cross-project knowledge memory. | adopt |
| NEXA | 20-task harness, 3 agents. | Sandboxed execution + auto-PR. | Customer-specific agent runtime. | trial |
| SYNCA | Risk rubric: code diff, test, secret, permission. | Human-in-loop gate. | Governed AI SDLC platform. | adopt |
| DOMUS | Monitor low direct signal. | Apply only to internal automation. | Vertical agent workflows if ROI proven. | monitor |
| Japan/VN/Global | Japan: security-first; VN: productivity pilots; Global: OSS orchestration. | Package case study with 15-25% cycle-time saving target. | Managed AI SDLC offering. | trial |
8CTO Recommendations — exactly 4
1. Build Fabbi Agent Eval Pack v0
Why now: 171-signal run shows harness/runtime cluster strongest. ROI/time-saving: 15-25% dev cycle if pass@1 improves ≥10%. Risk: 2/5. Owner: Head of AI Eng. TTV: 10 working days. Validate: 20 tasks, 3 repos, pass@1/cost/time/security.
Why now: 171-signal run shows harness/runtime cluster strongest. ROI/time-saving: 15-25% dev cycle if pass@1 improves ≥10%. Risk: 2/5. Owner: Head of AI Eng. TTV: 10 working days. Validate: 20 tasks, 3 repos, pass@1/cost/time/security.
2. Ship sandboxed NEXA PoC
Why now: microsandbox 6,327 stars proves execution isolation demand. ROI/time-saving: 10-18% via safer auto-run tests. Risk: 3/5. Owner: Platform Lead. TTV: 2 weeks. Validate: 5 workflows with blocked network/secrets/fs writes.
Why now: microsandbox 6,327 stars proves execution isolation demand. ROI/time-saving: 10-18% via safer auto-run tests. Risk: 3/5. Owner: Platform Lead. TTV: 2 weeks. Validate: 5 workflows with blocked network/secrets/fs writes.
3. FARE context/spec graph pilot
Why now: spec-as-source + repo-intel signals indicate context bottleneck. ROI/time-saving: 12-20% fewer clarification loops. Risk: 2/5. Owner: Solution Architect. TTV: 3 weeks. Validate: precision@10, answer groundedness, hallucination rate.
Why now: spec-as-source + repo-intel signals indicate context bottleneck. ROI/time-saving: 12-20% fewer clarification loops. Risk: 2/5. Owner: Solution Architect. TTV: 3 weeks. Validate: precision@10, answer groundedness, hallucination rate.
4. Vendor-neutral coding-agent benchmark
Why now: OSS/control group avoids Claude/Codex lock-in. ROI/time-saving: 8-15% procurement+tooling efficiency. Risk: 2/5. Owner: CTO Office. TTV: 1 week. Validate: Claude Code vs Codex vs gptme on same 30 tasks.
Why now: OSS/control group avoids Claude/Codex lock-in. ROI/time-saving: 8-15% procurement+tooling efficiency. Risk: 2/5. Owner: CTO Office. TTV: 1 week. Validate: Claude Code vs Codex vs gptme on same 30 tasks.
9Must-read Sources
S01 HNDirac OSS agent / TerminalBench claim
P0 — 393 pts / 148 comments. Read to design benchmark control.P0
P0 — 393 pts / 148 comments. Read to design benchmark control.P0
S02 GitHubmanaflow-ai/cmux
P0 — 20,014 stars / 1,501 forks. Orchestration signal.P0
P0 — 20,014 stars / 1,501 forks. Orchestration signal.P0
S03 GitHubmicrosandbox
P0 — sandbox primitive for agent execution.P0
P0 — sandbox primitive for agent execution.P0
S04 HNDeepSWE
P1 — frontier coding-agent measurement.P1
P1 — frontier coding-agent measurement.P1
S05 HNVAEN portable harness
P1 — packaging harness concept.P1
P1 — packaging harness concept.P1
S06 BlogSpecs feel more like source code
P1 — context/spec operating model.P1
P1 — context/spec operating model.P1
10Data Quality / Scan Health Appendix
Plan executed: manifest → collectors → normalize → gates → score → Vietnamese synthesis → HTML → Cloudflare deploy. Counts: total 171; dev_web/HN 30; GitHub 64; Reddit 25; YouTube 20; X 32 via search fallback; papers/product 0 due arXiv HTTP 429; Facebook public 0 no usable links. Gates: source volume PASS, social completeness PASS 3/4, source links PASS, papers/product PARTIAL. Confidence: 72% overall; paper/product claims limited.