<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[多智能体评测不能只看最终答案]]></title><description><![CDATA[<p dir="auto">多 agent 评测怎么做？只看最终答案对不对，好像看不出资料员和审校员有没有用。</p>
]]></description><link>https://localaihub.com/topic/125/多智能体评测不能只看最终答案</link><generator>RSS for Node</generator><lastBuildDate>Wed, 03 Jun 2026 20:32:18 GMT</lastBuildDate><atom:link href="https://localaihub.com/topic/125.rss" rel="self" type="application/rss+xml"/><pubDate>Fri, 08 May 2026 10:47:00 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Sat, 09 May 2026 15:04:00 GMT]]></title><description><![CDATA[<p dir="auto">记得留失败样例。生产优化靠失败样例，不靠平均分好看。</p>
]]></description><link>https://localaihub.com/post/1161</link><guid isPermaLink="true">https://localaihub.com/post/1161</guid><dc:creator><![CDATA[melo]]></dc:creator><pubDate>Sat, 09 May 2026 15:04:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Sat, 09 May 2026 13:02:00 GMT]]></title><description><![CDATA[<p dir="auto">我先做 100 条集，标错因，比较单 agent 和三角色工作流。</p>
]]></description><link>https://localaihub.com/post/1160</link><guid isPermaLink="true">https://localaihub.com/post/1160</guid><dc:creator><![CDATA[小谢]]></dc:creator><pubDate>Sat, 09 May 2026 13:02:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Sat, 09 May 2026 10:02:00 GMT]]></title><description><![CDATA[<p dir="auto">过程日志只能解释，不是质量本身。</p>
]]></description><link>https://localaihub.com/post/1159</link><guid isPermaLink="true">https://localaihub.com/post/1159</guid><dc:creator><![CDATA[Grace]]></dc:creator><pubDate>Sat, 09 May 2026 10:02:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Sat, 09 May 2026 08:47:00 GMT]]></title><description><![CDATA[<p dir="auto">我们之前就被日志骗过。多角色一堆步骤，最后答案还是没有证据。</p>
]]></description><link>https://localaihub.com/post/1158</link><guid isPermaLink="true">https://localaihub.com/post/1158</guid><dc:creator><![CDATA[小周]]></dc:creator><pubDate>Sat, 09 May 2026 08:47:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Sat, 09 May 2026 06:56:00 GMT]]></title><description><![CDATA[<p dir="auto">人工评审要盲评。不然看到多 agent 日志很长，会天然觉得更认真。</p>
]]></description><link>https://localaihub.com/post/1157</link><guid isPermaLink="true">https://localaihub.com/post/1157</guid><dc:creator><![CDATA[nora]]></dc:creator><pubDate>Sat, 09 May 2026 06:56:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Sat, 09 May 2026 04:52:00 GMT]]></title><description><![CDATA[<p dir="auto">对，别只看通过率。比如从“答错”变成“拒答太多”，通过率可能看起来提高，体验却变差。</p>
]]></description><link>https://localaihub.com/post/1156</link><guid isPermaLink="true">https://localaihub.com/post/1156</guid><dc:creator><![CDATA[小傅]]></dc:creator><pubDate>Sat, 09 May 2026 04:52:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Sat, 09 May 2026 03:53:00 GMT]]></title><description><![CDATA[<p dir="auto">我建议先做 ablation。单 agent、单 agent+审校、多 agent 全跑同一批问题，看错因变化。</p>
]]></description><link>https://localaihub.com/post/1155</link><guid isPermaLink="true">https://localaihub.com/post/1155</guid><dc:creator><![CDATA[qwer_asdf]]></dc:creator><pubDate>Sat, 09 May 2026 03:53:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Sat, 09 May 2026 01:31:00 GMT]]></title><description><![CDATA[<p dir="auto">看任务价值。高风险合规报告可以接受，普通 FAQ 不值。</p>
]]></description><link>https://localaihub.com/post/1154</link><guid isPermaLink="true">https://localaihub.com/post/1154</guid><dc:creator><![CDATA[陈一]]></dc:creator><pubDate>Sat, 09 May 2026 01:31:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Fri, 08 May 2026 22:38:00 GMT]]></title><description><![CDATA[<p dir="auto">如果多 agent 最终答案更好，但成本 5 倍，怎么算？</p>
]]></description><link>https://localaihub.com/post/1153</link><guid isPermaLink="true">https://localaihub.com/post/1153</guid><dc:creator><![CDATA[半糖]]></dc:creator><pubDate>Fri, 08 May 2026 22:38:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Fri, 08 May 2026 19:53:00 GMT]]></title><description><![CDATA[<p dir="auto">SWE-bench 或 SWE-agent 结果也一样。benchmark 是参考，不是你仓库的验收。</p>
]]></description><link>https://localaihub.com/post/1152</link><guid isPermaLink="true">https://localaihub.com/post/1152</guid><dc:creator><![CDATA[阿航]]></dc:creator><pubDate>Fri, 08 May 2026 19:53:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Fri, 08 May 2026 17:01:00 GMT]]></title><description><![CDATA[<p dir="auto">WebArena 这类 benchmark 能看 web agent 能力，但企业内部后台和数据约束不一样，不能直接当上线证明。</p>
]]></description><link>https://localaihub.com/post/1151</link><guid isPermaLink="true">https://localaihub.com/post/1151</guid><dc:creator><![CDATA[小蓝]]></dc:creator><pubDate>Fri, 08 May 2026 17:01:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Fri, 08 May 2026 14:14:00 GMT]]></title><description><![CDATA[<p dir="auto">还有“差点出事”的指标。审校员拦住了外发错误，这种要记功。</p>
]]></description><link>https://localaihub.com/post/1150</link><guid isPermaLink="true">https://localaihub.com/post/1150</guid><dc:creator><![CDATA[林小北]]></dc:creator><pubDate>Fri, 08 May 2026 14:14:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Fri, 08 May 2026 12:30:00 GMT]]></title><description><![CDATA[<p dir="auto">但别指标太多。最后没人看。我们保留 6 个：正确性、引用支持、越权动作、耗时、成本、人工改动量。</p>
]]></description><link>https://localaihub.com/post/1149</link><guid isPermaLink="true">https://localaihub.com/post/1149</guid><dc:creator><![CDATA[melo]]></dc:creator><pubDate>Fri, 08 May 2026 12:30:00 GMT</pubDate></item><item><title><![CDATA[Reply to 多智能体评测不能只看最终答案 on Fri, 08 May 2026 11:28:00 GMT]]></title><description><![CDATA[<p dir="auto">要看过程指标。资料员召回、来源质量、主控选择、审校拦截率，都得拆开。</p>
]]></description><link>https://localaihub.com/post/1148</link><guid isPermaLink="true">https://localaihub.com/post/1148</guid><dc:creator><![CDATA[Grace]]></dc:creator><pubDate>Fri, 08 May 2026 11:28:00 GMT</pubDate></item></channel></rss>