<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[评测基准分数高，为什么业务样例还是翻车]]></title><description><![CDATA[<p dir="auto">我们选了榜单分数很高的模型，结果内部制度问答不如另一个分低的。榜单是不是没意义？</p>
]]></description><link>https://localaihub.com/topic/162/评测基准分数高-为什么业务样例还是翻车</link><generator>RSS for Node</generator><lastBuildDate>Wed, 03 Jun 2026 19:23:26 GMT</lastBuildDate><atom:link href="https://localaihub.com/topic/162.rss" rel="self" type="application/rss+xml"/><pubDate>Mon, 11 May 2026 13:38:00 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Tue, 12 May 2026 09:49:00 GMT]]></title><description><![CDATA[<p dir="auto">我准备把榜单截图放附录，主报告放我们自己的样例结果。</p>
]]></description><link>https://localaihub.com/post/1703</link><guid isPermaLink="true">https://localaihub.com/post/1703</guid><dc:creator><![CDATA[阿远]]></dc:creator><pubDate>Tue, 12 May 2026 09:49:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Tue, 12 May 2026 09:34:00 GMT]]></title><description><![CDATA[<p dir="auto">还有成本和部署。一个模型再聪明，跑不稳也不适合你。</p>
]]></description><link>https://localaihub.com/post/1702</link><guid isPermaLink="true">https://localaihub.com/post/1702</guid><dc:creator><![CDATA[阿航]]></dc:creator><pubDate>Tue, 12 May 2026 09:34:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Tue, 12 May 2026 08:12:00 GMT]]></title><description><![CDATA[<p dir="auto">不用走极端。榜单帮你缩小候选，业务评测决定上线。</p>
]]></description><link>https://localaihub.com/post/1701</link><guid isPermaLink="true">https://localaihub.com/post/1701</guid><dc:creator><![CDATA[陈一]]></dc:creator><pubDate>Tue, 12 May 2026 08:12:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Tue, 12 May 2026 07:59:00 GMT]]></title><description><![CDATA[<p dir="auto">我们要不要完全不看榜单？</p>
]]></description><link>https://localaihub.com/post/1700</link><guid isPermaLink="true">https://localaihub.com/post/1700</guid><dc:creator><![CDATA[小蓝]]></dc:creator><pubDate>Tue, 12 May 2026 07:59:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Tue, 12 May 2026 06:47:00 GMT]]></title><description><![CDATA[<p dir="auto">别忘了拒答和安全样例。榜单通常不覆盖你公司的红线。</p>
]]></description><link>https://localaihub.com/post/1699</link><guid isPermaLink="true">https://localaihub.com/post/1699</guid><dc:creator><![CDATA[nora]]></dc:creator><pubDate>Tue, 12 May 2026 06:47:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Tue, 12 May 2026 03:53:00 GMT]]></title><description><![CDATA[<p dir="auto">起步几十条能排除明显不行，正式上线至少上百条分层样例。</p>
]]></description><link>https://localaihub.com/post/1698</link><guid isPermaLink="true">https://localaihub.com/post/1698</guid><dc:creator><![CDATA[林小北]]></dc:creator><pubDate>Tue, 12 May 2026 03:53:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Tue, 12 May 2026 03:44:00 GMT]]></title><description><![CDATA[<p dir="auto">那内部评测多少条够？</p>
]]></description><link>https://localaihub.com/post/1697</link><guid isPermaLink="true">https://localaihub.com/post/1697</guid><dc:creator><![CDATA[普通网友A]]></dc:creator><pubDate>Tue, 12 May 2026 03:44:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Tue, 12 May 2026 02:27:00 GMT]]></title><description><![CDATA[<p dir="auto">我见过榜单高的模型特别会写，但引用纪律差。知识库场景就麻烦。</p>
]]></description><link>https://localaihub.com/post/1696</link><guid isPermaLink="true">https://localaihub.com/post/1696</guid><dc:creator><![CDATA[半截薯条]]></dc:creator><pubDate>Tue, 12 May 2026 02:27:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Mon, 11 May 2026 23:59:00 GMT]]></title><description><![CDATA[<p dir="auto">还有 RAG 系统里模型只是最后一环。召回错了，再高分也没用。</p>
]]></description><link>https://localaihub.com/post/1695</link><guid isPermaLink="true">https://localaihub.com/post/1695</guid><dc:creator><![CDATA[小吴]]></dc:creator><pubDate>Mon, 11 May 2026 23:59:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Mon, 11 May 2026 23:00:00 GMT]]></title><description><![CDATA[<p dir="auto">可以给榜单作为初筛依据，但决策要看内部评测。两者不是一个层级。</p>
]]></description><link>https://localaihub.com/post/1694</link><guid isPermaLink="true">https://localaihub.com/post/1694</guid><dc:creator><![CDATA[Grace]]></dc:creator><pubDate>Mon, 11 May 2026 23:00:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Mon, 11 May 2026 20:00:00 GMT]]></title><description><![CDATA[<p dir="auto">老板喜欢看榜单截图。</p>
]]></description><link>https://localaihub.com/post/1693</link><guid isPermaLink="true">https://localaihub.com/post/1693</guid><dc:creator><![CDATA[阿远]]></dc:creator><pubDate>Mon, 11 May 2026 20:00:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Mon, 11 May 2026 16:57:00 GMT]]></title><description><![CDATA[<p dir="auto">业务样例才是你的考场。尤其是内部缩写、老文档、权限、口语问法。</p>
]]></description><link>https://localaihub.com/post/1692</link><guid isPermaLink="true">https://localaihub.com/post/1692</guid><dc:creator><![CDATA[melo]]></dc:creator><pubDate>Mon, 11 May 2026 16:57:00 GMT</pubDate></item><item><title><![CDATA[Reply to 评测基准分数高，为什么业务样例还是翻车 on Mon, 11 May 2026 15:27:00 GMT]]></title><description><![CDATA[<p dir="auto">有意义，但不是你的业务验收。榜单测的是它定义的任务。</p>
]]></description><link>https://localaihub.com/post/1691</link><guid isPermaLink="true">https://localaihub.com/post/1691</guid><dc:creator><![CDATA[陈一]]></dc:creator><pubDate>Mon, 11 May 2026 15:27:00 GMT</pubDate></item></channel></rss>