<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[模型评测集别只抄公开榜单]]></title><description><![CDATA[<p dir="auto">老板让我选模型，拿公开榜单排名就行吗？我怕自己做评测太主观。</p>
]]></description><link>https://localaihub.com/topic/84/模型评测集别只抄公开榜单</link><generator>RSS for Node</generator><lastBuildDate>Wed, 03 Jun 2026 19:16:28 GMT</lastBuildDate><atom:link href="https://localaihub.com/topic/84.rss" rel="self" type="application/rss+xml"/><pubDate>Tue, 05 May 2026 02:57:00 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Wed, 06 May 2026 05:42:00 GMT]]></title><description><![CDATA[<p dir="auto">这就能讨论了。没有评测集，选模型基本是选信仰。</p>
]]></description><link>https://localaihub.com/post/543</link><guid isPermaLink="true">https://localaihub.com/post/543</guid><dc:creator><![CDATA[林小北]]></dc:creator><pubDate>Wed, 06 May 2026 05:42:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Wed, 06 May 2026 02:50:00 GMT]]></title><description><![CDATA[<p dir="auto">我准备先做 200 条业务评测，公开榜单只当备注。每条记录输入、期望、错误等级、成本。</p>
]]></description><link>https://localaihub.com/post/542</link><guid isPermaLink="true">https://localaihub.com/post/542</guid><dc:creator><![CDATA[卡在第7步]]></dc:creator><pubDate>Wed, 06 May 2026 02:50:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Wed, 06 May 2026 01:54:00 GMT]]></title><description><![CDATA[<p dir="auto">会，所以分开发集和隐藏集。开发集调提示，隐藏集只验收。线上失败再定期加入新集。</p>
]]></description><link>https://localaihub.com/post/541</link><guid isPermaLink="true">https://localaihub.com/post/541</guid><dc:creator><![CDATA[zeroOne]]></dc:creator><pubDate>Wed, 06 May 2026 01:54:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 23:48:00 GMT]]></title><description><![CDATA[<p dir="auto">评测集会不会过拟合？模型提示调几轮就专门会答这些题。</p>
]]></description><link>https://localaihub.com/post/540</link><guid isPermaLink="true">https://localaihub.com/post/540</guid><dc:creator><![CDATA[普通网友A]]></dc:creator><pubDate>Tue, 05 May 2026 23:48:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 23:19:00 GMT]]></title><description><![CDATA[<p dir="auto">代码模型评测要跑测试。光让模型解释算法，和能改你们的老项目，不是一回事。</p>
]]></description><link>https://localaihub.com/post/539</link><guid isPermaLink="true">https://localaihub.com/post/539</guid><dc:creator><![CDATA[小陈在改bug]]></dc:creator><pubDate>Tue, 05 May 2026 23:19:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 21:36:00 GMT]]></title><description><![CDATA[<p dir="auto">还有成本维度。同样 92 分，一个 0.8 元千次，一个 8 元千次，产品决策不一样。</p>
]]></description><link>https://localaihub.com/post/538</link><guid isPermaLink="true">https://localaihub.com/post/538</guid><dc:creator><![CDATA[阿宁]]></dc:creator><pubDate>Tue, 05 May 2026 21:36:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 19:57:00 GMT]]></title><description><![CDATA[<p dir="auto">我们加了“不可接受错误”标签：编造政策、泄露内部、越权承诺、没证据装有证据。这比平均分重要。</p>
]]></description><link>https://localaihub.com/post/537</link><guid isPermaLink="true">https://localaihub.com/post/537</guid><dc:creator><![CDATA[melo]]></dc:creator><pubDate>Tue, 05 May 2026 19:57:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 18:11:00 GMT]]></title><description><![CDATA[<p dir="auto">先人工定标准，再模型辅助。模型裁判也会偏爱某种文风，尤其中文礼貌话多时容易给高分。</p>
]]></description><link>https://localaihub.com/post/536</link><guid isPermaLink="true">https://localaihub.com/post/536</guid><dc:creator><![CDATA[leaf_1997]]></dc:creator><pubDate>Tue, 05 May 2026 18:11:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 15:51:00 GMT]]></title><description><![CDATA[<p dir="auto">评分用人工还是模型当裁判？</p>
]]></description><link>https://localaihub.com/post/535</link><guid isPermaLink="true">https://localaihub.com/post/535</guid><dc:creator><![CDATA[小周]]></dc:creator><pubDate>Tue, 05 May 2026 15:51:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 14:20:00 GMT]]></title><description><![CDATA[<p dir="auto">长上下文场景可以参考 LongBench 的思路，但别直接拿分数当结论。你要测“答案能不能引用正确段落”。</p>
]]></description><link>https://localaihub.com/post/534</link><guid isPermaLink="true">https://localaihub.com/post/534</guid><dc:creator><![CDATA[nora]]></dc:creator><pubDate>Tue, 05 May 2026 14:20:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 13:30:00 GMT]]></title><description><![CDATA[<p dir="auto">从真实失败和高频场景抽样。比如客服就抽退款、发票、越权、辱骂、缺资料、政策冲突。每类 20 条起步。</p>
]]></description><link>https://localaihub.com/post/533</link><guid isPermaLink="true">https://localaihub.com/post/533</guid><dc:creator><![CDATA[林小北]]></dc:creator><pubDate>Tue, 05 May 2026 13:30:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 10:42:00 GMT]]></title><description><![CDATA[<p dir="auto">自己的测试集怎么避免拍脑袋？</p>
]]></description><link>https://localaihub.com/post/532</link><guid isPermaLink="true">https://localaihub.com/post/532</guid><dc:creator><![CDATA[小潘同学]]></dc:creator><pubDate>Tue, 05 May 2026 10:42:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 08:01:00 GMT]]></title><description><![CDATA[<p dir="auto">lm-evaluation-harness、OpenCompass、HELM 这些框架适合建立方法感，但业务上线要加自己的测试集。</p>
]]></description><link>https://localaihub.com/post/531</link><guid isPermaLink="true">https://localaihub.com/post/531</guid><dc:creator><![CDATA[陈一]]></dc:creator><pubDate>Tue, 05 May 2026 08:01:00 GMT</pubDate></item><item><title><![CDATA[Reply to 模型评测集别只抄公开榜单 on Tue, 05 May 2026 05:07:00 GMT]]></title><description><![CDATA[<p dir="auto">公开榜单能做初筛，不能替代业务评测。榜单题和你用户的问题分布差太远。</p>
]]></description><link>https://localaihub.com/post/530</link><guid isPermaLink="true">https://localaihub.com/post/530</guid><dc:creator><![CDATA[zeroOne]]></dc:creator><pubDate>Tue, 05 May 2026 05:07:00 GMT</pubDate></item></channel></rss>