<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[PDF 解析出来标题层级全丢，RAG 怎么救？]]></title><description><![CDATA[<p dir="auto">公司制度 PDF 解析后只有一坨正文，标题层级没了。检索能命中，但答案经常漏例外条款。</p>
]]></description><link>https://localaihub.com/topic/61/pdf-解析出来标题层级全丢-rag-怎么救</link><generator>RSS for Node</generator><lastBuildDate>Wed, 03 Jun 2026 19:24:09 GMT</lastBuildDate><atom:link href="https://localaihub.com/topic/61.rss" rel="self" type="application/rss+xml"/><pubDate>Sun, 03 May 2026 02:32:00 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 22:41:00 GMT]]></title><description><![CDATA[<p dir="auto">更新：换 Docling 后标题好很多，表格还得单独处理，但至少不是一坨了。</p>
]]></description><link>https://localaihub.com/post/198</link><guid isPermaLink="true">https://localaihub.com/post/198</guid><dc:creator><![CDATA[半截薯条]]></dc:creator><pubDate>Sun, 03 May 2026 22:41:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 20:36:00 GMT]]></title><description><![CDATA[<p dir="auto">对，业务能读懂的中间件很重要。RAG 的输入先让人能审。</p>
]]></description><link>https://localaihub.com/post/197</link><guid isPermaLink="true">https://localaihub.com/post/197</guid><dc:creator><![CDATA[nora]]></dc:creator><pubDate>Sun, 03 May 2026 20:36:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 17:54:00 GMT]]></title><description><![CDATA[<p dir="auto">我先把解析结果导出成 Markdown 给业务看，不直接进库。</p>
]]></description><link>https://localaihub.com/post/196</link><guid isPermaLink="true">https://localaihub.com/post/196</guid><dc:creator><![CDATA[半截薯条]]></dc:creator><pubDate>Sun, 03 May 2026 17:54:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 15:46:00 GMT]]></title><description><![CDATA[<p dir="auto">文档结构规则不是假 AI。假的是拿规则装智能问答。生产里解析需要规则和模型配合。</p>
]]></description><link>https://localaihub.com/post/195</link><guid isPermaLink="true">https://localaihub.com/post/195</guid><dc:creator><![CDATA[林小北]]></dc:creator><pubDate>Sun, 03 May 2026 15:46:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 14:20:00 GMT]]></title><description><![CDATA[<p dir="auto">这算硬规则吗？</p>
]]></description><link>https://localaihub.com/post/194</link><guid isPermaLink="true">https://localaihub.com/post/194</guid><dc:creator><![CDATA[不想写周报]]></dc:creator><pubDate>Sun, 03 May 2026 14:20:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 13:20:00 GMT]]></title><description><![CDATA[<p dir="auto">我们最后加了规则：连续短行、字号更大、前后空行，才当标题。不是纯 AI。</p>
]]></description><link>https://localaihub.com/post/193</link><guid isPermaLink="true">https://localaihub.com/post/193</guid><dc:creator><![CDATA[半糖]]></dc:creator><pubDate>Sun, 03 May 2026 13:20:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 12:05:00 GMT]]></title><description><![CDATA[<p dir="auto">能，但别盲信。中文制度里“一、”“（一）”“1.” 混用，解析器也会迷糊。</p>
]]></description><link>https://localaihub.com/post/192</link><guid isPermaLink="true">https://localaihub.com/post/192</guid><dc:creator><![CDATA[阿白]]></dc:creator><pubDate>Sun, 03 May 2026 12:05:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 09:40:00 GMT]]></title><description><![CDATA[<p dir="auto">解析器能不能自动判断标题？</p>
]]></description><link>https://localaihub.com/post/191</link><guid isPermaLink="true">https://localaihub.com/post/191</guid><dc:creator><![CDATA[小树]]></dc:creator><pubDate>Sun, 03 May 2026 09:40:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 07:06:00 GMT]]></title><description><![CDATA[<p dir="auto">建议先做“解析质量验收”，不是直接入库。抽 20 页人工看标题、段落、表格、页码。</p>
]]></description><link>https://localaihub.com/post/190</link><guid isPermaLink="true">https://localaihub.com/post/190</guid><dc:creator><![CDATA[小路灯]]></dc:creator><pubDate>Sun, 03 May 2026 07:06:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 06:31:00 GMT]]></title><description><![CDATA[<p dir="auto">还有跨页条款。PDF 视觉上是一条，文本抽出来在页尾断开。</p>
]]></description><link>https://localaihub.com/post/189</link><guid isPermaLink="true">https://localaihub.com/post/189</guid><dc:creator><![CDATA[米饭]]></dc:creator><pubDate>Sun, 03 May 2026 06:31:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 06:08:00 GMT]]></title><description><![CDATA[<p dir="auto">页眉页脚要清，不然每个 chunk 都带公司名和页码，embedding 会被污染。</p>
]]></description><link>https://localaihub.com/post/188</link><guid isPermaLink="true">https://localaihub.com/post/188</guid><dc:creator><![CDATA[林小北]]></dc:creator><pubDate>Sun, 03 May 2026 06:08:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 05:48:00 GMT]]></title><description><![CDATA[<p dir="auto">我们之前用 pdfplumber 直接抽文本，速度快，但目录、页眉、表格全混进去了。</p>
]]></description><link>https://localaihub.com/post/187</link><guid isPermaLink="true">https://localaihub.com/post/187</guid><dc:creator><![CDATA[小周]]></dc:creator><pubDate>Sun, 03 May 2026 05:48:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 04:14:00 GMT]]></title><description><![CDATA[<p dir="auto">Docling 可以试，尤其是需要保留结构时。Unstructured 的 partition 也可以按元素类型拆。</p>
]]></description><link>https://localaihub.com/post/186</link><guid isPermaLink="true">https://localaihub.com/post/186</guid><dc:creator><![CDATA[nora]]></dc:creator><pubDate>Sun, 03 May 2026 04:14:00 GMT</pubDate></item><item><title><![CDATA[Reply to PDF 解析出来标题层级全丢，RAG 怎么救？ on Sun, 03 May 2026 03:59:00 GMT]]></title><description><![CDATA[<p dir="auto">这是解析问题，不是 RAG 问题。标题层级丢了，chunk 再怎么调都像在拆废纸。</p>
]]></description><link>https://localaihub.com/post/185</link><guid isPermaLink="true">https://localaihub.com/post/185</guid><dc:creator><![CDATA[阿航]]></dc:creator><pubDate>Sun, 03 May 2026 03:59:00 GMT</pubDate></item></channel></rss>