<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[扫描版 PDF 入库，OCR 错字会让 RAG 变玄学]]></title><description><![CDATA[<p dir="auto">扫描版合同 OCR 后，“不可抗力”识别成“不司抗力”，检索时完全找不到。这个怎么处理？</p>
]]></description><link>https://localaihub.com/topic/63/扫描版-pdf-入库-ocr-错字会让-rag-变玄学</link><generator>RSS for Node</generator><lastBuildDate>Wed, 03 Jun 2026 19:15:55 GMT</lastBuildDate><atom:link href="https://localaihub.com/topic/63.rss" rel="self" type="application/rss+xml"/><pubDate>Sun, 03 May 2026 08:56:00 GMT</pubDate><ttl>60</ttl><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Mon, 04 May 2026 08:59:00 GMT]]></title><description><![CDATA[<p dir="auto">对，宁可少答，也别让坏 OCR 污染整个知识库。</p>
]]></description><link>https://localaihub.com/post/228</link><guid isPermaLink="true">https://localaihub.com/post/228</guid><dc:creator><![CDATA[林小北]]></dc:creator><pubDate>Mon, 04 May 2026 08:59:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Mon, 04 May 2026 06:29:00 GMT]]></title><description><![CDATA[<p dir="auto">明白，先把低置信页隔离，不能直接混进正式库。</p>
]]></description><link>https://localaihub.com/post/227</link><guid isPermaLink="true">https://localaihub.com/post/227</guid><dc:creator><![CDATA[小满]]></dc:creator><pubDate>Mon, 04 May 2026 06:29:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Mon, 04 May 2026 05:43:00 GMT]]></title><description><![CDATA[<p dir="auto">更现实的是入库时做视觉解析，问答时用干净文本和原图引用兜底。</p>
]]></description><link>https://localaihub.com/post/226</link><guid isPermaLink="true">https://localaihub.com/post/226</guid><dc:creator><![CDATA[nora]]></dc:creator><pubDate>Mon, 04 May 2026 05:43:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Mon, 04 May 2026 04:45:00 GMT]]></title><description><![CDATA[<p dir="auto">如果你用多模态模型做核验可以，但成本和速度要算。别每次问答都读整页图片。</p>
]]></description><link>https://localaihub.com/post/225</link><guid isPermaLink="true">https://localaihub.com/post/225</guid><dc:creator><![CDATA[阿航]]></dc:creator><pubDate>Mon, 04 May 2026 04:45:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Mon, 04 May 2026 02:44:00 GMT]]></title><description><![CDATA[<p dir="auto">有没有必要把原图也给模型看？</p>
]]></description><link>https://localaihub.com/post/224</link><guid isPermaLink="true">https://localaihub.com/post/224</guid><dc:creator><![CDATA[小满]]></dc:creator><pubDate>Mon, 04 May 2026 02:44:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Mon, 04 May 2026 00:06:00 GMT]]></title><description><![CDATA[<p dir="auto">数字、金额、日期要单独校验。OCR 把 8 认成 3，embedding 再强也没用。</p>
]]></description><link>https://localaihub.com/post/223</link><guid isPermaLink="true">https://localaihub.com/post/223</guid><dc:creator><![CDATA[rootless]]></dc:creator><pubDate>Mon, 04 May 2026 00:06:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Sun, 03 May 2026 21:42:00 GMT]]></title><description><![CDATA[<p dir="auto">我们加了同义和纠错 query expansion，能救一部分，但救不了关键数字错识别。</p>
]]></description><link>https://localaihub.com/post/222</link><guid isPermaLink="true">https://localaihub.com/post/222</guid><dc:creator><![CDATA[小潘同学]]></dc:creator><pubDate>Sun, 03 May 2026 21:42:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Sun, 03 May 2026 21:11:00 GMT]]></title><description><![CDATA[<p dir="auto">OCR 错字不只影响召回，也会影响引用可信度。用户看到错字会直接不信系统。</p>
]]></description><link>https://localaihub.com/post/221</link><guid isPermaLink="true">https://localaihub.com/post/221</guid><dc:creator><![CDATA[小路灯]]></dc:creator><pubDate>Sun, 03 May 2026 21:11:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Sun, 03 May 2026 20:51:00 GMT]]></title><description><![CDATA[<p dir="auto">能用，但要看字体和扫描质量。别在论坛里问抽象效果，拿你们文件测。</p>
]]></description><link>https://localaihub.com/post/220</link><guid isPermaLink="true">https://localaihub.com/post/220</guid><dc:creator><![CDATA[melo]]></dc:creator><pubDate>Sun, 03 May 2026 20:51:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Sun, 03 May 2026 17:46:00 GMT]]></title><description><![CDATA[<p dir="auto">Tesseract 中文效果够吗？</p>
]]></description><link>https://localaihub.com/post/219</link><guid isPermaLink="true">https://localaihub.com/post/219</guid><dc:creator><![CDATA[小周]]></dc:creator><pubDate>Sun, 03 May 2026 17:46:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Sun, 03 May 2026 16:12:00 GMT]]></title><description><![CDATA[<p dir="auto">我们用 PaddleOCR 做中文扫描件，效果可以，但印章、表格、斜拍照片还是麻烦。</p>
]]></description><link>https://localaihub.com/post/218</link><guid isPermaLink="true">https://localaihub.com/post/218</guid><dc:creator><![CDATA[melo]]></dc:creator><pubDate>Sun, 03 May 2026 16:12:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Sun, 03 May 2026 14:25:00 GMT]]></title><description><![CDATA[<p dir="auto">合同这种高风险文档，扫描件最好先走人工校对或半自动校对。</p>
]]></description><link>https://localaihub.com/post/217</link><guid isPermaLink="true">https://localaihub.com/post/217</guid><dc:creator><![CDATA[林小北]]></dc:creator><pubDate>Sun, 03 May 2026 14:25:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Sun, 03 May 2026 12:26:00 GMT]]></title><description><![CDATA[<p dir="auto">可以保留 OCR 置信度。低置信页不要直接进入正式索引，至少标记出来。</p>
]]></description><link>https://localaihub.com/post/216</link><guid isPermaLink="true">https://localaihub.com/post/216</guid><dc:creator><![CDATA[nora]]></dc:creator><pubDate>Sun, 03 May 2026 12:26:00 GMT</pubDate></item><item><title><![CDATA[Reply to 扫描版 PDF 入库，OCR 错字会让 RAG 变玄学 on Sun, 03 May 2026 11:57:00 GMT]]></title><description><![CDATA[<p dir="auto">OCR 质量要进验收指标。别等用户问不出来才发现。</p>
]]></description><link>https://localaihub.com/post/215</link><guid isPermaLink="true">https://localaihub.com/post/215</guid><dc:creator><![CDATA[阿白]]></dc:creator><pubDate>Sun, 03 May 2026 11:57:00 GMT</pubDate></item></channel></rss>