Maximizing Document Ingestion: Converting Formats for Optimal Chunking
Hello Dust team, Iāve read your article https://blog.dust.tt/dust-is-better-at-company-data/ which describes your approach to structure-preserving balanced chunking for document ingestion and semantic search. In this context, I have two questions:
- 1.
Iām extracting large amounts of content from various formats (PDF, DOCX, PPTX, HTML, etc.). To speed up ingestion and avoid certain limitations (such as the number of PDFs that can be uploaded to Dust), Iām considering converting all documents to Markdown (MD) on my server before sending them to Dust. Is this a good idea, given your chunking approach? Are there any pitfalls I should watch out for, especially regarding structure preservation (headings, lists, sections) in Markdown compared to native formats? Iāve noticed your dynamic conversion can sometimes be a bottleneck, so server-side conversion would be much faster and more scalable for me.
- 2.
Which open-source Python tools or libraries would you recommend for this kind of conversion and pre-chunkingāespecially to stay as close as possible to your āstructure-preserving, balanced chunkingā philosophy? (Iām considering Microsoftās MarkItDown, but Iām open to other suggestions that might better match your ingestion pipeline.)
Any advice on keeping the necessary structural richness for optimal semantic search and chunking would be much appreciated! Thank you in advance for your feedback.