Maximizing Document Ingestion: Converting Formats for Optimal Chunking
Hello Dust team, I’ve read your article https://blog.dust.tt/dust-is-better-at-company-data/ which describes your approach to structure-preserving balanced chunking for document ingestion and semantic search. In this context, I have two questions:
- 1.
I’m extracting large amounts of content from various formats (PDF, DOCX, PPTX, HTML, etc.). To speed up ingestion and avoid certain limitations (such as the number of PDFs that can be uploaded to Dust), I’m considering converting all documents to Markdown (MD) on my server before sending them to Dust. Is this a good idea, given your chunking approach? Are there any pitfalls I should watch out for, especially regarding structure preservation (headings, lists, sections) in Markdown compared to native formats? I’ve noticed your dynamic conversion can sometimes be a bottleneck, so server-side conversion would be much faster and more scalable for me.
- 2.
Which open-source Python tools or libraries would you recommend for this kind of conversion and pre-chunking—especially to stay as close as possible to your “structure-preserving, balanced chunking” philosophy? (I’m considering Microsoft’s MarkItDown, but I’m open to other suggestions that might better match your ingestion pipeline.)
Any advice on keeping the necessary structural richness for optimal semantic search and chunking would be much appreciated! Thank you in advance for your feedback.
