Hello Dust team, Iāve read your article https://blog.dust.tt/dust-is-better-at-company-data/ which describes your approach to structure-preserving balanced chunking for document ingestion and semantic search. In this context, I have two questions:
Iām extracting large amounts of content from various formats (PDF, DOCX, PPTX, HTML, etc.). To speed up ingestion and avoid certain limitations (such as the number of PDFs that can be uploaded to Dust), Iām considering converting all documents to Markdown (MD) on my server before sending them to Dust. Is this a good idea, given your chunking approach? Are there any pitfalls I should watch out for, especially regarding structure preservation (headings, lists, sections) in Markdown compared to native formats? Iāve noticed your dynamic conversion can sometimes be a bottleneck, so server-side conversion would be much faster and more scalable for me.
Which open-source Python tools or libraries would you recommend for this kind of conversion and pre-chunkingāespecially to stay as close as possible to your āstructure-preserving, balanced chunkingā philosophy? (Iām considering Microsoftās MarkItDown, but Iām open to other suggestions that might better match your ingestion pipeline.)
Any advice on keeping the necessary structural richness for optimal semantic search and chunking would be much appreciated! Thank you in advance for your feedback.
Hello Damien Laborie š Dust handles the chuncking itself. The only thing I would recommend is to keep 1 pdf = 1 "doc" in a Dust folder. This ensures Dust knows the chunks are linked and allows you to benefit from Dust built in context-awareness. MD should be good š For most use cases, you usually don't have to overthink it. The vector search works well. Since chunks are "grouped" together. Having text cut in half through chunking is not much of an issue from experience. On the library, let me ask the team and get back to you.
