Hi everyone, I’m reaching out for guidance on setting up a Dust agent to perform RAG (using the 'Search' tool) on a text file containing a list of all our HubSpot tickets. The goal is to enable product managers to query the agent for examples of user tickets related to specific features within a given context. Current Issue: The file I’m working with is ~600MB, but I’ve noticed Dust’s upload limit is 2MB. While I could technically split the file into 300 smaller parts, I suspect there might be a more efficient solution. Questions:
Is the 2MB upload limit tied to my current plan, or is it a universal restriction for all users?
Are there recommended workarounds or best practices for handling large datasets like this in Dust since this use case is quite common?
Any insights or alternatives you could provide would be greatly appreciated! Thanks in advance
Hey Raphaël Gardies, the size limit is a universal limit. In your case, I would recommend creating one doc per ticket (instead of one massive file). You can push using the Make/Zapier integration or the API. It is a bit more work but will give you much better results in the end (for a bunch of different, somewhat subtle & technical reasons).
Hey Raphaël Gardies Im working on a similar solution to analyze our intercom tickets. Ive been cleaning the data upstream to keep strictly the message exchanges, the creation date and contacts. At the end, months of Tickets will weight less than 1mo But Im hitting another limit, which is the number of tokens. The short term solution is performing analysis on the data with limited time-frames (i.e, up to the month). The mid-long term solution I'm exploring is to add a step in the cleaning process to compile the messages into a summary generated by some AI. I would then divide the context size by ten, enabling one year of analysis.
Thanks ! Remi Actually I have about 1M tickets and if I take only the last month I have 100k, so using make to write files one by one might explode our quotas. Are you sure it would scale properly ? Khaled K I'm already exporting only the message exchanges, and about the number of token I don't know if you are using RAG but it shouldn't be an issue.
Ah, that's a lot 😬 Is using the API an option? The doc is here: https://docs.dust.tt/reference/post_api-v1-w-wid-spaces-spaceid-data-sources-dsid-documents-documentid you can also ask @help in Dust to help you write the code.
Working with Raphaël Gardies on it at zeffy. Remi does this endpoint API have a limit in terms of volume? Can we create 100K api calls (if we have 100K tickets on our end) ? Or you would suggest we explore the suggestion of Khaled K in order to decrease the volumes ?
Do you have some updates on this ?
I’m pushing a codebase in size of 40k small files. There is 120 request/min rate limit. I’m able to get about 100 request/min into the system, so looking at about 7-8h of running the process.
