Dust Community Icon

Guidance Needed for Setting Up Dust Agent with Large Data Files

·
·

Hi everyone, I’m reaching out for guidance on setting up a Dust agent to perform RAG (using the 'Search' tool) on a text file containing a list of all our HubSpot tickets. The goal is to enable product managers to query the agent for examples of user tickets related to specific features within a given context. Current Issue: The file I’m working with is ~600MB, but I’ve noticed Dust’s upload limit is 2MB. While I could technically split the file into 300 smaller parts, I suspect there might be a more efficient solution. Questions:

  1. 1.

    Is the 2MB upload limit tied to my current plan, or is it a universal restriction for all users?

  2. 2.

    Are there recommended workarounds or best practices for handling large datasets like this in Dust since this use case is quite common?

Any insights or alternatives you could provide would be greatly appreciated! Thanks in advance

  • Avatar of Remi
    Remi
    ·
    ·

    Hey Raphaël Gardies, the size limit is a universal limit. In your case, I would recommend creating one doc per ticket (instead of one massive file). You can push using the Make/Zapier integration or the API. It is a bit more work but will give you much better results in the end (for a bunch of different, somewhat subtle & technical reasons).

  • Avatar of Khaled K
    Khaled K
    ·
    ·

    Hey Raphaël Gardies Im working on a similar solution to analyze our intercom tickets. Ive been cleaning the data upstream to keep strictly the message exchanges, the creation date and contacts. At the end, months of Tickets will weight less than 1mo But Im hitting another limit, which is the number of tokens. The short term solution is performing analysis on the data with limited time-frames (i.e, up to the month). The mid-long term solution I'm exploring is to add a step in the cleaning process to compile the messages into a summary generated by some AI. I would then divide the context size by ten, enabling one year of analysis.

  • Avatar of Raphaël Gardies
    Raphaël Gardies
    ·
    ·

    Thanks ! Remi Actually I have about 1M tickets and if I take only the last month I have 100k, so using make to write files one by one might explode our quotas. Are you sure it would scale properly ? Khaled K I'm already exporting only the message exchanges, and about the number of token I don't know if you are using RAG but it shouldn't be an issue.

  • Avatar of Remi
    Remi
    ·
    ·

    Ah, that's a lot 😬 Is using the API an option? The doc is here: https://docs.dust.tt/reference/post_api-v1-w-wid-spaces-spaceid-data-sources-dsid-documents-documentid you can also ask @help in Dust to help you write the code.

  • Avatar of Hector - Zeffy
    Hector - Zeffy
    ·
    ·

    Working with Raphaël Gardies on it at zeffy. Remi does this endpoint API have a limit in terms of volume? Can we create 100K api calls (if we have 100K tickets on our end) ? Or you would suggest we explore the suggestion of Khaled K in order to decrease the volumes ?

  • Avatar of Remi
    Remi
    ·
    ·

    Let me loop in someone from the team just to be sure! cc Alban what do you think? Can they push 100k tickets through the API? Is there another better approach?

  • Avatar of Raphaël Gardies
    Raphaël Gardies
    ·
    ·

    Do you have some updates on this ?

  • Avatar of Gregor
    Gregor
    ·
    ·

    I’m pushing a codebase in size of 40k small files. There is 120 request/min rate limit. I’m able to get about 100 request/min into the system, so looking at about 7-8h of running the process.

  • Avatar of Gregor
    Gregor
    ·
    ·

    Looks to be doable, just make sure to use tags on the files (for example “title: Unique identifier”) as it helps RAG from what I’ve found out so far