Dust Community Icon

Using the Table Queries Tool for Database Files

Ā·
Ā·

Hi, you need to use the "table queries" tool to query database files (gsheet, etc)

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    I thought it was more for quantitative analysis... ?

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    In my case the types of instructions are ā€œcategorize these products according to this specific taxonomyā€

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    this kind of things

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    @help advised to go through an data extraction Alban

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    then my question is

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    even if I provide the precise headers label in the ā€œTool descriptionā€

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    why the tool is kind of ā€œinterpretingā€ the data to create its own categorization that is not related to the content of the csv?

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    it’s creating properties like ā€œingredient_functionā€, ā€œingredient_originā€, ā€œconsumer_benefitā€, which is certainly very interesting, since it’s even my objective with this assistant (to generate this type of qualification for each of the products in the csv), but which represent information that is absolutely not in the csv?

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    I may have found an answer, which is that it’s all about the context window of the model. I’ve uploaded an extract of the csv I want the assistant to analyse (categorize a list of products according to a specific taxonomy). I’ve asked how many lines like these an model like o3, claude, gpt could handle and the answer is that’s it’s all about the context window of each:

    • o3 mini = (ā‰ˆ4,096 tokens context) = 8 rows per file

    • Claude 3.7 (ā‰ˆ200,000 tokens context) = 417 rows

    • Standard ChatGPT-4 (ā‰ˆ8,000 tokens context) can handle about 8,000 Ć· 480 ā‰ˆ 16 rows

    let’s say you have a 10k rows file. How would you proceed to have an assistant proceed with such a number of lines without having to split the file, call the assistant and iterate hundreds of times?

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    I got confirmation: This is the ā€˜real’ problem with AI, it’s the context window. so basically the size of the input It can’t be exceeded. There’s 0 means. The system itself limits it In this case, the feedback I was given is as follows: automate the processing (file upload + execution of instructions) via a script + the API of a model:

    • Either it’s word processing and we can go through RAG, so load everything and then the LLM searches in the RAG (kind of contextual database),

    • or you make a script that makes an LLM call for XX lines

    In my case, on the one hand a taxonomy made up of categories and subcategories, on the other a csv of 14,000 lines/products. If anyone sees a way to resolve this case in dust, I’m interested.

  • Avatar of Remi
    Remi
    Ā·
    Ā·

    Pierre Bernat If I understand correctly, your use case is:

    • You have a file with 14k products

    • You'd like to categories those products (lines) using AI

    Is that right? In that case you could create a Dust agent that takes one row as input and that gives the category as output. You can give the list of category in the agent's instructions. Then you can use the Dust Gsheet integration to call your agent directly in Gsheet. You'd give all the cells in the row as input to the agent and output the answer in a new column. (Documentation: https://docs.dust.tt/docs/beta-google-spreadsheets) You'll probably right your workspace limit at some point and will need to do this over time. Hope this helps !

  • Avatar of Pierre Bernat
    Pierre Bernat
    Ā·
    Ā·

    It does, thanks!