Hi, you need to use the "table queries" tool to query database files (gsheet, etc)
I thought it was more for quantitative analysis... ?
In my case the types of instructions are “categorize these products according to this specific taxonomy”
this kind of things
@help advised to go through an data extraction Alban
then my question is
even if I provide the precise headers label in the “Tool description”
why the tool is kind of “interpreting” the data to create its own categorization that is not related to the content of the csv?
it’s creating properties like “ingredient_function”, “ingredient_origin”, “consumer_benefit”, which is certainly very interesting, since it’s even my objective with this assistant (to generate this type of qualification for each of the products in the csv), but which represent information that is absolutely not in the csv?
I may have found an answer, which is that it’s all about the context window of the model. I’ve uploaded an extract of the csv I want the assistant to analyse (categorize a list of products according to a specific taxonomy). I’ve asked how many lines like these an model like o3, claude, gpt could handle and the answer is that’s it’s all about the context window of each:
o3 mini = (≈4,096 tokens context) = 8 rows per file
Claude 3.7 (≈200,000 tokens context) = 417 rows
Standard ChatGPT-4 (≈8,000 tokens context) can handle about 8,000 ÷ 480 ≈ 16 rows
let’s say you have a 10k rows file. How would you proceed to have an assistant proceed with such a number of lines without having to split the file, call the assistant and iterate hundreds of times?
I got confirmation: This is the ‘real’ problem with AI, it’s the context window. so basically the size of the input It can’t be exceeded. There’s 0 means. The system itself limits it In this case, the feedback I was given is as follows: automate the processing (file upload + execution of instructions) via a script + the API of a model:
Either it’s word processing and we can go through RAG, so load everything and then the LLM searches in the RAG (kind of contextual database),
or you make a script that makes an LLM call for XX lines
In my case, on the one hand a taxonomy made up of categories and subcategories, on the other a csv of 14,000 lines/products. If anyone sees a way to resolve this case in dust, I’m interested.
Pierre Bernat If I understand correctly, your use case is:
You have a file with 14k products
You'd like to categories those products (lines) using AI
Is that right? In that case you could create a Dust agent that takes one row as input and that gives the category as output. You can give the list of category in the agent's instructions. Then you can use the Dust Gsheet integration to call your agent directly in Gsheet. You'd give all the cells in the row as input to the agent and output the answer in a new column. (Documentation: https://docs.dust.tt/docs/beta-google-spreadsheets) You'll probably right your workspace limit at some point and will need to do this over time. Hope this helps !