Hi, you need to use the "table queries" tool to query database files (gsheet, etc)
I thought it was more for quantitative analysis... ?
In my case the types of instructions are ācategorize these products according to this specific taxonomyā
this kind of things
@help advised to go through an data extraction Alban
then my question is
even if I provide the precise headers label in the āTool descriptionā
why the tool is kind of āinterpretingā the data to create its own categorization that is not related to the content of the csv?
itās creating properties like āingredient_functionā, āingredient_originā, āconsumer_benefitā, which is certainly very interesting, since itās even my objective with this assistant (to generate this type of qualification for each of the products in the csv), but which represent information that is absolutely not in the csv?
I may have found an answer, which is that itās all about the context window of the model. Iāve uploaded an extract of the csv I want the assistant to analyse (categorize a list of products according to a specific taxonomy). Iāve asked how many lines like these an model like o3, claude, gpt could handle and the answer is thatās itās all about the context window of each:
o3 mini = (ā4,096 tokens context) = 8 rows per file
Claude 3.7 (ā200,000 tokens context) = 417 rows
Standard ChatGPT-4 (ā8,000 tokens context) can handle about 8,000 Ć· 480 ā 16 rows
letās say you have a 10k rows file. How would you proceed to have an assistant proceed with such a number of lines without having to split the file, call the assistant and iterate hundreds of times?
I got confirmation: This is the ārealā problem with AI, itās the context window. so basically the size of the input It canāt be exceeded. Thereās 0 means. The system itself limits it In this case, the feedback I was given is as follows: automate the processing (file upload + execution of instructions) via a script + the API of a model:
Either itās word processing and we can go through RAG, so load everything and then the LLM searches in the RAG (kind of contextual database),
or you make a script that makes an LLM call for XX lines
In my case, on the one hand a taxonomy made up of categories and subcategories, on the other a csv of 14,000 lines/products. If anyone sees a way to resolve this case in dust, Iām interested.
Pierre Bernat If I understand correctly, your use case is:
You have a file with 14k products
You'd like to categories those products (lines) using AI
Is that right? In that case you could create a Dust agent that takes one row as input and that gives the category as output. You can give the list of category in the agent's instructions. Then you can use the Dust Gsheet integration to call your agent directly in Gsheet. You'd give all the cells in the row as input to the agent and output the answer in a new column. (Documentation: https://docs.dust.tt/docs/beta-google-spreadsheets) You'll probably right your workspace limit at some point and will need to do this over time. Hope this helps !