Is there a way to get a list of filenames in a Knowledge base folder? I’ve uploaded ~40k of files but had some interruptions on the upload so there might be ~100 files missing. I don’t want to repush the full 40k one more time just to be sure 😅
I had to do something similar. Here is the code I had:
def get_existing_dust_files():
all_documents = set()
offset = 0
limit = 100
while True:
url = f"https://dust.tt/api/v1/w/{wld}/spaces/{space_id}/data_sources/{dsId}/documents?limit={limit}&offset={offset}"
response = requests.get(url, headers={"Authorization": f"Bearer {dust_token}"})
if not response.ok:
print(f"Error fetching documents: {response.status_code} - {response.text}")
break
data = response.json()
documents = data.get("documents", [])
# If no more documents were returned, break the loop
if not documents:
break
# Add document IDs to the set
for doc in documents:
all_documents.add(doc["document_id"])
# Update offset for next batch
offset += limit
return all_documents
oh my. but that basically pulls everything back over. Crazy. Thank you for the code, I’ guess I need to use it 😄
To be clear, it just pulls the metadata, not the doc content. It should be significantly faster than re-uploading everything.
oh! That is good news! 😅
Let me know how it goes! I suspect that it will take less than a minute to list your 40k documents, and then you can easily find what's missing.
testing now
Pulled 40967 documents in 282 seconds
it took 4 mins to get all 41k 😄
Ok, I was a bit optimistic, but that's probably a lot less than it took you to upload them! 😅 Might be able to make it a bit faster by getting more than 100 at a time, though there is some limit to the batch size (not sure what it is).
Well, I’ll take 4 mins over 7h of upload