Dust Community Icon

Best Practices for Scraping Newspaper Websites Efficiently

·
·

Hi 👋 i was wondering what the best solution was for scrapping newspapers' websites. If I add a website as a source, it will synchronise 50 child pages and refresh the same 50 pages every week for instance, whereas I would want to refresh it with 50 different pages every week, to stay updated on the latest news. Thanks for the help

  • Avatar of Remi
    Remi
    ·
    ·

    Hey Gabriel Zeitoun 👋 Did you have a look at this guide: https://dust-tt.notion.site/Tech-Radar-09e11010411e4758860302c0febbbfb0 it's a pretty useful one in your case. An alternative would be to use the "web search" feature & to add in the instruction "browse this website and get content of latest articles". Then you add the url of a sub-category of the website. (I don't have 100% confidence in this one but worth testing)

  • Avatar of Gabriel Zeitoun
    Gabriel Zeitoun
    ·
    ·

    Yes I did ! I used it as my instructions' base for my assistant. Yes it's what i thought of doing as well, because I don't think the websites as sources will be useful in my case. Let me know if you have other ideas 🙂 Thanks

  • Avatar of Josselin Davy
    Josselin Davy
    ·
    ·

    Hey Gabriel Zeitoun, I worked on a similar project, and I found that the best approach is to use RSS feeds. You can leverage tools like Zapier or Make to trigger a workflow whenever a new article is published. This allows you to automatically add the content to a designated Dust folder. With an RSS feed, you’ll have access to the article’s title, URL, and a short description. If you need the full article, you can make an HTTP request to retrieve the complete content. To find an RSS feed for a specific media outlet, simply search for “RSS feed [media name]” (e.g., RSS feed Le Monde).

  • Avatar of Gabriel Zeitoun
    Gabriel Zeitoun
    ·
    ·

    Thanks a lot for the tip Josselin Davy! I'll look into it. My project was rather to scrap a dozen websites and ask my assistant to sort the articles and give me the 10 most relevant ones in output every week. You still think a RSS feed might be the solution? Otherwise, i might have another solution thanks to Alban from Dust. Lmk

  • Avatar of Josselin Davy
    Josselin Davy
    ·
    ·

    My use case is to aggregate content from around 30 websites and generate a weekly newsletter with the most relevant insights. It works well, but it might be overengineered for your use case.