Webscraping

TLDR: Create a new Github repository and add a Workflow file containing a schedule and a curl statement which downloads a JSON file from a API endpoint on cron schedule into the repository. Visualize the results using Flat-viewer. See below for a more detailed step-by-step guide. on: push: workflow_dispatch: schedule: - cron: '6,26,46 * * * *' # every twenty minutes jobs: scheduled: runs-on: ubuntu-latest steps: - name: Check out this repo uses: actions/checkout@v2 - name: Fetch latest data from the API endpoint run: |- curl -s "https://www.nu.nl/block/lean_json/articlelist?limit=20&offset=0&source=latest&filter=site" | jq '.data.context.articles' > headlines.json - name: Commit and push if the data has changed run: |- git config user.name "Automated" git config user.email "actions@users.noreply.github.com" git add -A timestamp=$(date -u) git commit -m "Latest data: ${timestamp}" || exit 0 git push Git scraping This guide is inspired by the excellent Simon Willison’s “Git Scraping” concept https://simonwillison.net/2021/Mar/5/git-scraping/ which combines the free compute in Github Actions with storing flat-files (e.g. json) in a Git repository. ...