← back

Tadpole the Language for Scraping 0.2.0 – Complex Control Flow, Stealth and More

zachperkitny | 2026-02-16 18:35 UTC | source
6 points | 2 comments | original link
Hello,

I posted a few weeks ago about my custom scraping language. It definitely got some traction, which was very exciting to see.

Github Repo: https://github.com/tadpolehq/tadpole Docs: https://tadpolehq.com/

The past 2 weeks, I've been focusing my efforts in introducing specific stealth actions, more complicated control flow actions and a lot of various evaluators for cleaning data.

Here is an example for scraping from `books.toscrape.com`

  main {
    new_page {
      goto "https://books.toscrape.com/"
      loop {
        do {
          $$ article.product_pod {
            extract "books[]" {
              title { $ "h3 a"; attr title }
              rating {
                $ ".star-rating";
                attr "class";
                extract "star-rating (One|Two|Three|Four|Five)" caseInsensitive=#true;
                func "(v) => ({'one': 1, 'two': 2, 'three': 3, 'four': 4, 'five': 5}[v.toLowerCase()] || null)"
              }
              price { $ "p.price_color"; text; as_float }
              in_stock { $ "p.availability"; text; matches "In stock" caseInsensitive=#true }
            }
          }
        }
        while { $ "li.next" }
        next {
          $ "li.next a" { click }
          wait_until
        }
      }
    }
  }
I've introduced actions like `apply_identity` to override User Agent Headers and User Agent Metadata. Here is an example module to selectively create different identities:

  module stealth {
    // Apple M2 Pro
    action apply_apple_m2 {
      apply_identity mac
      set_webgl_vendor "Apple Inc." "Apple M2"
      set_device_memory 16
      set_hardware_concurrency 8
      set_viewport 1440 900 deviceScaleFactor=2
    }

    // Windows Desktop
    action apply_windows_16_8 {
      apply_identity windows
      set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
      set_device_memory 16
      set_hardware_concurrency 8
      set_viewport 1920 1080
    }

    // Windows Budget Laptop
    action apply_windows_8_4 {
      apply_identity windows
      set_webgl_vendor "Google Inc. (Intel)" "ANGLE (Intel, Intel(R) UHD Graphics 620 Direct3D11 vs_5_0 ps_5_0)"
      set_device_memory 8
      set_hardware_concurrency 4
      set_viewport 1366 768
    }
  }

The full release changelog is available here: https://github.com/tadpolehq/tadpole/releases/

My goals for the next 0.3.0 release is to heavily focus on Plugins, Distributed Execution through Message Queues, Redis Support for Crawling, Static Parsing as opposed to exclusively over CDP/Chrome.

I will keep trying to keep my release cadence at every 2 weeks!

Comments

rithdmc | 2026-02-17 09:53 UTC
This seems neat. My previous experience has been with scrapy, but if you're using books.toscrape, then you probably already know of it.

I'll keep this in mind as an alternative next time I'm scraping something.

zachperkitny | 2026-02-17 16:20 UTC
Thanks! I've uesd Scrapy before, I like it a lot. This is built around CDP and uses an actual browser so it supports client side rendered content as well. I am adding a feature specifically for static HTML parsing for performance reasons in my next release. It's useful to have both.