From Blocked to Blazing Fast: A Scraping Showdown, Live

Goal: scrape all 37,650 player card records from pesdb.net/efootball.
End result: requests compressed from 38,827 to 1,177, runtime slashed from 20 hours to 30 minutes.

Phase One: First Version Goes Live, Gets Rate-Limited Immediately

The initial scraper logic was dead simple:

Scrape 1,177 list pages, collect all player IDs
Fetch 37,650 player detail pages one by one
Parse HTML, save to CSV

Not long after it started running, HTTP 429s flooded in. Added rate limits, random delays — still triggered repeatedly.

Problem One: Python’s TLS Fingerprint Gives You Away

When the requests library sends an HTTPS request, it exposes its identity at the TLS handshake stage.

Every TLS client, when initiating a connection, sends a ClientHello message containing:

Supported cipher suites list
TLS extension order and parameters
HTTP/2 SETTINGS frame format

These features combined form a JA3/JA4 fingerprint. Python’s urllib3 generates a fingerprint completely different from Chrome/Firefox. The server can identify you as a scraper at the network layer — totally independent of User-Agent.

Solution: curl_cffi

# Before
import requests
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 ..."})

# After
from curl_cffi.requests import Session
s = Session(impersonate="chrome136")  # Fully replicate Chrome's TLS handshake
r = s.get(url)

curl_cffi uses libcurl under the hood and fully replicates the target browser’s:

Cipher Suite ordering
TLS extensions (ALPN, SNI, session tickets, etc.)
HTTP/2 frame format

Problem Two: Getting the impersonate Version Wrong Crashes Immediately

ImpersonateError: Impersonating firefox117 is not supported

curl_cffi only supports specific version numbers — not all versions exist. The fix: first check what your installed version actually supports:

from curl_cffi.requests import BrowserType
print([b.value for b in BrowserType])
# ['chrome99', 'chrome116', 'chrome124', 'chrome136', 'firefox133', 'firefox135', ...]

Pick from the returned list. Don’t guess.

Problem Three: Multi-Threaded curl_cffi Connections All Fail

After switching to curl_cffi, single-threaded tests worked fine, but the multi-threaded version was all CONN ERR.

Root cause: calling module-level curl_requests.get() concurrently across multiple threads causes libcurl’s internal handle initialization to conflict.

Solution: each thread maintains its own independent Session object

import threading
_tls = threading.local()

def _get_session(impersonate: str) -> Session:
    if not hasattr(_tls, "session") or _tls.impersonate != impersonate:
        _tls.session     = Session(impersonate=impersonate)
        _tls.impersonate = impersonate
    return _tls.session

threading.local() guarantees each thread gets its own Session, no interference.

Problem Four: 429s Still Frequent — Swapping Identities Doesn’t Help

Even after TLS fingerprinting was solved, 429s kept firing. Root cause: IP-level rate limiting.

The server has a hard cap on requests per time window from the same IP — doesn’t matter which browser you impersonate.

To handle this gracefully, I implemented a three-layer defense:

1. Three-Layer Perlin Noise Rate Limiting

Using Perlin noise to simulate uneven human clicking rhythm instead of fixed intervals:

n1 = noise.pnoise1(t * 0.06) * 0.25  # Slow wave: overall rhythm drift ±25%
n2 = noise.pnoise1(t * 0.35) * 0.15  # Medium wave: burst/pause episodes ±15%
n3 = noise.pnoise1(t * 2.00) * 0.08  # Fast wave: per-request jitter ±8%
interval = base * (1 + n1 + n2 + n3)

Every 35–90 requests, a random 1.5–5 second “thinking pause” is inserted.

2. ThrottleGate Global Circuit Breaker

When any thread hits a 429, immediately close the global gate — all threads pause:

COOLDOWNS = [20, 40, 60, 90, 120]  # Cooldown escalates with trigger count

After cooldown, automatically switch to a new browser profile and continue with a fresh identity.

3. Adaptive Backoff

429 hit → multiply request interval by 1.6; every 100 successes → multiply interval by 0.95, gradually returning to normal speed.

The Turning Point: Realizing I Never Needed Detail Pages At All

While analyzing the site structure, I noticed something: the page’s column display was controlled by a Cookie — and the Cookie was literally named columns.

In browser dev tools, I found the JS function that submits column selections:

function submitColumns() {
    // Read all checkboxes with id starting with col_
    // Build ?columns=id,pos,name,speed,...
    document.location.search = Vars;
}

This meant: the list page can directly display all attribute fields — including speed, shooting, passing, all 118 columns.

Just send the request with the right Cookie:

import urllib.parse
session.cookies.set(
    "columns",
    urllib.parse.quote("id,pos,name,overall_rating,speed,acceleration,...", safe=""),
    domain="pesdb.net",
)

Verification result:

Status: 200
Columns: 118, Data rows: 32
First player:
  ID: 8554076, Name: Safi Belal, Speed: 95, Overall: 116 ...

Final Architecture Comparison

	Original Plan	Optimized
Phase 1 (collect IDs)	1,177 requests	Not needed
Phase 2 (detail pages)	37,650 requests	Not needed
List pages (full columns)	—	1,177 requests
Total requests	38,827	1,177
Estimated runtime	10–20 hours	30–60 minutes
Rate-limit risk	Extremely high	Low

97% of requests never needed to be sent. Sometimes the best way to fight rate limiting isn’t to fight harder — it’s to look at the problem from a different angle.

Core Takeaways

Inspect network requests before writing the scraper. A lot of site data is exposed earlier than the HTML page — in XHR or Cookies. Finding the right entry point can save 97% of the work.
TLS fingerprint matters more than User-Agent. Servers increasingly identify you at the handshake stage. Changing UA does nothing — you need to swap the entire TLS stack.
A 429 isn’t necessarily a speed problem — it might be an IP problem. No matter how slow you go, rotating IPs is the real fix.
Multi-threading + HTTP clients: always read the thread-safety docs. curl_cffi Sessions cannot be shared across threads.
A global circuit breaker beats per-request backoff. When rate limiting is detected, having all threads stop together is far more stable than each one flailing with its own backoff.

Tool stack: Python 3.12 · curl_cffi · BeautifulSoup · Rich · noise

by Jiahao Ren | github.com/Giggitycountless | jiahao.uk