From Blocked to Blazing Fast: A Scraping Showdown, Live
Goal: scrape all 37,650 player card records from pesdb.net. From TLS fingerprint exposure, multi-threaded crashes, to discovering a single Cookie that saves 97% of requests — a complete record of an anti-anti-scraping battle.
From Blocked to Blazing Fast: A Scraping Showdown, Live
Goal: scrape all 37,650 player card records from pesdb.net/efootball.
End result: requests compressed from 38,827 to 1,177, runtime slashed from 20 hours to 30 minutes.
Phase One: First Version Goes Live, Gets Rate-Limited Immediately
The initial scraper logic was dead simple:
- Scrape 1,177 list pages, collect all player IDs
- Fetch 37,650 player detail pages one by one
- Parse HTML, save to CSV
Not long after it started running, HTTP 429s flooded in. Added rate limits, random delays — still triggered repeatedly.
Problem One: Python’s TLS Fingerprint Gives You Away
When the requests library sends an HTTPS request, it exposes its identity at the TLS handshake stage.
Every TLS client, when initiating a connection, sends a ClientHello message containing:
- Supported cipher suites list
- TLS extension order and parameters
- HTTP/2 SETTINGS frame format
These features combined form a JA3/JA4 fingerprint. Python’s urllib3 generates a fingerprint completely different from Chrome/Firefox. The server can identify you as a scraper at the network layer — totally independent of User-Agent.
Solution: curl_cffi
# Before
import requests
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0 ..."})
# After
from curl_cffi.requests import Session
s = Session(impersonate="chrome136") # Fully replicate Chrome's TLS handshake
r = s.get(url)
curl_cffi uses libcurl under the hood and fully replicates the target browser’s:
- Cipher Suite ordering
- TLS extensions (ALPN, SNI, session tickets, etc.)
- HTTP/2 frame format
Problem Two: Getting the impersonate Version Wrong Crashes Immediately
ImpersonateError: Impersonating firefox117 is not supported
curl_cffi only supports specific version numbers — not all versions exist. The fix: first check what your installed version actually supports:
from curl_cffi.requests import BrowserType
print([b.value for b in BrowserType])
# ['chrome99', 'chrome116', 'chrome124', 'chrome136', 'firefox133', 'firefox135', ...]
Pick from the returned list. Don’t guess.
Problem Three: Multi-Threaded curl_cffi Connections All Fail
After switching to curl_cffi, single-threaded tests worked fine, but the multi-threaded version was all CONN ERR.
Root cause: calling module-level curl_requests.get() concurrently across multiple threads causes libcurl’s internal handle initialization to conflict.
Solution: each thread maintains its own independent Session object
import threading
_tls = threading.local()
def _get_session(impersonate: str) -> Session:
if not hasattr(_tls, "session") or _tls.impersonate != impersonate:
_tls.session = Session(impersonate=impersonate)
_tls.impersonate = impersonate
return _tls.session
threading.local() guarantees each thread gets its own Session, no interference.
Problem Four: 429s Still Frequent — Swapping Identities Doesn’t Help
Even after TLS fingerprinting was solved, 429s kept firing. Root cause: IP-level rate limiting.
The server has a hard cap on requests per time window from the same IP — doesn’t matter which browser you impersonate.
To handle this gracefully, I implemented a three-layer defense:
1. Three-Layer Perlin Noise Rate Limiting
Using Perlin noise to simulate uneven human clicking rhythm instead of fixed intervals:
n1 = noise.pnoise1(t * 0.06) * 0.25 # Slow wave: overall rhythm drift ±25%
n2 = noise.pnoise1(t * 0.35) * 0.15 # Medium wave: burst/pause episodes ±15%
n3 = noise.pnoise1(t * 2.00) * 0.08 # Fast wave: per-request jitter ±8%
interval = base * (1 + n1 + n2 + n3)
Every 35–90 requests, a random 1.5–5 second “thinking pause” is inserted.
2. ThrottleGate Global Circuit Breaker
When any thread hits a 429, immediately close the global gate — all threads pause:
COOLDOWNS = [20, 40, 60, 90, 120] # Cooldown escalates with trigger count
After cooldown, automatically switch to a new browser profile and continue with a fresh identity.
3. Adaptive Backoff
429 hit → multiply request interval by 1.6; every 100 successes → multiply interval by 0.95, gradually returning to normal speed.
The Turning Point: Realizing I Never Needed Detail Pages At All
While analyzing the site structure, I noticed something: the page’s column display was controlled by a Cookie — and the Cookie was literally named columns.
In browser dev tools, I found the JS function that submits column selections:
function submitColumns() {
// Read all checkboxes with id starting with col_
// Build ?columns=id,pos,name,speed,...
document.location.search = Vars;
}
This meant: the list page can directly display all attribute fields — including speed, shooting, passing, all 118 columns.
Just send the request with the right Cookie:
import urllib.parse
session.cookies.set(
"columns",
urllib.parse.quote("id,pos,name,overall_rating,speed,acceleration,...", safe=""),
domain="pesdb.net",
)
Verification result:
Status: 200
Columns: 118, Data rows: 32
First player:
ID: 8554076, Name: Safi Belal, Speed: 95, Overall: 116 ...
Final Architecture Comparison
| Original Plan | Optimized | |
|---|---|---|
| Phase 1 (collect IDs) | 1,177 requests | Not needed |
| Phase 2 (detail pages) | 37,650 requests | Not needed |
| List pages (full columns) | — | 1,177 requests |
| Total requests | 38,827 | 1,177 |
| Estimated runtime | 10–20 hours | 30–60 minutes |
| Rate-limit risk | Extremely high | Low |
97% of requests never needed to be sent. Sometimes the best way to fight rate limiting isn’t to fight harder — it’s to look at the problem from a different angle.
Core Takeaways
-
Inspect network requests before writing the scraper. A lot of site data is exposed earlier than the HTML page — in XHR or Cookies. Finding the right entry point can save 97% of the work.
-
TLS fingerprint matters more than User-Agent. Servers increasingly identify you at the handshake stage. Changing UA does nothing — you need to swap the entire TLS stack.
-
A 429 isn’t necessarily a speed problem — it might be an IP problem. No matter how slow you go, rotating IPs is the real fix.
-
Multi-threading + HTTP clients: always read the thread-safety docs.
curl_cffiSessions cannot be shared across threads. -
A global circuit breaker beats per-request backoff. When rate limiting is detected, having all threads stop together is far more stable than each one flailing with its own backoff.
Tool stack: Python 3.12 · curl_cffi · BeautifulSoup · Rich · noise
by Jiahao Ren | github.com/Giggitycountless | jiahao.uk