eFootball Scraper Tech Deep Dive: From JA3 to the Cookie Hack

Technical notes for people who “know my scraper works, but don’t know why every piece inside it exists.”

What the Scraper Actually Does

When you open pesdb.net/efootball/, your browser does something simple:

Sends an HTTP request to the server — “give me this page’s HTML”
The server returns a blob of HTML
The browser paints it into the table you see

The scraper does the same thing, just replacing step 3 — instead of painting it for a human, it pulls player data out of that HTML and saves it to CSV.

So my script only does two actions, over and over:

Fetch: send a request, get back HTML — fetch_html()
Parse: extract data from that HTML — parse_list_page()

pesdb has 37,650 players. If each player had its own detail page, that’s 37,000 fetches. Way too many — and the smartest thing about this project is cutting that number down to 1,177.

A cookie is a little note the website stores in your browser. Every time you revisit the site, your browser automatically tucks that note into the request, and the site can “recognize” you and remember your preferences.

pesdb’s player list page only shows a few columns by default (name, position, overall rating…). But the site lets you customize which columns to show — and your “which columns did I pick” preference is stored in a cookie called columns.

Key insight: if you tick all 118 columns in the cookie, the list page spits out every single data point for every player on that page in one shot — including all ability stats, skills, and playing styles that you’d normally have to click into a detail page to see.

What this means

Strategy	Pages to Fetch
Dumb way: scrape every player’s detail page	37,650 requests
My way: cookie + list pages	1,177 requests

Request volume slashed by 97%. This is why I can run on a single IP, no proxies, finishing in minutes. The root cause isn’t any other optimization — it’s this one trick that squeezes “how many times you knock on the door” down to the bare minimum.

# _COLS_PARAM lists all 118 columns
_COLS_PARAM = "id,club_number,…,P05,P06,P07"

def _get_session():
    session.cookies.set(
        "columns",
        urllib.parse.quote(_COLS_PARAM, safe=""),
        domain="pesdb.net",
    )

This fakes that “membership card” telling pesdb: “This user wants to see all 118 columns.”

How Websites Tell “You’re Not Human”

If scraping is just “sending requests,” then how does a website block scripts while letting real people through? Because programmatic requests and real browser requests look different in a lot of subtle ways. There are four layers:

Layer	What the site inspects	Real human vs bare scraper
① TLS Fingerprint	The “handshake style” when establishing an encrypted connection	Python’s default handshake looks nothing like Chrome’s
② HTTP Headers	Browser model, language preferences…	Bare scrapers have sparse, fake headers
③ Behavioral Rhythm	Request frequency and regularity	Humans vary; scrapers hammer at a steady pace
④ IP Address	Network address of the request origin	A single IP sending massive requests in a short window = suspicious

I have a weapon for each layer:

① → TLS fingerprint spoofing (curl_cffi)
② → Browser identity rotation (BrowserProfile)
③ → Human-like rate limiting + circuit breaker backoff
④ → Proxy pool (but at my volume, I don’t need it)

Weapon One: TLS Fingerprint Spoofing (JA3 / curl_cffi)

When you visit an https:// site, the browser and server first shake hands to agree on encryption. The first step of the handshake is the client sending a “here are the cipher suites, extensions, and order I support” list (ClientHello).

Different software produces different lists with different ordering — Chrome has one arrangement, Firefox another, Python’s standard library yet another. Compress that list into a hash and you get a fingerprint. JA3 is the most famous fingerprinting algorithm (with the newer JA4 also in play).

Python’s default HTTP libraries (requests, urllib) have a handshake that smells distinctly “Python-like” and matches no real browser. Anti-bot systems feel that and know immediately: “Not a browser. A script.”

How I solved it

curl_cffi borrows Chrome’s underlying crypto engine (BoringSSL) and Firefox’s (NSS), so the ClientHello it produces is identical to real Chrome / real Firefox. And it doesn’t just fool JA3 — it also replicates HTTP/2 communication fingerprints (another newer identification method).

from curl_cffi.requests import Session as CurlSession
session = CurlSession(impersonate="chrome136")

impersonate="chrome136" is saying: “When you shake hands, pretend you’re Chrome 136.”

Stability note

The impersonate value has to be a model that your installed curl_cffi version recognizes. I’m on 0.15.0, and the code supports 6 models. But watch out:

Install an older curl_cffi on a different machine, and it might not know chrome136 → immediate error
Real Chrome keeps upgrading. The oldest model in my code, chrome116, is already an “antique fingerprint” in 2026. Long-term, impersonation targets should drift toward newer versions.

Weapon Two: Browser Identity Rotation (BrowserProfile)

Even if every request impersonates Chrome, if thousands of requests are all “the same Chrome 136, same language settings,” it looks fishy.

I prepared 6 different browser + language combos:

_PROFILES = [
    ("chrome136",  "en-US,en;q=0.9"),
    ("chrome124",  "en-GB,en;q=0.9,de;q=0.7"),
    ("chrome120",  "en-US,en;q=0.8,zh-CN;q=0.5,zh;q=0.3"),
    ("chrome116",  "ja,en-US;q=0.9,en;q=0.8"),
    ("firefox135", "en-US,en;q=0.9"),
    ("firefox133", "en-GB,en;q=0.8,fr;q=0.3"),
]

The key is: don’t rotate on every request — constant switching is unnatural. I only swap identities after getting rate-limited and triggering a circuit break. Like: “This identity got flagged, so I’ll change clothes and come back in.”

Weapon Three: Proxy Pool — The One I Don’t Need

A proxy server is a “middleman” that forwards your requests. The site sees the proxy’s IP, not your real one. Proxy pool = a rotating collection of proxies to distribute IPs.

My code has a ProxyPool class that reads from proxies.txt. But that file doesn’t exist in my runtime environment, so the whole thing runs on my real IP, direct connection.

And that’s totally fine, because the cookie trick already squashed request volume from 37K to 1,177 — that’s light enough for a single IP to handle comfortably.

Proxies solve the problem of “too many requests from the same IP.” I solved it at the source by making request count so low that the problem never arises. That anyIP sales email was trying to sell me something I didn’t need.

Weapon Four: Human-Like Rate Limiting (Perlin Noise)

The easiest way to spot a bot is rhythm too regular — exactly 1.5 seconds between requests, thousands of times, never wavering. No human does that.

I didn’t use time.sleep(1.5) as a fixed wait. Instead, I use a 1.5-second baseline modulated by Perlin noise to make intervals smoothly drift up and down:

n1 = noise.pnoise1(self._t * 0.06) * 0.25   # Slow wave: overall rhythm drift ±25%
n2 = noise.pnoise1(self._t * 0.35) * 0.15   # Medium wave: burst/pause episodes ±15%
n3 = noise.pnoise1(self._t * 2.00) * 0.08   # Fast wave: per-request jitter ±8%
interval = base * (1 + n1 + n2 + n3)

Plain random numbers are “jumpy” — 0.1 this time, 0.9 next, no connection. Perlin noise is “smooth randomness”: values undulate continuously over time, like real tides, like the natural tremor of a human hand. Three waves at different speeds layered together produce an organic, irregular rhythm.

Also, every 35–90 requests, I randomly insert a 1.5–5 second “zoning out” pause — simulating a human pausing to think.

Weapon Five: Circuit Breaker and Backoff (ThrottleGate / ReliabilityTracker)

Everything above is proactive camouflage. This section is “what to do once you’ve been spotted.”

When requests come too fast, servers usually respond with HTTP 429 (Too Many Requests) or 503 (Service Unavailable).

ThrottleGate: the master switch

Once a 429/503 hits, ThrottleGate acts like a master circuit breaker — slams shut, halting all workers, entering cooldown. Cooldown times escalate in steps:

COOLDOWNS = [20, 40, 60, 90, 120]   # 1st time: 20s, 2nd: 40s… max 120s

After cooldown, the gate reopens, and we rotate browser identity before continuing.

ReliabilityTracker: adaptive backoff

Just pausing isn’t enough — if rate limits keep firing, the overall pace is too fast and needs permanent slowing:

Every rate-limit hit: multiply base interval by 1.6 (slow down)
Every 100 successes: multiply interval by 0.95 (gradually recover)
Interval has upper/lower bounds, max 3× the baseline

This “brake — ease off” system keeps me running right at the fastest speed the site will tolerate.

Component	Role	Timescale
`ThrottleGate`	Emergency full stop on trouble	Short-term (seconds)
`ReliabilityTracker`	Adjust long-term cruising speed	Long-term (gradual across the run)
`HumanizedRateLimiter`	Control per-request interval	Every single request

Concurrency: 3 Workers, Side by Side

I use 3 threads fetching different pages simultaneously — scheduled by ThreadPoolExecutor. Why 3 and not 30? Too many = too aggressive toward the site = instant rate limit. 3 is the balance between “fast enough” and “don’t provoke.”

3 workers writing to CSV and updating progress at the same time can collide, so I protect with a lock:

with csv_lock:
    writer.writerow(row)

Each worker also gets its own independent session (threading.local()), so their cookies and identities don’t interfere with each other.

Resume-From-Checkpoint + Dashboard

Every 50 pages completed, progress is saved to efootball_progress_v2.json. Power loss, error, Ctrl+C — next launch auto-skips finished pages. CSV is append mode, never overwrites.

The dashboard is a real-time panel drawn with the rich library: progress bar, ETA, current identity, cooldown status, HTTP error stats. Refreshes every 0.25 seconds. Purely for looks — delete it and the scraper still works.

A Single Request’s Full Journey

Here’s what happens when we fetch page 500:

ThreadPoolExecutor assigns the task to an idle worker
_gate.wait() — if we’re in cooldown, the worker waits for the gate to open
_rate.wait() — pause for the Perlin-noise-computed interval (~1.5s with drift), occasional longer zoning-out
_get_session() — get this worker’s own session, impersonating Chrome 136, with the full-118-column cookie
_proxy_pool.next() — returns None, direct connection
Send the request with spoofed TLS fingerprint + browser headers
Judge the result: 200 → log success and slightly speed up; 429/503 → backoff + circuit-break cooldown + swap identity and retry
parse_list_page() — pull dozens of players × 118 columns out of the HTML
Grab csv_lock, write to CSV, mark page 500 complete
Background thread refreshes the dashboard

1177 pages each go through this flow (3 workers in parallel). When they’re all done, you’ve got a complete efootball_players.csv.

Summary: Three Takeaways

The real killer is the cookie column selector — slashing request volume by 97% made everything else easy
TLS fingerprint spoofing is request-level “face-swapping” — making the script’s handshake look exactly like real Chrome; identity rotation, human-like rate limiting, and circuit-breaker backoff are all about “being fast without alarming the site”
I don’t need a proxy pool because request volume is low enough for a single IP to easily handle — that anyIP sales email was peddling exactly what I didn’t lack

Tool stack: Python 3.12 · curl_cffi · BeautifulSoup · Rich · noise · ThreadPoolExecutor