Sometimes curl tells the truth about the server response and still gives you the wrong page.
That is exactly what happened while collecting recent selfh.st Weekly projects for the FOSS Engineer backlog.
The public HTML loaded fine, but the important Development Activity section was missing. The browser showed it. The saved file did not.
The fix was not a better regex. It was using a real browser.
Playwright is usually introduced as an end-to-end testing framework, but it is also a very practical automation layer for rendered pages, logged-in sessions, screenshots, DOM snapshots, and “wait until the thing I need is really there” workflows.
What Is Playwright?
Playwright is an open-source browser automation framework maintained by Microsoft. It can drive Chromium, Firefox, and WebKit, and it has first-class support for JavaScript/TypeScript, Python, Java, and .NET.
For test suites, Playwright is useful because it has locators, browser contexts, tracing, screenshots, and auto-waiting. For scraping and content workflows, the useful part is simpler:
- Open the same page a human opens.
- Reuse a logged-in browser profile.
- Wait for hydrated content.
- Read the rendered DOM.
- Save screenshots or full HTML.
- Fail if the expected content never appears.
That last point matters. A script that quietly saves a partial page is worse than a script that fails loudly.
The Problem With curl
curl is excellent when the server response is the artifact you need.
But many modern sites split the page into layers:
- Public server-rendered shell.
- JavaScript hydration.
- Member-only or logged-in content.
- API calls that run after the page loads.
- UI elements that appear only after client-side checks.
In the selfh.st case, the short public page included Weekly Highlights and Content Spotlight, but not Development Activity. A normal HTML save from the wrong moment had the same problem. The page visible in the browser was correct; the file was not.
So the automation rule became:
Do not save the page until the rendered DOM contains
Development Activity.
A Small Rendered HTML Capture Script
This is the workflow now in this repo:
uv run --with playwright==1.56.0 \
Z_Codex_Posts/save_selfhst_rendered.py \
"https://selfh.st/weekly/2026-05-29/" \
"Z_Codex_Posts/Self-Host Weekly (29 May 2026).html" \
--headed --wait-text "Development Activity"
The important details:
--headedopens a visible browser so you can log in if the session is not already authenticated.- The script uses a persistent profile, so cookies and local storage survive the next run.
- It waits for
Development Activity. - It refuses to save if that text never appears.
- It writes
document.documentElement.outerHTML, not the original server response.
That gives the downstream scraper the same DOM the user saw in the browser.
Why Persistent Profiles Matter
Playwright can create isolated browser contexts for test suites, which is what you want for repeatable tests. But for member-gated content, a persistent profile is useful because it behaves like a normal browser profile:
- Login once.
- Keep cookies.
- Keep local storage.
- Reuse the state on future runs.
For this site, the profile path lives outside the git repository:
~/.cache/fossengineer-selfhst-playwright
That is intentional. Browser profiles can contain session cookies and other private state. They belong in local cache, not in source control.
Why not commit storage state?
Playwright also supports saving and loading storage state. That is useful for tests and CI when you control the account and secret handling.
For this editorial workflow, a persistent local profile is simpler. The browser opens, the user logs in if needed, and the scraper waits for the section it needs. No cookie export, no committed auth file, and no pretending a session artifact is harmless.
The Python Shape
The core pattern is short:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
context = p.chromium.launch_persistent_context(
"~/.cache/fossengineer-selfhst-playwright",
headless=False,
executable_path="/usr/bin/google-chrome",
viewport={"width": 1440, "height": 1200},
)
page = context.pages[0] if context.pages else context.new_page()
page.goto("https://selfh.st/weekly/2026-05-29/", wait_until="domcontentloaded")
page.wait_for_load_state("networkidle")
page.locator("body").filter(has_text="Development Activity").wait_for()
html = page.evaluate("document.documentElement.outerHTML")
context.close()
The repo script wraps this with argument parsing, timeout handling, output paths, and a safer “do not save partial HTML” failure path.
Where Playwright Fits
Playwright is not a replacement for every HTTP client.
Use curl, httpx, requests, or fetch when:
- The HTML or JSON response already contains the data.
- You need speed and low resource usage.
- You are calling a documented API.
- You are running large-scale extraction where a browser would be wasteful.
Use Playwright when:
- The page is rendered after JavaScript runs.
- Authenticated state matters.
- You need screenshots or PDFs.
- You need to click through UI states.
- A page behaves differently in a real browser than in raw HTTP.
- You need browser APIs such as cookies, local storage, IndexedDB, viewport, or user agent behavior.
That makes Playwright a good bridge between simple scraping and full manual browsing.
If the problem is not rendering but browser fingerprint behavior, the next layer is different. I covered CloakBrowser separately because it wraps Playwright/Puppeteer-style automation around a patched Chromium runtime for authorized QA, browser-agent experiments, and fingerprint testing. Playwright is the baseline automation layer; CloakBrowser is a specialized browser runtime built around stealthier automation signals.
Playwright vs Selenium vs Puppeteer
Selenium is still the long-running WebDriver baseline. It is widely supported and familiar in enterprise QA.
Puppeteer is excellent for Chromium-centric automation, especially Node.js projects that already live close to Chrome DevTools Protocol.
Playwright feels more modern for new work because it bakes in browser contexts, locators, auto-waiting, tracing, and multi-browser support. The Python API is also comfortable for automation scripts that are not full test suites.
The main idea is not that one tool wins forever. It is that Playwright is often the shortest path when you need a real browser and deterministic waiting.
Practical Lessons From This Workflow
The selfh.st capture bug produced a few useful rules:
- Wait for content, not time. Waiting five seconds is a guess. Waiting for
Development Activityis a contract. - Fail before saving bad data. A missing section should stop the script, not create a misleading output file.
- Keep profiles out of git. Browser profiles can contain cookies.
- Prefer rendered DOM for hydrated pages. If the browser is the source of truth, save what the browser sees.
- Keep the parser boring. Once Playwright saves the right HTML, BeautifulSoup can do the simple part.
The Full selfh.st Flow
After Playwright captures the full rendered issue, the existing parser workflow takes over:
# Group repos by newsletter section
uv run --quiet --with beautifulsoup4==4.12.3 \
Z_Codex_Posts/v4.py "Z_Codex_Posts/Self-Host Weekly (29 May 2026).html" \
> Z_Codex_Posts/discovered-2026-05-29.md
# Rank repos by GitHub stars and descriptions
uv run --quiet --with beautifulsoup4==4.12.3 \
Z_Codex_Posts/v5.py "Z_Codex_Posts/Self-Host Weekly (29 May 2026).html" \
> Z_Codex_Posts/discovered-2026-05-29-stars.md
v4.py preserves the editorial structure. v5.py adds prioritization data from GitHub.
The browser step is only there to make sure those scripts receive the complete issue.
Self-Hosting Angle
Playwright itself is not a “self-hosted app” in the usual dashboard sense. It is infrastructure for workflows:
- Editorial content discovery.
- Screenshot generation.
- Visual regression checks.
- Authenticated page archiving.
- Admin UI smoke tests.
- Browser-based data extraction.
- Agent-visible web automation.
If you run a homelab, a blog, or internal tools, Playwright is the piece that lets scripts interact with the web like a user rather than like a socket.
Conclusion
The practical lesson is simple: when a page lies to curl, ask a browser.
Playwright gives you that browser without turning the workflow into a manual process. You can keep login state locally, wait for the exact content you need, save the rendered DOM, and hand the result to ordinary parsers.
That is a useful pattern well beyond selfh.st newsletters. Any time the useful page exists only after JavaScript, auth, or hydration, Playwright is worth reaching for.
Comments