Hyperbrowser Web Scraping Tools
Hyperbrowser is a platform for running and scaling headless browsers. It lets you launch and manage browser sessions at scale and provides easy to use solutions for any webscraping needs, such as scraping a single page or crawling an entire site.
Key Features:
- Instant Scalability - Spin up hundreds of browser sessions in seconds without infrastructure headaches
- Simple Integration - Works seamlessly with popular tools like Puppeteer and Playwright
- Powerful APIs - Easy to use APIs for scraping/crawling any site, and much more
- Bypass Anti-Bot Measures - Built-in stealth mode, ad blocking, automatic CAPTCHA solving, and rotating proxies
This notebook provides a quick overview for getting started with Hyperbrowser web tools.
For more information about Hyperbrowser, please visit the Hyperbrowser website or if you want to check out the docs, you can visit the Hyperbrowser docs.
Key Capabilities
Scrape
Hyperbrowser provides powerful scraping capabilities that allow you to extract data from any webpage. The scraping tool can convert web content into structured formats like markdown or HTML, making it easy to process and analyze the data.
Crawl
The crawling functionality enables you to navigate through multiple pages of a website automatically. You can set parameters like page limits to control how extensively the crawler explores the site, collecting data from each page it visits.
Extract
Hyperbrowser's extraction capabilities use AI to pull specific information from webpages according to your defined schema. This allows you to transform unstructured web content into structured data that matches your exact requirements.
Overview
Integration details
Tool | Package | Local | Serializable | JS support |
---|---|---|---|---|
Crawl Tool | langchain-hyperbrowser | ❌ | ❌ | ❌ |
Scrape Tool | langchain-hyperbrowser | ❌ | ❌ | ❌ |
Extract Tool | langchain-hyperbrowser | ❌ | ❌ | ❌ |
Setup
To access the Hyperbrowser web tools you'll need to install the langchain-hyperbrowser
integration package, and create a Hyperbrowser account and get an API key.
Credentials
Head to Hyperbrowser to sign up and generate an API key. Once you've done this set the HYPERBROWSER_API_KEY environment variable:
export HYPERBROWSER_API_KEY=<your-api-key>
Installation
Install langchain-hyperbrowser.
%pip install -qU langchain-hyperbrowser
Instantiation
Crawl Tool
The HyperbrowserCrawlTool
is a powerful tool that can crawl entire websites, starting from a given URL. It supports configurable page limits and scraping options.
from langchain_hyperbrowser import HyperbrowserCrawlTool
tool = HyperbrowserCrawlTool()
Scrape Tool
The HyperbrowserScrapeTool
is a tool that can scrape content from web pages. It supports both markdown and HTML output formats, along with metadata extraction.
from langchain_hyperbrowser import HyperbrowserScrapeTool
tool = HyperbrowserScrapeTool()
Extract Tool
The HyperbrowserExtractTool
is a powerful tool that uses AI to extract structured data from web pages. It can extract information based predefined schemas.
from langchain_hyperbrowser import HyperbrowserExtractTool
tool = HyperbrowserExtractTool()
Invocation
Basic Usage
Crawl Tool
from langchain_hyperbrowser import HyperbrowserCrawlTool
result = HyperbrowserCrawlTool().invoke(
{
"url": "https://example.com",
"max_pages": 2,
"scrape_options": {"formats": ["markdown"]},
}
)
print(result)
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
Scrape Tool
from langchain_hyperbrowser import HyperbrowserScrapeTool
result = HyperbrowserScrapeTool().invoke(
{"url": "https://example.com", "scrape_options": {"formats": ["markdown"]}}
)
print(result)
{'data': ScrapeJobData(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None), 'error': None}
Extract Tool
from langchain_hyperbrowser import HyperbrowserExtractTool
from pydantic import BaseModel
class SimpleExtractionModel(BaseModel):
title: str
result = HyperbrowserExtractTool().invoke(
{
"url": "https://example.com",
"schema": SimpleExtractionModel,
}
)
print(result)
{'data': {'title': 'Example Domain'}, 'error': None}
With Custom Options
Crawl Tool with Custom Options
result = HyperbrowserCrawlTool().run(
{
"url": "https://example.com",
"max_pages": 2,
"scrape_options": {
"formats": ["markdown", "html"],
},
"session_options": {"use_proxy": True, "solve_captchas": True},
}
)
print(result)
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
Scrape Tool with Custom Options
result = HyperbrowserScrapeTool().run(
{
"url": "https://example.com",
"scrape_options": {
"formats": ["markdown", "html"],
},
"session_options": {"use_proxy": True, "solve_captchas": True},
}
)
print(result)
{'data': ScrapeJobData(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html='<html><head>\n <title>Example Domain</title>\n\n <meta charset="utf-8">\n <meta http-equiv="Content-type" content="text/html; charset=utf-8">\n <meta name="viewport" content="width=device-width, initial-scale=1">\n \n</head>\n\n<body>\n<div>\n <h1>Example Domain</h1>\n <p>This domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.</p>\n <p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n\n\n</body></html>', markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None), 'error': None}
Extract Tool with Custom Schema
from typing import List
from pydantic import BaseModel
class ProductSchema(BaseModel):
title: str
price: float
class ProductsSchema(BaseModel):
products: List[ProductSchema]
result = HyperbrowserExtractTool().run(
{
"url": "https://dummyjson.com/products?limit=10",
"schema": ProductsSchema,
"session_options": {"session_options": {"use_proxy": True}},
}
)
print(result)
{'data': {'products': [{'price': 9.99, 'title': 'Essence Mascara Lash Princess'}, {'price': 19.99, 'title': 'Eyeshadow Palette with Mirror'}, {'price': 14.99, 'title': 'Powder Canister'}, {'price': 12.99, 'title': 'Red Lipstick'}, {'price': 8.99, 'title': 'Red Nail Polish'}, {'price': 49.99, 'title': 'Calvin Klein CK One'}, {'price': 129.99, 'title': 'Chanel Coco Noir Eau De'}, {'price': 89.99, 'title': "Dior J'adore"}, {'price': 69.99, 'title': 'Dolce Shine Eau de'}, {'price': 79.99, 'title': 'Gucci Bloom Eau de'}]}, 'error': None}
Async Usage
All tools support async usage:
from typing import List
from langchain_hyperbrowser import (
HyperbrowserCrawlTool,
HyperbrowserExtractTool,
HyperbrowserScrapeTool,
)
from pydantic import BaseModel
class ExtractionSchema(BaseModel):
popular_library_name: List[str]
async def web_operations():
# Crawl
crawl_tool = HyperbrowserCrawlTool()
crawl_result = await crawl_tool.arun(
{
"url": "https://example.com",
"max_pages": 5,
"scrape_options": {"formats": ["markdown"]},
}
)
# Scrape
scrape_tool = HyperbrowserScrapeTool()
scrape_result = await scrape_tool.arun(
{"url": "https://example.com", "scrape_options": {"formats": ["markdown"]}}
)
# Extract
extract_tool = HyperbrowserExtractTool()
extract_result = await extract_tool.arun(
{
"url": "https://npmjs.com",
"schema": ExtractionSchema,
}
)
return crawl_result, scrape_result, extract_result
results = await web_operations()
print(results)
---------------------------------------------------------------------------
``````output
NameError Traceback (most recent call last)
``````output
Cell In[6], line 10
1 from langchain_hyperbrowser import (
2 HyperbrowserCrawlTool,
3 HyperbrowserExtractTool,
4 HyperbrowserScrapeTool,
5 )
7 from pydantic import BaseModel
---> 10 class ExtractionSchema(BaseModel):
11 popular_library_name: List[str]
14 async def web_operations():
15 # Crawl
``````output
Cell In[6], line 11, in ExtractionSchema()
10 class ExtractionSchema(BaseModel):
---> 11 popular_library_name: List[str]
``````output
NameError: name 'List' is not defined
Use within an agent
Here's how to use any of the web tools within an agent:
from langchain_hyperbrowser import HyperbrowserCrawlTool
from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent
# Initialize the crawl tool
crawl_tool = HyperbrowserCrawlTool()
# Create the agent with the crawl tool
llm = ChatOpenAI(temperature=0)
agent = create_react_agent(llm, [crawl_tool])
user_input = "Crawl https://example.com and get content from up to 5 pages"
for step in agent.stream(
{"messages": user_input},
stream_mode="values",
):
step["messages"][-1].pretty_print()
================================[1m Human Message [0m=================================
Crawl https://example.com and get content from up to 5 pages
==================================[1m Ai Message [0m==================================
Tool Calls:
hyperbrowser_crawl_data (call_G2ofdHOqjdnJUZu4hhbuga58)
Call ID: call_G2ofdHOqjdnJUZu4hhbuga58
Args:
url: https://example.com
max_pages: 5
scrape_options: {'formats': ['markdown']}
=================================[1m Tool Message [0m=================================
Name: hyperbrowser_crawl_data
{'data': [CrawledPage(metadata={'url': 'https://www.example.com/', 'title': 'Example Domain', 'viewport': 'width=device-width, initial-scale=1', 'sourceURL': 'https://example.com'}, html=None, markdown='Example Domain\n\n# Example Domain\n\nThis domain is for use in illustrative examples in documents. You may use this\ndomain in literature without prior coordination or asking for permission.\n\n[More information...](https://www.iana.org/domains/example)', links=None, screenshot=None, url='https://example.com', status='completed', error=None)], 'error': None}
==================================[1m Ai Message [0m==================================
I have crawled the website [https://example.com](https://example.com) and retrieved content from the first page. Here is the content in markdown format:
\`\`\`
Example Domain
# Example Domain
This domain is for use in illustrative examples in documents. You may use this
domain in literature without prior coordination or asking for permission.
[More information...](https://www.iana.org/domains/example)
\`\`\`
If you would like to crawl more pages or need additional information, please let me know!
Configuration Options
Common Options
All tools support these basic configuration options:
url
: The URL to processsession_options
: Browser session configurationuse_proxy
: Whether to use a proxysolve_captchas
: Whether to automatically solve CAPTCHAsaccept_cookies
: Whether to accept cookies
Tool-Specific Options
Crawl Tool
max_pages
: Maximum number of pages to crawlscrape_options
: Options for scraping each pageformats
: List of output formats (markdown, html)
Scrape Tool
scrape_options
: Options for scraping the pageformats
: List of output formats (markdown, html)
Extract Tool
schema
: Pydantic model defining the structure to extractextraction_prompt
: Natural language prompt for extraction
For more details, see the respective API references:
API reference
Related
- Tool conceptual guide
- Tool how-to guides