Scrape

Scrape Content of a Given URL

POST /env/scrape

This endpoint scrapes the content of a given URL within the session’s environment. If a session ID is not provided, a new session is created. If a session ID is provided and no URL is provided, the current page in the session is scraped.

Request Example:

bashCopyEditcurl --location \
--request POST 'https://api.cros.one/env/scrape' \
--header 'Authorization: Bearer your-api-key' \
--header 'Content-Type: application/json' \
--data '{
  "session_id": "1234567890abcdef",
  "url": "https://example.com",
  "only_main_content": true,
  "scrape_images": false,
  "screenshot": true,
  "session_timeout_minutes": 10
}'

Response Example:

jsonCopyEdit{
  "metadata": {
    "url": "https://example.com",
    "page_title": "Example Website",
    "timestamp": "2025-02-06T14:00:00.000Z"
  },
  "session": {
    "session_id": "1234567890abcdef",
    "status": "active",
    "last_accessed_at": "2025-02-06T14:05:00.000Z",
    "timeout_minutes": 10
  },
  "data": {
    "main_content": "<div><h1>Example Page</h1><p>This is the main content of the page.</p></div>"
  },
  "screenshot": "base64_encoded_image_data_here",
  "space": {
    "description": "Available actions on the current page",
    "actions": [
      {
        "id": "I1",
        "description": "Click on the login button",
        "category": "User Interaction"
      }
    ]
  }
}

Fields in the Request Body:

session_id (string | null): The ID of the session. If not provided, a new session will be created.
url (string | null): The URL to scrape. If not provided, uses the current page URL.
keep_alive (boolean, default: false): If true, the session will not be closed after the operation is completed.
max_nb_actions (integer, default: 100): The maximum number of actions to list. The listing will stop after this number is reached.
min_nb_actions (integer | null): The minimum number of actions to list before stopping. If not provided, the listing will continue until the maximum number of actions is reached.
only_main_content (boolean, default: true): If true, only the main content of the page will be scraped, excluding elements like navbars, footers, etc.
scrape_images (boolean, default: false): If true, images will be scraped from the page.
screenshot (boolean | null): Whether to include a screenshot in the response.
session_timeout_minutes (integer, default: 5): Session timeout in minutes. Cannot exceed the global timeout. Required range: 0 < x < 30.

Fields in the Response:

metadata (object): Metadata of the current page (e.g., URL, page title, and timestamp).
session (object): Browser session information, including the session ID, status, last accessed time, and timeout.
data (object | null): Extracted data from the page, such as the main content scraped or other information.
screenshot (file | null): A base64-encoded screenshot of the current page, if requested.
space (object | null): Available actions on the current page, such as clickable elements or form submissions.

PreviousObserve

Last updated 5 months ago