Web Scraping with Selenium IDE
How to extract data from a website
This page explains how to do web scraping with Selenium IDE commands. Web scraping works if the data is inside the HTML of a website. If you want to extract data from a PDF, image or video you need to use visual screen scraping instead.
How to generate a good XPath for web scraping
The easiest method is to first record a CLICK on the element that you want to extract. This works even if the element is not a link, but e. g.
a table entry. This generates a CLICK command a few suggested locators (XPath). Press the "Down" arrow in the IDE to see the full list of suggested, possible locators.
Once you have decided on a good locator, then simply change the CLICK command to e. g. STORETEXT or STOREATTRIBUTE:
When to use what command?
The table belows shows the best command for each type of data extraction. Click the recommended command for more information and example code.
| Data to extract is in... | Command to use | Comment |
|---|---|---|
| Visible website text, for example text in a table just like this one, or a price on website | storeText | |
| Text in input fields (input box, text area, select drop down,...) | storeValue | Do not confuse this command with storeEval, which is not for web scraping. |
| Get the status of a checkbox or radiobutton | storeChecked | |
| URL "behind" an image | storeAttribute@href | storeAttribute | xpath=...@href extracts the link of any element - if it has one! If that fails, consider browser automation to copy the link to the ${!clipboard} variable. |
| ALT text "behind" an image | storeAttribute@alt | The storeAttribute command can be used to get any attribute the HTML element has. For example, use @alt to get the "Alt" text of an image. |
| Page title | storeTitle | |
| Table content: Row/Column/Cell | storeText with XPath locator | See TABLE Web Scraping or automate browser addon |
| Data from a list e. g. search results | Loop over storeText | See How to web scrape search results |
| Save complete web page source code | XType | ${KEY_CTRL+KEY_S}* | On Mac it is ${KEY_CMD+KEY_S}. |
| Save complete web page with images | XType | ${KEY_CTLR+KEY_S}* | See Forum post: How to save the entire HTML code |
| Take screenshot of website | captureEntirePageScreenshot* | This saves the complete website as image. |
| Take screenshot of a web page element | storeImage* | The element can be an image or any other web page HTML tag |
| Download an image from a website | saveItem* | Retrieve the image directly from the browser cache. |
| Text found only website source code | sourceExtract* | e. g. Google Analytics ID. For text inside page comments or Javascript, this is the only option |
| Extract complete website HTML with | executeScript to get entire HTML code | Useful if you want to extract the complete HTML source code of the website, e. g. as input for aiPrompt |
| JSON data displayed in web browser | storeText | css=Pre | json | A web api (e. g. OCR Api) displays JSON in the browser, not HTML. storeText with locator "css=Pre" can be used to extract it (Example: JSON scraping RPA) |
| PDF, Image, Video, Canvas | OCRExtractRelative and OCRExtractByTextRelative | This screen scraping command works everywhere because it works visually. The disadvantage is that it is slower than the pure HTML-based commands like storeText. |
| Text from outside the web page | OCRExtractRelative and OCRExtractByTextRelative | If you run these commands in desktop mode you can read data from any desktop app. |
(*) These commands are only available in the Ui.Vision RPA Selenium IDE. They are not part of the classic Selenium IDE.
See also
- - Screen scraping (scraping/data extraction with computer vision, OCR)
- - Text parsing with AI and LLM
- - Form filling with Selenium IDE (the opposite of web scraping)
- - File uploads with Selenium IDE
- - Best Selenium IDE Locator Strategy
- - RPA Software User Manual.
Anything wrong or missing on this page? Suggestions?
...then please contact us.