Python Parsel Cheat Sheet
This cheat sheet provides a quick reference for using the parsel
library in Python for web scraping. parsel
builds on top of lxml
, providing a more Pythonic and user-friendly interface for working with HTML and XML.
1. Installation:
pip install parsel
2. Basic Usage:
- Creating a Selector: The core of
parsel
is theSelector
object. You create it by passing HTML or XML content.
from parsel import Selector
html_content = """
<html>
<body>
<table>
<tr><td>Data 1</td><td>Data 2</td></tr>
<tr><td>Data 3</td><td>Data 4</td></tr>
</table>
</body>
</html>
"""
selector = Selector(text=html_content)
- CSS Selectors:
parsel
uses CSS selectors for querying elements.
# Get the first table element
first_table = selector.css('table').get() # Returns the first matching element as a string
# Get all table elements
all_tables = selector.css('table').getall() # Returns a list of matching elements as strings
# Extract text from all <td> elements within the first table
data_cells = selector.css('table td::text').getall() # Returns a list of text content
# Extract attributes
href_values = selector.css('a::attr(href)').getall() # Extracts href attributes from all <a> tags
# Extract specific attribute from a specific element
first_link_href = selector.css('a:first-of-type::attr(href)').get() # Extracts href from the first <a> tag
3. XPath Selectors:
While CSS selectors are often sufficient, parsel
also supports XPath for more complex scenarios.
# Get all table rows using XPath
rows_xpath = selector.xpath('//table/tr').getall()
# Extract text from the first cell of each row using XPath
first_cell_data = selector.xpath('//table/tr/td[1]/text()').getall()
4. Handling Multiple Matches:
The .get()
method returns the first match, while .getall()
returns a list of all matches. If no matches are found, .get()
returns None
, and .getall()
returns an empty list.
5. Error Handling:
It's good practice to handle potential errors, such as when no element is found.
first_paragraph = selector.css('p:first-of-type::text').get()
if first_paragraph:
print(f"First paragraph: {first_paragraph}")
else:
print("No paragraph found.")
6. Working with Attributes:
Use the ::attr()
pseudo-element to extract attribute values.
7. Advanced Techniques:
- Combining Selectors: Chain selectors together for more precise targeting. For example,
selector.css('div.container p::text').getall()
extracts text from all<p>
tags within a<div>
with the class "container". - Regular Expressions: Use regular expressions within selectors for pattern matching. For example,
selector.css('[id^=item-]::text').getall()
extracts text from elements whoseid
attribute starts with "item-".
This cheat sheet provides a foundation for using parsel
. Refer to the official documentation for more advanced features and detailed explanations. Remember to always respect the robots.txt
of the websites you are scraping.