Skip to content

Python Parsel Cheat Sheet

This cheat sheet provides a quick reference for using the parsel library in Python for web scraping. parsel builds on top of lxml, providing a more Pythonic and user-friendly interface for working with HTML and XML.

1. Installation:

bash
pip install parsel

2. Basic Usage:

  • Creating a Selector: The core of parsel is the Selector object. You create it by passing HTML or XML content.
python
from parsel import Selector

html_content = """
<html>
<body>
<table>
  <tr><td>Data 1</td><td>Data 2</td></tr>
  <tr><td>Data 3</td><td>Data 4</td></tr>
</table>
</body>
</html>
"""

selector = Selector(text=html_content)
  • CSS Selectors: parsel uses CSS selectors for querying elements.
python
# Get the first table element
first_table = selector.css('table').get()  # Returns the first matching element as a string

# Get all table elements
all_tables = selector.css('table').getall() # Returns a list of matching elements as strings

# Extract text from all <td> elements within the first table
data_cells = selector.css('table td::text').getall() # Returns a list of text content

# Extract attributes
href_values = selector.css('a::attr(href)').getall() # Extracts href attributes from all <a> tags

# Extract specific attribute from a specific element
first_link_href = selector.css('a:first-of-type::attr(href)').get() # Extracts href from the first <a> tag

3. XPath Selectors:

While CSS selectors are often sufficient, parsel also supports XPath for more complex scenarios.

python
# Get all table rows using XPath
rows_xpath = selector.xpath('//table/tr').getall()

# Extract text from the first cell of each row using XPath
first_cell_data = selector.xpath('//table/tr/td[1]/text()').getall()

4. Handling Multiple Matches:

The .get() method returns the first match, while .getall() returns a list of all matches. If no matches are found, .get() returns None, and .getall() returns an empty list.

5. Error Handling:

It's good practice to handle potential errors, such as when no element is found.

python
first_paragraph = selector.css('p:first-of-type::text').get()
if first_paragraph:
    print(f"First paragraph: {first_paragraph}")
else:
    print("No paragraph found.")

6. Working with Attributes:

Use the ::attr() pseudo-element to extract attribute values.

7. Advanced Techniques:

  • Combining Selectors: Chain selectors together for more precise targeting. For example, selector.css('div.container p::text').getall() extracts text from all <p> tags within a <div> with the class "container".
  • Regular Expressions: Use regular expressions within selectors for pattern matching. For example, selector.css('[id^=item-]::text').getall() extracts text from elements whose id attribute starts with "item-".

This cheat sheet provides a foundation for using parsel. Refer to the official documentation for more advanced features and detailed explanations. Remember to always respect the robots.txt of the websites you are scraping.

Released under the MIT License.