Files
rikako-note/python/beatifulsoup.md
2024-05-04 22:26:57 +08:00

542 lines
19 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

- [Beautiful Soup](#beautiful-soup)
- [安装](#安装)
- [构造BeautifulSoup对象](#构造beautifulsoup对象)
- [Tag](#tag)
- [name](#name)
- [attributes](#attributes)
- [多值属性](#多值属性)
- [NavigableString](#navigablestring)
- [文档树遍历](#文档树遍历)
- [.contents](#contents)
- [.children](#children)
- [.descendants](#descendants)
- [.string](#string)
- [.parent](#parent)
- [.parents](#parents)
- [.next\_sibling和.previous\_sibling](#next_sibling和previous_sibling)
- [.next\_siblings和.previous\_siblings](#next_siblings和previous_siblings)
- [搜索文档树](#搜索文档树)
- [过滤器类型](#过滤器类型)
- [name过滤](#name过滤)
- [正则](#正则)
- [列表](#列表)
- [True](#true)
- [自定义方法](#自定义方法)
- [属性查找](#属性查找)
- [按class进行搜索](#按class进行搜索)
- [按string](#按string)
- [limit](#limit)
- [recursive](#recursive)
- [find](#find)
- [其他find方法变体](#其他find方法变体)
- [css选择器](#css选择器)
- [输出](#输出)
# Beautiful Soup
## 安装
```bash
pip install beautifulsoup4
```
## 构造BeautifulSoup对象
```python
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("index.html"),'html.parser')
soup = BeautifulSoup("<html>data</html>",'html.parser')
```
在调用玩构造方法后BeautifulSoup会将传入的html文本解析为一个树结构的python对象python对象存在如下几种类型
- Tag
- NavigableString
- BeautifulSoup
- Comment
## Tag
Tag对象与html文档中的tag相同
```py
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','html.parser')
tag = soup.b
type(tag)
# <class 'bs4.element.Tag'>
```
Tag对象中具有两个重要属性name和attributes
### name
每个Tag对象都拥有name属性可以通过`.name`来获取
```py
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
print(soup.b.name) # b
```
tag的name属性可以进行修改如果改变tag的name属性那么BeautifulSoup对象也会随之修改
```py
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
soup.b.name = 'h1'
print(soup)
# <h1 class="boldest">Extremely bold</h1>
```
### attributes
一个tag能拥有许多属性tag属性的获取和字典操作相同
```py
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
print(soup.b['class']) # ['boldest']
print(soup.b.attrs['class']) # ['boldest']
```
tag的属性同样能在运行时被修改且tag的属性操作和字典操作相同运行时能够新增或删除、修改属性对tag属性的改动会影响BeautifulSoup对象
```py
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
print(soup) # <b class="boldest">Extremely bold</b>
soup.b['class'] = 'nvidia'
print(soup) # <b class="nvidia">Extremely bold</b>
soup.b['id'] = 'amd'
print(soup) # <b class="nvidia" id="amd">Extremely bold</b>
del soup.b['class']
print(soup) # <b id="amd">Extremely bold</b>
```
### 多值属性
html协议中存在部分属性存在多个值的场景。最常见的多值属性为`rel`,`rev`,`accept-charset`,`headers`,`class`,`accesskey`等。
在BeautifulSoup中多值属性的返回类型为list
```py
soup = BeautifulSoup('<b class="nvidia amd">Extremely bold</b>', 'html.parser')
print(soup.b['class']) # ['nvidia', 'amd']
```
如果有属性看起来像有多个值但是html协议中该属性未被定义为多值属性那么BeautifulSoup会将该属性值作为字符串返回
```py
soup = BeautifulSoup('<b id="nvidia amd">Extremely bold</b>', 'html.parser')
print(soup.b['id']) # nvidia amd
```
## NavigableString
NavigableString值通常被嵌套在Tag中可以通过`tag.string`进行获取
```py
soup = BeautifulSoup('<b id="nvidia amd">Extremely bold</b>', 'html.parser')
print(soup.b.string) # Extremely bold
print(type(soup.b.string)) # <class 'bs4.element.NavigableString'>
```
如果想要将NavigableString转化为unicode字符串可以调用`unicode`方法
```py
unicode_string = unicode(tag.string)
unicode_string
# u'Extremely bold'
type(unicode_string)
# <type 'unicode'>
```
如果想要在BeautifulSoup之外使用NavigableString对象应该将其转为unicode字符串
## 文档树遍历
BeautifulSoup对象中Tag对象通常有其子节点可以通过`tag.{child-tag-type}`的形式来获取tag对象第一个`child-tag-type`类型的子节点,示例如下:
```py
soup = BeautifulSoup('<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p></body>', 'html.parser')
print(soup.p) # <p>i hate nvidia</p>
```
如果想要获取soup对象中所有`child-tag-type`类型的标签,需要调用`find_all`方法。`find_all`将会在整个树结构中寻找指定类型的子节点,不仅包含其直接子节点,还包含间接子节点:
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
print(soup.find_all("p"))
# [<p>i hate nvidia</p>, <p>nvidia is a piece of shit</p>, <p>fuck Jensen Huang</p>]
```
### .contents
通过`contents`属性可以获取直接子节点
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
print(soup.body.contents)
# [<p>i hate nvidia</p>, <p>nvidia is a piece of shit</p>, <block><p>fuck Jensen Huang</p></block>]
```
### .children
`children`属性可以对子节点进行迭代
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
for e in soup.body.children:
print(e)
# <p>i hate nvidia</p>
# <p>nvidia is a piece of shit</p>
# <block><p>fuck Jensen Huang</p></block>
```
### .descendants
通过`descendants`属性,可以对所有直接和间接的子节点进行遍历
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
for t in soup.descendants:
print(t)
```
产生的结果为
```bash
<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>
<p>i hate nvidia</p>
i hate nvidia
<p>nvidia is a piece of shit</p>
nvidia is a piece of shit
<block><p>fuck Jensen Huang</p></block>
<p>fuck Jensen Huang</p>
fuck Jensen Huang
```
### .string
如果tag只有一个NavigableString类型的子节点那么直接可以通过`string`属性进行访问:
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
print(soup.body.p.string) # i hate nvidia
```
### .parent
通过`parent`属性,可以得到一个节点的直接父节点
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
grand_son_node = soup.body.contents[2].p
print(grand_son_node.parent)
```
输出为
```bash
<block><p>fuck Jensen Huang</p></block>
```
### .parents
通过`parents`属性,可以得到一个节点的所有父节点:
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
grand_son_node = soup.body.contents[2].p
i = 0
for p in grand_son_node.parents:
for j in range(0, i):
print("\t", end='')
i += 1
print(f"type: {type(p)}, content: {p}")
```
输出为
```bash
type: <class 'bs4.element.Tag'>, content: <block><p>fuck Jensen Huang</p></block>
type: <class 'bs4.element.Tag'>, content: <body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>
type: <class 'bs4.BeautifulSoup'>, content: <body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>
```
### .next_sibling和.previous_sibling
可以通过`.next_sibling``.previous_sibling`来查询兄弟节点
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
mid_node = soup.body.contents[1]
print(mid_node.previous_sibling) # <p>i hate nvidia</p>
print(mid_node.next_sibling) # <block><p>fuck Jensen Huang</p></block>
```
### .next_siblings和.previous_siblings
通过`.next_siblings``previous_siblings`,可以遍历所有之前和之后的兄弟节点
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
mid_node = soup.body.contents[1]
print([e for e in mid_node.previous_siblings])
print([e for e in mid_node.next_siblings])
```
输出
```bash
[<p>i hate nvidia</p>]
[<block><p>fuck Jensen Huang</p></block>]
```
## 搜索文档树
### 过滤器类型
如果要在文档树中进行查找,过滤器有如下类型:
- name
- attributes
- 字符串
### name过滤
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
print(soup.find_all(name='p'))
```
输出为
```bash
[<p>i hate nvidia</p>, <p>nvidia is a piece of shit</p>, <p>fuck Jensen Huang</p>]
```
#### 正则
在根据name查询时可以适配正则
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
print(soup.find_all(name=re.compile("^b")))
```
上述正则会匹配`body``block`标签,查询结果为
```bash
[<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>, <block><p>fuck Jensen Huang</p></block>]
```
#### 列表
在根据name查询时可以传入多个tag类型
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
print(soup.find_all(name=['p', 'block']))
```
其会查询p类型和block类型的tag输出为
```bash
[<p>i hate nvidia</p>, <p>nvidia is a piece of shit</p>, <block><p>fuck Jensen Huang</p></block>, <p>fuck Jensen Huang</p>]
```
#### True
如果要查询所有的tag可以向name传入True
```py
soup = BeautifulSoup(
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(name=True):
print(e)
```
输出为
```bash
<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>
<p>i hate nvidia</p>
<p>nvidia is a piece of shit</p>
<block><p>fuck Jensen Huang</p></block>
<p>fuck Jensen Huang</p>
```
#### 自定义方法
除了上述外还可以自定义过滤方法来对tag对象进行过滤
```py
def is_match(tag):
return tag.name == 'p' and 'id' in tag.attrs and tag.attrs['id'] == '100'
soup = BeautifulSoup(
'<body><p id = "200">i hate nvidia</p><p id = "100">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(name=is_match):
print(e)
```
输出为
```bash
<p id="100">nvidia is a piece of shit</p>
```
### 属性查找
如果为find_all方法指定了一个命名参数但是该参数不是find_all方法的内置命名参数那么会将该参数名称作为属性名称进行查找
```py
soup = BeautifulSoup(
'<body><p id = "200">i hate nvidia</p><p id = "100">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(id="200"):
print(e)
```
上述会查找拥有id属性且id值为200的tag对象输出为
```bash
soup = BeautifulSoup(
'<body><p id = "200">i hate nvidia</p><p id = "100">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(id="200"):
print(e)
```
同样的,根据属性查找也支持正则:
```py
soup = BeautifulSoup(
'<body><p id = "200">i hate nvidia</p><p id = "100">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(id=re.compile("^[0-9]{1,}00$")):
print(e)
```
上述会查找拥有id属性并且id值符合正则pattern的tag对象输出为
```bash
<p id="200">i hate nvidia</p>
<p id="100">nvidia is a piece of shit</p>
```
根据属性查找还可以通过向attrs参数传递一个字典
```py
soup = BeautifulSoup(
'<body><p id = "200">i hate nvidia</p><p id = "100" page="1">nvidia is a piece of shit</p><block id = "100"><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(attrs={
'id': "100",
"page": True
}):
print(e)
```
此时输出为:
```bash
<p id="100" page="1">nvidia is a piece of shit</p>
```
### 按class进行搜索
可以通过指定class_来按class进行搜索
```py
soup = BeautifulSoup(
'<body><p id = "200">i hate nvidia</p><p class = "main show">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(class_="main"):
print(e)
```
输出为
```bash
<p class="main show">nvidia is a piece of shit</p>
```
### 按string
指定string后可以针对html文档中的字符串内容进行搜索搜索中的元素只会是NavigableString类型
```py
soup = BeautifulSoup(
'<body><p id = "200">i hate nvidia</p><p class = "main show">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(string=re.compile("nvidia")):
print(f"type: {type(e)}, content: {e}")
```
输出为:
```bash
type: <class 'bs4.element.NavigableString'>, content: i hate nvidia
type: <class 'bs4.element.NavigableString'>, content: nvidia is a piece of shit
```
### limit
通过limit参数可以限制搜索的数量如果文档很大limit可以降低搜索时间
```py
soup = BeautifulSoup(
'<body><p id = "200">i hate nvidia</p><p class = "main show">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
'html.parser')
for e in soup.find_all(string=re.compile("nvidia"), limit=1):
print(f"type: {type(e)}, content: {e}")
```
此时只会输出一条搜索结果
```bash
type: <class 'bs4.element.NavigableString'>, content: i hate nvidia
```
### recursive
通过指定recursive参数为`False`find_all只会搜索当前tag的直接子节点未指定该参数时默认搜索所有直接子节点和间接子节点
### find
调用`find_all`方法会返回所有匹配的对象,而如果只想搜索一个对象,可以调用`find`
`find`方法等价于`find_all(..., limit=1)`
> find方法和find_all方法只会搜索当前节点的下级节点并不包含当前节点本身
### 其他find方法变体
除了find方法和find_all方法外还包含如下find方法变体
```py
find_parents( name , attrs , recursive , string , **kwargs )
find_parent( name , attrs , recursive , string , **kwargs )
find_next_siblings( name , attrs , recursive , string , **kwargs )
find_next_sibling( name , attrs , recursive , string , **kwargs )
find_previous_siblings( name , attrs , recursive , string , **kwargs )
find_previous_sibling( name , attrs , recursive , string , **kwargs )
find_all_next( name , attrs , recursive , string , **kwargs )
find_next( name , attrs , recursive , string , **kwargs )
find_all_previous( name , attrs , recursive , string , **kwargs )
find_previous( name , attrs , recursive , string , **kwargs )
```
### css选择器
通过`select`方法支持通过css选择器语法查找tag
```py
soup.select("title")
# [<title>The Dormouse's story</title>]
soup.select("p:nth-of-type(3)")
# [<p class="story">...</p>]
```
逐层查找
```py
soup.select("body a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("html head title")
# [<title>The Dormouse's story</title>]
```
查找直接下级子标签
```py
soup.select("head > title")
# [<title>The Dormouse's story</title>]
soup.select("p > a")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("p > a:nth-of-type(2)")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
soup.select("p > #link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.select("body > a")
# []
```
css类名查找
```py
soup.select(".sister")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("[class~=sister]")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```
查找兄弟节点标签
```py
soup.select("#link1 ~ .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
soup.select("#link1 + .sister")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```
根据tag的id查找
```py
soup.select("#link1")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
soup.select("a#link2")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```
同时用多种css选择器
```py
soup.select("#link1,#link2")
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```
## 输出
可以通过调用`prettify`方法来美化输出:
```py
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup)
soup.prettify()
# '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'
print(soup.prettify())
# <html>
# <head>
# </head>
# <body>
# <a href="http://example.com/">
# I linked to
# <i>
# example.com
# </i>
# </a>
# </body>
# </html>
```