阅读beautiful soup文档
This commit is contained in:
541
python/beatifulsoup.md
Normal file
541
python/beatifulsoup.md
Normal file
@@ -0,0 +1,541 @@
|
||||
- [Beautiful Soup](#beautiful-soup)
|
||||
- [安装](#安装)
|
||||
- [构造BeautifulSoup对象](#构造beautifulsoup对象)
|
||||
- [Tag](#tag)
|
||||
- [name](#name)
|
||||
- [attributes](#attributes)
|
||||
- [多值属性](#多值属性)
|
||||
- [NavigableString](#navigablestring)
|
||||
- [文档树遍历](#文档树遍历)
|
||||
- [.contents](#contents)
|
||||
- [.children](#children)
|
||||
- [.descendants](#descendants)
|
||||
- [.string](#string)
|
||||
- [.parent](#parent)
|
||||
- [.parents](#parents)
|
||||
- [.next\_sibling和.previous\_sibling](#next_sibling和previous_sibling)
|
||||
- [.next\_siblings和.previous\_siblings](#next_siblings和previous_siblings)
|
||||
- [搜索文档树](#搜索文档树)
|
||||
- [过滤器类型](#过滤器类型)
|
||||
- [name过滤](#name过滤)
|
||||
- [正则](#正则)
|
||||
- [列表](#列表)
|
||||
- [True](#true)
|
||||
- [自定义方法](#自定义方法)
|
||||
- [属性查找](#属性查找)
|
||||
- [按class进行搜索](#按class进行搜索)
|
||||
- [按string](#按string)
|
||||
- [limit](#limit)
|
||||
- [recursive](#recursive)
|
||||
- [find](#find)
|
||||
- [其他find方法变体](#其他find方法变体)
|
||||
- [css选择器](#css选择器)
|
||||
- [输出](#输出)
|
||||
|
||||
|
||||
# Beautiful Soup
|
||||
## 安装
|
||||
```bash
|
||||
pip install beautifulsoup4
|
||||
```
|
||||
## 构造BeautifulSoup对象
|
||||
```python
|
||||
from bs4 import BeautifulSoup
|
||||
|
||||
soup = BeautifulSoup(open("index.html"),'html.parser')
|
||||
|
||||
soup = BeautifulSoup("<html>data</html>",'html.parser')
|
||||
```
|
||||
在调用玩构造方法后,BeautifulSoup会将传入的html文本解析为一个树结构的python对象,python对象存在如下几种类型:
|
||||
- Tag
|
||||
- NavigableString
|
||||
- BeautifulSoup
|
||||
- Comment
|
||||
|
||||
## Tag
|
||||
Tag对象与html文档中的tag相同
|
||||
```py
|
||||
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>','html.parser')
|
||||
tag = soup.b
|
||||
type(tag)
|
||||
# <class 'bs4.element.Tag'>
|
||||
```
|
||||
Tag对象中具有两个重要属性:name和attributes
|
||||
### name
|
||||
每个Tag对象都拥有name属性,可以通过`.name`来获取
|
||||
```py
|
||||
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
|
||||
print(soup.b.name) # b
|
||||
```
|
||||
tag的name属性可以进行修改,如果改变tag的name属性,那么BeautifulSoup对象也会随之修改
|
||||
```py
|
||||
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
|
||||
soup.b.name = 'h1'
|
||||
print(soup)
|
||||
# <h1 class="boldest">Extremely bold</h1>
|
||||
```
|
||||
### attributes
|
||||
一个tag能拥有许多属性,tag属性的获取和字典操作相同:
|
||||
```py
|
||||
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
|
||||
print(soup.b['class']) # ['boldest']
|
||||
print(soup.b.attrs['class']) # ['boldest']
|
||||
```
|
||||
tag的属性同样能在运行时被修改,且tag的属性操作和字典操作相同,运行时能够新增或删除、修改属性,对tag属性的改动会影响BeautifulSoup对象
|
||||
```py
|
||||
soup = BeautifulSoup('<b class="boldest">Extremely bold</b>', 'html.parser')
|
||||
print(soup) # <b class="boldest">Extremely bold</b>
|
||||
soup.b['class'] = 'nvidia'
|
||||
print(soup) # <b class="nvidia">Extremely bold</b>
|
||||
soup.b['id'] = 'amd'
|
||||
print(soup) # <b class="nvidia" id="amd">Extremely bold</b>
|
||||
del soup.b['class']
|
||||
print(soup) # <b id="amd">Extremely bold</b>
|
||||
```
|
||||
### 多值属性
|
||||
html协议中,存在部分属性存在多个值的场景。最常见的多值属性为`rel`,`rev`,`accept-charset`,`headers`,`class`,`accesskey`等。
|
||||
|
||||
在BeautifulSoup中,多值属性的返回类型为list:
|
||||
```py
|
||||
soup = BeautifulSoup('<b class="nvidia amd">Extremely bold</b>', 'html.parser')
|
||||
print(soup.b['class']) # ['nvidia', 'amd']
|
||||
```
|
||||
如果有属性看起来像有多个值,但是html协议中该属性未被定义为多值属性,那么BeautifulSoup会将该属性值作为字符串返回
|
||||
```py
|
||||
soup = BeautifulSoup('<b id="nvidia amd">Extremely bold</b>', 'html.parser')
|
||||
print(soup.b['id']) # nvidia amd
|
||||
```
|
||||
|
||||
## NavigableString
|
||||
NavigableString值通常被嵌套在Tag中,可以通过`tag.string`进行获取
|
||||
```py
|
||||
soup = BeautifulSoup('<b id="nvidia amd">Extremely bold</b>', 'html.parser')
|
||||
print(soup.b.string) # Extremely bold
|
||||
print(type(soup.b.string)) # <class 'bs4.element.NavigableString'>
|
||||
```
|
||||
如果想要将NavigableString转化为unicode字符串,可以调用`unicode`方法
|
||||
```py
|
||||
unicode_string = unicode(tag.string)
|
||||
unicode_string
|
||||
# u'Extremely bold'
|
||||
type(unicode_string)
|
||||
# <type 'unicode'>
|
||||
```
|
||||
如果想要在BeautifulSoup之外使用NavigableString对象,应该将其转为unicode字符串
|
||||
|
||||
## 文档树遍历
|
||||
BeautifulSoup对象中,Tag对象通常有其子节点,可以通过`tag.{child-tag-type}`的形式来获取tag对象第一个`child-tag-type`类型的子节点,示例如下:
|
||||
```py
|
||||
soup = BeautifulSoup('<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p></body>', 'html.parser')
|
||||
print(soup.p) # <p>i hate nvidia</p>
|
||||
```
|
||||
如果想要获取soup对象中所有`child-tag-type`类型的标签,需要调用`find_all`方法。`find_all`将会在整个树结构中寻找指定类型的子节点,不仅包含其直接子节点,还包含间接子节点:
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
|
||||
print(soup.find_all("p"))
|
||||
# [<p>i hate nvidia</p>, <p>nvidia is a piece of shit</p>, <p>fuck Jensen Huang</p>]
|
||||
```
|
||||
### .contents
|
||||
通过`contents`属性,可以获取直接子节点:
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
|
||||
print(soup.body.contents)
|
||||
# [<p>i hate nvidia</p>, <p>nvidia is a piece of shit</p>, <block><p>fuck Jensen Huang</p></block>]
|
||||
```
|
||||
### .children
|
||||
`children`属性可以对子节点进行迭代
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
|
||||
for e in soup.body.children:
|
||||
print(e)
|
||||
# <p>i hate nvidia</p>
|
||||
# <p>nvidia is a piece of shit</p>
|
||||
# <block><p>fuck Jensen Huang</p></block>
|
||||
```
|
||||
### .descendants
|
||||
通过`descendants`属性,可以对所有直接和间接的子节点进行遍历
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
|
||||
for t in soup.descendants:
|
||||
print(t)
|
||||
```
|
||||
产生的结果为
|
||||
```bash
|
||||
<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>
|
||||
<p>i hate nvidia</p>
|
||||
i hate nvidia
|
||||
<p>nvidia is a piece of shit</p>
|
||||
nvidia is a piece of shit
|
||||
<block><p>fuck Jensen Huang</p></block>
|
||||
<p>fuck Jensen Huang</p>
|
||||
fuck Jensen Huang
|
||||
```
|
||||
### .string
|
||||
如果tag只有一个NavigableString类型的子节点,那么直接可以通过`string`属性进行访问:
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
|
||||
print(soup.body.p.string) # i hate nvidia
|
||||
```
|
||||
### .parent
|
||||
通过`parent`属性,可以得到一个节点的直接父节点
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>', 'html.parser')
|
||||
grand_son_node = soup.body.contents[2].p
|
||||
print(grand_son_node.parent)
|
||||
```
|
||||
输出为
|
||||
```bash
|
||||
<block><p>fuck Jensen Huang</p></block>
|
||||
```
|
||||
### .parents
|
||||
通过`parents`属性,可以得到一个节点的所有父节点:
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
grand_son_node = soup.body.contents[2].p
|
||||
i = 0
|
||||
for p in grand_son_node.parents:
|
||||
for j in range(0, i):
|
||||
print("\t", end='')
|
||||
i += 1
|
||||
print(f"type: {type(p)}, content: {p}")
|
||||
```
|
||||
输出为
|
||||
```bash
|
||||
type: <class 'bs4.element.Tag'>, content: <block><p>fuck Jensen Huang</p></block>
|
||||
type: <class 'bs4.element.Tag'>, content: <body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>
|
||||
type: <class 'bs4.BeautifulSoup'>, content: <body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>
|
||||
```
|
||||
|
||||
### .next_sibling和.previous_sibling
|
||||
可以通过`.next_sibling`和`.previous_sibling`来查询兄弟节点
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
mid_node = soup.body.contents[1]
|
||||
print(mid_node.previous_sibling) # <p>i hate nvidia</p>
|
||||
print(mid_node.next_sibling) # <block><p>fuck Jensen Huang</p></block>
|
||||
```
|
||||
|
||||
### .next_siblings和.previous_siblings
|
||||
通过`.next_siblings`和`previous_siblings`,可以遍历所有之前和之后的兄弟节点
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
mid_node = soup.body.contents[1]
|
||||
print([e for e in mid_node.previous_siblings])
|
||||
print([e for e in mid_node.next_siblings])
|
||||
```
|
||||
输出
|
||||
```bash
|
||||
[<p>i hate nvidia</p>]
|
||||
[<block><p>fuck Jensen Huang</p></block>]
|
||||
```
|
||||
## 搜索文档树
|
||||
### 过滤器类型
|
||||
如果要在文档树中进行查找,过滤器有如下类型:
|
||||
- name
|
||||
- attributes
|
||||
- 字符串
|
||||
|
||||
### name过滤
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
print(soup.find_all(name='p'))
|
||||
```
|
||||
输出为
|
||||
```bash
|
||||
[<p>i hate nvidia</p>, <p>nvidia is a piece of shit</p>, <p>fuck Jensen Huang</p>]
|
||||
```
|
||||
#### 正则
|
||||
在根据name查询时,可以适配正则
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
print(soup.find_all(name=re.compile("^b")))
|
||||
```
|
||||
上述正则会匹配`body`和`block`标签,查询结果为
|
||||
```bash
|
||||
[<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>, <block><p>fuck Jensen Huang</p></block>]
|
||||
```
|
||||
#### 列表
|
||||
在根据name查询时,可以传入多个tag类型
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
print(soup.find_all(name=['p', 'block']))
|
||||
```
|
||||
其会查询p类型和block类型的tag,输出为
|
||||
```bash
|
||||
[<p>i hate nvidia</p>, <p>nvidia is a piece of shit</p>, <block><p>fuck Jensen Huang</p></block>, <p>fuck Jensen Huang</p>]
|
||||
```
|
||||
|
||||
#### True
|
||||
如果要查询所有的tag,可以向name传入True
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(name=True):
|
||||
print(e)
|
||||
```
|
||||
输出为
|
||||
```bash
|
||||
<body><p>i hate nvidia</p><p>nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>
|
||||
<p>i hate nvidia</p>
|
||||
<p>nvidia is a piece of shit</p>
|
||||
<block><p>fuck Jensen Huang</p></block>
|
||||
<p>fuck Jensen Huang</p>
|
||||
```
|
||||
#### 自定义方法
|
||||
除了上述外,还可以自定义过滤方法来对tag对象进行过滤
|
||||
```py
|
||||
def is_match(tag):
|
||||
return tag.name == 'p' and 'id' in tag.attrs and tag.attrs['id'] == '100'
|
||||
|
||||
|
||||
soup = BeautifulSoup(
|
||||
'<body><p id = "200">i hate nvidia</p><p id = "100">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(name=is_match):
|
||||
print(e)
|
||||
```
|
||||
|
||||
输出为
|
||||
```bash
|
||||
<p id="100">nvidia is a piece of shit</p>
|
||||
```
|
||||
### 属性查找
|
||||
如果为find_all方法指定了一个命名参数,但是该参数不是find_all方法的内置命名参数,那么会将该参数名称作为属性名称进行查找:
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p id = "200">i hate nvidia</p><p id = "100">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(id="200"):
|
||||
print(e)
|
||||
```
|
||||
上述会查找拥有id属性且id值为200的tag对象,输出为
|
||||
```bash
|
||||
soup = BeautifulSoup(
|
||||
'<body><p id = "200">i hate nvidia</p><p id = "100">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(id="200"):
|
||||
print(e)
|
||||
```
|
||||
同样的,根据属性查找也支持正则:
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p id = "200">i hate nvidia</p><p id = "100">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(id=re.compile("^[0-9]{1,}00$")):
|
||||
print(e)
|
||||
```
|
||||
上述会查找拥有id属性并且id值符合正则pattern的tag对象,输出为:
|
||||
```bash
|
||||
<p id="200">i hate nvidia</p>
|
||||
<p id="100">nvidia is a piece of shit</p>
|
||||
```
|
||||
|
||||
根据属性查找,还可以通过向attrs参数传递一个字典:
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p id = "200">i hate nvidia</p><p id = "100" page="1">nvidia is a piece of shit</p><block id = "100"><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(attrs={
|
||||
'id': "100",
|
||||
"page": True
|
||||
}):
|
||||
print(e)
|
||||
```
|
||||
此时输出为:
|
||||
```bash
|
||||
<p id="100" page="1">nvidia is a piece of shit</p>
|
||||
```
|
||||
### 按class进行搜索
|
||||
可以通过指定class_来按class进行搜索
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p id = "200">i hate nvidia</p><p class = "main show">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(class_="main"):
|
||||
print(e)
|
||||
```
|
||||
输出为
|
||||
```bash
|
||||
<p class="main show">nvidia is a piece of shit</p>
|
||||
```
|
||||
### 按string
|
||||
指定string后,可以针对html文档中的字符串内容进行搜索,搜索中的元素只会是NavigableString类型
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p id = "200">i hate nvidia</p><p class = "main show">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(string=re.compile("nvidia")):
|
||||
print(f"type: {type(e)}, content: {e}")
|
||||
```
|
||||
输出为:
|
||||
```bash
|
||||
type: <class 'bs4.element.NavigableString'>, content: i hate nvidia
|
||||
type: <class 'bs4.element.NavigableString'>, content: nvidia is a piece of shit
|
||||
```
|
||||
### limit
|
||||
通过limit参数,可以限制搜索的数量,如果文档很大,limit可以降低搜索时间
|
||||
```py
|
||||
soup = BeautifulSoup(
|
||||
'<body><p id = "200">i hate nvidia</p><p class = "main show">nvidia is a piece of shit</p><block><p>fuck Jensen Huang</p></block></body>',
|
||||
'html.parser')
|
||||
for e in soup.find_all(string=re.compile("nvidia"), limit=1):
|
||||
print(f"type: {type(e)}, content: {e}")
|
||||
```
|
||||
此时只会输出一条搜索结果
|
||||
```bash
|
||||
type: <class 'bs4.element.NavigableString'>, content: i hate nvidia
|
||||
```
|
||||
### recursive
|
||||
通过指定recursive参数为`False`,find_all只会搜索当前tag的直接子节点,未指定该参数时默认搜索所有直接子节点和间接子节点:
|
||||
|
||||
### find
|
||||
调用`find_all`方法会返回所有匹配的对象,而如果只想搜索一个对象,可以调用`find`。
|
||||
|
||||
`find`方法等价于`find_all(..., limit=1)`
|
||||
|
||||
> find方法和find_all方法只会搜索当前节点的下级节点,并不包含当前节点本身
|
||||
|
||||
### 其他find方法变体
|
||||
除了find方法和find_all方法外,还包含如下find方法变体:
|
||||
```py
|
||||
find_parents( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_parent( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_next_siblings( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_next_sibling( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_previous_siblings( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_previous_sibling( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_all_next( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_next( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_all_previous( name , attrs , recursive , string , **kwargs )
|
||||
|
||||
find_previous( name , attrs , recursive , string , **kwargs )
|
||||
```
|
||||
### css选择器
|
||||
通过`select`方法,支持通过css选择器语法查找tag
|
||||
```py
|
||||
soup.select("title")
|
||||
# [<title>The Dormouse's story</title>]
|
||||
|
||||
soup.select("p:nth-of-type(3)")
|
||||
# [<p class="story">...</p>]
|
||||
```
|
||||
逐层查找
|
||||
```py
|
||||
soup.select("body a")
|
||||
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
|
||||
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
|
||||
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
|
||||
|
||||
soup.select("html head title")
|
||||
# [<title>The Dormouse's story</title>]
|
||||
```
|
||||
查找直接下级子标签
|
||||
```py
|
||||
soup.select("head > title")
|
||||
# [<title>The Dormouse's story</title>]
|
||||
|
||||
soup.select("p > a")
|
||||
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
|
||||
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
|
||||
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
|
||||
|
||||
soup.select("p > a:nth-of-type(2)")
|
||||
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
|
||||
|
||||
soup.select("p > #link1")
|
||||
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
|
||||
|
||||
soup.select("body > a")
|
||||
# []
|
||||
```
|
||||
css类名查找
|
||||
```py
|
||||
soup.select(".sister")
|
||||
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
|
||||
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
|
||||
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
|
||||
|
||||
soup.select("[class~=sister]")
|
||||
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
|
||||
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
|
||||
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
|
||||
```
|
||||
查找兄弟节点标签
|
||||
```py
|
||||
soup.select("#link1 ~ .sister")
|
||||
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
|
||||
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
|
||||
|
||||
soup.select("#link1 + .sister")
|
||||
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
|
||||
```
|
||||
根据tag的id查找
|
||||
```py
|
||||
soup.select("#link1")
|
||||
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
|
||||
|
||||
soup.select("a#link2")
|
||||
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
|
||||
```
|
||||
同时用多种css选择器
|
||||
```py
|
||||
soup.select("#link1,#link2")
|
||||
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
|
||||
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
|
||||
```
|
||||
## 输出
|
||||
可以通过调用`prettify`方法来美化输出:
|
||||
```py
|
||||
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
|
||||
soup = BeautifulSoup(markup)
|
||||
soup.prettify()
|
||||
# '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'
|
||||
|
||||
print(soup.prettify())
|
||||
# <html>
|
||||
# <head>
|
||||
# </head>
|
||||
# <body>
|
||||
# <a href="http://example.com/">
|
||||
# I linked to
|
||||
# <i>
|
||||
# example.com
|
||||
# </i>
|
||||
# </a>
|
||||
# </body>
|
||||
# </html>
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user