- [Beautiful Soup](#beautiful-soup) - [安装](#安装) - [构造BeautifulSoup对象](#构造beautifulsoup对象) - [Tag](#tag) - [name](#name) - [attributes](#attributes) - [多值属性](#多值属性) - [NavigableString](#navigablestring) - [文档树遍历](#文档树遍历) - [.contents](#contents) - [.children](#children) - [.descendants](#descendants) - [.string](#string) - [.parent](#parent) - [.parents](#parents) - [.next\_sibling和.previous\_sibling](#next_sibling和previous_sibling) - [.next\_siblings和.previous\_siblings](#next_siblings和previous_siblings) - [搜索文档树](#搜索文档树) - [过滤器类型](#过滤器类型) - [name过滤](#name过滤) - [正则](#正则) - [列表](#列表) - [True](#true) - [自定义方法](#自定义方法) - [属性查找](#属性查找) - [按class进行搜索](#按class进行搜索) - [按string](#按string) - [limit](#limit) - [recursive](#recursive) - [find](#find) - [其他find方法变体](#其他find方法变体) - [css选择器](#css选择器) - [输出](#输出) # Beautiful Soup ## 安装 ```bash pip install beautifulsoup4 ``` ## 构造BeautifulSoup对象 ```python from bs4 import BeautifulSoup soup = BeautifulSoup(open("index.html"),'html.parser') soup = BeautifulSoup("data",'html.parser') ``` 在调用玩构造方法后,BeautifulSoup会将传入的html文本解析为一个树结构的python对象,python对象存在如下几种类型: - Tag - NavigableString - BeautifulSoup - Comment ## Tag Tag对象与html文档中的tag相同 ```py soup = BeautifulSoup('Extremely bold','html.parser') tag = soup.b type(tag) # ``` Tag对象中具有两个重要属性:name和attributes ### name 每个Tag对象都拥有name属性,可以通过`.name`来获取 ```py soup = BeautifulSoup('Extremely bold', 'html.parser') print(soup.b.name) # b ``` tag的name属性可以进行修改,如果改变tag的name属性,那么BeautifulSoup对象也会随之修改 ```py soup = BeautifulSoup('Extremely bold', 'html.parser') soup.b.name = 'h1' print(soup) #

Extremely bold

``` ### attributes 一个tag能拥有许多属性,tag属性的获取和字典操作相同: ```py soup = BeautifulSoup('Extremely bold', 'html.parser') print(soup.b['class']) # ['boldest'] print(soup.b.attrs['class']) # ['boldest'] ``` tag的属性同样能在运行时被修改,且tag的属性操作和字典操作相同,运行时能够新增或删除、修改属性,对tag属性的改动会影响BeautifulSoup对象 ```py soup = BeautifulSoup('Extremely bold', 'html.parser') print(soup) # Extremely bold soup.b['class'] = 'nvidia' print(soup) # Extremely bold soup.b['id'] = 'amd' print(soup) # Extremely bold del soup.b['class'] print(soup) # Extremely bold ``` ### 多值属性 html协议中,存在部分属性存在多个值的场景。最常见的多值属性为`rel`,`rev`,`accept-charset`,`headers`,`class`,`accesskey`等。 在BeautifulSoup中,多值属性的返回类型为list: ```py soup = BeautifulSoup('Extremely bold', 'html.parser') print(soup.b['class']) # ['nvidia', 'amd'] ``` 如果有属性看起来像有多个值,但是html协议中该属性未被定义为多值属性,那么BeautifulSoup会将该属性值作为字符串返回 ```py soup = BeautifulSoup('Extremely bold', 'html.parser') print(soup.b['id']) # nvidia amd ``` ## NavigableString NavigableString值通常被嵌套在Tag中,可以通过`tag.string`进行获取 ```py soup = BeautifulSoup('Extremely bold', 'html.parser') print(soup.b.string) # Extremely bold print(type(soup.b.string)) # ``` 如果想要将NavigableString转化为unicode字符串,可以调用`unicode`方法 ```py unicode_string = unicode(tag.string) unicode_string # u'Extremely bold' type(unicode_string) # ``` 如果想要在BeautifulSoup之外使用NavigableString对象,应该将其转为unicode字符串 ## 文档树遍历 BeautifulSoup对象中,Tag对象通常有其子节点,可以通过`tag.{child-tag-type}`的形式来获取tag对象第一个`child-tag-type`类型的子节点,示例如下: ```py soup = BeautifulSoup('

i hate nvidia

nvidia is a piece of shit

', 'html.parser') print(soup.p) #

i hate nvidia

``` 如果想要获取soup对象中所有`child-tag-type`类型的标签,需要调用`find_all`方法。`find_all`将会在整个树结构中寻找指定类型的子节点,不仅包含其直接子节点,还包含间接子节点: ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') print(soup.find_all("p")) # [

i hate nvidia

,

nvidia is a piece of shit

,

fuck Jensen Huang

] ``` ### .contents 通过`contents`属性,可以获取直接子节点: ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') print(soup.body.contents) # [

i hate nvidia

,

nvidia is a piece of shit

,

fuck Jensen Huang

] ``` ### .children `children`属性可以对子节点进行迭代 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.body.children: print(e) #

i hate nvidia

#

nvidia is a piece of shit

#

fuck Jensen Huang

``` ### .descendants 通过`descendants`属性,可以对所有直接和间接的子节点进行遍历 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for t in soup.descendants: print(t) ``` 产生的结果为 ```bash

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

i hate nvidia

i hate nvidia

nvidia is a piece of shit

nvidia is a piece of shit

fuck Jensen Huang

fuck Jensen Huang

fuck Jensen Huang ``` ### .string 如果tag只有一个NavigableString类型的子节点,那么直接可以通过`string`属性进行访问: ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') print(soup.body.p.string) # i hate nvidia ``` ### .parent 通过`parent`属性,可以得到一个节点的直接父节点 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') grand_son_node = soup.body.contents[2].p print(grand_son_node.parent) ``` 输出为 ```bash

fuck Jensen Huang

``` ### .parents 通过`parents`属性,可以得到一个节点的所有父节点: ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') grand_son_node = soup.body.contents[2].p i = 0 for p in grand_son_node.parents: for j in range(0, i): print("\t", end='') i += 1 print(f"type: {type(p)}, content: {p}") ``` 输出为 ```bash type: , content:

fuck Jensen Huang

type: , content:

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

type: , content:

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

``` ### .next_sibling和.previous_sibling 可以通过`.next_sibling`和`.previous_sibling`来查询兄弟节点 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') mid_node = soup.body.contents[1] print(mid_node.previous_sibling) #

i hate nvidia

print(mid_node.next_sibling) #

fuck Jensen Huang

``` ### .next_siblings和.previous_siblings 通过`.next_siblings`和`previous_siblings`,可以遍历所有之前和之后的兄弟节点 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') mid_node = soup.body.contents[1] print([e for e in mid_node.previous_siblings]) print([e for e in mid_node.next_siblings]) ``` 输出 ```bash [

i hate nvidia

] [

fuck Jensen Huang

] ``` ## 搜索文档树 ### 过滤器类型 如果要在文档树中进行查找,过滤器有如下类型: - name - attributes - 字符串 ### name过滤 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') print(soup.find_all(name='p')) ``` 输出为 ```bash [

i hate nvidia

,

nvidia is a piece of shit

,

fuck Jensen Huang

] ``` #### 正则 在根据name查询时,可以适配正则 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') print(soup.find_all(name=re.compile("^b"))) ``` 上述正则会匹配`body`和`block`标签,查询结果为 ```bash [

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

,

fuck Jensen Huang

] ``` #### 列表 在根据name查询时,可以传入多个tag类型 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') print(soup.find_all(name=['p', 'block'])) ``` 其会查询p类型和block类型的tag,输出为 ```bash [

i hate nvidia

,

nvidia is a piece of shit

,

fuck Jensen Huang

,

fuck Jensen Huang

] ``` #### True 如果要查询所有的tag,可以向name传入True ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(name=True): print(e) ``` 输出为 ```bash

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

fuck Jensen Huang

``` #### 自定义方法 除了上述外,还可以自定义过滤方法来对tag对象进行过滤 ```py def is_match(tag): return tag.name == 'p' and 'id' in tag.attrs and tag.attrs['id'] == '100' soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(name=is_match): print(e) ``` 输出为 ```bash

nvidia is a piece of shit

``` ### 属性查找 如果为find_all方法指定了一个命名参数,但是该参数不是find_all方法的内置命名参数,那么会将该参数名称作为属性名称进行查找: ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(id="200"): print(e) ``` 上述会查找拥有id属性且id值为200的tag对象,输出为 ```bash soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(id="200"): print(e) ``` 同样的,根据属性查找也支持正则: ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(id=re.compile("^[0-9]{1,}00$")): print(e) ``` 上述会查找拥有id属性并且id值符合正则pattern的tag对象,输出为: ```bash

i hate nvidia

nvidia is a piece of shit

``` 根据属性查找,还可以通过向attrs参数传递一个字典: ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(attrs={ 'id': "100", "page": True }): print(e) ``` 此时输出为: ```bash

nvidia is a piece of shit

``` ### 按class进行搜索 可以通过指定class_来按class进行搜索 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(class_="main"): print(e) ``` 输出为 ```bash

nvidia is a piece of shit

``` ### 按string 指定string后,可以针对html文档中的字符串内容进行搜索,搜索中的元素只会是NavigableString类型 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(string=re.compile("nvidia")): print(f"type: {type(e)}, content: {e}") ``` 输出为: ```bash type: , content: i hate nvidia type: , content: nvidia is a piece of shit ``` ### limit 通过limit参数,可以限制搜索的数量,如果文档很大,limit可以降低搜索时间 ```py soup = BeautifulSoup( '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') for e in soup.find_all(string=re.compile("nvidia"), limit=1): print(f"type: {type(e)}, content: {e}") ``` 此时只会输出一条搜索结果 ```bash type: , content: i hate nvidia ``` ### recursive 通过指定recursive参数为`False`,find_all只会搜索当前tag的直接子节点,未指定该参数时默认搜索所有直接子节点和间接子节点: ### find 调用`find_all`方法会返回所有匹配的对象,而如果只想搜索一个对象,可以调用`find`。 `find`方法等价于`find_all(..., limit=1)` > find方法和find_all方法只会搜索当前节点的下级节点,并不包含当前节点本身 ### 其他find方法变体 除了find方法和find_all方法外,还包含如下find方法变体: ```py find_parents( name , attrs , recursive , string , **kwargs ) find_parent( name , attrs , recursive , string , **kwargs ) find_next_siblings( name , attrs , recursive , string , **kwargs ) find_next_sibling( name , attrs , recursive , string , **kwargs ) find_previous_siblings( name , attrs , recursive , string , **kwargs ) find_previous_sibling( name , attrs , recursive , string , **kwargs ) find_all_next( name , attrs , recursive , string , **kwargs ) find_next( name , attrs , recursive , string , **kwargs ) find_all_previous( name , attrs , recursive , string , **kwargs ) find_previous( name , attrs , recursive , string , **kwargs ) ``` ### css选择器 通过`select`方法,支持通过css选择器语法查找tag ```py soup.select("title") # [The Dormouse's story] soup.select("p:nth-of-type(3)") # [

...

] ``` 逐层查找 ```py soup.select("body a") # [Elsie, # Lacie, # Tillie] soup.select("html head title") # [The Dormouse's story] ``` 查找直接下级子标签 ```py soup.select("head > title") # [The Dormouse's story] soup.select("p > a") # [Elsie, # Lacie, # Tillie] soup.select("p > a:nth-of-type(2)") # [Lacie] soup.select("p > #link1") # [Elsie] soup.select("body > a") # [] ``` css类名查找 ```py soup.select(".sister") # [Elsie, # Lacie, # Tillie] soup.select("[class~=sister]") # [Elsie, # Lacie, # Tillie] ``` 查找兄弟节点标签 ```py soup.select("#link1 ~ .sister") # [Lacie, # Tillie] soup.select("#link1 + .sister") # [Lacie] ``` 根据tag的id查找 ```py soup.select("#link1") # [Elsie] soup.select("a#link2") # [Lacie] ``` 同时用多种css选择器 ```py soup.select("#link1,#link2") # [Elsie, # Lacie] ``` ## 输出 可以通过调用`prettify`方法来美化输出: ```py markup = 'I linked to example.com' soup = BeautifulSoup(markup) soup.prettify() # '\n \n \n \n \n...' print(soup.prettify()) # # # # # # I linked to # # example.com # # # # ```