From 3f77333e4e4c6a556fd3933ce88015a7966f8196 Mon Sep 17 00:00:00 2001 From: asahi Date: Sat, 4 May 2024 22:26:57 +0800 Subject: [PATCH] =?UTF-8?q?=E9=98=85=E8=AF=BBbeautiful=20soup=E6=96=87?= =?UTF-8?q?=E6=A1=A3?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- python/beatifulsoup.md | 541 +++++++++++++++++++++++++++++++++++++++++ python/py.md | 71 ++++++ 2 files changed, 612 insertions(+) create mode 100644 python/beatifulsoup.md diff --git a/python/beatifulsoup.md b/python/beatifulsoup.md new file mode 100644 index 0000000..b3f24a1 --- /dev/null +++ b/python/beatifulsoup.md @@ -0,0 +1,541 @@ +- [Beautiful Soup](#beautiful-soup) + - [安装](#安装) + - [构造BeautifulSoup对象](#构造beautifulsoup对象) + - [Tag](#tag) + - [name](#name) + - [attributes](#attributes) + - [多值属性](#多值属性) + - [NavigableString](#navigablestring) + - [文档树遍历](#文档树遍历) + - [.contents](#contents) + - [.children](#children) + - [.descendants](#descendants) + - [.string](#string) + - [.parent](#parent) + - [.parents](#parents) + - [.next\_sibling和.previous\_sibling](#next_sibling和previous_sibling) + - [.next\_siblings和.previous\_siblings](#next_siblings和previous_siblings) + - [搜索文档树](#搜索文档树) + - [过滤器类型](#过滤器类型) + - [name过滤](#name过滤) + - [正则](#正则) + - [列表](#列表) + - [True](#true) + - [自定义方法](#自定义方法) + - [属性查找](#属性查找) + - [按class进行搜索](#按class进行搜索) + - [按string](#按string) + - [limit](#limit) + - [recursive](#recursive) + - [find](#find) + - [其他find方法变体](#其他find方法变体) + - [css选择器](#css选择器) + - [输出](#输出) + + +# Beautiful Soup +## 安装 +```bash +pip install beautifulsoup4 +``` +## 构造BeautifulSoup对象 +```python +from bs4 import BeautifulSoup + +soup = BeautifulSoup(open("index.html"),'html.parser') + +soup = BeautifulSoup("data",'html.parser') +``` +在调用玩构造方法后,BeautifulSoup会将传入的html文本解析为一个树结构的python对象,python对象存在如下几种类型: +- Tag +- NavigableString +- BeautifulSoup +- Comment + +## Tag +Tag对象与html文档中的tag相同 +```py +soup = BeautifulSoup('Extremely bold','html.parser') +tag = soup.b +type(tag) +# +``` +Tag对象中具有两个重要属性:name和attributes +### name +每个Tag对象都拥有name属性,可以通过`.name`来获取 +```py +soup = BeautifulSoup('Extremely bold', 'html.parser') +print(soup.b.name) # b +``` +tag的name属性可以进行修改,如果改变tag的name属性,那么BeautifulSoup对象也会随之修改 +```py +soup = BeautifulSoup('Extremely bold', 'html.parser') +soup.b.name = 'h1' +print(soup) +#

Extremely bold

+``` +### attributes +一个tag能拥有许多属性,tag属性的获取和字典操作相同: +```py +soup = BeautifulSoup('Extremely bold', 'html.parser') +print(soup.b['class']) # ['boldest'] +print(soup.b.attrs['class']) # ['boldest'] +``` +tag的属性同样能在运行时被修改,且tag的属性操作和字典操作相同,运行时能够新增或删除、修改属性,对tag属性的改动会影响BeautifulSoup对象 +```py +soup = BeautifulSoup('Extremely bold', 'html.parser') +print(soup) # Extremely bold +soup.b['class'] = 'nvidia' +print(soup) # Extremely bold +soup.b['id'] = 'amd' +print(soup) # Extremely bold +del soup.b['class'] +print(soup) # Extremely bold +``` +### 多值属性 +html协议中,存在部分属性存在多个值的场景。最常见的多值属性为`rel`,`rev`,`accept-charset`,`headers`,`class`,`accesskey`等。 + +在BeautifulSoup中,多值属性的返回类型为list: +```py +soup = BeautifulSoup('Extremely bold', 'html.parser') +print(soup.b['class']) # ['nvidia', 'amd'] +``` +如果有属性看起来像有多个值,但是html协议中该属性未被定义为多值属性,那么BeautifulSoup会将该属性值作为字符串返回 +```py +soup = BeautifulSoup('Extremely bold', 'html.parser') +print(soup.b['id']) # nvidia amd +``` + +## NavigableString +NavigableString值通常被嵌套在Tag中,可以通过`tag.string`进行获取 +```py +soup = BeautifulSoup('Extremely bold', 'html.parser') +print(soup.b.string) # Extremely bold +print(type(soup.b.string)) # +``` +如果想要将NavigableString转化为unicode字符串,可以调用`unicode`方法 +```py +unicode_string = unicode(tag.string) +unicode_string +# u'Extremely bold' +type(unicode_string) +# +``` +如果想要在BeautifulSoup之外使用NavigableString对象,应该将其转为unicode字符串 + +## 文档树遍历 +BeautifulSoup对象中,Tag对象通常有其子节点,可以通过`tag.{child-tag-type}`的形式来获取tag对象第一个`child-tag-type`类型的子节点,示例如下: +```py +soup = BeautifulSoup('

i hate nvidia

nvidia is a piece of shit

', 'html.parser') +print(soup.p) #

i hate nvidia

+``` +如果想要获取soup对象中所有`child-tag-type`类型的标签,需要调用`find_all`方法。`find_all`将会在整个树结构中寻找指定类型的子节点,不仅包含其直接子节点,还包含间接子节点: +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') +print(soup.find_all("p")) + # [

i hate nvidia

,

nvidia is a piece of shit

,

fuck Jensen Huang

] + ``` + ### .contents + 通过`contents`属性,可以获取直接子节点: + ```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') +print(soup.body.contents) +# [

i hate nvidia

,

nvidia is a piece of shit

,

fuck Jensen Huang

] +``` +### .children +`children`属性可以对子节点进行迭代 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') +for e in soup.body.children: + print(e) +#

i hate nvidia

+#

nvidia is a piece of shit

+#

fuck Jensen Huang

+``` +### .descendants +通过`descendants`属性,可以对所有直接和间接的子节点进行遍历 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') +for t in soup.descendants: + print(t) +``` +产生的结果为 +```bash +

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

+

i hate nvidia

+i hate nvidia +

nvidia is a piece of shit

+nvidia is a piece of shit +

fuck Jensen Huang

+

fuck Jensen Huang

+fuck Jensen Huang +``` +### .string +如果tag只有一个NavigableString类型的子节点,那么直接可以通过`string`属性进行访问: +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') +print(soup.body.p.string) # i hate nvidia +``` +### .parent +通过`parent`属性,可以得到一个节点的直接父节点 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', 'html.parser') +grand_son_node = soup.body.contents[2].p +print(grand_son_node.parent) +``` +输出为 +```bash +

fuck Jensen Huang

+``` +### .parents +通过`parents`属性,可以得到一个节点的所有父节点: +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +grand_son_node = soup.body.contents[2].p +i = 0 +for p in grand_son_node.parents: + for j in range(0, i): + print("\t", end='') + i += 1 + print(f"type: {type(p)}, content: {p}") +``` +输出为 +```bash +type: , content:

fuck Jensen Huang

+ type: , content:

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

+ type: , content:

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

+``` + +### .next_sibling和.previous_sibling +可以通过`.next_sibling`和`.previous_sibling`来查询兄弟节点 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +mid_node = soup.body.contents[1] +print(mid_node.previous_sibling) #

i hate nvidia

+print(mid_node.next_sibling) #

fuck Jensen Huang

+``` + +### .next_siblings和.previous_siblings +通过`.next_siblings`和`previous_siblings`,可以遍历所有之前和之后的兄弟节点 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +mid_node = soup.body.contents[1] +print([e for e in mid_node.previous_siblings]) +print([e for e in mid_node.next_siblings]) +``` +输出 +```bash +[

i hate nvidia

] +[

fuck Jensen Huang

] +``` +## 搜索文档树 +### 过滤器类型 +如果要在文档树中进行查找,过滤器有如下类型: +- name +- attributes +- 字符串 + +### name过滤 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +print(soup.find_all(name='p')) +``` +输出为 +```bash +[

i hate nvidia

,

nvidia is a piece of shit

,

fuck Jensen Huang

] +``` +#### 正则 +在根据name查询时,可以适配正则 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +print(soup.find_all(name=re.compile("^b"))) +``` +上述正则会匹配`body`和`block`标签,查询结果为 +```bash +[

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

,

fuck Jensen Huang

] +``` +#### 列表 +在根据name查询时,可以传入多个tag类型 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +print(soup.find_all(name=['p', 'block'])) +``` +其会查询p类型和block类型的tag,输出为 +```bash +[

i hate nvidia

,

nvidia is a piece of shit

,

fuck Jensen Huang

,

fuck Jensen Huang

] +``` + +#### True +如果要查询所有的tag,可以向name传入True +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(name=True): + print(e) +``` +输出为 +```bash +

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

+

i hate nvidia

+

nvidia is a piece of shit

+

fuck Jensen Huang

+

fuck Jensen Huang

+``` +#### 自定义方法 +除了上述外,还可以自定义过滤方法来对tag对象进行过滤 +```py +def is_match(tag): + return tag.name == 'p' and 'id' in tag.attrs and tag.attrs['id'] == '100' + + +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(name=is_match): + print(e) +``` + +输出为 +```bash +

nvidia is a piece of shit

+``` +### 属性查找 +如果为find_all方法指定了一个命名参数,但是该参数不是find_all方法的内置命名参数,那么会将该参数名称作为属性名称进行查找: +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(id="200"): + print(e) +``` +上述会查找拥有id属性且id值为200的tag对象,输出为 +```bash +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(id="200"): + print(e) +``` +同样的,根据属性查找也支持正则: +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(id=re.compile("^[0-9]{1,}00$")): + print(e) +``` +上述会查找拥有id属性并且id值符合正则pattern的tag对象,输出为: +```bash +

i hate nvidia

+

nvidia is a piece of shit

+``` + +根据属性查找,还可以通过向attrs参数传递一个字典: +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(attrs={ + 'id': "100", + "page": True +}): + print(e) +``` +此时输出为: +```bash +

nvidia is a piece of shit

+``` +### 按class进行搜索 +可以通过指定class_来按class进行搜索 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(class_="main"): + print(e) +``` +输出为 +```bash +

nvidia is a piece of shit

+``` +### 按string +指定string后,可以针对html文档中的字符串内容进行搜索,搜索中的元素只会是NavigableString类型 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(string=re.compile("nvidia")): + print(f"type: {type(e)}, content: {e}") +``` +输出为: +```bash +type: , content: i hate nvidia +type: , content: nvidia is a piece of shit +``` +### limit +通过limit参数,可以限制搜索的数量,如果文档很大,limit可以降低搜索时间 +```py +soup = BeautifulSoup( + '

i hate nvidia

nvidia is a piece of shit

fuck Jensen Huang

', + 'html.parser') +for e in soup.find_all(string=re.compile("nvidia"), limit=1): + print(f"type: {type(e)}, content: {e}") +``` +此时只会输出一条搜索结果 +```bash +type: , content: i hate nvidia +``` +### recursive +通过指定recursive参数为`False`,find_all只会搜索当前tag的直接子节点,未指定该参数时默认搜索所有直接子节点和间接子节点: + +### find +调用`find_all`方法会返回所有匹配的对象,而如果只想搜索一个对象,可以调用`find`。 + +`find`方法等价于`find_all(..., limit=1)` + +> find方法和find_all方法只会搜索当前节点的下级节点,并不包含当前节点本身 + +### 其他find方法变体 +除了find方法和find_all方法外,还包含如下find方法变体: +```py +find_parents( name , attrs , recursive , string , **kwargs ) + +find_parent( name , attrs , recursive , string , **kwargs ) + +find_next_siblings( name , attrs , recursive , string , **kwargs ) + +find_next_sibling( name , attrs , recursive , string , **kwargs ) + +find_previous_siblings( name , attrs , recursive , string , **kwargs ) + +find_previous_sibling( name , attrs , recursive , string , **kwargs ) + +find_all_next( name , attrs , recursive , string , **kwargs ) + +find_next( name , attrs , recursive , string , **kwargs ) + +find_all_previous( name , attrs , recursive , string , **kwargs ) + +find_previous( name , attrs , recursive , string , **kwargs ) +``` +### css选择器 +通过`select`方法,支持通过css选择器语法查找tag +```py +soup.select("title") +# [The Dormouse's story] + +soup.select("p:nth-of-type(3)") +# [

...

] +``` +逐层查找 +```py +soup.select("body a") +# [Elsie, +# Lacie, +# Tillie] + +soup.select("html head title") +# [The Dormouse's story] +``` +查找直接下级子标签 +```py +soup.select("head > title") +# [The Dormouse's story] + +soup.select("p > a") +# [Elsie, +# Lacie, +# Tillie] + +soup.select("p > a:nth-of-type(2)") +# [Lacie] + +soup.select("p > #link1") +# [Elsie] + +soup.select("body > a") +# [] +``` +css类名查找 +```py +soup.select(".sister") +# [Elsie, +# Lacie, +# Tillie] + +soup.select("[class~=sister]") +# [Elsie, +# Lacie, +# Tillie] +``` +查找兄弟节点标签 +```py +soup.select("#link1 ~ .sister") +# [Lacie, +# Tillie] + +soup.select("#link1 + .sister") +# [Lacie] +``` +根据tag的id查找 +```py +soup.select("#link1") +# [Elsie] + +soup.select("a#link2") +# [Lacie] +``` +同时用多种css选择器 +```py +soup.select("#link1,#link2") +# [Elsie, +# Lacie] +``` +## 输出 +可以通过调用`prettify`方法来美化输出: +```py +markup = 'I linked to example.com' +soup = BeautifulSoup(markup) +soup.prettify() +# '\n \n \n \n \n...' + +print(soup.prettify()) +# +# +# +# +# +# I linked to +# +# example.com +# +# +# +# +``` + + + + + + + diff --git a/python/py.md b/python/py.md index 4d59d0e..017411c 100644 --- a/python/py.md +++ b/python/py.md @@ -1,3 +1,74 @@ +- [Python](#python) + - [变量](#变量) + - [字符串](#字符串) + - [字符串首字母大写](#字符串首字母大写) + - [字符串全部字符大写](#字符串全部字符大写) + - [字符串全部字符小写](#字符串全部字符小写) + - [字符串删除空白符](#字符串删除空白符) + - [访问字符串中字符](#访问字符串中字符) + - [字符串切片](#字符串切片) + - [字符串迭代](#字符串迭代) + - [数字](#数字) + - [/](#) + - [//](#-1) + - [数字类型向字符串类型转换](#数字类型向字符串类型转换) + - [列表](#列表) + - [访问列表中元素](#访问列表中元素) + - [列表元素操作](#列表元素操作) + - [修改](#修改) + - [插入到末尾](#插入到末尾) + - [在某位置之前插入](#在某位置之前插入) + - [删除列表中的元素](#删除列表中的元素) + - [列表与栈api](#列表与栈api) + - [remove](#remove) + - [列表排序](#列表排序) + - [sort](#sort) + - [sorted](#sorted) + - [列表中顺序反转](#列表中顺序反转) + - [获取列表长度](#获取列表长度) + - [遍历列表](#遍历列表) + - [range](#range) + - [max, min, sum](#max-min-sum) + - [根据一个列表生成另一个列表](#根据一个列表生成另一个列表) + - [切片](#切片) + - [列表复制](#列表复制) + - [元组](#元组) + - [if](#if) + - [and/or](#andor) + - [列表中是否包含某值](#列表中是否包含某值) + - [列表中是否不包含某值](#列表中是否不包含某值) + - [多分支if/elif/else](#多分支ifelifelse) + - [字典](#字典) + - [向字典中添加键值对](#向字典中添加键值对) + - [删除字典中的键值对](#删除字典中的键值对) + - [字典遍历](#字典遍历) + - [按顺序遍历字典的key](#按顺序遍历字典的key) + - [while](#while) + - [函数](#函数) + - [参数默认值](#参数默认值) + - [接收多个参数](#接收多个参数) + - [文件操作](#文件操作) + - [文件读取](#文件读取) + - [写入文件](#写入文件) + - [文件末尾追加](#文件末尾追加) + - [异常处理](#异常处理) + - [抛异常](#抛异常) + - [捕获异常后重新抛出异常](#捕获异常后重新抛出异常) + - [主动抛出异常](#主动抛出异常) + - [数据存储](#数据存储) + - [json.dump](#jsondump) + - [json.load](#jsonload) + - [http api](#http-api) + - [多线程](#多线程) + - [linux命令交互](#linux命令交互) + - [正则](#正则) + - [match](#match) + - [search](#search) + - [pattern](#pattern) + - [findall](#findall) + - [finditer](#finditer) + + # Python ## 变量 在python中,可以通过如下方式创建变量: