文章詳情頁

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

瀏覽：3日期：2022-08-06 16:43:24

目錄使用范例常用的對象–Tag常用的對象–NavigableString常用的對象–BeautifulSoup常用的對象–Comment對文檔樹的遍歷tag中包含多個字符串的情況.stripped_strings 去除空白內容搜索文檔樹–find和find_allselect方法(各種查找)獲取內容總結使用范例

from bs4 import BeautifulSoup#創建 Beautiful Soup 對象# 使用lxml來進行解析soup = BeautifulSoup(html,'lxml')print(soup.prettify())

常用的對象–Tag

就是 HTML 中的一個個標簽

在上面范例的基礎上添加

from bs4 import BeautifulSoup#創建 Beautiful Soup 對象# 使用lxml來進行解析soup = BeautifulSoup(html,'lxml')#print(soup.prettify())#創建 Beautiful Soup 對象soup = BeautifulSoup(html,’lxml’)print (soup.title)#None因為這里沒有tiele標簽所以返回noneprint (soup.head)#None因為這里沒有head標簽所以返回noneprint (soup.a)#返回 <a target='_blank'>編輯自我介紹，讓更多人了解你<span class='write-icon'></span></a>print (type(soup.p))#返回 <class ’bs4.element.Tag’>print( soup.p)

其中print( soup.p)

返回結果為

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

同樣地，在上面地基礎上添加

print (soup.name)# [document] #soup 對象本身比較特殊，它的 name 即為 [document]

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

print (soup.head.name)#head #對于其他內部標簽，輸出的值為標簽本身的名稱

print (soup.p.attrs)##把p標簽的所有屬性打印出來,得到的類型是一個字典。

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

print (soup.p[’class’])#獲取P標簽下地class標簽

soup.p[’class’] = 'newClass'print (soup.p) # 可以對這些屬性和內容等等進行修改

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

常用的對象–NavigableString

前面地基礎上添加

print (soup.p.string)# The Dormouse’s storyprint (type(soup.p.string))# <class ’bs4.element.NavigableString’>thon

返回結果

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

常用的對象–BeautifulSoup

beautiful soup對象表示文檔的全部內容。大多數情況下，它可以被視為標記對象。它支持遍歷文檔樹并搜索文檔樹中描述的大多數方法因為Beauty soup對象不是真正的HTML或XML標記，所以它沒有名稱和屬性。但是，有時查看其內容很方便。Name屬性，因此美麗的湯對象包含一個特殊屬性。值為“[文檔]”的名稱

print(soup.name)#返回 ’[document]’常用的對象–Comment

用于解釋注釋部分的內容

markup = '<b></b>'soup = BeautifulSoup(markup)comment = soup.b.stringtype(comment)# <class ’bs4.element.Comment’>對文檔樹的遍歷

在上面的基礎上添加

head_tag = soup.div# 返回所有子節點的列表print(head_tag.contents)

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

同理

head_tag = soup.div# 返回所有子節點的迭代器for child in head_tag.children: print(child)

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

tag中包含多個字符串的情況

可用 .strings 來循環獲取

for string in soup.strings: print(repr(string))

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

.stripped_strings 去除空白內容

for string in soup.stripped_strings: print(repr(string))

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

搜索文檔樹–find和find_all

找到所有

print(soup.find_all('a',id=’link2’))

find方法是找到第一個滿足條件的標簽后立即返回，返回一個元素。find_all方法是把所有滿足條件的標簽都選到，然后返回。

select方法(各種查找)

#通過標簽名查找：print(soup.select(’a’))#通過類名查找：#通過類名，則應該在類的前面加一個’.’print(soup.select(’.sister’))#通過id查找：#通過id查找，應該在id的名字前面加一個＃號print(soup.select('#link1'))

查找a標簽返回的結果

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

其他因為網頁本身沒有，返回的是一個空列表

組合查找

print(soup.select('p #link1'))#查找 p 標簽中，id 等于 link1的內容

子標簽查找

print(soup.select('head > title'))

通過屬性查找

print(soup.select(’a[]’))#屬性與標簽屬同一節點，中間不能有空格獲取內容

先查看類型

print (type(soup.select(’div’)))

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

for title in soup.select(’div’): print (title.get_text())

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

print (soup.select(’div’)[20].get_text())#選取第20個div標簽的內容

python爬蟲學習筆記--BeautifulSoup4庫的使用詳解

總結

本篇文章就到這里了，希望能給你帶來幫助，也希望您能夠多多關注好吧啦網的更多內容!

Python 編程

上一條：python中的zip模塊下一條：基于python + django + whoosh + jieba 分詞器實現站內檢索功能

相關文章：

1. python numpy庫np.percentile用法說明2. CSS自定義滾動條樣式案例詳解3. Android Studio 3.6 正式版終于發布了,快來圍觀4. python中HTMLParser模塊知識點總結5. python 批量下載bilibili視頻的gui程序6. Ajax提交post請求案例分析7. PHP 面向對象程序設計之類屬性與類常量實現方法分析8. JSP實現客戶信息管理系統9. 使用css實現全兼容tooltip提示框10. Java Spring WEB應用實例化如何實現

排行榜

					
					python 批量下載bilibili視頻的gui程序
PHP 面向對象程序設計之類屬性與類常量實現方法分析
使用ProcessBuilder調用外部命令，并返回大量結果
Ajax提交post請求案例分析
Java Spring WEB應用實例化如何實現
使用css實現全兼容tooltip提示框
python numpy庫np.percentile用法說明
CSS自定義滾動條樣式案例詳解
IntelliJ IDEA設置默認瀏覽器的方法
Android Studio 3.6 正式版終于發布了,快來圍觀
IntelliJ IDEA 2020.2正式發布,兩點多多總能助你提效
				

熱門標簽