文章詳情頁

python實現csdn全部博文下載并轉PDF

瀏覽：4日期：2022-06-16 18:21:40

我們學習編程，在學習的時候，會有想把有用的知識點保存下來，我們可以把知識點的內容爬下來轉變成pdf格式，方便我們拿手機可以閑時翻看，是很方便的

先來一個單個的博文下載轉pdf格式的操作

python實現csdn全部博文下載并轉PDF

python中將html轉化為pdf的常用工具是Wkhtmltopdf工具包，在python環境下，pdfkit是這個工具包的封裝類。如何使用pdfkit以及如何配置呢？分如下幾個步驟。

下載wkhtmltopdf安裝包，并且安裝到電腦上。下載地址：https://wkhtmltopdf.org/downloads.html

python實現csdn全部博文下載并轉PDF

我下的是這個版本，安裝的時候要記住路徑，之后調用要用到路徑

python實現csdn全部博文下載并轉PDF

開發工具 python pycharm pdfkit （pip install pdfkit） lxml

今天目標：博主的全部博文下載，并且轉pdf格式保存

基本思路：

1、url + headers2、分析網頁： CSDN網頁是靜態網頁，請求獲取網頁源代碼3、lxml解析獲取boke_urls, author_name4、循環遍歷，得到 boke_url5、xpath解析獲取文件名6、css選擇器獲取標簽文本的主體7、構造拼接html文件8、保存html文件9、文件的轉換

分析網頁： CSDN網頁是靜態網頁，請求獲取網頁源代碼start_url =“https://i1bit.blog.csdn.net/” 為例確定網址為同步加載

python實現csdn全部博文下載并轉PDF

css選擇器獲取標簽文本的主體為代碼要點部分css語法部分

# css選擇器獲取標簽文本的主體html_css = parsel.Selector(response_2)html_content = html_css.css(’article’).get()# 構造拼接html文件html = ’’’<!DOCTYPE html> <html lang='en'> <head><meta charset='UTF-8'><title>Title</title> </head> <body>{} </body></html> ’’’.format(html_content)

點開博主的一篇博文打開開發者工具

python實現csdn全部博文下載并轉PDF

# css選擇器獲取標簽文本的主體html_css = parsel.Selector(response_2)html_content = html_css.css(’article’).get()# 構造拼接html文件html = ’’’<!DOCTYPE html> <html lang='en'> <head><meta charset='UTF-8'><title>Title</title> </head> <body>{} </body></html> ’’’.format(html_content)

文件的轉換

config = pdfkit.configuration(wkhtmltopdf=r’這里為下載wkhtmltopdf.exe的路徑’) pdfkit.from_file(第一個參數要轉變的html文件,第二個參數轉變后的pdf文件,configuration=config ) # 上面這樣寫清楚一點，也可以直接 pdfkit.from_file(第一個參數要轉變的html文件,第二個參數轉變后的pdf文件, configuration=pdfkit.configuration(wkhtmltopdf=r’這里為下載wkhtmltopdf.exe的路徑’) )

源碼展示：

import parsel, os, pdfkitfrom lxml import etreefrom requests_html import HTMLSessionsession = HTMLSession()def main(): # 1、url + headers start_url = input(r’請輸入csdn博主的地址：’) headers = {’User-Agent’: ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 ’ ’(KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36’ } # 2、分析網頁： CSDN網頁是靜態網頁，請求獲取網頁源代碼 response_1 = session.get(start_url, headers=headers).text # 3、解析獲取boke_urls, author_name html_xpath_1 = etree.HTML(response_1) author_name = html_xpath_1.xpath(r’//*[@id='floor-user-profile_485']/div/div[1]/div[2]/div[2]/div[1]/div[1]/text()’)[0] boke_urls = html_xpath_1.xpath(r’//article[@class='blog-list-box']/a/@href’) # 4、循環遍歷，得到 boke_url for boke_url in boke_urls:# 5、請求response_2 = session.get(boke_url, headers=headers).text# 6、xpath解析獲取文件名html_xpath_2 = etree.HTML(response_2)file_name = html_xpath_2.xpath(r’//h1[@id='articleContentId']/text()’)[0]# 7、css選擇器獲取標簽文本的主體html_css = parsel.Selector(response_2)html_content = html_css.css(’article’).get()# 8、構造拼接html文件html = ’’’<!DOCTYPE html> <html lang='en'> <head><meta charset='UTF-8'><title>Title</title> </head> <body>{} </body></html> ’’’.format(html_content)# 9、創建兩個文件夾，一個用來保存html 一個用來保存pdf文件if not os.path.exists(r’{}-html’.format(author_name)): os.mkdir(r’{}-html’.format(author_name))if not os.path.exists(r’{}-pdf’.format(author_name)): os.mkdir(r’{}-pdf’.format(author_name))# 10、保存html文件try: with open(r’{}-html/{}.html’.format(author_name, file_name), ’w’, encoding=’utf-8’) as f:f.write(html)except Exception as e: print(’文件名錯誤’)# 11、文件的轉換try: config = pdfkit.configuration(wkhtmltopdf=r’C:Program Fileswkhtmltopdfbinwkhtmltopdf.exe’) pdfkit.from_file(’{}-html/{}.html’.format(author_name, file_name),’{}-pdf/{}.pdf’.format(author_name, file_name),configuration=config ) a = print(r’--文件下載成功：{}.pdf’.format(file_name))except Exception as e: continueif __name__ == ’__main__’: main()

代碼操作：

python實現csdn全部博文下載并轉PDF

到此這篇關于python實現csdn全部博文下載并轉PDF的文章就介紹到這了,更多相關python 博文下載并轉PDF內容請搜索好吧啦網以前的文章或繼續瀏覽下面的相關文章希望大家以后多多支持好吧啦網！

Python 編程

上一條：Python異常處理中容易犯得錯誤總結下一條：Python實現單例模式的5種方法

相關文章：

1. Vue中nvm-windows的安裝與使用教程(親測)2. asp批量添加修改刪除操作示例代碼3. msxml3.dll 錯誤 800c0019 系統錯誤:-2146697191解決方法4. 推薦一個好看Table表格的css樣式代碼詳解5. 刪除docker里建立容器的操作方法6. CSS3實現動態翻牌效果仿百度貼吧3D翻牌一次動畫特效7. jsp+servlet實現猜數字游戲8. JSP數據交互實現過程解析9. 三個不常見的 HTML5 實用新特性簡介10. jsp實現簡單用戶7天內免登錄

排行榜

					
					ajax4jsf 1.0.2 發布,添加新的a4j tags.
PHP連接MySQL數據庫操作代碼實例解析
Vue中父子組件的值傳遞與方法傳遞
docker環境下安裝jenkins容器的詳細教程
配置PHP使之能同時支持GIF和JPEG
將Git存儲庫克隆到本地IntelliJ IDEA項目中的詳細教程
更好的構造開發模板 五種PHP設計模式
Python xlrd/xlwt 創建excel文件及常用操作
刪除docker里建立容器的操作方法
JSP數據交互實現過程解析
詳解c#與js的rsa加密互通
				

熱門標簽

国产综合久久一区二区三区