爬虫playwright自动化爬取药监局网站非逆向
这个网站反爬比较厉害,一般的自动化工具比如selenium都很难获取到数据,小编试了很多方法,终于用playwright的火狐浏览器的内核成功拿到了数据,真是不容易.而且还支持无头模式下,可以部署在docker上面运行了,虽然慢了点但是能获取到数据就行了.
废话不多说直接上代码
需要先安装
pip install playwright
playwright install
import time
from bs4 import BeautifulSoup
from playwright.sync_api import Playwright, sync_playwright
start_time = time.perf_counter()
def run(playwright: Playwright,url) -> None:
browser = playwright.firefox.launch(headless=True)
context = browser.new_context()
page = context.new_page()
page.goto(url)
page.wait_for_selector('div.top', state='attached')#等待元素
html = page.content()
r=BeautifulSoup(html,'html.parser')
print(r.get_text())
context.close()
browser.close()
def get(url):
with sync_playwright() as playwright:
run(playwright,url)
if __name__ == '__main__':
get('网址')
end_time = time.perf_counter()
print(f'运行时间:{end_time - start_time:.2f}秒')
检测到需要的元素 然后打印出数据.原理也很简单.支持二次发开.注意只能火狐这个内核firefox
THE END