全稱(chēng)是 XML Path Language,XML路徑語(yǔ)言,它是一門(mén)在XML文檔中和HTML文檔中查找信息的語(yǔ)言
1.XPath常用規(guī)則
表達(dá)式 描述
nodename 選取此節(jié)點(diǎn)的所有子節(jié)點(diǎn)
/ 從當(dāng)前節(jié)點(diǎn)選取直接子節(jié)點(diǎn)
// 從當(dāng)前節(jié)點(diǎn)選取子孫節(jié)點(diǎn)
. 選取當(dāng)前節(jié)點(diǎn)
.. 選取當(dāng)前節(jié)點(diǎn)的父節(jié)點(diǎn)
@ 選取屬性
2.準(zhǔn)備工作:安裝 lxml 庫(kù)
3.例子:
from lxml import etree text = ''' <div> <ul> <li class="ex1"><a href="ex1.html">ex1</a></li> <li class="ex2"><a href="ex2.html">ex2</a> </ul> </div> ''' html = etree.HTML(text) # 調(diào)用HTML類(lèi)進(jìn)行html初始化工作 r = etree.tostring(html) # 修復(fù)HTML代碼,補(bǔ)全其他選項(xiàng) print(r.decode('utf-8')) # 結(jié)果返回是bytes,我們將其轉(zhuǎn)化成UTF-8 4.所有節(jié)點(diǎn)
選取所有節(jié)點(diǎn):
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) res = html.xpath('//*') # 選取所有 print(res)
5.子節(jié)點(diǎn)
選取li節(jié)點(diǎn)的所有直接a子節(jié)點(diǎn):
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) res = html.xpath('//li/a') print(res)
6.父節(jié)點(diǎn)
使用.和..
7.屬性匹配
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) res = html.xpath('//li[@class='ex1']') print(res) 8.文本屬性
選取li節(jié)點(diǎn)的內(nèi)部文本,兩種方法,推薦第二種
a.
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) res = html.xpath('//li[@class='ex1']/a/text()') print(res) b.推薦,信息更全
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) res = html.xpath('//li[@class="ex1"]//text()') print(res)
9.屬性獲取
獲取所有l(wèi)i節(jié)點(diǎn)下所有a節(jié)點(diǎn)的href屬性
from lxml import etree html = etree.parse('./test.html',etree.HTMLParser()) res = html.xpath('//li/a/@href') print(res)
10.屬性多值匹配
from lxml import etree text = ''' <div> <ul> <li class="li li-first"><a href="ex1.html">li1</a></li> </ul> </div> ''' html = etree.HTML(text) res = html.xpath('//li[contains(@class,"li")]/a/text()') print(res)
【注】
contains()中,
第一個(gè)參數(shù)傳入屬性名稱(chēng),第二個(gè)參數(shù)傳入屬性值
11.多屬性匹配
根據(jù)多個(gè)屬性確定一個(gè)節(jié)點(diǎn)
from lxml import etree text = ''' <div> <ul> <li class="li" name="123"><a href="ex1.html">ex1</a></li> </ul> </div> ''' html = etree.HTML(text) res = html.xpath('//li[@contains(@class,"li") and @name="123"]/a/text()') print(res)
12.按序選擇(多個(gè)節(jié)點(diǎn))
from lxml import etree text = ''' <div> <ul> <li class="ex1"><a href="ex1.html">ex1</a></li> <li class="ex2"><a href="ex2.html">ex2</a></li> <li class="ex3"><a href="ex3.html">ex3</a></li> </ul> </div> ''' html = etree.HTML(text) res = html.xpath('//li[1]/a/text()') # 第一個(gè)li res = html.xpath('//li[last()]/a/text()') # 最后一個(gè)li res = html.xpath('//li[position()<3]/a/text()') # 前兩個(gè)li res = html.xpath('//li[last()-2]/a/text()') # 第一個(gè)li
【注】
序號(hào)從1開(kāi)始
13.節(jié)點(diǎn)軸選擇
from lxml import etree text = ''' <div> <ul> <li class="ex1"><a href="ex1.html">ex1</a></li> <li class="ex2"><a href="ex2.html">ex2</a></li> <li class="ex3"><a href="ex3.html">ex3</a></li> </ul> </div> ''' html = etree.HTML(text) res = html.xpath('//li[1]/ancestor::*') # 獲取祖先節(jié)點(diǎn) res = html.xpath('//li[1]/ancestor::div') # 獲取祖先div節(jié)點(diǎn) res = html.xpath('//li[1]/attribute::*') # 所有屬性值 res = html.xpath('//li[1]/child::a[href="ex1.html"]') # 所有直接子節(jié)點(diǎn) res = html.xpath('//li[1]/descendant::span') # 所有子孫節(jié)點(diǎn) res = html.xpath('//li[1]/following::*[2]') # 當(dāng)前節(jié)點(diǎn)之后的所有節(jié)點(diǎn) res = html.xpath('//li[1]/following-sibling::*') # 當(dāng)前節(jié)點(diǎn)之后的所有同級(jí)節(jié)點(diǎn)