狠狠噜天天噜日日噜av,精品人妻一线二线,黑人巨大精品欧美一区二区桃花岛

Beautiful Soup 是一個(gè)可以從 HTML 和 XML 文件中提取數(shù)據(jù)的 Python 庫，本文整理了Beautiful Soup的基礎(chǔ)知識和使用方法，一起來看看吧，希望對大家學(xué)習(xí)python有所幫助。

　　版本：4.4.0

　　安裝Beautiful Soup

　　確保安裝Python3 之后，只需一行命令。

　　 pip install beautifulsoup4

　　注意，Mac 中可能需要使用 pip3 install beautifulsoup4

　　安裝完BeautifulSoup 后，我們還需要 HTTP 解析器，例如三方解析器 lxml

　　 pip install lxml

　　萬事俱備只欠東風(fēng)！

　　快速開始

　　>>>from bs4import BeautifulSoup>>>soup = BeautifulSoup('Extremely bold')

　　對象種類

　　BeautifulSoup 將復(fù)雜的 HTML 文檔轉(zhuǎn)為一個(gè)復(fù)雜的樹形結(jié)構(gòu)，每個(gè)節(jié)點(diǎn)都是 Python 對象，所有對象可以歸納為四種： Tag, NavigableString, BeautifulSoup, Comment 。

　　 Tag

　　Tag 對象與 XML 和 HTML 原生文檔中的 tag 相同。例如：

　　>>>tagb = soup.b>>>type(tag)

　　< class' bs4. element. Tag'>

　　下面介紹兩個(gè)最重要的屬性：name 和 attributes 。 Tag 有很多屬性和方法，在遍歷文檔樹和搜索文檔樹中詳細(xì)介紹。

　　 Name

　　使用 .name 獲取和修改tag 的名字

　　>>> tag.name'b'

　　 Attributes

　　一個(gè)tag 有很多屬性。例如：前面的 tag , 有一個(gè) class 屬性。

　　>>> tag['class']

　　['boldest']

　　獲取所有的屬性

　　 tag.attrs

　　另外tag 的屬性可以添加，刪除和修改。操作方法和字典一樣

　　注意：多值屬性，一個(gè)屬性可以同時(shí)存在多個(gè)值

　　 NavigableString

　　字符串常被包含在tag 中，使用 NavigableString 類來包裝 tag 中的字符串：

　　>>> tag.string'Extremely bold'>>> type(tag.string)

　　< class' bs4. element. NavigableString'>

　　 BeautifulSoup

　　BeautifulSoup 對象并不是真正的 HTML 或 XML 的 tag ，所以它沒有 name 和 Attribute 屬性。有時(shí)我們需要 .name 查看，所以它包含一個(gè)值為 [documnet] 的特殊屬性 .name

　　>>>soup.name'[document]'

　　 Comment

　　上面三個(gè)覆蓋了HTML 和 XML 中的所有內(nèi)容。但是還有一些特殊對象。

　　>>> markup = "">>> soup = BeautifulSoup(markup)>>> comment = soup.b.string>>> type(comment)

　　< class' bs4. element. Comment'>>>> comment'Hey, buddy. Want to buy a used parser?'

　　Comment 對象是一個(gè)特殊類型的 NavigableString 對象

　　 Comment 對象會使用特殊的格式輸出：

　　>>> print(soup.b.prettify())< b>

　　遍歷文檔樹

　　:chestnut: ：

>>> html_doc = """

... <html><head><title>The Dormouse's story</title></head>

... <body>

... <pclass="title"><b>The Dormouse's story</b></p>

...

... <pclass="story">Once upon a time there were three little sisters; and their names were

... <aclass="sister"id="link1">Elsie</a>,

... <aclass="sister"id="link2">Lacie</a> and

... <aclass="sister"id="link3">Tillie</a>;

... and they lived at the bottom of a well.</p>

...

... <pclass="story">...</p>

... """

>>>

>>> from bs4 import BeautifulSoup

>>> soup = BeautifulSoup(html_doc, 'html.parser')

　　子節(jié)點(diǎn)

　　一個(gè)Tag 可能包含多個(gè)字符串或其它的 Tag ，這些都是這個(gè) Tag 的子節(jié)點(diǎn)。 Beautiful Soup 提供了許多操作和遍歷子節(jié)點(diǎn)的屬性。

　　 Tag的名字

　　操作文檔樹最簡單的方法就是告訴想獲取標(biāo)簽的名稱：

>>> soup.head

<head><title>The Dormouse's story</title></head>

>>> soup.title

<title>The Dormouse's story</title>

>>> soup.body.b

<b>The Dormouse's story</b>

　　 .contents 和 .children

　　 .contents 屬性可以將tag 的子節(jié)點(diǎn)以列表的方式輸出：

　　>>> head_tag = soup.head

>>> head_tag

<head><title>The Dormouse's story</title></head>

>>> head_tag.contents

[<title>The Dormouse's story</title>]

# .contents 返回的是列表

>>> title_tag = head_tag.contents[0]

>>> title_tag

<title>The Dormouse's story</title>

>>> title_tag.contents

["The Dormouse's story"]

　　注意：字符串沒有子節(jié)點(diǎn)，所以字符串沒有 .contents 屬性。

　　>>> for child in title_tag.children: ... print(child) ...

　　The Dormouse's story

　　 .descendants

　　 .contents 和 .children 屬性僅包含tag 的直接子節(jié)點(diǎn)。 .descendants 屬性可以對所有tag 的子孫節(jié)點(diǎn)進(jìn)遞歸循環(huán)

>>> for child in head_tag.descendants:

... print(child)

...

<title>The Dormouse's story</title>

The Dormouse's story

　　 .string

　　>>> title_tag.string"The Dormouse's story"

　　 .strings 和 stripped_strings

　　· 如果 tag 中包含多個(gè)字符串，可以使用 .strings 來循環(huán)獲取。

　　 for string in soup.strings:

　　print(repr(string))

　　# u"The Dormouse's story"

　　# u'\n\n'

　　# u"The Dormouse's story"

　　# u'\n\n'

　　# u'Once upon a time there were three little sisters; and their names were\n'

　　# u'Elsie'

　　# u',\n'

　　# u'Lacie'

　　# u' and\n'

　　# u'Tillie'

　　# u';\nand they lived at the bottom of a well.'

　　# u'\n\n'

　　# u'...'

　　# u'\n'

　　· 輸出的字符串中可以包含了很多空格或空行，使用 .stripped_strings 可以去除多余空白內(nèi)容。

　　 for string in soup.stripped_strings:

　　print(repr(string))

　　# u"The Dormouse's story"

　　# u'Once upon a time there were three little sisters; and their names were'

　　# u'Elsie'

　　# u','

　　# u'Lacie'

　　# u'and'

　　# u'Tillie'

　　# u';\nand they lived at the bottom of a well.'

　　# u'...'

　　父節(jié)點(diǎn)

　　 .parent

　　 .parent 屬性獲取某個(gè)元素的父節(jié)點(diǎn)。

　　字符串也有父節(jié)點(diǎn)

　　的父節(jié)點(diǎn)是BeautifulSoup 對象

　　BeautifulSoup 對象的父節(jié)點(diǎn)是 None

　　 .parents

　　 .parents 遞歸獲取所有的父輩節(jié)點(diǎn)。

　　兄弟節(jié)點(diǎn)

　　使用 .next_sibling 和 .previous_sibling 屬性來查詢兄弟節(jié)點(diǎn)

　　通過 .next_siblings 和 .previous_siblings 屬性可以對當(dāng)前節(jié)點(diǎn)的兄弟節(jié)點(diǎn)迭代輸出。

　　搜索文檔樹

　　Beautiful Soup 定義了很多搜索方法。例如： find() 和 find_all() 。

　　過濾器

　　常見的過濾器類型，如下幾種：

　　字符串

　　最簡單的過濾器，例如：查找 <b> 標(biāo)簽可以寫成 find_all('b') 。

　　正則表達(dá)式

　　匹配符合正則表達(dá)式的內(nèi)容。

　　列表

　　匹配列表中所有元素內(nèi)容。

　　 TRUE

　　可以匹配任何值。

　　方法

　　可以定義一個(gè)接受一個(gè)參數(shù)的方法，返回布爾類型。如果是TRUE 表示當(dāng)前元素匹配找到，否則為找到。

　　 find_all

　　f∈dall(name,ars,recursive,str∈g,kwargs)f∈dall(name,ars,recursive,str∈g,kwargs)**

　　搜索所有當(dāng)前tag 的所有 tag 子節(jié)點(diǎn)，并判斷是否符合過濾器的條件。

　　1. name

　　 name 參數(shù)可以查找所有名字為name 的 tag 。

　soup.find_all("title")

# [<title>The Dormouse`s story</title>]

　　1. keyword 參數(shù)

　　如果一個(gè)指定名字的參數(shù)不是搜索內(nèi)置的參數(shù)名，搜索時(shí)會把該參數(shù)當(dāng)作指定名字tag 的屬性來搜索。

# id

soup.find_all(id = "links")

# [<a class="sister" id="link2">Lacie</a>]

# href

>>> soup.find_all(href = re.compile('elsie'))

[<a class="sister" id="link1">Elsie</a>]

# attrs

data_soup = BeautifulSoup('<div data-foo = "value">foo!</div>')

>>> data_soup.find_all(attrs = {"data-foo": "value"})

#[<div data-foo="value">foo!</div>]

按Class 搜索

按照類名搜索，但是由于 class 是保留字，所以使用 class_ 代替。

>>> soup.find_all("a", class_="sister")

[<aclass="sister"id="link1">Elsie</a>, <aclass="sister"id="link2">Lacie</a>, <aclass="sister"id="link3">Tillie</a>]

　　1. String 參數(shù)

　　使用string 參數(shù)搜索和使用 name 參數(shù)的可選值一樣。

　　>>> soup.find_all(string="Elsie")

　　['Elsie']

　　1. limit 參數(shù)

　　使用limit 限制返回的數(shù)量

　　>>> soup.find_all("a", limit=2)

[<aclass="sister"id="link1">Elsie</a>, <aclass="sister"id="link2">Lacie</a>]

　　1. recursive 參數(shù)

　　將recursive 設(shè)置為 False, 只會搜索 tag 的直接子節(jié)點(diǎn)。

　　 find

　　find() 和 find_all() 不同的是，前者直接返回結(jié)果，后者返回包含值的列表。

　　 CSS選擇器

　　在BeautifulSoup 對象的 select() 方法傳入字符串參數(shù)，即可以使用CSS 選擇器。

來源：紫電清霜

99热99这里只有精品6国产,亚洲中文字幕在线天天更新,在线观看亚洲精品国产福利片 ,久久久久综合网

Python3學(xué)習(xí)之解析庫BeautifulSoup詳解

熱門帖子

Swift 教程

最新帖子

Xcode 9.4下載