2.3 BeautifulSoup範例 - 2
再來一個範例:
import requests
from bs4 import BeautifulSoup
# Structure of the example html page:
# body
# - div
# - h2
# - p
# - table.table
# - thead
# - tr
# - th
# - th
# - th
# - th
# - tbody
# - tr
# - td
# - td
# - td
# - td
# - a
# - img
# - tr
# - ...
def main():
url = 'http://blog.castman.net/web-crawler-tutorial/ch2/table/table.html'
resp = requests.get(url)
soup = BeautifulSoup(resp.text, 'html.parser')
count_course_number(soup)
calculate_course_average_price1(soup)
calculate_course_average_price2(soup)
retrieve_all_tr_contents(soup)
def count_course_number(soup):
print('Total course count: ' + str(len(soup.find('table', 'table').tbody.find_all('tr'))) + '\n')
def calculate_course_average_price1(soup):
# To calculate the average course price
# Retrieve the record with index:
prices = []
rows = soup.find('table', 'table').tbody.find_all('tr')
for row in rows:
price = row.find_all('td')[2].text
print(price)
prices.append(int(price))
print('Average course price: ' + str(sum(prices) / len(prices)) + '\n')
def calculate_course_average_price2(soup):
# Retrieve the record via siblings:
prices = []
links = soup.find_all('a')
for link in links:
price = link.parent.previous_sibling.text
prices.append(int(price))
print('Average course price: ' + str(sum(prices) / len(prices)) + '\n')
def retrieve_all_tr_contents(soup):
# Retrieve all tr record:
rows = soup.find('table', 'table').tbody.find_all('tr')
for row in rows:
# Except all_tds = row.find_all('td'), you can also retrieve all td record with the following line code:
all_tds = [td for td in row.children]
if 'href' in all_tds[3].a.attrs:
href = all_tds[3].a['href']
else:
href = None
print(all_tds[0].text, all_tds[1].text, all_tds[2].text, href, all_tds[3].a.img['src'])
if __name__ == '__main__':
main()
跟前一個範例比起來, 在這種類型的網頁中, find()跟find_all()不見得就是最好用的, 在這種走訪網頁結構的過程中, parent, children, next/previous siblings也可以有很好的效果.
輸出如下:
Total course count: 6
1490
1890
1890
1890
1890
1890
Average course price: 1823.3333333333333
Average course price: 1823.3333333333333
初心者 - Python入門 初學者 1490 http://www.pycone.com img/python-logo.png
Python 網頁爬蟲入門實戰 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 機器學習入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 資料科學入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 資料視覺化入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 網站架設入門實戰 (預計) 有程式基礎的初學者 1890 None img/python-logo.png
Process finished with exit code 0
原始碼點我
參考資料: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree
Last updated