📘
Python web crawler note
  • Introduction
  • 1. 環境安裝與爬蟲的基本
  • 1.1 環境安裝
  • 1.2 IDE設定
  • 1.3 一隻很原始的爬蟲
  • 1.4 幫爬蟲裝煞車
  • 2. 用BeautifuSoup來分析網頁
  • 2.1 BeautifulSoup範例 - 1
  • 2.2 BeautifulSoup說明
  • 2.3 BeautifulSoup範例 - 2
  • 2.4 加入Regular Expression
  • 2.5 Dcard今日十大熱門文章
  • 3. 更多實際的應用
  • 3.1 PTT八卦版今日熱門文章
  • 3.2 Yahoo奇摩電影本週新片
  • 3.3 蘋果日報/自由時報今日焦點
  • 3.4 Google Finance 個股資訊
  • 3.5 Yahoo奇摩字典
  • 4. 基於API的爬蟲
  • 4.1 八卦版鄉民從哪來?
  • 4.2 Facebook Graph API
  • 4.3 imdb電影資訊查詢
  • 4.4 Google Finance API
  • 4.5 台灣證券交易所API
  • 5. 資料儲存
  • 5.1 痴漢爬蟲(PTT表特版下載器)
  • 5.2 儲存成CSV檔案
  • 5.3 儲存至SQLite
  • 6. 不同編碼/文件類型的爬蟲
  • 6.1 非UTF-8編碼的文件
  • 6.2 XML文件
  • 7. 比價爬蟲
  • 7.1 momo購物網爬蟲
  • 7.2 PChome 24h API爬蟲
  • 7.3 比價圖表程式
  • 8. 處理POST請求/登入頁面
  • 8.1 空氣品質監測網
  • 9. 動態網頁爬蟲
  • 9.1 台銀法拍屋資訊查詢
  • 10. 自然語言處理
  • 10.1 歌詞頻率與歌詞雲
Powered by GitBook
On this page

Was this helpful?

2.3 BeautifulSoup範例 - 2

再來一個範例:

import requests
from bs4 import BeautifulSoup

# Structure of the example html page:
#  body
#   - div
#     - h2
#     - p
#     - table.table
#       - thead
#         - tr
#           - th
#           - th
#           - th
#           - th
#       - tbody
#         - tr
#           - td
#           - td
#           - td
#           - td
#             - a
#               - img
#         - tr
#         - ...


def main():
    url = 'http://blog.castman.net/web-crawler-tutorial/ch2/table/table.html'
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')

    count_course_number(soup)
    calculate_course_average_price1(soup)
    calculate_course_average_price2(soup)
    retrieve_all_tr_contents(soup)


def count_course_number(soup):
    print('Total course count: ' + str(len(soup.find('table', 'table').tbody.find_all('tr'))) + '\n')


def calculate_course_average_price1(soup):
    # To calculate the average course price
    # Retrieve the record with index:
    prices = []
    rows = soup.find('table', 'table').tbody.find_all('tr')
    for row in rows:
        price = row.find_all('td')[2].text
        print(price)
        prices.append(int(price))
    print('Average course price: ' + str(sum(prices) / len(prices)) + '\n')


def calculate_course_average_price2(soup):
    # Retrieve the record via siblings:
    prices = []
    links = soup.find_all('a')
    for link in links:
        price = link.parent.previous_sibling.text
        prices.append(int(price))
    print('Average course price: ' + str(sum(prices) / len(prices)) + '\n')


def retrieve_all_tr_contents(soup):
    # Retrieve all tr record:
    rows = soup.find('table', 'table').tbody.find_all('tr')
    for row in rows:
        # Except all_tds = row.find_all('td'), you can also retrieve all td record with the following line code:
        all_tds = [td for td in row.children]
        if 'href' in all_tds[3].a.attrs:
            href = all_tds[3].a['href']
        else:
            href = None
        print(all_tds[0].text, all_tds[1].text, all_tds[2].text, href, all_tds[3].a.img['src'])


if __name__ == '__main__':
    main()

跟前一個範例比起來, 在這種類型的網頁中, find()跟find_all()不見得就是最好用的, 在這種走訪網頁結構的過程中, parent, children, next/previous siblings也可以有很好的效果.

輸出如下:

Total course count: 6

1490
1890
1890
1890
1890
1890
Average course price: 1823.3333333333333

Average course price: 1823.3333333333333

初心者 - Python入門 初學者 1490 http://www.pycone.com img/python-logo.png
Python 網頁爬蟲入門實戰 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 機器學習入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 資料科學入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 資料視覺化入門實戰 (預計) 有程式基礎的初學者 1890 http://www.pycone.com img/python-logo.png
Python 網站架設入門實戰 (預計) 有程式基礎的初學者 1890 None img/python-logo.png

Process finished with exit code 0
Previous2.2 BeautifulSoup說明Next2.4 加入Regular Expression

Last updated 5 years ago

Was this helpful?

原始碼

參考資料:

點我
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-the-tree