2.1 BeautifulSoup範例 - 1

以下這個範例會簡單示範BeautifulSoup的使用方式, 可以邊看html結構邊對照程式碼了解一下各種場合要怎麼去擷取資訊.

import requests
from bs4 import BeautifulSoup


def main():
    url = 'http://blog.castman.net/web-crawler-tutorial/ch2/blog/blog.html'
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')

    # The following two lines are the same.
    # print(soup.find('h4'))
    print('Content of the first h4:')
    print(soup.h4)

    # To find the first text content of anchor of h4
    print('\nText content of the first h4:')
    print(soup.h4.a.text)

    print('\nTo find all the h4 text content:')
    h4_tags = soup.find_all('h4')
    for h4 in h4_tags:
        print(h4.a.text)

    print('\nTo find all the h4 text content with class named \'card-title\' :')
    # The following three ways are the same.
    # h4_tags = soup.find_all('h4', {'class': 'card-title'})
    # h4_tags = soup.find_all('h4', 'card-title')
    h4_tags = soup.find_all('h4', class_='card-title')
    for h4 in h4_tags:
        print(h4.a.text)

    print('\nTo find elements with id attribute: ')
    print(soup.find(id='mac-p').text.strip())
    # If the attribute key contains special character, it will occur SyntaxError:
    # print(soup.find(data-foo='mac-p').text.strip())
    # To prevent this, you can do as the following line:
    print(soup.find_all('', {'data-foo': 'mac-foo'}))

    print('\nTo retrieve all the blog post\'s information:')
    divs = soup.find_all('div', 'content')
    for div in divs:
        # If we only use print(div.text) to retrieve the content, it's not easy to handle the information,
        # to make the retrieved data clearly, you can craw the blog page like this:
        print(div.h6.text.strip(), div.h4.a.text.strip(), div.p.text.strip())

    # There is also another good way the retrieve the blog info, by stripped_strings() function,
    # it will return all the text content that are under the parent tag, even wrap by other sub tags.
    # However, the return object of stripped_strings is an iterator object, so it's not human-readable.
    # To solve this, take a look at following code block:
    print('\nTo find all blog contents via stripped_strings function:')
    for div in divs:
        # If you feel it's hard to understand, google "[s for s in subsets(S)]"
        print([s for s in div.stripped_strings])


if __name__ == '__main__':
    main()

輸出如下:

原始碼點我

Last updated

Was this helpful?