📘
Python web crawler note
  • Introduction
  • 1. 環境安裝與爬蟲的基本
  • 1.1 環境安裝
  • 1.2 IDE設定
  • 1.3 一隻很原始的爬蟲
  • 1.4 幫爬蟲裝煞車
  • 2. 用BeautifuSoup來分析網頁
  • 2.1 BeautifulSoup範例 - 1
  • 2.2 BeautifulSoup說明
  • 2.3 BeautifulSoup範例 - 2
  • 2.4 加入Regular Expression
  • 2.5 Dcard今日十大熱門文章
  • 3. 更多實際的應用
  • 3.1 PTT八卦版今日熱門文章
  • 3.2 Yahoo奇摩電影本週新片
  • 3.3 蘋果日報/自由時報今日焦點
  • 3.4 Google Finance 個股資訊
  • 3.5 Yahoo奇摩字典
  • 4. 基於API的爬蟲
  • 4.1 八卦版鄉民從哪來?
  • 4.2 Facebook Graph API
  • 4.3 imdb電影資訊查詢
  • 4.4 Google Finance API
  • 4.5 台灣證券交易所API
  • 5. 資料儲存
  • 5.1 痴漢爬蟲(PTT表特版下載器)
  • 5.2 儲存成CSV檔案
  • 5.3 儲存至SQLite
  • 6. 不同編碼/文件類型的爬蟲
  • 6.1 非UTF-8編碼的文件
  • 6.2 XML文件
  • 7. 比價爬蟲
  • 7.1 momo購物網爬蟲
  • 7.2 PChome 24h API爬蟲
  • 7.3 比價圖表程式
  • 8. 處理POST請求/登入頁面
  • 8.1 空氣品質監測網
  • 9. 動態網頁爬蟲
  • 9.1 台銀法拍屋資訊查詢
  • 10. 自然語言處理
  • 10.1 歌詞頻率與歌詞雲
Powered by GitBook
On this page

Was this helpful?

2.4 加入Regular Expression

Previous2.3 BeautifulSoup範例 - 2Next2.5 Dcard今日十大熱門文章

Last updated 5 years ago

Was this helpful?

有些時候可能要找含有某些特定pattern的內容, 如電話, email, url, 特定的tag(h4)等等..., 這時候如果會用regular expression就可以比較有效率的取出需要的資訊.

一些常見的pattern:

  • URL: http(s)?://[a-zA-Z0-9./_]+

  • Email: [a-zA-Z0-9._+]+@[a-zA-Z0-9._]+.(com|org|edu|gov|net)

  • 所有的中文字(不包含標點符號): [\u4e00-\u9fa5]+

  • 線上Unicode查詢:

  • 自己google別人寫好的

import requests
import re
from bs4 import BeautifulSoup


def main():
    url = 'http://blog.castman.net/web-crawler-tutorial/ch2/blog/blog.html'
    resp = requests.get(url)
    soup = BeautifulSoup(resp.text, 'html.parser')
    find_text_content_by_reg(soup, 'h[1-6]')

    # [a-zA-Z0-9]+ -> means that we hope the result string is composed by character a~z, A~Z and 0~9,
    # and the string length should ≥ 1 (which represented by "+").

    # http(s)?://[a-zA-Z0-9\./_]+ -> means hyper link.

    # [\u4e00-\u9fa5]+ -> means all the chinese words in unicode format.

    print('\nFind all .png img source:')
    # To find png type image source by reg.
    # $ means the tail, the end of the string.
    # \. means ".", the \ is for escaping the special characters.
    png_source_pattern = '\.png$'
    find_img_source_by_reg(soup, png_source_pattern)

    # To find png type image source which contains "beginner" in source name by reg.
    # In the pattern, the "." after beginner means any words,
    # the * means the length is 0 or 1.
    print('\nFind all .png img sources that contain \"beginner\" in file name:')
    find_img_source_by_reg(soup, 'beginner.*'+png_source_pattern)

    print('\nTo count the blog number:')
    blog_class_pattern = 'card\-blog$'
    count_blog_number(soup, blog_class_pattern)

    print('\nTo find how many image sources contains the word \"crawler\"')
    target_pattern = 'crawler.*'
    find_img_source_by_reg(soup, target_pattern)


# re.compile API DOC: https://docs.python.org/3/library/re.html#re.compile
def find_text_content_by_reg(soup, reg_pattern):
    for element in soup.find_all(re.compile(reg_pattern)):
        print(element.text.strip())


def find_img_source_by_reg(soup, source_type):
    for img in soup.find_all('img', {'src': re.compile(source_type)}):
        print(img['src'])


def count_blog_number(soup, blog_pattern):
    count = len(soup.find_all('div', {'class': re.compile(blog_pattern)}))
    print('Blog count: ' + str(count))


if __name__ == '__main__':
    main()

輸出如下:

Python教學文章
開發環境設定
Mac使用者
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析
資料科學
給初學者的 Python 網頁爬蟲與資料分析

Find all .png img source:
static/python-for-beginners.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png

Find all .png img sources that contain "beginner" in file name:
static/python-for-beginners.png

To count the blog number:
Blog count: 6

To find how many image sources contains the word "crawler"
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png
static/python_crawler.png

Process finished with exit code 0

原始碼

線上測試regular expression:

http://unicodelookup.com
點我
http://www.regexpal.com/