# 5.1 痴漢爬蟲(PTT表特版下載器)

~~窩豪喜翻這隻爬蟲R!!!! (被拖走~~

好惹, 這隻爬蟲顧名思義, 就是要去表特版爬圖, 然後把圖存到你的~~D槽~~電腦裡. 在這種場合下, 例外處理就很重要了, 因為~~任何錯誤都不能阻止痴漢爬圖R~~\~

```python
import requests
import time
from bs4 import BeautifulSoup
import os
import re
import urllib.request
import json


PTT_URL = 'https://www.ptt.cc'


def get_web_content(url):
    resp = requests.get(url=url, cookies={'over18': '1'})
    if resp.status_code != 200:
        print('Invalid url: ' + resp.url)
        return None
    else:
        return resp.text


def get_articles(dom, date):
    soup = BeautifulSoup(dom, 'html5lib')

    paging_dev = soup.find('div', 'btn-group btn-group-paging')
    prev_url = paging_dev.find_all('a')[1]['href']

    articles = []
    divs = soup.find_all('div', 'r-ent')
    for div in divs:
        if div.find('div', 'date').text.strip() == date:
            push_count = 0
            push_str = div.find('div', 'nrec').text
            if push_str:
                try:
                    push_count = int(push_str)
                except ValueError:
                    if push_str == '爆':
                        push_count = 99
                    elif push_str.startswith('X'):
                        push_count = -10

            if div.find('a'):
                href = div.find('a')['href']
                title = div.find('a').text
                author = div.find('div', 'author').text if div.find('div', 'author') else ''
                articles.append({
                    'title': title,
                    'href': href,
                    'push_count': push_count,
                    'author': author
                })
    return articles, prev_url


def parse(dom):
    soup = BeautifulSoup(dom, 'html.parser')
    links = soup.find(id='main-content').find_all('a')
    img_urls = []
    for link in links:
        if re.match(r'^https?://(i.)?(m.)?imgur.com', link['href']):
            img_urls.append(link['href'])
    return img_urls


def save(img_urls, title):
    if img_urls:
        try:
            folder_name = title.strip()
            os.makedirs(folder_name)
            for img_url in img_urls:
                # e.g. 'http://imgur.com/9487qqq.jpg'.split('//') -> ['http:', 'imgur.com/9487qqq.jpg']
                if img_url.split('//')[1].startswith('m.'):
                    img_url = img_url.replace('//m.', '//i.')
                if not img_url.split('//')[1].startswith('i.'):
                    img_url = img_url.split('//')[0] + '//i.' + img_url.split('//')[1]
                if not img_url.endswith('.jpg'):
                    img_url += '.jpg'
                file_name = img_url.split('/')[-1]
                urllib.request.urlretrieve(img_url, os.path.join(folder_name, file_name))
        except Exception as e:
            print(e)


def main():
    current_page = get_web_content(PTT_URL + '/bbs/Beauty/index.html')
    if current_page:
        articles = []
        date = time.strftime("%m/%d").lstrip('0')
        current_articles, prev_url = get_articles(current_page, date)
        while current_articles:
            articles += current_articles
            current_page = get_web_content(PTT_URL + prev_url)
            current_articles, prev_url = get_articles(current_page, date)

        for article in articles:
            print('Collecting beauty from:', article)
            page = get_web_content(PTT_URL + article['href'])
            if page:
                img_urls = parse(page)
                save(img_urls, article['title'])
                article['num_image'] = len(img_urls)

        with open('data.json', 'w', encoding='utf-8') as file:
            json.dump(articles, file, indent=2, sort_keys=True, ensure_ascii=False)


if __name__ == '__main__':
    main()
```

輸出結果

```
Collecting beauty from: {'title': '[正妹] pokcy 超好吃', 'href': '/bbs/Beauty/M.1495876101.A.BEF.html', 'push_count': 3, 'author': 'ljislovej'}
Collecting beauty from: {'title': '[神人] 求神此妹....', 'href': '/bbs/Beauty/M.1495877463.A.5FA.html', 'push_count': 0, 'author': 'ymtk3280'}
Collecting beauty from: {'title': '[正妹] 正妹記者', 'href': '/bbs/Beauty/M.1495877750.A.3ED.html', 'push_count': 0, 'author': 'ellemo'}
Collecting beauty from: {'title': '[神人] 網路上看到的 求神', 'href': '/bbs/Beauty/M.1495879474.A.F81.html', 'push_count': 4, 'author': 'thejackys'}
Collecting beauty from: {'title': '[公告] wkheinz 水桶', 'href': '/bbs/Beauty/M.1495815970.A.2AB.html', 'push_count': 18, 'author': 'ffwind'}
Collecting beauty from: {'title': '[正妹] GAL GADOT', 'href': '/bbs/Beauty/M.1495818616.A.AC7.html', 'push_count': 36, 'author': 'as314'}
Collecting beauty from: {'title': '[神人] 有人認識最近龜甲萬醬油的女主角嗎?', 'href': '/bbs/Beauty/M.1495843739.A.166.html', 'push_count': 0, 'author': 'DL3'}
Collecting beauty from: {'title': '[正妹] 職能治療師', 'href': '/bbs/Beauty/M.1495853540.A.128.html', 'push_count': 0, 'author': 'catiesweet'}
Collecting beauty from: {'title': '[正妹] 我覺得後方的趴板好像滿好用的........', 'href': '/bbs/Beauty/M.1495855959.A.AA1.html', 'push_count': 1, 'author': 'li08090627'}
Collecting beauty from: {'title': '[正妹] 越南', 'href': '/bbs/Beauty/M.1495856490.A.735.html', 'push_count': 10, 'author': 'xuexiaomi'}
Collecting beauty from: {'title': '[正妹] 一張', 'href': '/bbs/Beauty/M.1495858089.A.BE4.html', 'push_count': 10, 'author': 'iyowe'}
Collecting beauty from: {'title': '[正妹] 艾瑪史東', 'href': '/bbs/Beauty/M.1495858709.A.AE9.html', 'push_count': 49, 'author': 'howgain'}
Collecting beauty from: {'title': '[正妹] 這樣我可以......', 'href': '/bbs/Beauty/M.1495859133.A.819.html', 'push_count': 3, 'author': 'li08090627'}
Collecting beauty from: {'title': 'Re: [神人]高雄左營麥當勞民族店員', 'href': '/bbs/Beauty/M.1495859367.A.971.html', 'push_count': 3, 'author': 'poca777'}
HTTP Error 404: Not Found
Collecting beauty from: {'title': '[正妹] 短髮清新', 'href': '/bbs/Beauty/M.1495860044.A.E28.html', 'push_count': 17, 'author': 'kitagawa0822'}
Collecting beauty from: {'title': '[神人] 這位短髮正妹是誰？', 'href': '/bbs/Beauty/M.1495866159.A.57A.html', 'push_count': 3, 'author': 'fawangching'}
Collecting beauty from: {'title': '[神人] 昨天看GOGORO2直播', 'href': '/bbs/Beauty/M.1495866447.A.143.html', 'push_count': 8, 'author': 'jerry121937'}
Collecting beauty from: {'title': '[正妹] 一張 一則影片', 'href': '/bbs/Beauty/M.1495867850.A.267.html', 'push_count': 6, 'author': 'starmind2230'}
Collecting beauty from: {'title': '[神人] 在woo遇到的正妹', 'href': '/bbs/Beauty/M.1495870616.A.1D3.html', 'push_count': -10, 'author': 'qwe88599'}
Collecting beauty from: {'title': '[神人] 短髮正妹', 'href': '/bbs/Beauty/M.1495870739.A.E51.html', 'push_count': 0, 'author': 'llauoykcuf'}

Process finished with exit code 0
```

看著滿滿的表特圖, 我忽然覺得會寫Python真是太棒喇\~

![](/files/-M523td92Bt5VzV7CEt-)

![](/files/-M523tdBFiXEPEnlujhL)

原始碼[點我](https://github.com/yotsuba1022/web-crawler-practice/blob/master/ch5/ptt_beauty_crawler.py)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://clu.gitbook.io/python-web-crawler-note/51-chi-han-pa-87f228-ptt-biao-te-ban-xia-zai-566829.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
