5.1 痴漢爬蟲(PTT表特版下載器)

~~窩豪喜翻這隻爬蟲R!!!! (被拖走~~

好惹, 這隻爬蟲顧名思義, 就是要去表特版爬圖, 然後把圖存到你的D槽電腦裡. 在這種場合下, 例外處理就很重要了, 因為~~任何錯誤都不能阻止痴漢爬圖R~~~

import requests
import time
from bs4 import BeautifulSoup
import os
import re
import urllib.request
import json


PTT_URL = 'https://www.ptt.cc'


def get_web_content(url):
    resp = requests.get(url=url, cookies={'over18': '1'})
    if resp.status_code != 200:
        print('Invalid url: ' + resp.url)
        return None
    else:
        return resp.text


def get_articles(dom, date):
    soup = BeautifulSoup(dom, 'html5lib')

    paging_dev = soup.find('div', 'btn-group btn-group-paging')
    prev_url = paging_dev.find_all('a')[1]['href']

    articles = []
    divs = soup.find_all('div', 'r-ent')
    for div in divs:
        if div.find('div', 'date').text.strip() == date:
            push_count = 0
            push_str = div.find('div', 'nrec').text
            if push_str:
                try:
                    push_count = int(push_str)
                except ValueError:
                    if push_str == '爆':
                        push_count = 99
                    elif push_str.startswith('X'):
                        push_count = -10

            if div.find('a'):
                href = div.find('a')['href']
                title = div.find('a').text
                author = div.find('div', 'author').text if div.find('div', 'author') else ''
                articles.append({
                    'title': title,
                    'href': href,
                    'push_count': push_count,
                    'author': author
                })
    return articles, prev_url


def parse(dom):
    soup = BeautifulSoup(dom, 'html.parser')
    links = soup.find(id='main-content').find_all('a')
    img_urls = []
    for link in links:
        if re.match(r'^https?://(i.)?(m.)?imgur.com', link['href']):
            img_urls.append(link['href'])
    return img_urls


def save(img_urls, title):
    if img_urls:
        try:
            folder_name = title.strip()
            os.makedirs(folder_name)
            for img_url in img_urls:
                # e.g. 'http://imgur.com/9487qqq.jpg'.split('//') -> ['http:', 'imgur.com/9487qqq.jpg']
                if img_url.split('//')[1].startswith('m.'):
                    img_url = img_url.replace('//m.', '//i.')
                if not img_url.split('//')[1].startswith('i.'):
                    img_url = img_url.split('//')[0] + '//i.' + img_url.split('//')[1]
                if not img_url.endswith('.jpg'):
                    img_url += '.jpg'
                file_name = img_url.split('/')[-1]
                urllib.request.urlretrieve(img_url, os.path.join(folder_name, file_name))
        except Exception as e:
            print(e)


def main():
    current_page = get_web_content(PTT_URL + '/bbs/Beauty/index.html')
    if current_page:
        articles = []
        date = time.strftime("%m/%d").lstrip('0')
        current_articles, prev_url = get_articles(current_page, date)
        while current_articles:
            articles += current_articles
            current_page = get_web_content(PTT_URL + prev_url)
            current_articles, prev_url = get_articles(current_page, date)

        for article in articles:
            print('Collecting beauty from:', article)
            page = get_web_content(PTT_URL + article['href'])
            if page:
                img_urls = parse(page)
                save(img_urls, article['title'])
                article['num_image'] = len(img_urls)

        with open('data.json', 'w', encoding='utf-8') as file:
            json.dump(articles, file, indent=2, sort_keys=True, ensure_ascii=False)


if __name__ == '__main__':
    main()

輸出結果

Collecting beauty from: {'title': '[正妹] pokcy 超好吃', 'href': '/bbs/Beauty/M.1495876101.A.BEF.html', 'push_count': 3, 'author': 'ljislovej'}
Collecting beauty from: {'title': '[神人] 求神此妹....', 'href': '/bbs/Beauty/M.1495877463.A.5FA.html', 'push_count': 0, 'author': 'ymtk3280'}
Collecting beauty from: {'title': '[正妹] 正妹記者', 'href': '/bbs/Beauty/M.1495877750.A.3ED.html', 'push_count': 0, 'author': 'ellemo'}
Collecting beauty from: {'title': '[神人] 網路上看到的 求神', 'href': '/bbs/Beauty/M.1495879474.A.F81.html', 'push_count': 4, 'author': 'thejackys'}
Collecting beauty from: {'title': '[公告] wkheinz 水桶', 'href': '/bbs/Beauty/M.1495815970.A.2AB.html', 'push_count': 18, 'author': 'ffwind'}
Collecting beauty from: {'title': '[正妹] GAL GADOT', 'href': '/bbs/Beauty/M.1495818616.A.AC7.html', 'push_count': 36, 'author': 'as314'}
Collecting beauty from: {'title': '[神人] 有人認識最近龜甲萬醬油的女主角嗎?', 'href': '/bbs/Beauty/M.1495843739.A.166.html', 'push_count': 0, 'author': 'DL3'}
Collecting beauty from: {'title': '[正妹] 職能治療師', 'href': '/bbs/Beauty/M.1495853540.A.128.html', 'push_count': 0, 'author': 'catiesweet'}
Collecting beauty from: {'title': '[正妹] 我覺得後方的趴板好像滿好用的........', 'href': '/bbs/Beauty/M.1495855959.A.AA1.html', 'push_count': 1, 'author': 'li08090627'}
Collecting beauty from: {'title': '[正妹] 越南', 'href': '/bbs/Beauty/M.1495856490.A.735.html', 'push_count': 10, 'author': 'xuexiaomi'}
Collecting beauty from: {'title': '[正妹] 一張', 'href': '/bbs/Beauty/M.1495858089.A.BE4.html', 'push_count': 10, 'author': 'iyowe'}
Collecting beauty from: {'title': '[正妹] 艾瑪史東', 'href': '/bbs/Beauty/M.1495858709.A.AE9.html', 'push_count': 49, 'author': 'howgain'}
Collecting beauty from: {'title': '[正妹] 這樣我可以......', 'href': '/bbs/Beauty/M.1495859133.A.819.html', 'push_count': 3, 'author': 'li08090627'}
Collecting beauty from: {'title': 'Re: [神人]高雄左營麥當勞民族店員', 'href': '/bbs/Beauty/M.1495859367.A.971.html', 'push_count': 3, 'author': 'poca777'}
HTTP Error 404: Not Found
Collecting beauty from: {'title': '[正妹] 短髮清新', 'href': '/bbs/Beauty/M.1495860044.A.E28.html', 'push_count': 17, 'author': 'kitagawa0822'}
Collecting beauty from: {'title': '[神人] 這位短髮正妹是誰？', 'href': '/bbs/Beauty/M.1495866159.A.57A.html', 'push_count': 3, 'author': 'fawangching'}
Collecting beauty from: {'title': '[神人] 昨天看GOGORO2直播', 'href': '/bbs/Beauty/M.1495866447.A.143.html', 'push_count': 8, 'author': 'jerry121937'}
Collecting beauty from: {'title': '[正妹] 一張 一則影片', 'href': '/bbs/Beauty/M.1495867850.A.267.html', 'push_count': 6, 'author': 'starmind2230'}
Collecting beauty from: {'title': '[神人] 在woo遇到的正妹', 'href': '/bbs/Beauty/M.1495870616.A.1D3.html', 'push_count': -10, 'author': 'qwe88599'}
Collecting beauty from: {'title': '[神人] 短髮正妹', 'href': '/bbs/Beauty/M.1495870739.A.E51.html', 'push_count': 0, 'author': 'llauoykcuf'}

Process finished with exit code 0

看著滿滿的表特圖, 我忽然覺得會寫Python真是太棒喇~

原始碼點我

Previous5. 資料儲存 Next5.2 儲存成CSV檔案

Last updated 5 years ago

Was this helpful?