4.1 八卦版鄉民從哪來?

這隻爬蟲會去爬當前八卦版前50篇文章, 然後看這些發文的鄉民是來自哪個國家:

import requests
import time
import json
import re
from bs4 import BeautifulSoup


PTT_URL = 'https://www.ptt.cc'
FREEGEOIP_API = 'http://freegeoip.net/json/'


def get_web_page(url):
    resp = requests.get(url=url, cookies={'over18': '1'})
    if resp.status_code != 200:
        print('Invalid url: ', resp.url)
        return None
    else:
        return resp.text


def get_articles(dom, date):
    soup = BeautifulSoup(dom, 'html5lib')
    # Retrieve the link of previous page
    paging_div = soup.find('div', 'btn-group btn-group-paging')
    prev_url = paging_div.find_all('a')[1]['href']

    articles = []
    divs = soup.find_all('div', 'r-ent')
    for d in divs:
        # If post date matched:
        if d.find('div', 'date').text.strip() == date:
            # To retrieve the push count:
            push_count = 0
            push_str = d.find('div', 'nrec').text
            if push_str:
                try:
                    push_count = int(push_str)
                except ValueError:
                    # If transform failed, it might be '爆', 'X1', 'X2', etc.
                    if push_str == '爆':
                        push_count = 99
                    elif push_str.startswith('X'):
                        push_count = -10

            # To retrieve title and href of the article:
            if d.find('a'):
                href = d.find('a')['href']
                title = d.find('a').text
                author = d.find('div', 'author').text if d.find('div', 'author') else ''
                articles.append({
                    'title': title,
                    'href': href,
                    'push_count': push_count,
                    'author': author
                })

    return articles, prev_url


def get_ip(dom):
    # e.g., ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 27.52.6.175
    pattern = '來自: \d+\.\d+\.\d+\.\d+'
    match = re.search(pattern, dom)
    if match:
        return match.group(0).replace('來自: ', '')
    else:
        return None


def get_country(ip):
    if ip:
        data = json.loads(requests.get(FREEGEOIP_API + ip).text)
        country_name = data['country_name'] if data['country_name'] else None
        return country_name
    return None


def main():
    print('取得今日文章列表:')
    current_page = get_web_page(PTT_URL + '/bbs/Gossiping/index.html')
    if current_page:
        articles = []
        today = time.strftime('%m/%d').lstrip('0')
        current_articles, prev_url = get_articles(current_page, today)
        while current_articles:
            articles += current_articles
            current_page = get_web_page(PTT_URL + prev_url)
            current_articles, prev_url = get_articles(current_page, today)
        print('共 %d 篇文章' % (len(articles)))

        print('取得前50篇文章的IP:')
        country_to_count = dict()
        for article in articles[:50]:
            print('查詢 IP:', article['title'])
            page = get_web_page(PTT_URL + article['href'])
            if page:
                ip = get_ip(page)
                country = get_country(ip)
                if country in country_to_count.keys():
                    country_to_count[country] += 1
                else:
                    country_to_count[country] = 1

        print('各國IP分佈: ')
        for k, v in country_to_count.items():
            print(k, v)


if __name__ == "__main__":
    main()

結果如下:

取得今日文章列表:
共 1450 篇文章
取得前50篇文章的IP:
查詢 IP: Re: [問卦] 統一根本是台灣唯一的出路了吧
查詢 IP: [問卦] 有人在武嶺嗎
查詢 IP: [問卦] 中國人吃大蒜蔥薑 氣味感人的八卦？
查詢 IP: [問卦] 為何不圈塊地 給甲甲自給自足？
查詢 IP: [問卦] 人類真的有登陸月球嗎
查詢 IP: Re: [問卦] 東京一哥是哪個地方？
查詢 IP: [問卦] 什麼時候開始大家喜歡跟別人撞包的八卦?
查詢 IP: [問卦] 在國道上時速80邊開車邊吃便當邊拍照是不是很厲害？
查詢 IP: [問卦] 為什麼國文課是必修科目？？？？？？？？
查詢 IP: [新聞] 鴿子身懷大量毒品　飛越邊境運毒被逮
查詢 IP: [新聞] 菲掃毒殺無赦 台灣就近變毒品中心
查詢 IP: [問卦] av男優穿建中制服拍甲片會被吉嗎？
查詢 IP: Re: [新聞] 中國遊客實拍　「台灣和柬埔寨很像」
查詢 IP: [問卦] 甲，T，雙性？？？
查詢 IP: Re: [問卦] 陳星工讀生是不是成功帶風向惹
查詢 IP: [問卦] 台灣同性戀什麼時候開始變風向了？
查詢 IP: [新聞] Ariana宣布重回曼徹斯特開唱 為受害者家
查詢 IP: [問卦] 有沒有圍棋跟傳武一樣中國垃圾八卦？
查詢 IP: [問卦] 褪黑激素臺灣藥局買的到嗎？
查詢 IP: [問卦] 一場不公平的選舉
查詢 IP: [公告] 八卦板板主投票提醒
查詢 IP: [問卦] 請問這三個符號到底是什麼意思？
查詢 IP: Re: [問卦] 說甲甲是愛滋高風險群算是歧視嗎?
查詢 IP: [問卦] 臺灣參加日本大胃王節目
查詢 IP: [問卦] 50% lakin這種牌子的衣服都誰在買？
查詢 IP: Re: [問卦] 遊戲有中文不買買日文的人在想啥
查詢 IP: Re: [新聞] 房仲百萬年薪拿不完 家庭事業兩兼顧
查詢 IP: [問卦] 東京一哥是哪個地方？
查詢 IP: [問卦] 有沒有股市上萬點民眾還是無感的八卦
查詢 IP: [問卦] 有沒有高度近視的八卦？
查詢 IP: [新聞] 如何對付文化恐怖份子?柯P：儘量做就對了
查詢 IP: [新聞] 妙齡女裸胸貼影印機  綠光劃過她大叫一聲
查詢 IP: Re: [問卦] 16歲女學生會想住豪宅想瘋了嗎?
查詢 IP: [問卦] 教主在哪個圈子最吃得開
查詢 IP: [問卦] 如果登入別人的人生想要幹嘛麼？
查詢 IP: [問卦] 藏鏡人有沒有自我意識?
查詢 IP: Re: [問卦] 有沒有中國遊戲在台灣發行 卻用日本配音?
查詢 IP: Re: [問卦] 遊戲有中文不買買日文的人在想啥
查詢 IP: [問卦] 惡魔果實能力者為何不當山賊
查詢 IP: [問卦] 取代tw ice
查詢 IP: [新聞] 白人學生髒話謾罵台裔師 影片瘋傳網友肉搜
查詢 IP: Re: [問卦] 陳星工讀生是不是成功帶風向惹
查詢 IP: [問卦] 李世石是怎麼贏AlphaGo的
查詢 IP: [問卦] 說甲甲是愛滋高風險群算是歧視嗎?
查詢 IP: [問卦] 反串??
查詢 IP: Re: [新聞] 李明哲遭陸逮捕 美國務院:鼓勵北京與台北
查詢 IP: Re: [新聞] 陸生問對習近平評價 馬英九：酒量不錯
查詢 IP: Re: [問卦] 統一根本是台灣唯一的出路了吧
查詢 IP: [問卦] youtube排擠台灣？
查詢 IP: [問卦] 有沒有中國遊戲在台灣發行 卻用日本配音?
各國IP分佈: 
Taiwan 45
China 2
Japan 1
Macao 1
Vietnam 1

Process finished with exit code 0

原始碼點我

關於國家的部分, 是把IP丟到freegeoip的api去查的, 詳細說明可以在官網查到: https://freegeoip.net

Previous4. 基於API的爬蟲 Next4.2 Facebook Graph API

Last updated 5 years ago

Was this helpful?