> For the complete documentation index, see [llms.txt](https://clu.gitbook.io/python-web-crawler-note/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://clu.gitbook.io/python-web-crawler-note/41-ba-gua-ban-xiang-min-cong-na-4f863f.md).

# 4.1 八卦版鄉民從哪來?

這隻爬蟲會去爬當前八卦版前50篇文章, 然後看這些發文的鄉民是來自哪個國家:

```python
import requests
import time
import json
import re
from bs4 import BeautifulSoup


PTT_URL = 'https://www.ptt.cc'
FREEGEOIP_API = 'http://freegeoip.net/json/'


def get_web_page(url):
    resp = requests.get(url=url, cookies={'over18': '1'})
    if resp.status_code != 200:
        print('Invalid url: ', resp.url)
        return None
    else:
        return resp.text


def get_articles(dom, date):
    soup = BeautifulSoup(dom, 'html5lib')
    # Retrieve the link of previous page
    paging_div = soup.find('div', 'btn-group btn-group-paging')
    prev_url = paging_div.find_all('a')[1]['href']

    articles = []
    divs = soup.find_all('div', 'r-ent')
    for d in divs:
        # If post date matched:
        if d.find('div', 'date').text.strip() == date:
            # To retrieve the push count:
            push_count = 0
            push_str = d.find('div', 'nrec').text
            if push_str:
                try:
                    push_count = int(push_str)
                except ValueError:
                    # If transform failed, it might be '爆', 'X1', 'X2', etc.
                    if push_str == '爆':
                        push_count = 99
                    elif push_str.startswith('X'):
                        push_count = -10

            # To retrieve title and href of the article:
            if d.find('a'):
                href = d.find('a')['href']
                title = d.find('a').text
                author = d.find('div', 'author').text if d.find('div', 'author') else ''
                articles.append({
                    'title': title,
                    'href': href,
                    'push_count': push_count,
                    'author': author
                })

    return articles, prev_url


def get_ip(dom):
    # e.g., ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 27.52.6.175
    pattern = '來自: \d+\.\d+\.\d+\.\d+'
    match = re.search(pattern, dom)
    if match:
        return match.group(0).replace('來自: ', '')
    else:
        return None


def get_country(ip):
    if ip:
        data = json.loads(requests.get(FREEGEOIP_API + ip).text)
        country_name = data['country_name'] if data['country_name'] else None
        return country_name
    return None


def main():
    print('取得今日文章列表:')
    current_page = get_web_page(PTT_URL + '/bbs/Gossiping/index.html')
    if current_page:
        articles = []
        today = time.strftime('%m/%d').lstrip('0')
        current_articles, prev_url = get_articles(current_page, today)
        while current_articles:
            articles += current_articles
            current_page = get_web_page(PTT_URL + prev_url)
            current_articles, prev_url = get_articles(current_page, today)
        print('共 %d 篇文章' % (len(articles)))

        print('取得前50篇文章的IP:')
        country_to_count = dict()
        for article in articles[:50]:
            print('查詢 IP:', article['title'])
            page = get_web_page(PTT_URL + article['href'])
            if page:
                ip = get_ip(page)
                country = get_country(ip)
                if country in country_to_count.keys():
                    country_to_count[country] += 1
                else:
                    country_to_count[country] = 1

        print('各國IP分佈: ')
        for k, v in country_to_count.items():
            print(k, v)


if __name__ == "__main__":
    main()
```

結果如下:

```
取得今日文章列表:
共 1450 篇文章
取得前50篇文章的IP:
查詢 IP: Re: [問卦] 統一根本是台灣唯一的出路了吧
查詢 IP: [問卦] 有人在武嶺嗎
查詢 IP: [問卦] 中國人吃大蒜蔥薑 氣味感人的八卦？
查詢 IP: [問卦] 為何不圈塊地 給甲甲自給自足？
查詢 IP: [問卦] 人類真的有登陸月球嗎
查詢 IP: Re: [問卦] 東京一哥是哪個地方？
查詢 IP: [問卦] 什麼時候開始大家喜歡跟別人撞包的八卦?
查詢 IP: [問卦] 在國道上時速80邊開車邊吃便當邊拍照是不是很厲害？
查詢 IP: [問卦] 為什麼國文課是必修科目？？？？？？？？
查詢 IP: [新聞] 鴿子身懷大量毒品　飛越邊境運毒被逮
查詢 IP: [新聞] 菲掃毒殺無赦 台灣就近變毒品中心
查詢 IP: [問卦] av男優穿建中制服拍甲片會被吉嗎？
查詢 IP: Re: [新聞] 中國遊客實拍　「台灣和柬埔寨很像」
查詢 IP: [問卦] 甲，T，雙性？？？
查詢 IP: Re: [問卦] 陳星工讀生是不是成功帶風向惹
查詢 IP: [問卦] 台灣同性戀什麼時候開始變風向了？
查詢 IP: [新聞] Ariana宣布重回曼徹斯特開唱 為受害者家
查詢 IP: [問卦] 有沒有圍棋跟傳武一樣中國垃圾八卦？
查詢 IP: [問卦] 褪黑激素臺灣藥局買的到嗎？
查詢 IP: [問卦] 一場不公平的選舉
查詢 IP: [公告] 八卦板板主投票提醒
查詢 IP: [問卦] 請問這三個符號到底是什麼意思？
查詢 IP: Re: [問卦] 說甲甲是愛滋高風險群算是歧視嗎?
查詢 IP: [問卦] 臺灣參加日本大胃王節目
查詢 IP: [問卦] 50% lakin這種牌子的衣服都誰在買？
查詢 IP: Re: [問卦] 遊戲有中文不買買日文的人在想啥
查詢 IP: Re: [新聞] 房仲百萬年薪拿不完 家庭事業兩兼顧
查詢 IP: [問卦] 東京一哥是哪個地方？
查詢 IP: [問卦] 有沒有股市上萬點民眾還是無感的八卦
查詢 IP: [問卦] 有沒有高度近視的八卦？
查詢 IP: [新聞] 如何對付文化恐怖份子?柯P：儘量做就對了
查詢 IP: [新聞] 妙齡女裸胸貼影印機  綠光劃過她大叫一聲
查詢 IP: Re: [問卦] 16歲女學生會想住豪宅想瘋了嗎?
查詢 IP: [問卦] 教主在哪個圈子最吃得開
查詢 IP: [問卦] 如果登入別人的人生想要幹嘛麼？
查詢 IP: [問卦] 藏鏡人有沒有自我意識?
查詢 IP: Re: [問卦] 有沒有中國遊戲在台灣發行 卻用日本配音?
查詢 IP: Re: [問卦] 遊戲有中文不買買日文的人在想啥
查詢 IP: [問卦] 惡魔果實能力者為何不當山賊
查詢 IP: [問卦] 取代tw ice
查詢 IP: [新聞] 白人學生髒話謾罵台裔師 影片瘋傳網友肉搜
查詢 IP: Re: [問卦] 陳星工讀生是不是成功帶風向惹
查詢 IP: [問卦] 李世石是怎麼贏AlphaGo的
查詢 IP: [問卦] 說甲甲是愛滋高風險群算是歧視嗎?
查詢 IP: [問卦] 反串??
查詢 IP: Re: [新聞] 李明哲遭陸逮捕 美國務院:鼓勵北京與台北
查詢 IP: Re: [新聞] 陸生問對習近平評價 馬英九：酒量不錯
查詢 IP: Re: [問卦] 統一根本是台灣唯一的出路了吧
查詢 IP: [問卦] youtube排擠台灣？
查詢 IP: [問卦] 有沒有中國遊戲在台灣發行 卻用日本配音?
各國IP分佈: 
Taiwan 45
China 2
Japan 1
Macao 1
Vietnam 1

Process finished with exit code 0
```

原始碼[點我](https://github.com/yotsuba1022/web-crawler-practice/blob/master/ch4/ptt_gossiping_ip.py)

關於國家的部分, 是把IP丟到freegeoip的api去查的, 詳細說明可以在官網查到: <https://freegeoip.net>


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://clu.gitbook.io/python-web-crawler-note/41-ba-gua-ban-xiang-min-cong-na-4f863f.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
