前言

上一篇中我们在维基百科的内部网站上随机跳转进入文章类网页,而忽视外部网站链接。本篇文章将处理网站的外部链接并试图收集一些网站数据。和单个域名网站爬取不同,不同域名的网站结构千差万别,这就意味我们的代码需要更加的灵活以适应不同的网站结构。
因此,我们将代码写成一组函数,这些函数组合起来就可以应用在不同类型的网络爬虫需求。

随机跳转外部链接

利用函数组,我们可以在50行左右满足爬取外部网站的需求。
示例代码:

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
import datetime
import random
from urllib.parse import quote

pages = set()
random.seed(datetime.datetime.now())


''' 获取一个网页的所有互联网链接'''

# 获取网页所有内部链接
def get_internal_links(soup, include_url):
    internal_links = []
    # find all links that befin with a '/'
    print(include_url)
    for link in soup.find_all('a',
                              href=re.compile(r'^((/|.)*' + include_url + ')')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in internal_links:
                internal_links.append(link.attrs['href'])
    return internal_links


# retrieves a list of all external links found on a page
#获取网页上所有外部链接
def get_external_links(soup, exclude_url):
    external_links = []
    # Finds all links that starts with 'http' or 'www' that do not contain the
    # current URL
    for link in soup.find_all('a',
                              href=re.compile(r'^(http|www)((?!' + exclude_url + ').)*$')):
        if link.attrs['href'] is not None:
            if link.attrs['href'] not in external_links:
                external_links.append(link.attrs['href'])
    return external_links


#拆分网址获取主域名
def split_address(address):
    address_parts = address.replace('http://', '').split('/')
    return address_parts


#随机外部链接跳转
def get_random_external_link(starting_page):
    html = urlopen(starting_page)

    soup = BeautifulSoup(html, 'lxml')
    external_links = get_external_links(
        soup, split_address(starting_page)[0])  # find the domain URL
    if len(external_links) == 0:
        internal_links = get_internal_links(soup, starting_page)
        print(len(internal_links))
        return get_external_links(soup,
                                  internal_links[random.randint(0, len(internal_links) - 1)])
    else:
        return external_links[random.randint(0, len(external_links) - 1)]



hop_count = set()

#只跳转外部链接,设置跳转次数loop, 默认跳转5次
def follow_external_only(starting_site, loop=5):
    global hop_count
    external_link = get_random_external_link(
        quote(starting_site, safe='/:?='))
    print('Random external link is: ' + external_link)
    while len(hop_count) 

由于代码没有异常处理和反反爬虫处理,因此一定会报错。由于跳转是随机的,可以多运行几次,有兴趣的可以根据每次的报错原因完善代码。
输出结果:

Random external link is: http://v.baidu.com/v?ct=301989888&rn=20&pn=0&db=0&s=25&ie=utf-8&word=
1
Random external link is: http://baishi.baidu.com/watch/6388818335201070269.html
2
Random external link is: http://v.baidu.com/tv/
3
Random external link is: http://player.baidu.com/yingyin.html
4
Random external link is: http://help.baidu.com/question?prod_en=player
5
Random external link is: http://home.baidu.com
[Finished in 6.3s]

抓取网页上所有外部链接

把代码写成函数的好处是可以简单地修改或者添加以满足不同的需求而不会破坏代码。比如:
目的:爬取整个网页所有外部链接并对每个链接标记
我们可以添加如下函数:

# Collects a list of all external URLs found on the site
all_ext_links = set()
all_int_links = set()


def get_all_external_links(site_url):
    html = urlopen(site_url)
    soup = BeautifulSoup(html, 'lxml')
    print(split_address(site_url)[0])
    int
    internal_links = get_internal_links(soup, split_address(site_url)[0])
    external_links = get_external_links(soup, split_address(site_url)[0])
    for link in external_links:
        if link not in all_ext_links:
            all_ext_links.add(link)
            print(link)
    for link in internal_links:
        if link not in all_int_links:
            print('About to get link: ' + link)
            all_int_links.add(link)
            get_all_external_links(link)

# follow_external_only("http://www.baidu.com")
get_all_external_links('http://oreilly.com')

输出结果如下:

oreilly.com
oreilly.com
https://cdn.oreillystatic.com/pdf/oreilly_high_performance_organizations_whitepaper.pdf
http://twitter.com/oreillymedia
http://fb.co/OReilly
https://www.linkedin.com/company/oreilly-media
https://www.youtube.com/user/OreillyMedia
About to get link: https://www.oreilly.com
https:
https:
https://www.oreilly.com
http://www.oreilly.com/ideas
https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav
http://www.oreilly.com/conferences/
http://shop.oreilly.com/
http://members.oreilly.com
https://www.oreilly.com/topics
https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+get+started+now
https://www.safaribooksonline.com/accounts/login/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170203+homepage+sign+in
https://www.safaribooksonline.com/live-training/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+take+a+live+online+course
https://www.safaribooksonline.com/learning-paths/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+follow+a+path
https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170505+homepage+unlimited+access
http://www.oreilly.com/live-training/?view=grid
https://www.safaribooksonline.com/your-experience/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170201+homepage+safari+platform
https://www.oreilly.com/ideas/8-data-trends-on-our-radar-for-2017?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+2017+trends
https://www.oreilly.com/ideas?utm_medium=referral&utm_source=oreilly.com&utm_campaign=lgen&utm_content=link+read+latest+articles
http://www.oreilly.com/about/
http://www.oreilly.com/work-with-us.html
http://www.oreilly.com/careers/
http://shop.oreilly.com/category/customer-service.do
http://www.oreilly.com/about/contact.html
http://www.oreilly.com/emails/newsletters/
http://www.oreilly.com/terms/
http://www.oreilly.com/privacy.html
http://www.oreilly.com/about/editorial_independence.html
About to get link: https://www.safaribooksonline.com/?utm_medium=content&utm_source=oreilly.com&utm_campaign=lgen&utm_content=20170601+nav
https:
https:
https://www.oreilly.com/
About to get link: https://www.oreilly.com/
https:
https:
About to get link: https://www.oreilly.com/topics
......

程序会一直循环下去直到达到python默认的循环极限, 有兴趣的朋友可以像上面的代码一样添加默认循环限制loop=5。

文章来源于互联网:python 网络爬虫-爬取网页外部网站

发表评论