Python小爬虫案例--抓取女神邓紫棋相关照片

2015-03-09

编程

pic alt

0x00 前言(废话)

在平时抓取部分自己喜欢的资源的时候,我们常常会去下载一些比较有用的资源,比如,我比较喜欢GEM的照片,但是,这个东西,总不能总是去找别人要吧,那么,怎么办?

很简单,我们只需要通过Python写一个小小的爬虫就可以解决这些问题.

什么是爬虫?自己可以百度去.

我这里指的爬虫是那些可以模拟浏览器的行为的小程序.

比如,我要抓取G.E.M的相片,那么,我就想个办法.把图片的地址解析出来.然后写一个小功能下载不就好了么.

虽然,话是这么说,

但,怎么下手?

这在写这篇文章,并且写到这里的时候,刚刚决定了抓取几个站点

(刚刚百度了邓紫棋壁纸得到这个网站 http://www.6188.com/show/12788_1.html)

0x01 准备工作以及爬取思路

Python3,Chrome浏览器或者Firefox,
Python3 基本依赖库beautifulsoup4 lxml

任务如下:

1.第一个简单案例 http://www.6188.com/show/12788_1.html

2.百度API解析

3.虾米照片爬取

4.instagram 墙外下载 Gem照片

5.发烧级别的GEM粉丝 - 虾米网 down! down! down!

涉及到的知识点:

爬虫的最最基本思路
几个解析方法正则解析,bs4解析,lxml解析
多线程使用

0x02 6188.com壁纸抓取 — 关键词爬虫

先说一下思路,首先,你要会点击下载按钮.(#-#)

1.访问 http://www.6188.com/show/12788_1.html

2.点击下载大图

3.看大图,手动另存为

这是普通人下载图片的方式.

让我们用程序员的眼光来看.

浏览器呈现的具体的过程可以看我的PyDjango中关于计算机Http协议的部分.

经过抓包(http包)分析(chrome的F12),知道,要想获取图片原始链接,有这么一个流程

从http://www.6188.com/show/12788_1.html解析出下面链接

从http://www.6188.com/show.php?pic=/flashAll/20140211/1392111065nvjKS7.jpg解析下面链接

从http://pic.6188.com/upload_6188s/flashAll/20140211/1392111065nvjKS7.jpg下载图片.

这样一看,非常简单明了.这就是下载一张图片的链接.

同样道理,把下面的程序写成一个for循环,就可以直接下载35张图片.


http://www.6188.com/show/12788_1.html
http://www.6188.com/show/12788_2.html
http://www.6188.com/show/12788_3.html
...
http://www.6188.com/show/12788_35.html

写的应该比较容易认出

import os
import re
import requests

__author__ = 'micheal'


r = requests.get("http://www.6188.com/show/12788_1.html")
m = re.search("(/show\.php.+jpg)\"", r.text)

pic_url = "http://www.6188.com" + m.group(0)
print(m.group(0))
data = requests.get(pic_url)
print(data.text)

m = re.search("src='(http://.+\.jpg)", data.text)
real_url = m.group(1)

try:
    print("real_url--正在下载--"+real_url)
    r = requests.get(real_url,stream=True)
    fileName = "GEM.jpg"
    fileFullPath = os.path.join('/home/micheal/Pictures/', fileName)
    print("正在下载" + str(data.status_code))
    with open(fileFullPath, 'wb') as f:
        for chunk in r.iter_content(chunk_size=1024 * 2):
            if chunk: # filter out keep-alive new chunks
                f.write(chunk)
                f.flush()
    pass
except Exception:
    print("出错")
    raise

print("任务完成")

添加For循环,优化一下

import os
import re
import requests

__author__ = 'micheal'

s = requests.session() # 仿真browser 使用一个会话

for i in range(1,30):
    html_url = "http://www.6188.com/show/12788_"+str(i)+".html"
    r = s.get(html_url)
    print("downloading" + html_url)
    m = re.search("(/show\.php.+jpg)\"", r.text)

    pic_url = "http://www.6188.com" + m.group(0)
    print(m.group(0))
    data = s.get(pic_url)
    print(data.text)

    m = re.search("src='(http://.+\.jpg)", data.text)
    real_url = m.group(1)

    try:
        img_store_dir = "/home/micheal/Pictures/GEM/6188"
        print("real_url--正在下载--"+real_url)
        r = s.get(real_url,stream=True)
        fileName = "GEM"+str(i)+".jpg"
        if not os.path.exists(img_store_dir):
            os.makedirs(img_store_dir)
        fileFullPath = os.path.join(img_store_dir, fileName)
        print("正在下载" + str(data.status_code))
        with open(fileFullPath, 'wb') as f:
            for chunk in r.iter_content(chunk_size=1024 * 2):
                if chunk: # filter out keep-alive new chunks
                    f.write(chunk)
                    f.flush()
        pass
    except Exception:
        print("出错")
        continue
    finally:
        pass


print("任务完成")

评价

第一个小程序应该是非常容易看懂的.

那么,这个程序有什么缺点呢?

下载速度太慢,需要使用多线程,
解析方法不具有通用性,这张网页中只有一个地址需要解析,所以正则表达式还是可以胜任.但是,复杂的网页肯定不行
下载的图片太少了.

0x03 百度API的解析 — 关键词多线程

主要就是增加了一个多线程的任务,原理什么的基本上和上面那个的相似

import json
import os
from queue import Queue
import re
import threading
import requests

__author__ = 'micheal'




q = Queue(maxsize=0)
# http://image.baidu.com/i?tn=resultjsonavatarnew&ie=utf-8&word=%E9%82%93%E7%B4%AB%E6%A3%8B&cg=star&pn=0&rn=60

s = requests.session()



def worker():
    while True:
        try:
            real_url = q.get()
            print("正在下载" + real_url)


            img_store_dir = "/home/micheal/Pictures/GEM/baidu"
            print("real_url--正在下载--"+real_url)
            r = s.get(real_url,stream=True)
            fileName = real_url.split("/")[-1]
            if not os.path.exists(img_store_dir):
                os.makedirs(img_store_dir)
            fileFullPath = os.path.join(img_store_dir, fileName)
            print("正在下载" + str(r.status_code))
            with open(fileFullPath, 'wb') as f:
                for chunk in r.iter_content(chunk_size=1024 * 2):
                    if chunk: # filter out keep-alive new chunks
                        f.write(chunk)
                        f.flush()
            pass
        except Exception as e:
            print("出错")
            raise e
            continue

        finally:
            pass

        q.task_done()

    pass



if __name__ == "__main__":

    for i in range(100):
        fr = i * 60
        to = i * 60 + 60
        r = requests.get("http://image.baidu.com/i?tn=resultjsonavatarnew&ie=utf-8&word=%E9%82%93%E7%B4%AB%E6%A3%8B&cg=star&pn="+str(fr)+"&rn="+str(to)+"&itg=1&z=3&fr=&width=0&height=0&lm=-1&ic=0&s=0&st=-1")
        data = json.loads(r.text)

        for j in range(60):
            real_url = data['imgs'][j]['objURL']
            print(real_url)
            q.put(real_url)

        for j in range(20):
            t = threading.Thread(target=worker)
            t.daemon = True
            t.start()

q.join()

print("任务完成")

引入了多线程,但是抓取效果并不好,大概有10%左右的照片可能是有点问题的,把线程数目从20条调整小一些.

先写到这里,明天接着写剩下来的代码.

抓取虾米的相册 — 反反爬虫

好吧,我们将魔手伸向了虾米音乐的图片板块

http://www.xiami.com/artist/pic-55712

我们尝试使用昨天的方法获取页面.

1
2
3

ss = requests.session()
r = ss.get(“http://www.xiami.com/artist/pic-55712?spm=0.0.0.0.IaKt5o&page=3”)
print(r.text)

但是出现问题了,

<!DOCTYPE HTML PUBLIC “-//IETF//DTD HTML 2.0//EN”>
<html>
<head><title>400 Bad Request</title></head>
<body bgcolor=”white”><script>
with(document)with(body)with(insertBefore(createElement(“script”),firstChild))setAttribute(“exparams”,”category=&userid=&aplus&yunid=&&asid=AABI4v5USkJh1pUp01o=”,id=”tb-beacon-aplus”,src=(location>”https”?”//s”:”//a”)+”.tbcdn.cn/s/aplus_v2.js”)
</script>

<h1>400 Bad Request</h1>
<p>Your browser sent a request that this server could not understand. Sorry for the inconvenience.<br/>
Please report this message and include the following information to us.<br/>
Thank you very much!</p>
<table>
<tr>
<td>URL:</td>
<td>http://www.xiami.com/artist/pic-55712?spm=0.0.0.0.IaKt5o&amp;page=3</td>
</tr>
<tr>
<td>Server:</td>
<td>web-xiami-main-030.cm10</td>
</tr>
<tr>
<td>Date:</td>
<td>2015/03/10 20:23:36</td>
</tr>
</table>
<hr/>Powered by Tengine</body>
</html>


Process finished with exit code 0
</td>
</tr>
</table></p></h>
</script></title></head>
</html>

我们前面使用的代码都是爬取一些没有做太多防止爬虫的网站,但是,我们今天准备爬取的是一个有防护措施的网站.

没办法,修改一下headers,然后继续访问即可.原理部分见我关于计算机网络的文章.

1
2

博客文章原创声明：
本博文章如果没有声明为整理或者转载,均为本人原创。非商业可以任意转载分享。但是编写的代码如果没有特别声明，虽然我建议保留原作者出处，但是代码皆为mit协议，也就是修改了名字也算是你的版权，开源世界嘛，我就喜欢那些可以拿来直接使用的东西，贯彻最纯粹的免费自由，但是求求你，改掉名字等等信息再说是版权是你的啊。

关于本人,
点击链接就可以以web幻灯片的方式看到我的介绍。

我的Github地址: https://github.com/twocucao （尽管东西不多，但是欢迎来Star和Fork，就算你们来这里提前Star Folk了）
简书地址: http://www.jianshu.com/users/9a7e0b9da317/latest_articles （不常更新，而且几乎没有技术文章的讲解）
联系方式: twocucao@gmail.com
本人才疏学浅，是一个水平比较菜的程序员，如果行文之间发现任何错误，欢迎指正，特别欢迎技术上的指正。