2016-07-18

【学习笔记】BeautifulSoup－爬虫文档解析器

在分析房价过程中，需要从链家爬数据，用到了BeautifulSoup来解析文档，简单学习一下。

简单介绍

官网的介绍是：

> Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:
>
> 1. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn\'t take much code to write an application
> Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。
> 2. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don\'t have to think about encodings, unless the document doesn\'t specify an encoding and Beautiful Soup can\'t detect one. Then you just have to specify the original encoding.
> Beautiful Soup自动将输入文档转换为Unicode编码，输出文档转换为utf-8编码。你不需要考虑编码方式，除非文档没有指定一个编码方式，这时，Beautiful Soup就不能自动识别编码方式了。然后，你仅仅需要说明一下原始编码方式就可以了。
> 3. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.
> Beautiful Soup已成为和lxml、html6lib一样出色的python解释器，为用户灵活地提供不同的解析策略或强劲的速度。

不足：不能解析动态页面

使用BeautifulSoup

创建BeautifulSoup对象

1 2	from bs4 import BeautifulSoup soup = BeautifulSoup(html_context)

四个对象种类

BeautifulSoup将HTML转化成文档树，每个节点都是一个节点，节点有4个种类

tag

HTML 中的一个个标签，例如:

1	<title>The Dormouse's story</title>

1	<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

1 2	print(soup.title) #<title>The Dormouse's story</title>

1 2	print(soup.find('a', class_="sister")) #<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

NavigableString

标签内部的文本

1 2	print(soup.find('a', class_="sister").text) #Elsie

BeautifulSoup

类似于tag，包含全部文档，可以提取attrs

1 2	print(soup.find('a', class_="sister").href) #http://example.com/elsie

打印页面中所有链接

1
2
3

hrefs = []
for link in soup.find_all('a'):
	hrefs.append(link.href)

Comment

HTML中的注释

遍历文档树

用.children表示所有的子结点，用.parent表示父结点。

<body>
	<ul>
		<li>1</li>
		<li>2</li>
	</ul>
</body>

1	soup.body.ul.children

BeautifulSoup文档

ZhangAnam

A clay idol who want to be Avalokitesvara!

【学习笔记】BeautifulSoup－爬虫文档解析器

简单介绍

使用BeautifulSoup