在分析房价过程中,需要从链家爬数据,用到了BeautifulSoup
来解析文档,简单学习一下。
简单介绍
官网的介绍是:
> Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:
>
> 1. Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. It doesn\'t take much code to write an application
> Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。它是一个工具箱,通过解析文档为用户提供需要抓取的数据,因为简单,所以不需要多少代码就可以写出一个完整的应用程序。
> 2. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don\'t have to think about encodings, unless the document doesn\'t specify an encoding and Beautiful Soup can\'t detect one. Then you just have to specify the original encoding.
> Beautiful Soup自动将输入文档转换为Unicode编码,输出文档转换为utf-8编码。你不需要考虑编码方式,除非文档没有指定一个编码方式,这时,Beautiful Soup就不能自动识别编码方式了。然后,你仅仅需要说明一下原始编码方式就可以了。
> 3. Beautiful Soup sits on top of popular Python parsers like lxml and html5lib, allowing you to try out different parsing strategies or trade speed for flexibility.
> Beautiful Soup已成为和lxml、html6lib一样出色的python解释器,为用户灵活地提供不同的解析策略或强劲的速度。
不足:不能解析动态页面
使用BeautifulSoup
创建BeautifulSoup对象
12from bs4 import BeautifulSoupsoup = BeautifulSoup(html_context)四个对象种类
BeautifulSoup将HTML转化成文档树,每个节点都是一个节点,节点有4个种类
tag
HTML 中的一个个标签,例如:
1<title>The Dormouse's story</title>1<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>12print(soup.title)#<title>The Dormouse's story</title>12print(soup.find('a', class_="sister"))#<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>NavigableString
标签内部的文本
12print(soup.find('a', class_="sister").text)#ElsieBeautifulSoup
类似于tag,包含全部文档,可以提取attrs
12print(soup.find('a', class_="sister").href)#http://example.com/elsie打印页面中所有链接
123hrefs = []for link in soup.find_all('a'):hrefs.append(link.href)Comment
HTML中的注释
遍历文档树
用
.children
表示所有的子结点,用.parent
表示父结点。123456<body><ul><li>1</li><li>2</li></ul></body>1soup.body.ul.children