DrissionPage/docs/实用示例/获取各国疫情排名.md
2022-01-18 00:35:15 +08:00

67 lines
2.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

此示例爬取全球新冠情况排行榜。该网站是纯 html 页面,特别适合 s 模式爬取和解析。
# 页面分析
网址https://www.outbreak.my/zh/world
目标:获取所有国家疫情排名信息,并显示出来。
![](https://gitee.com/g1879/DrissionPage-demos/raw/master/pics/%E5%BE%AE%E4%BF%A1%E6%88%AA%E5%9B%BE_20211231225026.jpg)
`F12`观察页面代码,可以发现:
- 标题栏有写列是不显示的,它们的`style`属性包含`display: none;`
- 表格内容中 3、5、7 列的真正数据在隐藏的 2、4、6 列中
![标题栏](https://gitee.com/g1879/DrissionPage-demos/raw/master/pics/QQ截图20220118002732.jpg)
![内容列](https://gitee.com/g1879/DrissionPage-demos/raw/master/pics/QQ截图20220118002945.jpg)
因此,在获取标题栏时,要跳过隐藏的列,在获取正文内容时,反而要用隐藏的数据。
# 示例代码
```python
from DrissionPage import MixPage
# 用 s 模式创建页面对象
page = MixPage('s')
# 访问数据网页
page.get('https://www.outbreak.my/zh/world')
# 获取表头元素
thead = page('tag:thead')
# 获取表头列,跳过其中的隐藏的列
title = thead.eles('tag:th@@-style:display: none;')
data = [th.text for th in title]
print(data) # 打印表头
# 获取内容表格元素
tbody = page('tag:tbody')
# 获取表格所有行
rows = tbody.eles('tag:tr')
for row in rows:
# 获取当前行所有列
cols = row.eles('tag:td')
# 生成当前行数据列表(跳过其中没用的几列)
data = [td.text for k, td in enumerate(cols) if k not in (2, 4, 6)]
print(data) # 打印行数据
```
# 结果
```
['总 (205)', '累积确诊', '死亡', '治愈', '现有确诊', '死亡率', '恢复率']
['美国', '55252823', '845745', '41467660', '12,939,418', '1.53%', '75.05%']
['印度', '34838804', '481080', '34266363', '91,361', '1.38%', '98.36%']
['巴西', '22277239', '619024', '21567845', '90,370', '2.78%', '96.82%']
['英国', '12748050', '148421', '10271706', '2,327,923', '1.16%', '80.57%']
['俄罗斯', '10499982', '308860', '9463919', '727,203', '2.94%', '90.13%']
['法国', '9740600', '123552', '8037752', '1,579,296', '1.27%', '82.52%']
......
```