该内容已被发布者删除 该内容被自由微信恢复
文章于 5月7日 上午 9:03 被检测为删除。
查看原文
被用户删除
其他

学点真正的技术,搞定知乎,b站,豆瓣,抖音,公众号,微博等平台

苏生不惑 苏生不惑 2024-03-23

苏生不惑第463 篇原创文章,加入我的知识星球

前几天我的知识星球里一位小伙伴问怎么下载知乎用户的回答 ?

有兴趣的小伙伴可以加入我的知识星球 , 星球几乎每天更新,主要发布我每天在国内外互联网上看到过有趣的网站,软件和一些工作生活经验分享,包括方方面面,堪称互联网宝藏库,所以叫互联网达人嘛,每条帖子都有标签,可以选择标签查看对应内容 https://t.zsxq.com/13bqoLXHJ

记得很久之前写过一篇关于web scraper抓取数据的文章,今天再整理分享下,不用写代码也可以自由抓取数据。

这里以渤海小吏这个知乎号为例https://www.zhihu.com/people/dai-zong-66 ,首先安装 web scraper 浏览器扩展(下载地址在公众号对话框回复 scraper)  ,安装后打开浏览器控制台点击import sitemap 。

复制以下代码:

{"_id":"zhihu_answer","startUrl":["https://www.zhihu.com/people/dai-zong-66/answers?page=[1-5]"],"selectors":[{"id":"row","parentSelectors":["_root"],"type":"SelectorElement","selector":"div.List-item","multiple":true},{"id":"知乎问题标题","parentSelectors":["row"],"type":"SelectorText","selector":"div[itemprop='zhihu:question'] a","multiple":false,"regex":""},{"id":"知乎问题链接","parentSelectors":["row"],"type":"SelectorElementAttribute","selector":"[itemprop='zhihu:question'] a","multiple":false,"extractAttribute":"href"}]}

点击 Data Preview看数据没问题。

然后点击scrape开始抓取。

之后浏览器会自动抓取数据,不用管,抓取完后浏览器自动关闭,看数据都抓取完成。

最后导出excel就行,包含所有知乎回答问题标题和链接。

效果如图:

如果想下载所有回答内容可以对抓取的回答链接再提取,这个就自己研究了,对于知乎文章的抓取也是一样的,导入以下代码:

{"_id":"zhihu_zhuanlan","startUrl":["https://www.zhihu.com/people/dai-zong-66/posts/posts?page=[1-30]"],"selectors":[{"id":"row","type":"SelectorElement","parentSelectors":["_root"],"selector":"div.List-item","multiple":true,"delay":0},{"id":"知乎标题","type":"SelectorText","parentSelectors":["row"],"selector":"h2.ContentItem-title","multiple":false,"regex":"","delay":0},{"id":"知乎链接","type":"SelectorElementAttribute","parentSelectors":["row"],"selector":"h2.ContentItem-title span a ","multiple":false,"extractAttribute":"href","delay":0}]}

导出的excel数据包含知乎文章标题,链接,评论数和赞同数:

如果还想批量下载知乎专栏的文章可以用我开发的这个工具2023 更新版:苏生不惑开发过的那些原创工具和脚本 ,下载效果:文章和回答保存到html目录,文件名是时间+标题。所有文章合成一个pdf文件。视频保存到video目录。

还有知乎话题的抓取,导入以下代码:

{"_id":"zhihu_topic","startUrl":["https://www.zhihu.com/topic/19559424/top-answers"],"selectors":[{"id":"row","parentSelectors":["_root"],"type":"SelectorElementScroll","selector":"div.List-item:nth-of-type(-n+10)","multiple":true,"delay":2000,"elementLimit":500},{"id":"知乎标题","parentSelectors":["row"],"type":"SelectorText","selector":"h2 a","multiple":false,"regex":""},{"id":"知乎链接","parentSelectors":["row"],"type":"SelectorLink","selector":"[itemprop='zhihu:question'] a[data-za-detail-view-element_name]","multiple":false,"linkType":"linkFromHref"}]}

哔哩哔哩视频抓取,比如抓取b站上木鱼水心的所有视频 https://space.bilibili.com/927587/video ,导入以下代码:

{"_id":"bilibili_videos","startUrl":["https://space.bilibili.com/927587/video?tid=0&pn=[1-42:1]&keyword=&order=pubdate"],"selectors":[{"id":"row","parentSelectors":["_root"],"type":"SelectorElement","selector":"li.small-item","multiple":true},{"id":"视频标题","parentSelectors":["row"],"type":"SelectorText","selector":"a.title","multiple":false,"regex":""},{"id":"视频链接","parentSelectors":["row"],"type":"SelectorElementAttribute","selector":"a.cover","multiple":false,"extractAttribute":"href"},{"id":"视频封面","parentSelectors":["row"],"type":"SelectorElementAttribute","selector":"a.cover div.b-img picture img","multiple":false,"extractAttribute":"src"},{"id":"视频播放量","parentSelectors":["row"],"type":"SelectorText","selector":".play span","multiple":false,"regex":""},{"id":"视频长度","parentSelectors":["row"],"type":"SelectorText","selector":" a.cover  span.length","multiple":false,"regex":""},{"id":"发布时间","parentSelectors":["row"],"type":"SelectorText","selector":"span.time","multiple":false,"regex":""}]}
 

导出的excel数据包含视频标题,链接,封面,播放量,长度,时间等,从2013到2023年共发布视频1200多个。b站热榜数据抓取,导入以下代码:

{"_id":"bilibili","startUrl":["https://www.bilibili.com/v/popular/rank/all"],"selectors":[{"id":"row","multiple":true,"parentSelectors":["_root"],"selector":"li.rank-item","type":"SelectorElement"},{"id":"视频排名","multiple":false,"parentSelectors":["row"],"regex":"","selector":"i.num","type":"SelectorText"},{"id":"视频标题","multiple":false,"parentSelectors":["row"],"regex":"","selector":"a.title","type":"SelectorText"},{"id":"播放量","multiple":false,"parentSelectors":["row"],"regex":"","selector":".detail-state > span:nth-of-type(1)","type":"SelectorText"},{"id":"弹幕数","multiple":false,"parentSelectors":["row"],"regex":"","selector":"span:nth-of-type(2)","type":"SelectorText"},{"id":"up主","multiple":false,"parentSelectors":["row"],"regex":"","selector":"a span","type":"SelectorText"},{"id":"视频链接","multiple":false,"parentSelectors":["row"],"selector":"a.title","type":"SelectorLink"},{"id":"点赞数","multiple":false,"parentSelectors":["视频链接"],"regex":"","selector":"span.like","type":"SelectorText"},{"id":"投币数","multiple":false,"parentSelectors":["视频链接"],"regex":"","selector":"span.coin","type":"SelectorText"},{"id":"收藏数","multiple":false,"parentSelectors":["视频链接"],"regex":"","selector":"span.collect","type":"SelectorText"}]}

抓取豆瓣电影排行榜 top 250,导入以下代码:

{"_id":"douban_movie_top_250","startUrl":["https://movie.douban.com/top250?start=0&filter="],"selectors":[{"id":"next_page","type":"SelectorLink","parentSelectors":["_root","next_page"],"selector":".next a","multiple":true,"delay":0},{"id":"container","type":"SelectorElement","parentSelectors":["_root","next_page"],"selector":".grid_view li","multiple":true,"delay":0},{"id":"title","type":"SelectorText","parentSelectors":["container"],"selector":"span.title:nth-of-type(1)","multiple":false,"regex":"","delay":0},{"id":"number","type":"SelectorText","parentSelectors":["container"],"selector":"em","multiple":false,"regex":"","delay":0}]}

还有抖音账号所有视频数据 ,数据包括视频日期,视频标题,视频链接,点赞数,评论数,收藏数,转发数等。微博账号的所有数据,包含微博链接,微博内容,发布时间,点赞数,转发数,评论数,话题等。


以及公众号的所有文章数据,数据包含文章日期,文章标题,文章链接,文章简介,文章作者,文章封面图,是否原创,IP归属地,阅读数,在看数,点赞数,留言数,赞赏次数,视频数,音频数等,比如深圳卫健委2022年的文章阅读数都是10万+,文章数据分析见文章2022年过去,抓取公众号阅读数点赞数在看数留言数做数据分析, 以深圳卫健委这个号为例 。

最新原创文章:

正式介绍下我的知识星球

2023 更新版:苏生不惑开发过的那些原创工具和脚本

再次更新:2023批量下载公众号文章内容/话题/图片/封面/视频/音频,导出文章pdf,文章数据含阅读数/点赞数/在看数/留言数

一次性搞定微博,苏生不惑又写了个脚本,一键下载微博内容导出pdf,批量抓取微博评论转发数据导出excel

2023 年数字图书馆 zlibrary 复活,新推出客户端人人可用

批量下载抖音视频,小红书视频,抓取抖音视频数据导出excel

如果文章对你有帮助还请 点赞/在看/分享 三连支持下, 感谢各位!

公众号苏生不惑

继续滑动看下一个
向上滑动看下一个

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存