wsh爬虫

前言

了解学习了下wsh爬虫,虽然只能简单的爬取,但是还是挺方便的。vbs语法还不太会,只能单页爬取,下面js代码可以爬取某多页的文档的内容。

请在控制台中运行,否则的话,你将会出现N多弹框。
以下是js运行结果:

js

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
var html = new ActiveXObject("htmlfile")
var http = new ActiveXObject("Msxml2.ServerXMLHTTP")

var PageNum = 0;
while (PageNum < 42) {
PageNum++;
html.designMode = "on"
var url = "http://www.doczj.com/doc/0f47f800a6c30c2259019e5e-" + PageNum + ".html";
http.open("GET", url)
http.send
strHtml = http.responseText

html.write(strHtml)
var text = html.getElementById("contents")
WSH.Echo (text.innerText)
WSH.Echo("--------------------------------------------------------------------------------------------")
html.designMode = "off"
}

vbs

1
2
3
4
5
6
7
8
9
10
11
12
Set html = CreateObject("htmlfile")
Set http = CreateObject("Msxml2.ServerXMLHTTP")

html.designMode = "on" 'enable edit mode

http.open "GET", "http://www.doczj.com/doc/0f47f800a6c30c2259019e5e-1.html", False
http.send
strHtml = http.responseText

html.write strHtml 'write data
Set bln = html.getElementById("contents")
WSH.Echo (bln.innerText)

参考文献

Method of parsing HTML document by VBS (htmlfile)