玩具 - 爬虫 MDN

需求描述

一个 canvas 展示 思维导图,类似 https://echarts.apache.org/examples/en/editor.html?c=tree-basic ,就当他是个 MDN Menu 吧 🤣

  • 鼠标左键 click node to control collapse & expand
  • 鼠标右键 contextmenu node to nav to respective page

成品展示 http://hojondo.com/MDN_MIND_MAPPING/
待完善 + filter + search + nav

目标 JSON 格式

interface PageNode {
  name: string;
  link: string;
  childrenPage?: Array<PageNode>;
}

[
  {
    "name": "Standard built-in objects",
    "link": "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects",
    "childrenPage": [
      {
        "name": "Text processing",
        "link": "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects#text_processing",
        "childrenPage": [
          {
            "name": "String",
            "link": "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String",
            "childrenPage": [
              {
                "name": "Static properties",
                "link": "",
                "childrenPage": []
              },
              {
                "name": "Static methods",
                "link": "",
                "childrenPage": []
              },
              {
                "name": "Instance properties",
                "link": "",
                "childrenPage": []
              },
              {
                "name": "Instance methods",
                "link": "",
                "childrenPage": [
                  {
                    "name": "match",
                    "link": "颗粒度 到 properties // 其实后续可以追加 #parameters #return #examples/特别注意用例,待定。。。"
                  },
                  {}
                ]
              }
            ]
          },
          {
            "name": "RegExp",
            "link": "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp"
          }
        ]
      },
      {
        "name": "Keyed collections",
        "link": "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects#keyed_collections",
        "childrenPage": [
          // Map,
          // Set,
          // WeakMap,
          // WeakSet
        ]
      }
    ]
  }
  // https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference#statements
]

梳理 MDN 每页的 layout 结构

root 选定 https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects
结构相对统一

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference
url 判断依据: (url.match(/(?<=^https\:\/\/).*/)??[''])[0].split('/').length === 6

h2 > a {
  // 指向 bread-crumb-length === 7
  /** 包括 
  Global_Objects; 
  statements; 
  expressions_and_operators; 
  functions; 
  additional_reference_pages
  */
}

只有 5 个 h2,而且 从该入口进的子页面 结构并不一致。
暂略

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects
url 判断依据: (url.match(/(?<=^https\:\/\/).*/)??[''])[0].split('/').length === 7

所有 h3 分类 及其 子页面

h3 > a {
  // 大分类 hash#url 例 #text_processing
}
h3 + div li > a {
  // 大分类 的 子页面
  // TODO: 排除 a 前面带 svg 的(要么 nonstandard 要么 deprecated)
}

映射到 nodejs _伪代码_,使用 cheerio as $

const childNodes = $("h3 > a");
const childNodesName = childNodes.text();
const childNodesLink = childNodes.attr("href"); // #hash 后续需要拼接当前url+#xx

const grandChildNodes = $("h3 + div li > a");
const grandChildNodesName = grandChildNodes.children("code").text();
const grandChildNodesLink = grandChildNodes.attr("href"); // 后续需要拼接'https://developer.mozilla.org' + xx

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String

url 判断依据: (url.match(/(?<=^https\:\/\/).*/)??[''])[0]?.split('/').length === 8

h2 > a {
  // 大分类 hash#url, 例 #static_methods
}
h2 + div dt > a {
  // 大分类 的 子页面
  // TODO: 排除 a 后面带 svg 的(要么 nonstandard 要么 deprecated)
}

映射到 nodejs _伪代码_,使用 cheerio as $

const childNodes = $("h2 > a");
const childNodesName = childNodes.text();
const childNodesLink = childNodes.attr("href"); // 同上

const grandChildNodes = $("h2 + div dt > a");
const grandChildNodesName = grandChildNodes.children("code").text();
const grandChildNodesLink = grandChildNodes.attr("href"); // 同上

草稿代码

需要注意的点:

nodejs 拿到 json

const https = require("https");
const fs = require("fs");
const cheerio = require("cheerio");

const rootUrl =
  "https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects";
https
  .get(rootUrl, (res) => {
    let html = "";

    res.on("data", (data) => {
      html += data;
    });
    res.on("end", () => {
      fs.writeFile("creawer.txt", html, (err) => {
        console.log(err);
      });
      // console.log(html);
      filterHtml(html);
    });
  })
  .on("error", () => {
    console.log("crash!");
  });

// todo filterHtml into .json
// === 7 / 8 ?
// 递归

前端 配合 echarts

https://github.com/Hojondo/MDN_MIND_MAPPING

node 爬虫扫盲
node爬虫实践总结
Puppeteer


   转载规则


《玩具 - 爬虫 MDN》 Ryan Who 采用 知识共享署名 4.0 国际许可协议 进行许可。
 上一篇
前端网安攻防 详记 前端网安攻防 详记
点击劫持(click-Jacking)wiki 点击劫持(clickjacking)是一种在网页中将恶意代码等隐藏在看似无害的内容(如按钮)之下,并诱使用户点击的手段。举例来说,如用户收到一封包含一段视频的电子邮件,但其中的“播放”按钮并不
2021-06-06
下一篇 
Symbol详解 Symbol详解
Symbol 英文意思为 符号、象征、标记、记号,在 js 中更确切的翻译应该为 独一无二的值 const s = Symbol('key用来描述这个symbol值') 这里的参数 key 其实只是 对 symbol 的描述
2021-05-31
  目录