词法分析实现

逻辑

有限自动机表示

状态集合、字母表、转移函数、开始状态、接收状态五元组
线性逐字母向前读取
根据转移函数逐个转移状态
判断读取完成时是否处于接收状态

能被 DFA 识别的语言称为正则语言
可基于正则表达式实现词法分析器
考虑括号匹配语言的场景

(()())
状态转移

实现

词法分析暴露出来的是一个 lex 的函数，输入的是字符串模版，输出是 tokens 的数组。他的核心就是在 while 循环中不停的对代码进行切分，对字符串做一些处理。这里面我们会使用 trim 和 getToken 辅助函数帮我们更快的实现功能。

定义 template 的 Mock 数据

let template = `
  <p>hello</p>
  <div>
    <h2>
      <small>world</small>
    </h2>
    <span>!</span>
  </div>`;

定义词法分析函数

function lex (template) {

}

定义 tokens 存放所有词法分析的 tokens , token 暂存单次的 token 。

function lex (template) {
  let tokens = []
  let token

  return tokens
}

对 template 做切分，直到 template 为空。

function lex (template) {
  let tokens = []
  let token
  while (template.length > 0) {
    template = trim(template)
    token = getToken(template)
    template = template.substr(token.val.length)
    token.val = trim(token.val)
    tokens.push(token)
  }
  return tokens
}

template = trim(template) 去除 template 中的空白字符
token = getToken(template) 对字符串进行匹配
template = template.substr(token.val.length) 把匹配到字符的做一次切分
tokens.push(token) 然后再把匹配返回的 token 添加到 tokens 数组里面

字符串进行匹配定义了三个表达式，声明了三种不同的 token 类型。这里我们会对每次输入的字符串进行匹配，如果匹配到，就会把匹配到的一段做一次 match，把这次 match 获得的做切分。

const TagClose = new RegExp(/^<\/[\w]+>/)
const TagOpen = new RegExp(/^<[\w]+>/)
const Value = new RegExp(/^[^<]+/)

function getToken (str) {
  if (str.match(TagClose)) {
    return { type: 'TagClose', val: str.match(TagClose)[0] }
  } else if (str.match(TagOpen)) {
    return { type: 'TagOpen', val: str.match(TagOpen)[0] }
  } else if (str.match(Value)) {
    return { type: 'Value', val: str.match(Value)[0] }
  }
  throw Error('LexicalError')
}

const TagClose = new RegExp(/^</[\w]+>/)

闭合标签匹配，以小于号(<)跟着一个斜杠(/)开头，斜杠前方需要添加反斜杠。接着匹配一个或则多个字符最后以大于号(>)结尾。

const TagOpen = new RegExp(/^<[\w]+>/)

开始标签匹配，和闭合标签的正则匹配相近，却别在于少了中间的斜杠。

const Value = new RegExp(/^[^<]+/)

字符串匹配，不包含小于号开始的一连串字符的集合

参考代码

```js
const TagClose = new RegExp(/^<\/[\w]+>/)
const TagOpen = new RegExp(/^<[\w]+>/)
const Value = new RegExp(/^[^<]+/)

function trim (str) {
  return str.replace(/^\s+|\s+$/, '')
}

function getToken (str) {
  if (str.match(TagClose)) {
    return { type: 'TagClose', val: str.match(TagClose)[0] }
  } else if (str.match(TagOpen)) {
    return { type: 'TagOpen', val: str.match(TagOpen)[0] }
  } else if (str.match(Value)) {
    return { type: 'Value', val: str.match(Value)[0] }
  }
  throw Error('LexicalError')
}

export default {
  lex (template) {
    let tokens = []
    let token
    while (template.length > 0) {
      template = trim(template)
      token = getToken(template)
      template = template.substr(token.val.length)
      token.val = trim(token.val)
      tokens.push(token)
    }
    return tokens
  }
}

输出

最终输出的 tokens 结构如下

tokens = [
  { type: 'TagOpen', val: '<p>' },
    { type: 'Value', val: 'hello' },
  { type: 'TagClose', val: '</p>' },

  { type: 'TagOpen', val: '<div>' },

    { type: 'TagOpen', val: '<h2>' },
      { type: 'TagOpen', val: '<small>' },
        { type: 'Value', val: 'world' },
      { type: 'TagClose', val: '</small>' },
    { type: 'TagClose', val: '</h2>' },

    { type: 'TagOpen', val: '<span>' },
      { type: 'Value', val: '!' },
    { type: 'TagClose', val: '</span>' },

  { type: 'TagClose', val: '</div>' }
]

# 词法分析实现

# 逻辑

# 实现

# 参考代码

# 输出

词法分析实现

逻辑

实现

参考代码

输出