我在 PHP7 中编写了一个正则表达式来验证 URI 方案,旨在支持 IANA here 列出的每个方案;永久的、临时的或历史的。到目前为止,我已经在永久协议中达到了
shttp
。
正则表达式在我的代码中作为定义的常量编写:
define('URL_VALIDATION_REGEX', '/\b(?:'.
'aaas?|about|acap|acct|cap|cid|coaps?(?:\+(?:tcp|ws))?|crid|data|dav|dict|dns|example|file|ftp|geo|'.
'go|gopher|h323|iax|icap|im(?:ap)?|info|ipps?|iris(?:.(?:beep|lwz|xpcs?))?|jabber|ldap|mailto|'.
'mid|msrps?|mtqp|mupdate|news|nfs|nih?|nntp|opaquelocktoken|pkcs11|pop|pres|reload|rtsp[su]?|service|'.
'session|s?https?'.
'):\/\//i');
有问题的代码部分是
s?https?
;显然,如果提供的方案是 http
、https
和 shttp
,则此正则表达式将返回匹配项,但也会错误地 shttps
。
我可以删除
s?https?
并将 https?
和 shttp
添加到正则表达式中,这会起作用,但对我来说这样做似乎不太优雅。
我的问题是,PHP7 是否允许正则表达式像
s?https?
一样工作,但排除 shttps
返回匹配项,而不必将字符串 shttps 作为文字或将 https?
和 shttp
作为正则表达式的单独部分?
不知道如何改进正则表达式,但是 parse_url、in_array 和 strtolower() 的组合似乎工作得很好,这段代码(包括操作码编译)在我的笔记本电脑上运行大约 52 毫秒,不包括操作码编译,30 毫秒(因为在生产环境中,操作码无论如何都会在第一次执行后被缓存)
<?php
declare(strict_types = 1);
$tests=array(
'http://foo.bar'=>true,
'irc://irc.freenode.net/#anime'=>true,
'foobar://wtf'=>false,
'shouldfail://wat'=>false
);
foreach($tests as $test=>$expected){
echo "$test: ";
if(in_array(strtolower(parse_url( $test, PHP_URL_SCHEME )),array('aaa','aaas','about','acap','acct','acr','adiumxtra','afp','afs','aim','appdata','apt','attachment','aw','barion','beshare','bitcoin',
'blob','bolo','browserext','callto','cap','chrome','chrome-extension','cid','coap','coap+tcp','coap+ws','coaps','coaps+tcp','coaps+ws',
'com-eventbrite-attendee','content','conti','crid','cvs','data','dav','diaspora','dict','dis','dlna-playcontainer','dlna-playsingle','dns','dntp','dtn',
'dvb','ed2k','example','facetime','fax','feed','feedready','file','filesystem','finger','fish','ftp','geo','gg','git','gizmoproject','go','gopher',
'graph','gtalk','h323','ham','hcp','http','https','hxxp','hxxps','hydrazone','iax','icap','icon','im','imap','info','iotdisco','ipn','ipp','ipps',
'irc','irc6','ircs','iris','iris.beep','iris.lwz','iris.xpc','iris.xpcs','isostore','itms','jabber','jar','jms','keyparc','lastfm','ldap','ldaps',
'lvlt','magnet','mailserver','mailto','maps','market','message','microsoft.windows.camera','microsoft.windows.camera.multipicker',
'microsoft.windows.camera.picker','mid','mms','modem','mongodb','moz','ms-access','ms-browser-extension','ms-drive-to','ms-enrollment','ms-excel',
'ms-gamebarservices','ms-gamingoverlay','ms-getoffice','ms-help','ms-infopath','ms-inputapp','ms-lockscreencomponent-config','ms-media-stream-id',
'ms-mixedrealitycapture','ms-officeapp','ms-people','ms-project','ms-powerpoint','ms-publisher','ms-restoretabcompanion','ms-search-repair',
'ms-secondary-screen-controller','ms-secondary-screen-setup','ms-settings','ms-settings-airplanemode','ms-settings-bluetooth','ms-settings-camera',
'ms-settings-cellular','ms-settings-cloudstorage','ms-settings-connectabledevices','ms-settings-displays-topology','ms-settings-emailandaccounts',
'ms-settings-language','ms-settings-location','ms-settings-lock','ms-settings-nfctransactions','ms-settings-notifications','ms-settings-power',
'ms-settings-privacy','ms-settings-proximity','ms-settings-screenrotation','ms-settings-wifi','ms-settings-workplace','ms-spd','ms-sttoverlay',
'ms-transit-to','ms-useractivityset','ms-virtualtouchpad','ms-visio','ms-walk-to','ms-whiteboard','ms-whiteboard-cmd','ms-word','msnim','msrp',
'msrps','mtqp','mumble','mupdate','mvn','news','nfs','ni','nih','nntp','notes','ocf','oid','onenote','onenote-cmd','opaquelocktoken','pack','palm',
'paparazzi','pkcs11','platform','pop','pres','prospero','proxy','pwid','psyc','qb','query','redis','rediss','reload','res','resource','rmi',
'rsync','rtmfp','rtmp','rtsp','rtsps','rtspu','secondlife','service','session','sftp','sgn','shttp','sieve','sip','sips','skype','smb','sms','smtp',
'snews','snmp','soap.beep','soap.beeps','soldat','spiffe','spotify','ssh','steam','stun','stuns','submit','svn','tag','teamspeak','tel','teliaeid',
'telnet','tftp','things','thismessage','tip','tn3270','tool','turn','turns','tv','udp','unreal','urn','ut2004','v-event','vemmi','ventrilo',
'videotex','vnc','view-source','wais','webcal','wpid','ws','wss','wtai','wyciwyg','xcon','xcon-userid','xfire','xmlrpc.beep','xmlrpc.beeps',
'xmpp','xri','ymsgr','z39.50','z39.50r','z39.50s'),true) === $expected){
echo "OK";
}else{
echo "FAIL";
}
echo "\n";
}
这是全部,而不仅仅是您的正则表达式包含的子集(我从csv文件中提取它们)
添加一个PHP内基准测试,
$start=microtime(true);
在循环之前,$end=microtime(true);var_dump($end-$start);
在循环之后,声称循环本身在我的笔记本电脑上使用0.1毫秒,所以就是这样。 double(0.00010299682617188)
我决定遵循@sln 的评论,全面trie;考虑到速度对我来说是优雅代码的一部分。我相信代码仍然可读,因为它按字母顺序列出:
define('URL_VALIDATION_REGEX', '/\b(?:'.
'a(?:aas?|bout|c(?:ap|ct|r)|diumxtra|f[ps]|im|p(?:pdata|t)|ttachment|w)|'.
'b(?:arion|eshare|itcoin|lob|olo|rowserext)|'.
'c(?:a(?:llto|p)|hrome(?:-extension)?|id|o(?:aps?(?:\+(?:tcp|ws))?|m-eventbrite-attendee|'.
'nt(?:ent|i))|rid|vs)|'.
'd(?:a(?:ta|v)|i(?:aspora|ct|s)|lna-play(?:container|single)|n(?:s|tp)|tn|vb)|'.
'e(?:d2k|xample)|'.
'f(?:a(?:cetime|x)|eed(?:ready)?|i(?:(?:le(?:system)?)|nger|sh)|tp)|'.
'g(?:eo|g|i(?:t|zmoproject)|o(?:pher)?|raph|talk)|'.
'h(?:323|am|cp|ttps?|xxps?|ydrazone)|'.
'i(?:ax|c(?:ap|on)|m(?:ap)?|nfo|otdisco|p(?:n|ps?)|r(?:c[6s]?|is(?:.(?:beep|lwz|xpcs?))?)|sostore|'.
'tms)|'.
'j(?:a(?:bber|r)|ms)|'.
'keyparc|'.
'l(?:astfm|daps?|vlt)|'.
'm(?:a(?:gnet|il(?:server|to)|ps|rket)|essage|i(?:crosoft.windows.camera(?:.(?:multi)?picker)?|d)|ms|'.
'o(?:dem|ngodb|z)|s(?:-(?:access|browser-extension|drive-to|e(?:nrollment|xcel)|'.
'g(?:am(?:ebarservices|ingoverlay)|etoffice)|help|in(?:fopath|putapp)|'.
'lockscreencomponent-config|m(?:edia-stream-id|ixedrealitycapture)|officeapp|p(?:eople|roject|'.
'owerpoint|ublisher)|restoretabcompanion|s(?:e(?:arch-repair|condary-screen-(?:controller|setup)|'.
'ttings(?:-(?:airplanemode|bluetooth|c(?:amera|ellular|loudstorage|onnectabledevices)|'.
'displays-topology|emailandaccounts|l(?:anguage|oc(?:ation|k))|n(?:fctransactions|otifications)|'.
'p(?:ower|r(?:ivacy|oximity))|screenrotation|w(?:ifi|orkplace)))?)|pd|ttoverlay)|transit-to|'.
'useractivityset|v(?:irtualtouchpad|isio)|w(?:alk-to|hiteboard(?:-cmd)?|ord))|nim|rps?)|tqp|'.
'u(?:mble|pdate)|vn)|'.
'n(?:ews|fs|ih?|ntp|otes)|'.
'o(?:cf|id|nenote(?:-cmd)?|paquelocktoken)|'.
'p(?:a(?:ck|lm|parazzi)|kcs11|latform|op|r(?:es|o(?:spero|xy))|wid|syc)|'.
'q(?:b|uery)|'.
'r(?:e(?:diss?|load|s(?:ource)?)|mi|sync|t(?:mf?p|sp[su]?))|'.
's(?:e(?:condlife|rvice|ssion)|ftp|gn|http|i(?:eve|ps?)|kype|m(?:b|s|tp)|n(?:ews|mp)|o(?:ap.beeps?|'.
'ldat)|p(?:iffe|otify)|sh|t(?:eam|uns?)|ubmit|vn)|'.
't(?:ag|e(?:amspeak|l(?:iaeid|net)?)|ftp|hi(?:ngs|smessage)|ip|n3270|ool|urns?|v)|'.
'u(?:dp|nreal|rn|t2004)|'.
'v(?:-event|e(?:mmi|ntrilo)|ideotex|nc|iew-source)|'.
'w(?:ais|ebcal|pid|ss?|tai|yciwyg)|'.
'x(?:con(?:-userid)?|fire|m(?:lrpc.beeps?|pp)|ri)|'.
'ymsgr|'.
'z39.50[rs]?'.
'):\/\//');
请注意,代码包含 IANA 的完整方案列表,而不是我的原始子集。