这个检查 URI 方案、检查 http、https 和 shttp 协议的 PHP 正则表达式可以写得更好吗?

问题描述 投票:0回答:2

我在 PHP7 中编写了一个正则表达式来验证 URI 方案,旨在支持 IANA here 列出的每个方案;永久的、临时的或历史的。到目前为止,我已经在永久协议中达到了

shttp

正则表达式在我的代码中作为定义的常量编写:

define('URL_VALIDATION_REGEX', '/\b(?:'.
    'aaas?|about|acap|acct|cap|cid|coaps?(?:\+(?:tcp|ws))?|crid|data|dav|dict|dns|example|file|ftp|geo|'.
    'go|gopher|h323|iax|icap|im(?:ap)?|info|ipps?|iris(?:.(?:beep|lwz|xpcs?))?|jabber|ldap|mailto|'.
    'mid|msrps?|mtqp|mupdate|news|nfs|nih?|nntp|opaquelocktoken|pkcs11|pop|pres|reload|rtsp[su]?|service|'.
    'session|s?https?'.
    '):\/\//i');

有问题的代码部分是

s?https?
;显然,如果提供的方案是
http
https
shttp
,则此正则表达式将返回匹配项,但也会错误地
shttps

我可以删除

s?https?
并将
https?
shttp
添加到正则表达式中,这会起作用,但对我来说这样做似乎不太优雅。

我的问题是,PHP7 是否允许正则表达式像

s?https?
一样工作,但排除
shttps
返回匹配项,而不必将字符串 shttps 作为文字或将
https?
shttp
作为正则表达式的单独部分?

php regex php-7
2个回答
0
投票

不知道如何改进正则表达式,但是 parse_url、in_array 和 strtolower() 的组合似乎工作得很好,这段代码(包括操作码编译)在我的笔记本电脑上运行大约 52 毫秒,不包括操作码编译,30 毫秒(因为在生产环境中,操作码无论如何都会在第一次执行后被缓存)

<?php
declare(strict_types = 1);
$tests=array(
        'http://foo.bar'=>true,
        'irc://irc.freenode.net/#anime'=>true,
        'foobar://wtf'=>false,
        'shouldfail://wat'=>false
);
foreach($tests as $test=>$expected){
    echo "$test: ";
    if(in_array(strtolower(parse_url( $test, PHP_URL_SCHEME )),array('aaa','aaas','about','acap','acct','acr','adiumxtra','afp','afs','aim','appdata','apt','attachment','aw','barion','beshare','bitcoin',
            'blob','bolo','browserext','callto','cap','chrome','chrome-extension','cid','coap','coap+tcp','coap+ws','coaps','coaps+tcp','coaps+ws',
            'com-eventbrite-attendee','content','conti','crid','cvs','data','dav','diaspora','dict','dis','dlna-playcontainer','dlna-playsingle','dns','dntp','dtn',
            'dvb','ed2k','example','facetime','fax','feed','feedready','file','filesystem','finger','fish','ftp','geo','gg','git','gizmoproject','go','gopher',
            'graph','gtalk','h323','ham','hcp','http','https','hxxp','hxxps','hydrazone','iax','icap','icon','im','imap','info','iotdisco','ipn','ipp','ipps',
            'irc','irc6','ircs','iris','iris.beep','iris.lwz','iris.xpc','iris.xpcs','isostore','itms','jabber','jar','jms','keyparc','lastfm','ldap','ldaps',
            'lvlt','magnet','mailserver','mailto','maps','market','message','microsoft.windows.camera','microsoft.windows.camera.multipicker',
            'microsoft.windows.camera.picker','mid','mms','modem','mongodb','moz','ms-access','ms-browser-extension','ms-drive-to','ms-enrollment','ms-excel',
            'ms-gamebarservices','ms-gamingoverlay','ms-getoffice','ms-help','ms-infopath','ms-inputapp','ms-lockscreencomponent-config','ms-media-stream-id',
            'ms-mixedrealitycapture','ms-officeapp','ms-people','ms-project','ms-powerpoint','ms-publisher','ms-restoretabcompanion','ms-search-repair',
            'ms-secondary-screen-controller','ms-secondary-screen-setup','ms-settings','ms-settings-airplanemode','ms-settings-bluetooth','ms-settings-camera',
            'ms-settings-cellular','ms-settings-cloudstorage','ms-settings-connectabledevices','ms-settings-displays-topology','ms-settings-emailandaccounts',
            'ms-settings-language','ms-settings-location','ms-settings-lock','ms-settings-nfctransactions','ms-settings-notifications','ms-settings-power',
            'ms-settings-privacy','ms-settings-proximity','ms-settings-screenrotation','ms-settings-wifi','ms-settings-workplace','ms-spd','ms-sttoverlay',
            'ms-transit-to','ms-useractivityset','ms-virtualtouchpad','ms-visio','ms-walk-to','ms-whiteboard','ms-whiteboard-cmd','ms-word','msnim','msrp',
            'msrps','mtqp','mumble','mupdate','mvn','news','nfs','ni','nih','nntp','notes','ocf','oid','onenote','onenote-cmd','opaquelocktoken','pack','palm',
            'paparazzi','pkcs11','platform','pop','pres','prospero','proxy','pwid','psyc','qb','query','redis','rediss','reload','res','resource','rmi',
            'rsync','rtmfp','rtmp','rtsp','rtsps','rtspu','secondlife','service','session','sftp','sgn','shttp','sieve','sip','sips','skype','smb','sms','smtp',
            'snews','snmp','soap.beep','soap.beeps','soldat','spiffe','spotify','ssh','steam','stun','stuns','submit','svn','tag','teamspeak','tel','teliaeid',
            'telnet','tftp','things','thismessage','tip','tn3270','tool','turn','turns','tv','udp','unreal','urn','ut2004','v-event','vemmi','ventrilo',
            'videotex','vnc','view-source','wais','webcal','wpid','ws','wss','wtai','wyciwyg','xcon','xcon-userid','xfire','xmlrpc.beep','xmlrpc.beeps',
            'xmpp','xri','ymsgr','z39.50','z39.50r','z39.50s'),true) === $expected){
        echo "OK";
    }else{
        echo "FAIL";
    }
    echo "\n";
}
  • 这是全部,而不仅仅是您的正则表达式包含的子集(我从csv文件中提取它们)

  • 添加一个PHP内基准测试,

    $start=microtime(true);
    在循环之前,
    $end=microtime(true);var_dump($end-$start);
    在循环之后,声称循环本身在我的笔记本电脑上使用0.1毫秒,所以就是这样。
    double(0.00010299682617188)


0
投票

我决定遵循@sln 的评论,全面trie;考虑到速度对我来说是优雅代码的一部分。我相信代码仍然可读,因为它按字母顺序列出:

define('URL_VALIDATION_REGEX', '/\b(?:'.
    'a(?:aas?|bout|c(?:ap|ct|r)|diumxtra|f[ps]|im|p(?:pdata|t)|ttachment|w)|'.
    'b(?:arion|eshare|itcoin|lob|olo|rowserext)|'.
    'c(?:a(?:llto|p)|hrome(?:-extension)?|id|o(?:aps?(?:\+(?:tcp|ws))?|m-eventbrite-attendee|'.
        'nt(?:ent|i))|rid|vs)|'.
    'd(?:a(?:ta|v)|i(?:aspora|ct|s)|lna-play(?:container|single)|n(?:s|tp)|tn|vb)|'.
    'e(?:d2k|xample)|'.
    'f(?:a(?:cetime|x)|eed(?:ready)?|i(?:(?:le(?:system)?)|nger|sh)|tp)|'.
    'g(?:eo|g|i(?:t|zmoproject)|o(?:pher)?|raph|talk)|'.
    'h(?:323|am|cp|ttps?|xxps?|ydrazone)|'.
    'i(?:ax|c(?:ap|on)|m(?:ap)?|nfo|otdisco|p(?:n|ps?)|r(?:c[6s]?|is(?:.(?:beep|lwz|xpcs?))?)|sostore|'.
        'tms)|'.
    'j(?:a(?:bber|r)|ms)|'.
    'keyparc|'.
    'l(?:astfm|daps?|vlt)|'.
    'm(?:a(?:gnet|il(?:server|to)|ps|rket)|essage|i(?:crosoft.windows.camera(?:.(?:multi)?picker)?|d)|ms|'.
        'o(?:dem|ngodb|z)|s(?:-(?:access|browser-extension|drive-to|e(?:nrollment|xcel)|'.
        'g(?:am(?:ebarservices|ingoverlay)|etoffice)|help|in(?:fopath|putapp)|'.
        'lockscreencomponent-config|m(?:edia-stream-id|ixedrealitycapture)|officeapp|p(?:eople|roject|'.
        'owerpoint|ublisher)|restoretabcompanion|s(?:e(?:arch-repair|condary-screen-(?:controller|setup)|'.
        'ttings(?:-(?:airplanemode|bluetooth|c(?:amera|ellular|loudstorage|onnectabledevices)|'.
        'displays-topology|emailandaccounts|l(?:anguage|oc(?:ation|k))|n(?:fctransactions|otifications)|'.
        'p(?:ower|r(?:ivacy|oximity))|screenrotation|w(?:ifi|orkplace)))?)|pd|ttoverlay)|transit-to|'.
        'useractivityset|v(?:irtualtouchpad|isio)|w(?:alk-to|hiteboard(?:-cmd)?|ord))|nim|rps?)|tqp|'.
        'u(?:mble|pdate)|vn)|'.
    'n(?:ews|fs|ih?|ntp|otes)|'.
    'o(?:cf|id|nenote(?:-cmd)?|paquelocktoken)|'.
    'p(?:a(?:ck|lm|parazzi)|kcs11|latform|op|r(?:es|o(?:spero|xy))|wid|syc)|'.
    'q(?:b|uery)|'.
    'r(?:e(?:diss?|load|s(?:ource)?)|mi|sync|t(?:mf?p|sp[su]?))|'.
    's(?:e(?:condlife|rvice|ssion)|ftp|gn|http|i(?:eve|ps?)|kype|m(?:b|s|tp)|n(?:ews|mp)|o(?:ap.beeps?|'.
                    'ldat)|p(?:iffe|otify)|sh|t(?:eam|uns?)|ubmit|vn)|'.
    't(?:ag|e(?:amspeak|l(?:iaeid|net)?)|ftp|hi(?:ngs|smessage)|ip|n3270|ool|urns?|v)|'.
    'u(?:dp|nreal|rn|t2004)|'.
    'v(?:-event|e(?:mmi|ntrilo)|ideotex|nc|iew-source)|'.
    'w(?:ais|ebcal|pid|ss?|tai|yciwyg)|'.
    'x(?:con(?:-userid)?|fire|m(?:lrpc.beeps?|pp)|ri)|'.
    'ymsgr|'.
    'z39.50[rs]?'.
    '):\/\//');

请注意,代码包含 IANA 的完整方案列表,而不是我的原始子集。

© www.soinside.com 2019 - 2024. All rights reserved.