t_,@s+ddlmZmZmZddlmZddlmZm Z ddl m Z ddl m Z ddl mZddl mZmZdd l mZmZmZdd l mZmZdd l mZdd lmZdd lmZeeZe dkr eZne ZGdddeZdS))absolute_importdivisionunicode_literals)unichr)deque OrderedDict) version_info)spaceCharacters)entities) asciiLettersasciiUpper2Lower)digits hexDigitsEOF) tokenTypes tagTokenTypes)replacementCharacters)HTMLInputStream)TriecseZdZdZdfddZddZddZdd d d Zd d ZddZ ddZ ddZ ddZ ddZ ddZddZddZddZd d!Zd"d#Zd$d%Zd&d'Zd(d)Zd*d+Zd,d-Zd.d/Zd0d1Zd2d3Zd4d5Zd6d7Zd8d9Zd:d;Zd<d=Z d>d?Z!d@dAZ"dBdCZ#dDdEZ$dFdGZ%dHdIZ&dJdKZ'dLdMZ(dNdOZ)dPdQZ*dRdSZ+dTdUZ,dVdWZ-dXdYZ.dZd[Z/d\d]Z0d^d_Z1d`daZ2dbdcZ3dddeZ4dfdgZ5dhdiZ6djdkZ7dldmZ8dndoZ9dpdqZ:drdsZ;dtduZ<dvdwZ=dxdyZ>dzd{Z?d|d}Z@d~dZAddZBddZCddZDddZEddZFddZGddZHddZIddZJddZKddZLS) HTMLTokenizera  This class takes care of tokenizing HTML. * self.currentToken Holds the token that is currently being processed. * self.state Holds a reference to the method to be invoked... XXX * self.stream Points to HTMLInputStream object. Nc sbt|||_||_d|_g|_|j|_d|_d|_t t |j dS)NF) rstreamparser escapeFlag lastFourChars dataStatestateescape currentTokensuperr__init__)selfrrkwargs) __class__/builddir/build/BUILDROOT/alt-python35-pip-20.2.4-1.el7.x86_64/opt/alt/python35/lib/python3.5/site-packages/pip/_vendor/html5lib/_tokenizer.pyr"(s      zHTMLTokenizer.__init__ccs{tg|_xe|jrvx4|jjrTdtdd|jjjdiVq!Wx|jrr|jjVqXWqWdS)z This is where the magic happens. We do our usually processing through the states and when we have a token to return we yield the token which pauses processing until the next token is requested. type ParseErrordatarN)r tokenQueuerrerrorsrpoppopleft)r#r&r&r'__iter__7s ( zHTMLTokenizer.__iter__c %Cst}d}|rt}d}g}|jj}x8||krm|tk rm|j||jj}q6Wtdj||}|tkrt|}|j jdt ddddd |iind |kod kns|d kr(d }|j jdt ddddd |iinld|ko?dknsd|ko[dknsd|kowdknsd|kodkns|t ddddddddddd d!d"d#d$d%d&d'd(d)d*d+d,d-d.d/d0d1d2d3d4d5d6d7d g#kr?|j jdt ddddd |iiyt |}WnBt k r|d8}t d |d?Bt d9|d:@B}YnX|d;kr|j jdt ddd<i|jj||S)=zThis function returns either U+FFFD or the character based on the decimal or hexadecimal representation. It also discards ";" if present. If not present self.tokenQueue.append({"type": tokenTypes["ParseError"]}) is invoked. r(r)r*z$illegal-codepoint-for-numeric-entitydatavars charAsIntiiiu�r ii iiiiiiiiiiiiiiiiiii i i i i i i i i i iiiiiiii;z numeric-entity-without-semicolon)rrrcharrappendintjoinrr+r frozensetchr ValueErrorunget) r#isHexallowedradix charStackcr4r<vr&r&r'consumeNumberEntityGsb             +  z!HTMLTokenizer.consumeNumberEntityFc Csd}|jjg}|dtks]|dtddfks]|dk rt||dkrt|jj|dn|ddkrkd}|j|jj|ddkrd }|j|jj|r|dtks| r|dtkr|jj|d|j|}q5|j jd t d d d i|jj|j ddj |}nxC|dtk rt jdj |sP|j|jjqnWy2t jdj |dd}t|}Wntk rd}YnX|dk r|ddkr:|j jd t d d di|ddkr|r||tks||tks||dkr|jj|j ddj |}q5t|}|jj|j |dj ||d7}nI|j jd t d d di|jj|j ddj |}|rW|jd dd|7r)z'expected-tag-name-but-got-right-bracketrRz<>?z'expected-tag-name-but-got-question-markzexpected-tag-namerLT)rr<markupDeclarationOpenStatercloseTagOpenStater rr tagNameStater+r=rrCbogusCommentState)r#r*r&r&r'rnws6             zHTMLTokenizer.tagOpenStatecCs1|jj}|tkrOdtdd|dgddi|_|j|_n|dkr|jjdtddd i|j |_n|t kr|jjdtddd i|jjdtd dd i|j |_nH|jjdtddd dd|ii|jj ||j |_dS)Nr(rdrbr*reFr|r)z*expected-closing-tag-but-got-right-bracketz expected-closing-tag-but-got-eofrRz|jjdtdddin|dkrY|j|_n|dkr|jjdtdddi|j|_n|dkr|jjdtddd i|jjdtddd i|j|_nG|t kr |j |_n,|jjdtdd|i|j|_d S) Nrr(rRr*rLr|rlr)zinvalid-codepointu�T) rr<r+r=rrrrwrrr)r#r*r&r&r'rs& #         z,HTMLTokenizer.scriptDataEscapedDashDashStatecCs|jj}|dkr3d|_|j|_n|tkr{|jjdtddd|i||_|j |_n<|jjdtdddi|jj ||j |_dS)Nrzr2r(rRr*rLT) rr<r scriptDataEscapedEndTagOpenStaterr r+r=r scriptDataDoubleEscapeStartStaterCr)r#r*r&r&r'rs   $   z0HTMLTokenizer.scriptDataEscapedLessThanSignStatecCss|jj}|tkr3||_|j|_n<|jjdtdddi|jj ||j |_dS)Nr(rRr*z|jjdtdddin8|dkry|jjdtdddi|j|_n|dkr|jjdtdddi|j|_n|dkr|jjdtddd i|jjdtddd i|j|_ng|t krJ|jjdtddd i|j |_n,|jjdtdd|i|j|_d S) Nrr(rRr*rLr|rlr)zinvalid-codepointu�zeof-in-script-in-scriptT) rr<r+r=rrrrwrrr)r#r*r&r&r'r%s, #           z2HTMLTokenizer.scriptDataDoubleEscapedDashDashStatecCss|jj}|dkrS|jjdtdddid|_|j|_n|jj||j |_dS)Nrzr(rRr*r2T) rr<r+r=rrscriptDataDoubleEscapeEndStaterrCr)r#r*r&r&r'r>s    z6HTMLTokenizer.scriptDataDoubleEscapedLessThanSignStatecCs|jj}|ttdBkrx|jjdtdd|i|jjdkri|j |_ q|j |_ nZ|t kr|jjdtdd|i|j|7_n|jj ||j |_ dS) Nrzr|r(rRr*rT)rzr|)rr<r r@r+r=rrrrrrr rC)r#r*r&r&r'rIs    z,HTMLTokenizer.scriptDataDoubleEscapeEndStatecCs|jj}|tkr1|jjtdnt|tkrf|jdj|dg|j|_n?|dkr|j n&|dkr|j |_n |dkr|j jd t d dd i|jdj|dg|j|_n|d krD|j jd t d ddi|jdjddg|j|_na|t kr|j jd t d ddi|j|_n&|jdj|dg|j|_dS)NTr*r2r|rz'"rPrLr(r)z#invalid-character-in-attribute-namerlzinvalid-codepointu�z#expected-attribute-name-but-got-eof)rrrPrL)rr<r ror r r=attributeNameStaterrkrr+rrr)r#r*r&r&r'rYs6            z&HTMLTokenizer.beforeAttributeNameStatecCsc|jj}d}d}|dkr6|j|_n|tkrw|jddd||jjtd7               z'HTMLTokenizer.beforeAttributeValueStatecCs|jj}|dkr*|j|_n|dkrF|jdn|dkr|jjdtdddi|jdd dd 7|tkr{|jj||jj|j|j|_ndS)Nr|T) rr<r+r=r rrrrC)r#r*r&r&r'rs  zHTMLTokenizer.bogusDoctypeStatecCs\g}x|j|jjd|j|jjd|jj}|tkrZPq |dkslt|ddddkr|ddd|ds