7 `TX` is short for `Tuxee XML`. It's a set of Python modules to
8 generate, transform, parse, search XML (and HTML) document.
10 | *Module* | *Description* |
11 | `tx.nodes` | Define classes for each type of node to build XML tree |
12 | `tx.tags` | Simplify tree creation |
13 | `tx.htmltree` | Build XML tree using `htmlparser` module |
14 | `tx.xpath` | Translate XPath expressions to Python functions |
16 And some modules used internally:
18 | *Module* | *Description* |
19 | `tx.error` | Exceptions used in `tx` |
20 | `tx.parser` | Generic parser inspired by PyParsing |
21 | `tx.iterators` | Iterators to walk a XML tree in various ways |
22 | `tx.htmlparser` | Error tolerant HTML parser |
23 | `tx.xpathparser` | Translate XPath expressions to "s-expression" |
24 | `tx.xpathfn` | Provide XPath/XQuery functions and operators |
25 | `tx.context` | XPath context object |
26 | `tx.sequence` | XPath sequence object |
30 | *Module* | *Description* |
31 | `tx.misc` | Contains some utility functions |
32 | `tx.rxpcompat` | Translate RXP-like tree structure to `tx` tree |
33 | `tx.xpath_misc` | ... |
34 | `tx.sequence_misc` | ... |
35 | `tx.nodes_misc` | ... |
39 | Archive | `frederic@jolliton.com--2005-main` |
40 | Location | `http://arch.tuxee.net/2005` |
41 | Version | `tx--main--0.1` |
42 | ArchZoom | [Web interface http://arch.tuxee.net/archzoom.cgi/frederic@jolliton.com--2005-main/tx--main--0.1] |
46 The script `install` will install `tx` into `/opt/tuxeenet` directory,
47 and will add a link (`.pth`) from Python directory to make `tx`
48 directly reachable by `tuxeenet.tx` name.
52 *Note*: The special variable `_` is the result of the previous computation.
54 == Generating tree with tags ==
56 Importing the `tags` object as `w`:
59 >>> from tuxeenet.tx.tags import tags as w
62 then generating a tree and serializing it:
65 >>> w.html( w.head( w.title( 'Hello, World!' ) ) , w.body( 'bla bla bla' ) )
67 '<html><head><title>Hello, World!</title></head><body>bla bla bla</body></html>'
70 The `_doc_` and `_comment_` name have special meanings. The former
71 create a `Document` node, while the latter create a `Comment` node.
74 >>> w._doc_( w.foo( w._comment_( ' this is a comment ' ) ) , w.bar( 'quux&baz' ) ).serialize()
75 '<foo><!-- this is a comment --></foo><bar>quux&baz</bar>'
77 >>> w.foo( w._comment_( ' a comment ' ) , 'bar' , id = 'contents' , width = '92' ).serialize()
78 '<foo id="contents" width="92"><!-- a comment -->bar</foo>'
81 For attribute names, double `_` are translated to `:` and single `_`
82 are translated to `-`. A `_` starting a name is dropped (useful when
83 name match a Python keyword.)
86 >>> w._return( 'Et voila !' , xml__lang = 'fr' , _class = 'rt2' ).serialize()
87 '<return xml:lang="fr" class="rt2">Et voila !</return>'
90 == Generating tree from nodes ==
92 The `tags` object from the `tags` module is just a convenient way for
93 building tree, but in reality it just construct `Element`,
94 `Attribute`, `Text`,.. nodes implicitly.
96 Here is how to generate tree directly from these objects:
99 >>> from tuxeenet.tx.nodes import *
100 >>> a = Attribute( 'id' , 'contents' )
101 >>> b = Attribute( 'width' , '92' )
102 >>> c = Comment( ' a comment ' )
103 >>> d = Text( 'bar' )
104 >>> e = Element( 'foo' , ( a , b ) , ( c , d ) ) # name, attributes, children
106 '<foo id="contents" width="92"><!-- a comment -->bar</foo>'
112 >>> from urllib import urlopen
113 >>> from tuxeenet.tx.htmltree import parse
116 Fetching and parsing the [http://slashdot.org] page:
119 >>> doc = parse( urlopen( 'http://slashdot.org/' ).read() )
121 <Document with 2 children>
124 Examples below will use this `doc` variable. Note that you will not
125 necessary get the exact same output, since the page (the homepage of
126 Slashdot) can change of course.
130 From the `doc`, we can output a verbose tree to show document
131 structure with nodes type:
134 >>> print doc.asDebug( maxChildren = 2 )
135 DOCUMENT[0] with 2484 nodes
140 TEXT[5] 'Slashdot: News for nerds, stuff that matters'
142 ATTRIBUTE[7] rel = `top`
143 ATTRIBUTE[8] title = `News for nerds, stuff that matters`
144 ATTRIBUTE[9] href = `//slashdot.org/`
145 [.. and 13 more children ..]
147 [.. and 2 more children ..]
150 == Querying with XPath ==
152 A large part of XPath 2.0 is available.
154 For example, to extract the string value of the attribute `title` of
155 an element `link` which have also an attribute `rel` of value `top`,
156 and where this element `link` is a child of element `head` itself
157 child of root element `html`:
160 >>> doc[ '/html/head/link[@rel="top"]/@title/string()' ]
161 >>> tuple(_) # Convert resulting sequence to a tuple
162 (u'News for nerds, stuff that matters',)
165 (Note that an Unicode string is returned.)
167 == Building RXP-like structure ==
169 The `pyRXP` module is a wrapper for the `RXP` XML parser. `tx` provide
170 a way to convert an existing tree to the type of structure used by
173 This is really only useful for compatibility purpose with `RXP`
177 >>> sequence = doc[ '//font[@face="verdana"]' ]
178 >>> sequence[ 0 ].asRxp()
179 ('font', {'color': '#001670', 'face': 'verdana'}, [u'\xa0', ('b', None, ['OSTG'], None)], None)
182 and translating back to a `tx` tree:
185 >>> doc = ('font', {'color': '#001670', 'face': 'verdana'}, [u'\xa0', ('b', None, ['OSTG'], None)], None)
186 >>> from tuxeenet.tx.rxpcompat import fromRxp
188 <Element font with 2 attributes and 2 children>
189 >>> fromRxp( doc ).serialize()
190 '<font color="#001670" face="verdana">\xc2\xa0<b>OSTG</b></font>'
193 *Important note*: You may have noticed that the `\xa0` is printed as
194 `\xc2\xa0'. It's because `.serialize()` produce string with `utf-8`
197 == Translating XQuery and XSLT examples to Python ==
199 First 2 examples show how to translate XQuery to Python, while the
200 third example show how to translate XSLT to Python.
202 These examples are important to show that specialized languages are
203 not necessary for processing XML document with same power as
206 === XQuery example 1 ===
208 Python could be used to fully replace XQuery complex operations.
210 Taking the following example from the XQuery spec:
213 let $i := <tool>wrench</tool>
214 let $o := <order> {$i} <quantity>5</quantity> </order>
215 let $odoc := document ($o)
219 Which is followed by these expected results:
221 * `fn:root($i)` returns `$i`
223 * `fn:root($o/quantity)` returns `$o`
225 * `fn:root($odoc//quantity)` returns `$odoc`
227 * `fn:root($newi)` returns `$o`
231 * The XQuery version implictly copy tree, but in Python we have to
232 ask it explicitly with `clone` member function.
234 * We have to call `.finalize()` on tree which are not `Document`
235 because by default any node which is not `Document` is considered
236 part of another tree, and not a root of a tree by itself.
238 * We have to run `fnRoot` (the `fn:root` function) in the
239 `nullContext` explicitly.
241 First part translated in Python with `tx` modules (using `tags` module):
244 i = w.tool( 'wrench' )
245 o = w.order( i.clone() , w.quantity( '5' ) )
247 newi = o/'tool' # Notice the use of the '/' operator to use XPath
250 Declare trees as standalone:
258 (Note: We make a `root` function to simplify `fnRoot` usage.)
261 >>> from tuxeenet.tx.sequence import Sequence
262 >>> from tuxeenet.tx.context import Context
263 >>> from tuxeenet.tx.xpathfn import fnRoot
264 >>> root = lambda node : fnRoot( Context() , Sequence( node ) )
267 Then we can check expected results:
270 >>> assert root( i ) == Sequence( i )
271 >>> assert root( o/'quantity' ) == Sequence( o )
272 >>> assert root( odoc/'.//quantity' ) == Sequence( odoc ) # The '.' is important
273 >>> assert root( newi ) == Sequence( o )
276 === XQuery example 2 ===
278 An example from [http://www.perfectxml.com/XQuery.html], with the
279 `books.xml` document used below:
284 <title>TCP/IP Illustrated</title>
289 <publisher>Addison-Wesley</publisher>
294 <title>Advanced Programming in the UNIX Environment</title>
299 <publisher>Addison-Wesley</publisher>
304 <title>Data on the Web</title>
306 <last>Abiteboul</last>
317 <publisher>Morgan Kaufmann Publishers</publisher>
322 <title>The Economics of Technology and Content for Digital TV</title>
326 <affiliation>CITI</affiliation>
328 <publisher>Kluwer Academic Publishers</publisher>
329 <price>129.95</price>
334 The XQuery source code:
339 for $p in distinct-values(doc("books.xml")//publisher)
345 for $b in doc("books.xml")/bib/book
346 where $b/publisher = $p
355 Translation to Python using `tx`, supposing `books.xml` XML tree is in
360 w.result( p , sorted( b/'title'
361 for b in doc/'/bib/book'
362 if b/'publisher' == p ) )
363 for p in sorted( doc/'distinct-values(//publisher)' ) )
366 Which construct the following document: (indentation added manually)
371 <publisher>Addison-Wesley</publisher>
372 <title>Advanced Programming in the UNIX Environment</title>
373 <title>TCP/IP Illustrated</title>
376 <publisher>Kluwer Academic Publishers</publisher>
377 <title>The Economics of Technology and Content for Digital TV</title>
380 <publisher>Morgan Kaufmann Publishers</publisher>
381 <title>Data on the Web</title>
386 === XSLT example 1 ===
388 Doing XSLT like transformation.
390 Taking example from [http://www.adp-gmbh.ch/xml/xslt_examples.html],
394 <?xml version="1.0" ?>
397 <persons category="medicine">
399 <firstname> Edward </firstname>
400 <name> Jenner </name>
403 <firstname> Gertrude </firstname>
407 <persons category="computer science">
409 <firstname> Charles </firstname>
410 <name> Babbage </name>
413 <firstname> Alan </firstname>
414 <name> Touring </name>
417 <firstname> Ada </firstname>
421 <persons category="astronomy">
423 <firstname> Tycho </firstname>
427 <firstname> Johannes </firstname>
428 <name> Kepler </name>
431 <firstname> Galileo </firstname>
432 <name> Galilei </name>
441 <?xml version="1.0" ?>
442 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
444 <xsl:template match="/">
445 <html><head><title>Sorting example</title></head><body>
446 <xsl:apply-templates select="famous-persons/persons">
447 <xsl:sort select="@category" />
448 </xsl:apply-templates>
452 <xsl:template match="persons">
453 <h2><xsl:value-of select="@category" /></h2>
455 <xsl:apply-templates select="person">
456 <xsl:sort select="name" />
457 <xsl:sort select="firstname" />
458 </xsl:apply-templates>
462 <xsl:template match="person">
463 <xsl:text disable-output-escaping="yes">
466 <b><xsl:value-of select="name" /></b>
467 <xsl:value-of select="firstname" />
473 could be translated to Python as follow:
476 import operator as op # for op.itemgetter
477 def transform( node ) :
478 if node.match( '/' ) :
479 return w.html( w.head( w.title( 'Sorting example' ) ) ,
480 w.body( map( transform ,
481 sorted( node/'famous-persons/persons' ,
482 key = op.itemgetter( '@category' ) ) ) ) )
483 elif node.match( 'persons' ) :
484 return ( w.h2( node/'@category/string()' ) ,
485 w.ul( map( transform ,
486 sorted( sorted( node/'person' ,
487 key = op.itemgetter( 'firstname' ) ) ,
488 key = op.itemgetter( 'name' ) ) ) ) )
489 elif node.match( 'person' ) :
490 return w.li( w.b( node/'name/string()' ) ,
491 node/'firstname/string()' )
493 result = w._doc_( transform( doc ) )
499 DOCUMENT[0] with 47 nodes
503 TEXT[4] 'Sorting example'
514 TEXT[15] u' Galilei '
515 TEXT[16] u' Galileo '
519 TEXT[20] u' Johannes '
521 TEXT[22] u'computer science'
525 TEXT[26] u' Babbage '
526 TEXT[27] u' Charles '
530 *Note* that we first sort by `firstname` then by `name`. Also note
531 that such example are just here to show that we can translate XSLT or
532 XQuery to Python, but that not necessary give optimized alternative.
534 *Note* also that for emulating XSLT we could have to add other node
535 match at end of transform such as:
539 elif node.match( '@*|text()' ) :
542 return map( transform , node/'node()' )
545 but in our example that was not necessary.
549 An XML document is a tree of nodes.
551 With actual implementation, nodes are supposed to be constant. Once
552 created, they're not expected to be updated. Mainly because `Document`
553 node number all its descendants and inserting some children somewhere
554 in the document would need this numbering to be redone at some point.
556 FIXME: Tag the root node with some flag (or the node from which we
557 need to restart the numbering -the lowest one if several descendant
558 are updated) to let it know that it should number again its descendant
559 when needed ? In the meantime, numbers new children with unique number
564 A `Document` node can contains any nodes except `Document` and `Attribute`.
566 Constructor: `Document( children = () , finalize = True )`
568 If `finalize` is `True` (default value), then `Document` numbers all
569 its descendant and mark their `root` pointers to it, hence making the
570 `Document` node the root node of the tree.
574 An `Element` node can contains any nodes except `Document`.
576 Constructor: `Element( name , attributes = () , children = () , finalize = False )`
580 An `Attribute` node is only allowed inside an `Element` node.
582 Constructor: `Attribute( name , value )`
586 A `Comment` node is only allowed inside a `Document` or an `Element` node.
588 Constructor: `Comment( contents )`
590 *Restriction*: In XML, a comment cannot contains `--` nor ends with `-`.
594 A `Text` node is only allowed inside a `Document` or an `Element` node.
596 Constructor: `Text( contents )`
600 The `tags` module provide the `w` object which can be used to create
601 document tree with Python syntax.
603 A string is automatically translated to a `Node` element.
605 Otherwise, a node of the type `Document`, `Element` or `Comment` is
606 create with the general syntax: `w.name( child1 , .. , attribute1 = value1 , .. )`.
607 Attributes make sense only for `Element` node type however.
609 It's also possible to pass a function, which take no parameters and
610 should return a correct XML tree. This function will be called at
616 w._comment_( ' Header ' ) ,
618 w.title( 'This is a example page' ) ,
619 w.link( rel = 'stylesheet' , href = '/default-style.css' , title = 'Default style' ) ,
620 w.meta( http_equiv = 'Content-type' , content = 'text/html' , charset = 'utf-8' ) ) ,
621 w._comment_( ' Body ' ) ,
623 w.h1( 'Section 1' ) ,
624 w.h2( 'Section 1.1' ) ,
626 w.h2( 'Section 1.2' ) ,
630 generate a document which once serialized give:
636 <title>This is a example page</title>
637 <link href="/default-style.css" rel="stylesheet" title="Default style"/>
638 <meta content="text/html" charset="utf-8" http-equiv="Content-type"/>
651 *Note* that the result here is split into several lines and indented
652 while in reality the result is just one line of text since no `\n`
653 (new line) characters are part of the document.
655 Same document presented with `debug` output:
658 >>> print doc.asDebug()
659 DOCUMENT[0] with 24 nodes
661 COMMENT[2] ' Header '
664 TEXT[5] 'This is a example page'
666 ATTRIBUTE[7] href = `/default-style.css`
667 ATTRIBUTE[8] rel = `stylesheet`
668 ATTRIBUTE[9] title = `Default style`
670 ATTRIBUTE[11] content = `text/html`
671 ATTRIBUTE[12] charset = `utf-8`
672 ATTRIBUTE[13] http-equiv = `Content-type`
678 TEXT[19] 'Section 1.1'
679 TEXT[20] 'Bla bla bla.'
681 TEXT[22] 'Section 1.2'
682 TEXT[23] 'Bla bla bla.'
685 Example of deferred function:
692 return w.p( "I'm generated %d time(s)." % count )
696 print doc.serialize()
697 print doc.serialize()
698 print doc.serialize()
704 <body><p>I'm generated 1 time(s).</p></body>
705 <body><p>I'm generated 2 time(s).</p></body>
706 <body><p>I'm generated 3 time(s).</p></body>
711 The module `xpath` provide a large subset of XPath 2.0.
713 Unsupported features are:
715 * `instance of`, `treat as`, `castable as`, `cast as` operators,
717 * `processing-instruction(..)` and `namespace(..)` tests,
719 * `schema-attribute(..)` and `schema-element(..)` tests,
721 * date support or any type except `string`, `float` and `boolean`
722 (`decimal` and `double` are considered as `float`.)
726 The module `xpath` contains a `compile` function which take a XPath
727 expression and return a function taking a context as argument and
728 returning a sequence as result of the evaluation.
730 == XPath class wrapper ==
732 `XPath` class is a convenient (small) wrapper around `compile` function.
734 An instance of the `XPath` class is created with a XPath expression.
735 To evaluate the XPath expression, use `eval` member function with a
736 optional context node.
739 >>> from tuxeenet.tx.xpath import XPath
740 >>> x1 = XPath( '//@href' )
741 >>> x1.eval( doc ) # return all href attribute in document 'doc'
744 == Shortcut syntax ==
746 The `Node` base class define operator `[]` and `/` to make it easy to
747 query a tree with XPath expression.
750 >>> doc[ '/html/head/link[@rel="top"]/@title/string()' ]
756 >>> doc / '/html/head/link[@rel="top"]/@title/string()'
759 are equivalent, while however the latter form could be written:
762 >>> doc/'html/head/link[@rel="top"]/@title/string()'
765 (without the initial `/` in the XPath expression) since `doc` is
766 already the root node (for this example.)
768 This is almost the direct translation of the following XQuery code:
771 $doc/html/head/link[@rel="top"]/@title/string()
776 For debugging purpose, a "XPath prompt" application is available
777 to interactively evaluate XPath expressions.
781 XPath TX 0.1 - (c)2005 Frederic Jolliton <frederic@jolliton.com>
786 Then any supported XPath expression can be entered.
788 There is some special command:
790 * `\.' followed by an URI (filename, URL,..) to load a document
791 as default context item,
793 * `\d` switch to default display mode,
795 * `\f` switch to full display mode,
797 * `\s` switch to short display mode,
799 * `\i` switch to inline display mode,
801 * `\l` switch the display of the location of resulting node on/off,
803 * `\e` followed by an XPath expression show its syntax tree,
805 * `\x` toggle query duration display,
807 * `\v` display names of variables currently defined,
809 * `\o` toggle query optimization on/off,
811 * `\c` flush query cache (useful after `\o` command to flush already
812 compiled expression),
814 * `$name := expression` evaluate XPath `expression` and store
815 the result into variable named `name`.
817 The `$current` variable is used as context item when evaluating XPath
820 Producing sequence of numbers:
831 XPath2.0> 12, 17 to 20, 22
832 Sequence(12, 17, 18, 19, 20, 22)
835 Using `if` ternary operator:
838 XPath2.0> if (1=1) then "ok" else "failed"
840 XPath2.0> if (1!=1) then "ok" else "failed"
844 Working with XML tree:
849 XPath2.0> $current := doc('http://slashdot.org')
852 * Extracting the title:
855 XPath2.0> /html/head/title
856 <Element title with 0 attributes and 1 children>
859 XPath2.0> /html/head/title
860 <title>Slashdot: News for nerds, stuff that matters</title>
863 * Extracting articles title (pre-september 2005):
868 XPath2.0> //td[@align='LEFT']//font[@color]/b[text()]/string()
869 1 STRING Ask Slashdot: GSM and Asterisk Integration?
870 2 STRING Hardware: Free WiFi Trend Continues
871 3 STRING Linux: Winemaker Drinks To Linux
872 4 STRING Games: World of Warcraft Card Game Coming Soon
873 5 STRING Your Rights Online: Is Your Boss a Psychopath?
874 6 STRING Linux: Australian Linux Trademark Holds Water
875 7 STRING Science: Nanotubes Start to Show their Promise
878 * Extracting articles title (post-september 2005):
883 XPath2.0> //div[@class='generaltitle']/normalize-space()
884 1 STRING Tivo Institutes 1 Year Service Contracts
885 2 STRING Politics: US Senate Allows NASA To Buy Soyuz Vehicles
886 3 STRING IT: Reconnaissance In Virtual Space
887 4 STRING Your Rights Online: FBI Agents Put New Focus on Deviant Porn
888 5 STRING Ask Slashdot: Top 50 Science Fiction TV Shows
889 6 STRING Your Rights Online: Business At The Price Of Freedom
890 7 STRING Apple: Music Exec Fires Back At Apple CEO
891 8 STRING Science: Grammar Traces Language Roots
892 9 STRING Developers: RMS Previews GPL3 Terms
893 10 STRING Massachusetts Finalizes OpenDocument Standard Plan
894 11 STRING Developers: Palm Teams With Microsoft for Smart Phone
895 12 STRING Developers: Why Vista Had To Be Rebuilt From Scratch
896 13 STRING Hardware: Nabaztag the WiFi Bunny
897 14 STRING Revamping the Movie Distribution Chain
898 15 STRING Politics: Municipal Broadband Projects Spread Across U.S.
901 * Extracting RSS title from an external document:
906 XPath2.0> doc('http://rss.slashdot.org/Slashdot/slashdot')//item/title/string()
907 1 STRING Tivo Institutes 1 Year Service Contracts
908 2 STRING US Senate Allows NASA To Buy Soyuz Vehicles
909 3 STRING Reconnaissance In Virtual Space
910 4 STRING FBI Agents Put New Focus on Deviant Porn
911 5 STRING Top 50 Science Fiction TV Shows
912 6 STRING Business At The Price Of Freedom
913 7 STRING Music Exec Fires Back At Apple CEO
914 8 STRING Grammar Traces Language Roots
915 9 STRING RMS Previews GPL3 Terms
916 10 STRING Massachusetts Finalizes OpenDocument Standard Plan
919 * Extracting 1st, 3rd and 7th comment of the document:
926 XPath2.0> (//comment())[position()=(1,3,7)]
927 <!-- BEGIN: AdSolution-Tag 4.2: Global-Code -->
928 <!-- begin OSTG navbar -->
931 -- 0.038567s(parse) + 0.006351s(eval) --
934 * Querying distinct `value` attributes of the document:
939 XPath2.0> distinct-values(//@value)
940 1 NODE ATTRIBUTE[1856] value = ``
941 2 NODE ATTRIBUTE[1866] value = `//slashdot.org/`
942 3 NODE ATTRIBUTE[1871] value = `userlogin`
943 4 NODE ATTRIBUTE[1882] value = `yes`
944 5 NODE ATTRIBUTE[1889] value = `Log in`
945 6 NODE ATTRIBUTE[1940] value = `1307`
946 7 NODE ATTRIBUTE[1945] value = `mainpage`
947 8 NODE ATTRIBUTE[1954] value = `1`
948 9 NODE ATTRIBUTE[1960] value = `2`
949 10 NODE ATTRIBUTE[1966] value = `3`
950 11 NODE ATTRIBUTE[1972] value = `4`
951 12 NODE ATTRIBUTE[1978] value = `5`
952 13 NODE ATTRIBUTE[1984] value = `6`
953 14 NODE ATTRIBUTE[1990] value = `7`
954 15 NODE ATTRIBUTE[1995] value = `Vote`
955 16 NODE ATTRIBUTE[2346] value = `freshmeat.net`
956 17 NODE ATTRIBUTE[2439] value = `Search`
957 -- 0.012589s(parse) + 0.055717s(eval) --
960 * Computing number of pixels covered by `img` elements
965 XPath2.0> for $img in //img return $img/@width * $img/@height
966 Sequence(19800, 4800, 2862, 4307, 4225, 4602, 5, 208, 4800, 208, 2862, \
967 208, 4307, 208, 4800, 208, 4225, 208, 4602, 208, 4125, 208, 6216, 208, \
968 5184, 208, 5070, 230, 230, 230, 230, 230, 230, 230, 5, nan, 1)
974 XPath2.0> //img/(@height * @width)
975 Sequence(19800, 4800, 2862, 4307, 4225, 4602, 5, 208, 4800, 208, 2862, \
976 208, 4307, 208, 4800, 208, 4225, 208, 4602, 208, 4125, 208, 6216, 208, \
977 5184, 208, 5070, 230, 230, 230, 230, 230, 230, 230, 5, nan, 1)
980 * Looking for `rel` attribute, with location (the XPath expression
981 that could be used to identify precisely the node in the resulting
990 1 NODE /html/head/link[1]/@rel
991 ATTRIBUTE[7] rel = `top`
992 2 NODE /html/head/link[2]/@rel
993 ATTRIBUTE[12] rel = `search`
994 3 NODE /html/head/link[3]/@rel
995 ATTRIBUTE[17] rel = `alternate`
996 4 NODE /html/head/link[4]/@rel
997 ATTRIBUTE[23] rel = `shortcut icon`
1000 Displaying parse tree of some expressions (mainly useful for debugging purpose !):
1004 (exprlist (+ (path (integer "1"))
1005 (path (integer "2"))))
1007 XPath2.0> \e foo() + bar()
1008 (exprlist (+ (path (call "foo"))
1009 (path (call "bar"))))
1011 XPath2.0> \e ../foo | @bar
1012 (exprlist (union (path (parent (node))
1013 (child (element "foo")))
1014 (path (attribute (attribute "bar")))))
1016 XPath2.0> \e /html/child::head/element(title)/string()
1018 (child (element "html"))
1019 (child (element "head"))
1020 (child (element "title"))
1023 XPath2.0> \e for $att in distinct-values(//@*/name()) return ($att,count(//attribute()[name()=$att]))
1024 (exprlist (for ((att (path (call "distinct-values"
1026 (descendant-or-self (node))
1027 (attribute (attribute "*"))
1029 (path (exprlist (path (varref "att"))
1032 (descendant-or-self (node))
1033 (predicates (attribute (attribute))
1034 (exprlist (= (path (call "name"))
1035 (path (varref "att"))))))))))))
1040 Patterns are used for XSLT nodes matching.
1043 node.match( 'a/b' ) # Return True if node is an element `b` with a parent element `a`
1046 *Note*: Internally patterns are translated to XPath expression. Such
1047 expression return the empty sequence if pattern doesn't match the
1048 node. For example, `a/b` become something like
1049 `self::element(b)/parent::element(a)`.
1053 The `htmlparser` module provide a replacement for `HTMLParser` module
1054 provided with Python. The main difference is that the `tx` module
1055 never throw an error. It is able to parse the worst HTML documents.
1057 *Note*: To parse regular XML document, a parser like `rxp` could be
1058 used instead with help of `rxpcompat` module, because the `HTMLParser`
1059 is not designed for XML and is not necessary good enough for this
1060 purpose. Your mileage may vary.
1062 The `htmltree` module use the `htmlparser` module and produce
1067 >>> from tuxeenet.tx.htmltree import parse
1068 >>> parse( '<html><test></html>' ).asDebug( file = sys.stdout )
1069 DOCUMENT[0] with 3 nodes
1072 >>> parse( '<html>some ~< bad <p>document</b> /><p>really' ).asDebug( file = sys.stdout )
1073 DOCUMENT[0] with 7 nodes
1075 TEXT[2] 'some ~< bad '
1077 TEXT[4] 'document />'
1080 >>> parse( '<html>some ~< bad <p>document</b> /><p>really' ).serialize()
1081 '<html>some ~< bad <p>document /></p><p>really</p></html>'