*** TX *** [[TableOfContents]] = Summary = `TX` is short for `Tuxee XML`. It's a set of Python modules to generate, transform, parse, search XML (and HTML) document. | *Module* | *Description* | | `tx.nodes` | Define classes for each type of node to build XML tree | | `tx.tags` | Simplify tree creation | | `tx.htmltree` | Build XML tree using `htmlparser` module | | `tx.xpath` | Translate XPath expressions to Python functions | And some modules used internally: | *Module* | *Description* | | `tx.error` | Exceptions used in `tx` | | `tx.parser` | Generic parser inspired by PyParsing | | `tx.iterators` | Iterators to walk a XML tree in various ways | | `tx.htmlparser` | Error tolerant HTML parser | | `tx.xpathparser` | Translate XPath expressions to "s-expression" | | `tx.xpathfn` | Provide XPath/XQuery functions and operators | | `tx.context` | XPath context object | | `tx.sequence` | XPath sequence object | Misc. modules: | *Module* | *Description* | | `tx.misc` | Contains some utility functions | | `tx.rxpcompat` | Translate RXP-like tree structure to `tx` tree | | `tx.xpath_misc` | ... | | `tx.sequence_misc` | ... | | `tx.nodes_misc` | ... | = Arch repository = | Archive | `frederic@jolliton.com--2005-main` | | Location | `http://arch.tuxee.net/2005` | | Version | `tx--main--0.1` | | ArchZoom | [Web interface http://arch.tuxee.net/archzoom.cgi/frederic@jolliton.com--2005-main/tx--main--0.1] | = Installation = The script `install` will install `tx` into `/opt/tuxeenet` directory, and will add a link (`.pth`) from Python directory to make `tx` directly reachable by `tuxeenet.tx` name. = Overview = *Note*: The special variable `_` is the result of the previous computation. == Generating tree with tags == Importing the `tags` object as `w`: -=-=- >>> from tuxeenet.tx.tags import tags as w -=-=- then generating a tree and serializing it: -=-=- >>> w.html( w.head( w.title( 'Hello, World!' ) ) , w.body( 'bla bla bla' ) ) >>> _.serialize() 'Hello, World!bla bla bla' -=-=- The `_doc_` and `_comment_` name have special meanings. The former create a `Document` node, while the latter create a `Comment` node. -=-=- >>> w._doc_( w.foo( w._comment_( ' this is a comment ' ) ) , w.bar( 'quux&baz' ) ).serialize() 'quux&baz' >>> w.foo( w._comment_( ' a comment ' ) , 'bar' , id = 'contents' , width = '92' ).serialize() 'bar' -=-=- For attribute names, double `_` are translated to `:` and single `_` are translated to `-`. A `_` starting a name is dropped (useful when name match a Python keyword.) -=-=- >>> w._return( 'Et voila !' , xml__lang = 'fr' , _class = 'rt2' ).serialize() 'Et voila !' -=-=- == Generating tree from nodes == The `tags` object from the `tags` module is just a convenient way for building tree, but in reality it just construct `Element`, `Attribute`, `Text`,.. nodes implicitly. Here is how to generate tree directly from these objects: -=-=- >>> from tuxeenet.tx.nodes import * >>> a = Attribute( 'id' , 'contents' ) >>> b = Attribute( 'width' , '92' ) >>> c = Comment( ' a comment ' ) >>> d = Text( 'bar' ) >>> e = Element( 'foo' , ( a , b ) , ( c , d ) ) # name, attributes, children >>> e.serialize() 'bar' -=-=- == Parsing HTML == -=-=- >>> from urllib import urlopen >>> from tuxeenet.tx.htmltree import parse -=-=- Fetching and parsing the [http://slashdot.org] page: -=-=- >>> doc = parse( urlopen( 'http://slashdot.org/' ).read() ) >>> doc -=-=- Examples below will use this `doc` variable. Note that you will not necessary get the exact same output, since the page (the homepage of Slashdot) can change of course. == Debug tree == From the `doc`, we can output a verbose tree to show document structure with nodes type: -=-=- >>> print doc.asDebug( maxChildren = 2 ) DOCUMENT[0] with 2484 nodes TEXT[1] '\n' ELEMENT[2] html ELEMENT[3] head ELEMENT[4] title TEXT[5] 'Slashdot: News for nerds, stuff that matters' ELEMENT[6] link ATTRIBUTE[7] rel = `top` ATTRIBUTE[8] title = `News for nerds, stuff that matters` ATTRIBUTE[9] href = `//slashdot.org/` [.. and 13 more children ..] TEXT[36] '\n' [.. and 2 more children ..] -=-=- == Querying with XPath == A large part of XPath 2.0 is available. For example, to extract the string value of the attribute `title` of an element `link` which have also an attribute `rel` of value `top`, and where this element `link` is a child of element `head` itself child of root element `html`: -=-=- >>> doc[ '/html/head/link[@rel="top"]/@title/string()' ] >>> tuple(_) # Convert resulting sequence to a tuple (u'News for nerds, stuff that matters',) -=-=- (Note that an Unicode string is returned.) == Building RXP-like structure == The `pyRXP` module is a wrapper for the `RXP` XML parser. `tx` provide a way to convert an existing tree to the type of structure used by `pyRXP`. This is really only useful for compatibility purpose with `RXP` module. -=-=- >>> sequence = doc[ '//font[@face="verdana"]' ] >>> sequence[ 0 ].asRxp() ('font', {'color': '#001670', 'face': 'verdana'}, [u'\xa0', ('b', None, ['OSTG'], None)], None) -=-=- and translating back to a `tx` tree: -=-=- >>> doc = ('font', {'color': '#001670', 'face': 'verdana'}, [u'\xa0', ('b', None, ['OSTG'], None)], None) >>> from tuxeenet.tx.rxpcompat import fromRxp >>> fromRxp( doc ) >>> fromRxp( doc ).serialize() '\xc2\xa0OSTG' -=-=- *Important note*: You may have noticed that the `\xa0` is printed as `\xc2\xa0'. It's because `.serialize()` produce string with `utf-8` encoding by default. == Translating XQuery and XSLT examples to Python == First 2 examples show how to translate XQuery to Python, while the third example show how to translate XSLT to Python. These examples are important to show that specialized languages are not necessary for processing XML document with same power as XQuery, XSLT,.. === XQuery example 1 === Python could be used to fully replace XQuery complex operations. Taking the following example from the XQuery spec: -=-=- let $i := wrench let $o := {$i} 5 let $odoc := document ($o) let $newi := $o/tool -=-=- Which is followed by these expected results: * `fn:root($i)` returns `$i` * `fn:root($o/quantity)` returns `$o` * `fn:root($odoc//quantity)` returns `$odoc` * `fn:root($newi)` returns `$o` Some notes: * The XQuery version implictly copy tree, but in Python we have to ask it explicitly with `clone` member function. * We have to call `.finalize()` on tree which are not `Document` because by default any node which is not `Document` is considered part of another tree, and not a root of a tree by itself. * We have to run `fnRoot` (the `fn:root` function) in the `nullContext` explicitly. First part translated in Python with `tx` modules (using `tags` module): -=-=- i = w.tool( 'wrench' ) o = w.order( i.clone() , w.quantity( '5' ) ) odoc = o.clone() newi = o/'tool' # Notice the use of the '/' operator to use XPath -=-=- Declare trees as standalone: -=-=- i.finalize() o.finalize() odoc.finalize() -=-=- (Note: We make a `root` function to simplify `fnRoot` usage.) -=-=- >>> from tuxeenet.tx.sequence import Sequence >>> from tuxeenet.tx.context import Context >>> from tuxeenet.tx.xpathfn import fnRoot >>> root = lambda node : fnRoot( Context() , Sequence( node ) ) -=-=- Then we can check expected results: -=-=- >>> assert root( i ) == Sequence( i ) >>> assert root( o/'quantity' ) == Sequence( o ) >>> assert root( odoc/'.//quantity' ) == Sequence( odoc ) # The '.' is important >>> assert root( newi ) == Sequence( o ) -=-=- === XQuery example 2 === An example from [http://www.perfectxml.com/XQuery.html], with the `books.xml` document used below: -=-=- TCP/IP Illustrated Stevens W. Addison-Wesley 65.95 Advanced Programming in the UNIX Environment Stevens W. Addison-Wesley 65.95 Data on the Web Abiteboul Serge Buneman Peter Suciu Dan Morgan Kaufmann Publishers 65.95 The Economics of Technology and Content for Digital TV Gerbarg Darcy CITI Kluwer Academic Publishers 129.95 -=-=- The XQuery source code: -=-=- { for $p in distinct-values(doc("books.xml")//publisher) order by $p return { $p } { for $b in doc("books.xml")/bib/book where $b/publisher = $p order by $b/title return $b/title } } -=-=- Translation to Python using `tx`, supposing `books.xml` XML tree is in `doc` variable: -=-=- w.listings( w.result( p , sorted( b/'title' for b in doc/'/bib/book' if b/'publisher' == p ) ) for p in sorted( doc/'distinct-values(//publisher)' ) ) -=-=- Which construct the following document: (indentation added manually) -=-=- Addison-Wesley Advanced Programming in the UNIX Environment TCP/IP Illustrated Kluwer Academic Publishers The Economics of Technology and Content for Digital TV Morgan Kaufmann Publishers Data on the Web -=-=- === XSLT example 1 === Doing XSLT like transformation. Taking example from [http://www.adp-gmbh.ch/xml/xslt_examples.html], with document: -=-=- Edward Jenner Gertrude Elion Charles Babbage Alan Touring Ada Byron Tycho Brahe Johannes Kepler Galileo Galilei -=-=- and stylesheet: -=-=- Sorting example

<li>
-=-=- could be translated to Python as follow: -=-=- import operator as op # for op.itemgetter def transform( node ) : if node.match( '/' ) : return w.html( w.head( w.title( 'Sorting example' ) ) , w.body( map( transform , sorted( node/'famous-persons/persons' , key = op.itemgetter( '@category' ) ) ) ) ) elif node.match( 'persons' ) : return ( w.h2( node/'@category/string()' ) , w.ul( map( transform , sorted( sorted( node/'person' , key = op.itemgetter( 'firstname' ) ) , key = op.itemgetter( 'name' ) ) ) ) ) elif node.match( 'person' ) : return w.li( w.b( node/'name/string()' ) , node/'firstname/string()' ) result = w._doc_( transform( doc ) ) -=-=- which produce: -=-=- DOCUMENT[0] with 47 nodes ELEMENT[1] html ELEMENT[2] head ELEMENT[3] title TEXT[4] 'Sorting example' ELEMENT[5] body ELEMENT[6] h2 TEXT[7] u'astronomy' ELEMENT[8] ul ELEMENT[9] li ELEMENT[10] b TEXT[11] u' Brahe ' TEXT[12] u' Tycho ' ELEMENT[13] li ELEMENT[14] b TEXT[15] u' Galilei ' TEXT[16] u' Galileo ' ELEMENT[17] li ELEMENT[18] b TEXT[19] u' Kepler ' TEXT[20] u' Johannes ' ELEMENT[21] h2 TEXT[22] u'computer science' ELEMENT[23] ul ELEMENT[24] li ELEMENT[25] b TEXT[26] u' Babbage ' TEXT[27] u' Charles ' [...] -=-=- *Note* that we first sort by `firstname` then by `name`. Also note that such example are just here to show that we can translate XSLT or XQuery to Python, but that not necessary give optimized alternative. *Note* also that for emulating XSLT we could have to add other node match at end of transform such as: -=-=- [...] elif node.match( '@*|text()' ) : return node else : return map( transform , node/'node()' ) -=-=- but in our example that was not necessary. = Nodes = An XML document is a tree of nodes. With actual implementation, nodes are supposed to be constant. Once created, they're not expected to be updated. Mainly because `Document` node number all its descendants and inserting some children somewhere in the document would need this numbering to be redone at some point. FIXME: Tag the root node with some flag (or the node from which we need to restart the numbering -the lowest one if several descendant are updated) to let it know that it should number again its descendant when needed ? In the meantime, numbers new children with unique number (the parent one) ? == Document == A `Document` node can contains any nodes except `Document` and `Attribute`. Constructor: `Document( children = () , finalize = True )` If `finalize` is `True` (default value), then `Document` numbers all its descendant and mark their `root` pointers to it, hence making the `Document` node the root node of the tree. == Element == An `Element` node can contains any nodes except `Document`. Constructor: `Element( name , attributes = () , children = () , finalize = False )` == Attribute == An `Attribute` node is only allowed inside an `Element` node. Constructor: `Attribute( name , value )` == Comment == A `Comment` node is only allowed inside a `Document` or an `Element` node. Constructor: `Comment( contents )` *Restriction*: In XML, a comment cannot contains `--` nor ends with `-`. == Text == A `Text` node is only allowed inside a `Document` or an `Element` node. Constructor: `Text( contents )` = Tags = The `tags` module provide the `w` object which can be used to create document tree with Python syntax. A string is automatically translated to a `Node` element. Otherwise, a node of the type `Document`, `Element` or `Comment` is create with the general syntax: `w.name( child1 , .. , attribute1 = value1 , .. )`. Attributes make sense only for `Element` node type however. It's also possible to pass a function, which take no parameters and should return a correct XML tree. This function will be called at serialization time. -=-=- w._doc_( w.html( w._comment_( ' Header ' ) , w.head( w.title( 'This is a example page' ) , w.link( rel = 'stylesheet' , href = '/default-style.css' , title = 'Default style' ) , w.meta( http_equiv = 'Content-type' , content = 'text/html' , charset = 'utf-8' ) ) , w._comment_( ' Body ' ) , w.body( w.h1( 'Section 1' ) , w.h2( 'Section 1.1' ) , 'Bla bla bla.' , w.h2( 'Section 1.2' ) , 'Bla bla bla.' ) ) ) -=-=- generate a document which once serialized give: -=-=- This is a example page

Section 1

Section 1.1

Bla bla bla.

Section 1.2

Bla bla bla. -=-=- *Note* that the result here is split into several lines and indented while in reality the result is just one line of text since no `\n` (new line) characters are part of the document. Same document presented with `debug` output: -=-=- >>> print doc.asDebug() DOCUMENT[0] with 24 nodes ELEMENT[1] html COMMENT[2] ' Header ' ELEMENT[3] head ELEMENT[4] title TEXT[5] 'This is a example page' ELEMENT[6] link ATTRIBUTE[7] href = `/default-style.css` ATTRIBUTE[8] rel = `stylesheet` ATTRIBUTE[9] title = `Default style` ELEMENT[10] meta ATTRIBUTE[11] content = `text/html` ATTRIBUTE[12] charset = `utf-8` ATTRIBUTE[13] http-equiv = `Content-type` COMMENT[14] ' Body ' ELEMENT[15] body ELEMENT[16] h1 TEXT[17] 'Section 1' ELEMENT[18] h2 TEXT[19] 'Section 1.1' TEXT[20] 'Bla bla bla.' ELEMENT[21] h2 TEXT[22] 'Section 1.2' TEXT[23] 'Bla bla bla.' -=-=- Example of deferred function: -=-=- count = 0 def foo() : global count count += 1 return w.p( "I'm generated %d time(s)." % count ) doc = w.body( foo ) print doc.serialize() print doc.serialize() print doc.serialize() -=-=- produce: -=-=-

I'm generated 1 time(s).

I'm generated 2 time(s).

I'm generated 3 time(s).

-=-=- = XPath = The module `xpath` provide a large subset of XPath 2.0. Unsupported features are: * `instance of`, `treat as`, `castable as`, `cast as` operators, * `processing-instruction(..)` and `namespace(..)` tests, * `schema-attribute(..)` and `schema-element(..)` tests, * date support or any type except `string`, `float` and `boolean` (`decimal` and `double` are considered as `float`.) == Compilation == The module `xpath` contains a `compile` function which take a XPath expression and return a function taking a context as argument and returning a sequence as result of the evaluation. == XPath class wrapper == `XPath` class is a convenient (small) wrapper around `compile` function. An instance of the `XPath` class is created with a XPath expression. To evaluate the XPath expression, use `eval` member function with a optional context node. -=-=- >>> from tuxeenet.tx.xpath import XPath >>> x1 = XPath( '//@href' ) >>> x1.eval( doc ) # return all href attribute in document 'doc' -=-=- == Shortcut syntax == The `Node` base class define operator `[]` and `/` to make it easy to query a tree with XPath expression. -=-=- >>> doc[ '/html/head/link[@rel="top"]/@title/string()' ] -=-=- or -=-=- >>> doc / '/html/head/link[@rel="top"]/@title/string()' -=-=- are equivalent, while however the latter form could be written: -=-=- >>> doc/'html/head/link[@rel="top"]/@title/string()' -=-=- (without the initial `/` in the XPath expression) since `doc` is already the root node (for this example.) This is almost the direct translation of the following XQuery code: -=-=- $doc/html/head/link[@rel="top"]/@title/string() -=-=- == XPath prompt == For debugging purpose, a "XPath prompt" application is available to interactively evaluate XPath expressions. -=-=- $ tx-prompt XPath TX 0.1 - (c)2005 Frederic Jolliton XPath2.0> -=-=- Then any supported XPath expression can be entered. There is some special command: * `\.' followed by an URI (filename, URL,..) to load a document as default context item, * `\d` switch to default display mode, * `\f` switch to full display mode, * `\s` switch to short display mode, * `\i` switch to inline display mode, * `\l` switch the display of the location of resulting node on/off, * `\e` followed by an XPath expression show its syntax tree, * `\x` toggle query duration display, * `\v` display names of variables currently defined, * `\o` toggle query optimization on/off, * `\c` flush query cache (useful after `\o` command to flush already compiled expression), * `$name := expression` evaluate XPath `expression` and store the result into variable named `name`. The `$current` variable is used as context item when evaluating XPath expression. Producing sequence of numbers: -=-=- XPath2.0> 1+2 3 XPath2.0> 18 div 4 4.5 XPath2.0> (12+3)*5 75 XPath2.0> 1, 2, 3 Sequence(1, 2, 3) XPath2.0> 12, 17 to 20, 22 Sequence(12, 17, 18, 19, 20, 22) -=-=- Using `if` ternary operator: -=-=- XPath2.0> if (1=1) then "ok" else "failed" ok XPath2.0> if (1!=1) then "ok" else "failed" failed -=-=- Working with XML tree: * Fetching document -=-=- XPath2.0> $current := doc('http://slashdot.org') -=-=- * Extracting the title: -=-=- XPath2.0> /html/head/title XPath2.0> \i [inline] XPath2.0> /html/head/title Slashdot: News for nerds, stuff that matters -=-=- * Extracting articles title (pre-september 2005): -=-=- XPath2.0> \s [short] XPath2.0> //td[@align='LEFT']//font[@color]/b[text()]/string() 1 STRING Ask Slashdot: GSM and Asterisk Integration? 2 STRING Hardware: Free WiFi Trend Continues 3 STRING Linux: Winemaker Drinks To Linux 4 STRING Games: World of Warcraft Card Game Coming Soon 5 STRING Your Rights Online: Is Your Boss a Psychopath? 6 STRING Linux: Australian Linux Trademark Holds Water 7 STRING Science: Nanotubes Start to Show their Promise -=-=- * Extracting articles title (post-september 2005): -=-=- XPath2.0> \s [short] XPath2.0> //div[@class='generaltitle']/normalize-space() 1 STRING Tivo Institutes 1 Year Service Contracts 2 STRING Politics: US Senate Allows NASA To Buy Soyuz Vehicles 3 STRING IT: Reconnaissance In Virtual Space 4 STRING Your Rights Online: FBI Agents Put New Focus on Deviant Porn 5 STRING Ask Slashdot: Top 50 Science Fiction TV Shows 6 STRING Your Rights Online: Business At The Price Of Freedom 7 STRING Apple: Music Exec Fires Back At Apple CEO 8 STRING Science: Grammar Traces Language Roots 9 STRING Developers: RMS Previews GPL3 Terms 10 STRING Massachusetts Finalizes OpenDocument Standard Plan 11 STRING Developers: Palm Teams With Microsoft for Smart Phone 12 STRING Developers: Why Vista Had To Be Rebuilt From Scratch 13 STRING Hardware: Nabaztag the WiFi Bunny 14 STRING Revamping the Movie Distribution Chain 15 STRING Politics: Municipal Broadband Projects Spread Across U.S. -=-=- * Extracting RSS title from an external document: -=-=- XPath2.0> \s [short] XPath2.0> doc('http://rss.slashdot.org/Slashdot/slashdot')//item/title/string() 1 STRING Tivo Institutes 1 Year Service Contracts 2 STRING US Senate Allows NASA To Buy Soyuz Vehicles 3 STRING Reconnaissance In Virtual Space 4 STRING FBI Agents Put New Focus on Deviant Porn 5 STRING Top 50 Science Fiction TV Shows 6 STRING Business At The Price Of Freedom 7 STRING Music Exec Fires Back At Apple CEO 8 STRING Grammar Traces Language Roots 9 STRING RMS Previews GPL3 Terms 10 STRING Massachusetts Finalizes OpenDocument Standard Plan -=-=- * Extracting 1st, 3rd and 7th comment of the document: -=-=- XPath2.0> \i [inline] XPath2.0> \x Timer On XPath2.0> (//comment())[position()=(1,3,7)] -- 0.038567s(parse) + 0.006351s(eval) -- -=-=- * Querying distinct `value` attributes of the document: -=-=- XPath2.0> \f [full] XPath2.0> distinct-values(//@value) 1 NODE ATTRIBUTE[1856] value = `` 2 NODE ATTRIBUTE[1866] value = `//slashdot.org/` 3 NODE ATTRIBUTE[1871] value = `userlogin` 4 NODE ATTRIBUTE[1882] value = `yes` 5 NODE ATTRIBUTE[1889] value = `Log in` 6 NODE ATTRIBUTE[1940] value = `1307` 7 NODE ATTRIBUTE[1945] value = `mainpage` 8 NODE ATTRIBUTE[1954] value = `1` 9 NODE ATTRIBUTE[1960] value = `2` 10 NODE ATTRIBUTE[1966] value = `3` 11 NODE ATTRIBUTE[1972] value = `4` 12 NODE ATTRIBUTE[1978] value = `5` 13 NODE ATTRIBUTE[1984] value = `6` 14 NODE ATTRIBUTE[1990] value = `7` 15 NODE ATTRIBUTE[1995] value = `Vote` 16 NODE ATTRIBUTE[2346] value = `freshmeat.net` 17 NODE ATTRIBUTE[2439] value = `Search` -- 0.012589s(parse) + 0.055717s(eval) -- -=-=- * Computing number of pixels covered by `img` elements -=-=- XPath2.0> \d [default] XPath2.0> for $img in //img return $img/@width * $img/@height Sequence(19800, 4800, 2862, 4307, 4225, 4602, 5, 208, 4800, 208, 2862, \ 208, 4307, 208, 4800, 208, 4225, 208, 4602, 208, 4125, 208, 6216, 208, \ 5184, 208, 5070, 230, 230, 230, 230, 230, 230, 230, 5, nan, 1) -=-=- or alternatively: -=-=- XPath2.0> //img/(@height * @width) Sequence(19800, 4800, 2862, 4307, 4225, 4602, 5, 208, 4800, 208, 2862, \ 208, 4307, 208, 4800, 208, 4225, 208, 4602, 208, 4125, 208, 6216, 208, \ 5184, 208, 5070, 230, 230, 230, 230, 230, 230, 230, 5, nan, 1) -=-=- * Looking for `rel` attribute, with location (the XPath expression that could be used to identify precisely the node in the resulting sequence): -=-=- XPath2.0> \f [full] XPath2.0> \l Location on XPath2.0> //@rel 1 NODE /html/head/link[1]/@rel ATTRIBUTE[7] rel = `top` 2 NODE /html/head/link[2]/@rel ATTRIBUTE[12] rel = `search` 3 NODE /html/head/link[3]/@rel ATTRIBUTE[17] rel = `alternate` 4 NODE /html/head/link[4]/@rel ATTRIBUTE[23] rel = `shortcut icon` -=-=- Displaying parse tree of some expressions (mainly useful for debugging purpose !): -=-=- XPath2.0> \e 1+2 (exprlist (+ (path (integer "1")) (path (integer "2")))) XPath2.0> \e foo() + bar() (exprlist (+ (path (call "foo")) (path (call "bar")))) XPath2.0> \e ../foo | @bar (exprlist (union (path (parent (node)) (child (element "foo"))) (path (attribute (attribute "bar"))))) XPath2.0> \e /html/child::head/element(title)/string() (exprlist (path "/" (child (element "html")) (child (element "head")) (child (element "title")) (call "string"))) XPath2.0> \e for $att in distinct-values(//@*/name()) return ($att,count(//attribute()[name()=$att])) (exprlist (for ((att (path (call "distinct-values" (path "/" (descendant-or-self (node)) (attribute (attribute "*")) (call "name")))))) (path (exprlist (path (varref "att")) (path (call "count" (path "/" (descendant-or-self (node)) (predicates (attribute (attribute)) (exprlist (= (path (call "name")) (path (varref "att")))))))))))) -=-=- = Pattern = Patterns are used for XSLT nodes matching. -=-=- node.match( 'a/b' ) # Return True if node is an element `b` with a parent element `a` -=-=- *Note*: Internally patterns are translated to XPath expression. Such expression return the empty sequence if pattern doesn't match the node. For example, `a/b` become something like `self::element(b)/parent::element(a)`. = Parsing HTML = The `htmlparser` module provide a replacement for `HTMLParser` module provided with Python. The main difference is that the `tx` module never throw an error. It is able to parse the worst HTML documents. *Note*: To parse regular XML document, a parser like `rxp` could be used instead with help of `rxpcompat` module, because the `HTMLParser` is not designed for XML and is not necessary good enough for this purpose. Your mileage may vary. The `htmltree` module use the `htmlparser` module and produce `Document`. -=-=- >>> import sys >>> from tuxeenet.tx.htmltree import parse >>> parse( '' ).asDebug( file = sys.stdout ) DOCUMENT[0] with 3 nodes ELEMENT[1] html ELEMENT[2] test >>> parse( 'some ~< bad

document />

really' ).asDebug( file = sys.stdout ) DOCUMENT[0] with 7 nodes ELEMENT[1] html TEXT[2] 'some ~< bad ' ELEMENT[3] p TEXT[4] 'document />' ELEMENT[5] p TEXT[6] 'really' >>> parse( 'some ~< bad

document />

really' ).serialize() 'some ~< bad

document />

really

' -=-=-