<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-3944976411672994427</id><updated>2012-02-16T16:13:13.043+07:00</updated><category term='M'/><category term='xml'/><category term='Open source'/><category term='Oslo'/><category term='ES4'/><category term='REST'/><category term='security'/><category term='mac'/><category term='RELAX NG'/><category term='jing-trang'/><category term='xml:id'/><category term='Type systems'/><category term='ECMAScript'/><category term='I18N'/><category term='JSON'/><category term='Thailand'/><category term='schemas'/><category term='HTTP'/><title type='text'>James Clark's Random Thoughts</title><subtitle type='html'></subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://blog.jclark.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://blog.jclark.com/feeds/posts/default/-/jing-trang'/><link rel='alternate' type='text/html' href='http://blog.jclark.com/search/label/jing-trang'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>James Clark</name><uri>http://www.blogger.com/profile/08445951113700394609</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='16' height='16' src='http://img2.blogblog.com/img/b16-rounded.gif'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>3</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-3944976411672994427.post-682813966902532634</id><published>2009-01-17T13:21:00.001+07:00</published><updated>2009-01-17T14:34:15.131+07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='RELAX NG'/><category scheme='http://www.blogger.com/atom/ns#' term='xml:id'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><category scheme='http://www.blogger.com/atom/ns#' term='jing-trang'/><title type='text'>RELAX NG and xml:id</title><content type='html'>&lt;p&gt;One part of the vision underlying RELAX NG is that validation should not be monolithic: it is not necessary or desirable to have one schema language that can handle every possible kind of validation you might want to do; it is better instead to have multiple specialized languages, each of which does one kind validation, really well. Consistent with this vision, RELAX NG provides only grammar-based validation. There's no implicit claim that other kinds of validation aren't useful and important.&lt;/p&gt;  &lt;p&gt;One kind of validation that is clearly useful and important and that can't be done by grammars is checking of cross-references. One possibility is to use Schematron for this. The designers of RELAX NG anticipated that there would be a little schema language specialized to this, which would be created as part of the ISO &lt;a href="http://dsdl.org/"&gt;DSDL&lt;/a&gt; effort (as part 6); this wouldn't be a million miles from the kind of thing that XSD provides with xs:key/xs:unique/xs:keyref. Unfortunately this hasn't happened yet.&lt;/p&gt;  &lt;p&gt;Since DTDs provide ID/IDREF checking and we wanted people to be able to move easily from DTDs to RELAX NG, we felt we had to provide some transitional support for ID/IDREF checking while awaiting the ultimate &amp;quot;right&amp;quot; solution. We therefore provided a separate, optional spec called &lt;a href="http://relaxng.org/compatibility-20011203.html"&gt;RELAX NG DTD Compatibility&lt;/a&gt;. Amongst other things, this defines a way in which RELAX NG processors can optionally provide DTD-compatible ID/IDREF checking based on the datatypes of attributes declared in the schema. Note that this can't handled by the XSD datatypes library for RELAX NG, because assignment of types in the schema to values in the instance is not part of the RELAX NG model of validation.&lt;/p&gt;  &lt;p&gt;When defining RELAX NG DTD compatibility, we took a fairly hard line about being DTD-compatible. In particular, we made it a requirement that you should be able to generate a DTD subset from the RELAX NG schema that would perform the same type assignment that the process defined by the spec would perform. This creates some problems when you use DTD Compatibility in conjunction with wildcards (which of course aren't a DTD feature). For example:&lt;/p&gt;  &lt;pre&gt;start = element doc { p* }
p = element p { id?, any* }
id = attribute id { xsd:ID }
any = element * { attribute * { text }*, (any|text)* }&lt;/pre&gt;

&lt;p&gt;will get a error about conflicting ID-types for &lt;a href="mailto:p/@id"&gt;p/@id&lt;/a&gt;.&amp;#160; This is because the schema allows &amp;lt;p&amp;gt; to contain a &amp;lt;p&amp;gt; element with an id attribute that doesn't have type ID. Instead you would have to write:&lt;/p&gt;

&lt;pre&gt;start = element doc { p* }
p = element p { id?, any* }
id = attribute id { xsd:ID }
any = element * - p { attribute * { text }*, (any|text)* }&lt;/pre&gt;

&lt;p&gt;Several years after the DTD compatibility spec was finished, the W3C came out with the xml:id &lt;a href="http://www.w3.org/TR/xml-id/"&gt;Recommendation&lt;/a&gt;. The spec mentions RELAX NG in a non-normative appendix and encourages authors &amp;quot;to declare attributes named &lt;code&gt;xml:id&lt;/code&gt; with the type &lt;code&gt;xs:ID&lt;/code&gt;&amp;quot;. Now on the face of it, this seems pretty reasonable advice.&amp;#160; Unfortunately, from the point of the RELAX NG DTD Compatibility spec it's precisely the wrong thing to do.&amp;#160; For example, this&lt;/p&gt;

&lt;pre&gt;start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:NCName }
any = element * { attribute * { text}*, (any|text)* }&lt;/pre&gt;

&lt;p&gt;will work perfectly with RELAX NG with or without DTD compatibility. The XML processor does the xml:id checking, and RELAX NG can ignore ID/IDREFs. But if instead you follow the xml:id Recommendation's suggestion and do:&lt;/p&gt;

&lt;pre&gt;start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:ID }
any = element * { attribute * { text}*, (any|text)* }&lt;/pre&gt;

&lt;p&gt;a RELAX NG validator that implements RELAX NG DTD compatibility will give you an error about conflicting ID-types p/@xml:id. You might think you could do&lt;/p&gt;

&lt;pre&gt;start = element doc { p* }
p = element p { id?, any* }
id = attribute xml:id { xsd:ID }
any = element * { attribute * - xml:id { text}*, id?, (any|text)* }&lt;/pre&gt;

&lt;p&gt;but that won't work either, because although you can now write a DTD subset that does equivalent type assignment for p, you can't do it for the other elements.&lt;/p&gt;

&lt;p&gt;(The xml:id Recommendation also says in the RELAX NG section that &amp;quot;A document that uses &lt;code&gt;xml:id&lt;/code&gt; attributes that have a declared type other than &lt;code&gt;xs:ID&lt;/code&gt; will always generate xml:id errors.&amp;quot;. I don't see why: the xml:id processor is quite likely to be part of the XML parser, which doesn't know anything about RELAX NG, nor does RELAX NG know anything about xml:id.)&lt;/p&gt;

&lt;p&gt;Back when RELAX NG DTD compatibility spec came out, I implemented support for the ID/IDREF checking part of DTD Compatibility in Jing.&amp;#160; I also decided to make Jing enforce this by default. There's a -i switch to turn it off. Before xml:id came along, this seemed to work OK: if a schema author specifies ID/IDREF in a RELAX NG schema then they usually want ID/IDREFs to be checked and RELAX NG DTD Compatibility was the only thing that could do this checking. With xml:id this no longer works well: if you&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;use xml:id &lt;/li&gt;

  &lt;li&gt;declare xml:id attributes as type xsd:ID in the RELAX NG schema &lt;/li&gt;

  &lt;li&gt;use wildcards in your RELAX NG schema &lt;/li&gt;

  &lt;li&gt;don't use any special options to Jing &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;you are very likely to get an error from Jing.&lt;/p&gt;

&lt;p&gt;At first, my plan was simply to change Jing not to enforce DTD Compatibility by default. However, Alex Brown &lt;a href="http://code.google.com/p/jing-trang/issues/detail?id=36#c2"&gt;pointed out&lt;/a&gt; that this isn't completely satisfactory: people who are coming from DTDs and aren't using xml:id lose the sensible ID/IDREF checking that they might reasonably expect to happen by default. So now I'm thinking that a better solution might be to add two boolean options to Jing, both of which would be enabled by default.&lt;/p&gt;

&lt;p&gt;The first option would be to make it a warning rather than an error if the schema does not use ID/IDREF in a DTD-compatible way. (If the schema is DTD-compatible, then duplicate IDs or IDREFs to non-existent IDs would still be errors.)&lt;/p&gt;

&lt;p&gt;The second option would tell Jing to be &amp;quot;xml:id aware&amp;quot;. This would have several effects.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;It would require attributes named xml:id to be declared with type xsd:ID (or with the ID type from the datatype library defined by the DTD compatibility spec). This isn't strictly necessarily, but it would seem to minimize confusion and be in keeping with the spirit of the xml:id Recommendation. It's slightly tricky to decide what this means with various unusual RELAX NG wildcards. It is obvious that attribute xml:id { text} is an error.&amp;#160; But the following are not all obvious to me: 
    &lt;ul&gt;
      &lt;li&gt;attribute xml:id|id { text } &lt;/li&gt;

      &lt;li&gt;attribute * { text } &lt;/li&gt;

      &lt;li&gt;attribute xml:* { text } &lt;/li&gt;

      &lt;li&gt;attribute *|xml:id { text } &lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;

  &lt;li&gt;When checking whether you can generate an equivalent DTD subset, xml:id attributes would be ignored. In the terms defined by the RELAX NG DTD Compatibility spec, you would ignore xml:id attributes when determining whether the schema is compatible with the ID/IDREF feature. &lt;/li&gt;

  &lt;li&gt;When checking uniquess of IDs, and when checking IDREFs, an attribute named xml:id would always be treated as an ID attribute. &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It might also be a good idea to revise the RELAX NG DTD compatibility spec to be xml:id aware in this way.&lt;/p&gt;  &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3944976411672994427-682813966902532634?l=blog.jclark.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.jclark.com/feeds/682813966902532634/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3944976411672994427&amp;postID=682813966902532634' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3944976411672994427/posts/default/682813966902532634'/><link rel='self' type='application/atom+xml' href='http://blog.jclark.com/feeds/posts/default/682813966902532634'/><link rel='alternate' type='text/html' href='http://blog.jclark.com/2009/01/relax-ng-and-xmlid.html' title='RELAX NG and xml:id'/><author><name>James Clark</name><uri>http://www.blogger.com/profile/04798042939786677843</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_CRGhVAUz8CE/SZPj6KJBPZI/AAAAAAAAAAM/TQ6htTUw0nk/S220/oriental-tea-portrait.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3944976411672994427.post-4884344461967366335</id><published>2008-11-17T08:55:00.001+07:00</published><updated>2008-11-17T08:55:33.316+07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='RELAX NG'/><category scheme='http://www.blogger.com/atom/ns#' term='I18N'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><category scheme='http://www.blogger.com/atom/ns#' term='jing-trang'/><title type='text'>What's allowed in a URI?</title><content type='html'>&lt;p&gt;Java 1.4 introduced the java.net.URI which provides RFC 2936-compliant URI handling. I thought I should try to fix Jing and Trang to use this. So I've been looking through all the relevant specs to figure out to what extent I can leave things to java.net.URI.&lt;/p&gt;  &lt;p&gt;It's convenient to begin with XLink.&amp;#160; &lt;a href="http://www.w3.org/TR/2001/REC-xlink-20010627/#link-locators" target="_blank"&gt;Section 5.4&lt;/a&gt; requires the value of the href attribute to be a URI reference after certain characters that are disallowed by RFC 2396 are escaped. These are described as&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;all non-ASCII characters, plus the excluded characters listed in Section 2.4 of IETF RFC 2396, except for the number sign (#) and percent sign (%) and the square bracket characters re-allowed in IETF RFC 2732&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;If we look at &lt;a href="http://tools.ietf.org/html/rfc2396#section-2.4.3" target="_blank"&gt;2.4.3 of RFC 2396&lt;/a&gt; (why does XLink reference section 2.4 rather than 2.4.3?), we see the following sets of characters excluded:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;control&amp;#160;&amp;#160;&amp;#160;&amp;#160; = &amp;lt;US-ASCII coded characters 00-1F and 7F hexadecimal&amp;gt; &lt;/li&gt;    &lt;li&gt;space&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; = &amp;lt;US-ASCII coded character 20 hexadecimal&amp;gt; &lt;/li&gt;    &lt;li&gt;delims&amp;#160;&amp;#160;&amp;#160;&amp;#160;&amp;#160; = &amp;quot;&amp;lt;&amp;quot; | &amp;quot;&amp;gt;&amp;quot; | &amp;quot;#&amp;quot; | &amp;quot;%&amp;quot; | &amp;lt;&amp;quot;&amp;gt; &lt;/li&gt;    &lt;li&gt;unwise&amp;#160;&amp;#160;&amp;#160;&amp;#160; = &amp;quot;{&amp;quot; | &amp;quot;}&amp;quot; | &amp;quot;|&amp;quot; | &amp;quot;\&amp;quot; | &amp;quot;^&amp;quot; | &amp;quot;[&amp;quot; | &amp;quot;]&amp;quot; | &amp;quot;`&amp;quot; &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;&lt;a href="http://tools.ietf.org/html/rfc2732#section-3" target="_blank"&gt;Section 3 of RFC 2732&lt;/a&gt; (which modifies RFC 2396 to handle IPv6 addresses)&amp;#160; does indeed allow square brackets by removing them from the 'unwise' set.&lt;/p&gt;  &lt;p&gt;Putting these all together, we can distinguish the following categories of characters that are allowed by XLink but not allowed by RFC 2396/RFC 2732&lt;/p&gt;  &lt;ol&gt;   &lt;li&gt;C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents &lt;/li&gt;    &lt;li&gt;space (#x20) &lt;/li&gt;    &lt;li&gt;disallowed ASCII graphic characters, specifically: &amp;lt;&amp;gt;&amp;quot;{}|\^` &lt;/li&gt;    &lt;li&gt;delete (#x7F) &lt;/li&gt;    &lt;li&gt;non-ASCII Unicode characters, excluding surrogates #x80-#xD7FF, #xE000-#x10FFFF (XML does not allow #xFFFE and #xFFFF) &lt;/li&gt; &lt;/ol&gt;  &lt;p&gt;Looking at the various XML-related specs, things seem to be nicely aligned:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;&lt;a href="http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent" target="_blank"&gt;XML 1.0 First Edition&lt;/a&gt; required escaping just for category 5, but &lt;a href="http://www.w3.org/TR/1998/REC-xml-19980210#sec-external-ent" target="_blank"&gt;XML 1.0 Second Edition&lt;/a&gt; got fixed to use the same wording as XLink &lt;/li&gt;    &lt;li&gt;&lt;a href="http://www.w3.org/TR/2001/REC-xmlbase-20010627/#escaping" target="_blank"&gt;XML Base&lt;/a&gt; uses the same wording as XLink &lt;/li&gt;    &lt;li&gt;&lt;a href="http://www.w3.org/TR/2001/REC-xmlschema-2-20010502/#anyURI" target="_blank"&gt;XML Schema Part 2&lt;/a&gt; references XLink (in specifying xs:anyURI) &lt;/li&gt;    &lt;li&gt;&lt;a href="http://relaxng.org/spec-20011203.html#href" target="_blank"&gt;RELAX NG&lt;/a&gt; references XLink &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;XSLT 1.0 just references RFC 2396 and doesn't say anything about escaping (as regards xsl:include and xsl:import). That seems like a bug to me.&amp;#160; Erratum &lt;a href="http://www.w3.org/1999/11/REC-xslt-19991116-errata/#E39" target="_blank"&gt;E39&lt;/a&gt; adds the following to the first paragraph of the spec:&lt;/p&gt;  &lt;blockquote&gt;   &lt;p&gt;For convenience, XML 1.0 and XML Names 1.0 references are usually used. Thus, URI references are also used though IRI may also be supported. In some cases, the XML 1.0 and XML 1.1 definitions may be exactly the same.&lt;/p&gt; &lt;/blockquote&gt;  &lt;p&gt;This seems to be intended to extend it to allow IRIs, though it seems like a bit of a hack: there's no reference to the IRI spec, and I don't see how it's &amp;quot;Thus, &amp;quot;. In any case, &lt;a href="http://www.w3.org/TR/xslt20/#uri-references" target="_blank"&gt;XSLT 2.0&lt;/a&gt; gets it right: it references xs:anyURI.&lt;/p&gt;  &lt;p&gt;RFC 2396 has been updated by &lt;a href="http://tools.ietf.org/html/rfc3986" target="_blank"&gt;RFC 3986&lt;/a&gt;.&amp;#160; This no longer has a section describing excluded characters, but I believe I am right in saying that the set of Unicode characters that cannot occur anywhere in a URI as defined by RFC 3986 is precisely the union of my categories 1 through 5.&lt;/p&gt;  &lt;p&gt;Next we have the IRI spec, &lt;a href="http://tools.ietf.org/html/rfc3987" target="_blank"&gt;RFC 3987&lt;/a&gt;. This defines:&lt;/p&gt;  &lt;pre&gt;   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD&lt;/pre&gt;

&lt;p&gt;It adds ucschar to the set of unreserved characters and adds iprivate to what's allowed in the query of a URI. The characters in my category 5 that are in neither ucschar nor iprivate are as follows:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;C1 controls: #x80 - #x9F &lt;/li&gt;

  &lt;li&gt;the 66 Unicode noncharacters: #xFDD0 - #xFDEF, and any code point whose bottom 16 bits are FFFE or FFFF &lt;/li&gt;

  &lt;li&gt;Specials: #xFFF0 - #xFFFD; these fall into three groups, unassigned specials (#xFFF0 - #xFFF8), annotation characters (#xFFF9 - #xFFFB) and replacement characters (#xFFFC - #xFFFD) &lt;/li&gt;

  &lt;li&gt;Language tags: #xE0000 - #xE0FFF &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I can buy controls and noncharacters being excluded, but the other two seem like over-engineering to me. The arguments for excluding these could equally be applied to various other weird Unicode characters.&amp;#160; You don't want to have to change the definition of an IRI whenever Unicode adds some new weird character.&lt;/p&gt;

&lt;p&gt;RFC 3987 also has the following in &lt;a href="http://tools.ietf.org/html/rfc3987#section-3.2" target="_blank"&gt;Section 3.2&lt;/a&gt;:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;Systems accepting IRIs MAY also deal with the printable characters in US-ASCII that are not allowed in URIs, namely &amp;quot;&amp;lt;&amp;quot;, &amp;quot;&amp;gt;&amp;quot;, '&amp;quot;', space, &amp;quot;{&amp;quot;, &amp;quot;}&amp;quot;, &amp;quot;|&amp;quot;, &amp;quot;\&amp;quot;, &amp;quot;^&amp;quot;, and &amp;quot;`&amp;quot;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Those characters correspond to my categories 2 and 3. Overall there are a lot of subtle differences between IRIs and the thing that is currently allowed by XML specs.&lt;/p&gt;

&lt;p&gt;Fortunately there is a &lt;a href="http://tools.ietf.org/html/draft-duerst-iri-bis-04" target="_blank"&gt;draft of a new version of the IRI spec&lt;/a&gt;. This introduces Legacy Extended IRI (LEIRI) references, which defines ucschar as:&lt;/p&gt;

&lt;pre&gt;   ucschar        = &amp;quot; &amp;quot; / &amp;quot;&amp;lt;&amp;quot; / &amp;quot;&amp;gt;&amp;quot; / '&amp;quot;' / &amp;quot;{&amp;quot; / &amp;quot;}&amp;quot; / &amp;quot;|&amp;quot;
                     / &amp;quot;\&amp;quot; / &amp;quot;^&amp;quot; / &amp;quot;`&amp;quot; / %x0-1F / %x7F-D7FF
                     / %xE000-FFFD / %x10000-10FFFF&lt;/pre&gt;

&lt;p&gt;which exactly corresponds to my categories 1 to 5.&lt;/p&gt;

&lt;p&gt;LEIRIs seem like a very useful innovation.&amp;#160; XML-related specs such as RELAX NG that referenced or incorporated the XLink wording will be able to simply reference RFC 3987bis and say that URI references MUST be LEIRIs and SHOULD be IRIs.&lt;/p&gt;

&lt;p&gt;Finally we are ready to look at &lt;a href="http://java.sun.com/javase/6/docs/api/java/net/URI.html" target="_blank"&gt;java.net.URI&lt;/a&gt;. This allows URIs to contain an additional set of &amp;quot;other&amp;quot; characters which consist of non-ASCII characters with the exception of:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;C1 controls (#x80 - #x9F) &lt;/li&gt;

  &lt;li&gt;Characters with a category of Zs, Zl or Zp &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This means that if you want to give an LEIRI such as an XML system identifier to java.net.URI you first need to percent encode any of the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;the following ASCII graphic characters: &amp;lt;&amp;gt;&amp;quot;{}|\^` &lt;/li&gt;

  &lt;li&gt;C0 control characters (#x00 - #x1F); of these only #x9, #xA and #xD are allowed in XML documents &lt;/li&gt;

  &lt;li&gt;space (#x20) &lt;/li&gt;

  &lt;li&gt;delete (#x7F) &lt;/li&gt;

  &lt;li&gt;C1 controls (#x80 - #x9F) &lt;/li&gt;

  &lt;li&gt;Characters with a category of Zs, Zl or Zp &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All except the first can be tested with Character.isISOControl(c) || Character.isSpace(c).&lt;/p&gt;

&lt;p&gt;Note that you don't want to blindly percent encode all non-ASCII characters because that will unnecessarily make IRIs containing non-ASCII characters unintelligible to humans.&lt;/p&gt;  &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3944976411672994427-4884344461967366335?l=blog.jclark.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.jclark.com/feeds/4884344461967366335/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3944976411672994427&amp;postID=4884344461967366335' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3944976411672994427/posts/default/4884344461967366335'/><link rel='self' type='application/atom+xml' href='http://blog.jclark.com/feeds/posts/default/4884344461967366335'/><link rel='alternate' type='text/html' href='http://blog.jclark.com/2008/11/what-allowed-in-uri.html' title='What&amp;#39;s allowed in a URI?'/><author><name>James Clark</name><uri>http://www.blogger.com/profile/04798042939786677843</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_CRGhVAUz8CE/SZPj6KJBPZI/AAAAAAAAAAM/TQ6htTUw0nk/S220/oriental-tea-portrait.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-3944976411672994427.post-3298162536941975690</id><published>2008-11-09T11:18:00.001+07:00</published><updated>2008-11-09T11:58:41.036+07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='RELAX NG'/><category scheme='http://www.blogger.com/atom/ns#' term='Open source'/><category scheme='http://www.blogger.com/atom/ns#' term='xml'/><category scheme='http://www.blogger.com/atom/ns#' term='schemas'/><category scheme='http://www.blogger.com/atom/ns#' term='jing-trang'/><title type='text'>Working on Jing and Trang</title><content type='html'>&lt;p&gt;I've been back to working on &lt;a href="http://www.thaiopensource.com/relaxng/jing.html"&gt;Jing&lt;/a&gt; and &lt;a href="http://www.thaiopensource.com/relaxng/trang.html"&gt;Trang&lt;/a&gt; for about a month now. It would be something of an understatement to say that they were badly in need of some maintenance love.&lt;/p&gt;  &lt;p&gt;I started a &lt;a href="http://jing-trang.googlecode.com"&gt;jing-trang project on Google Code&lt;/a&gt; to host future development. There are new releases of both Jing and Trang in the &lt;a href="http://code.google.com/p/jing-trang/downloads/list"&gt;downloads&lt;/a&gt; section of the project site. These have been out for about 10 days, and there have been a reasonable number of downloads, and no reports of any major bugs, so I think these should be fairly solid.&amp;#160; (Interestingly, the number of downloads of Trang have been running at about twice those of Jing.)&lt;/p&gt;  &lt;p&gt;It's been 5 years since the last release, so what new and exciting features are there? Well, actually, in the current release, none.&amp;#160; My work for that release was focused on two areas:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;getting things to work properly with current versions of Java and other dependencies; &lt;/li&gt;    &lt;li&gt;getting the source code structure and build system into reasonable shape. &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;The second was a lot of work. The code base for Jing and Trang had evolved over a number of years, incorporating various bits of functionality that were independent of each other to various degrees; its structure only made any sense from a historical perspective.&amp;#160; The &lt;a href="http://code.google.com/p/jing-trang/wiki/SourceOverview"&gt;current structure&lt;/a&gt; is now nicely modular.&amp;#160; I converted my CVS repository to subversion before I started moving things around, so the &lt;a href="http://code.google.com/p/jing-trang/source/list"&gt;complete history&lt;/a&gt; is available in the project repository. For people who want to stay on the bleeding edge, it's now &lt;a href="http://code.google.com/p/jing-trang/wiki/HowToBuildFromSource"&gt;really easy&lt;/a&gt; to check out and build from subversion.&lt;/p&gt;  &lt;p&gt;My natural tendencies are much more to the cathedral than to the bazaar, but I'm trying to be more open.&amp;#160; I'm pleased to say that are already two committers in addition to myself. There's a commercial XML editor called &lt;a href="http://www.oxygenxml.com/" target="_blank"&gt;&amp;lt;oXygen/&amp;gt;&lt;/a&gt;, which uses Jing and Trang to support RELAX NG. The main guy behind that, George Bina, had made a number of useful improvements. In particular, he upgraded Jing's support for the &lt;a href="http://thaiopensource.com/relaxng/nrl.html" target="_blank"&gt;Namespace Routing Language&lt;/a&gt; to its ISO-standardized version, which is called &lt;a href="http://nvdl.org/" target="_blank"&gt;NVDL&lt;/a&gt; (you might want to start with this &lt;a href="http://jnvdl.sourceforge.net/tutorial.html" target="_blank"&gt;NVDL tutorial&lt;/a&gt; rather than the spec).&amp;#160; This is now on the &lt;a href="http://code.google.com/p/jing-trang/source/browse/#svn/trunk/mod/nvdl/src/main/com/thaiopensource/validate/nvdl" target="_blank"&gt;trunk&lt;/a&gt;. The other committer is Henri Sivonen, who has been using Jing in his &lt;a href="http://validator.nu/" target="_blank"&gt;Validator.nu&lt;/a&gt; service.&lt;/p&gt;  &lt;p&gt;My goals for the next release are:&lt;/p&gt;  &lt;ul&gt;   &lt;li&gt;complete support for NVDL (I think the only missing feature is inline schemas) &lt;/li&gt;    &lt;li&gt;support for the ISO-standardized version of &lt;a href="http://www.schematron.com/" target="_blank"&gt;Schematron&lt;/a&gt; &lt;/li&gt;    &lt;li&gt;customizable resource resolution support (so that, for example, you can use &lt;a href="http://www.oasis-open.org/committees/entity/" target="_blank"&gt;XML catalogs&lt;/a&gt;) &lt;/li&gt;    &lt;li&gt;support standard JAXP XML validation API (javax.xml.validation) &lt;/li&gt;    &lt;li&gt;more code cleanup &lt;/li&gt; &lt;/ul&gt;  &lt;p&gt;Please use the &lt;a href="http://code.google.com/p/jing-trang/issues/list" target="_blank"&gt;issue tracker&lt;/a&gt; to let me know what you would like.&amp;#160; Google Code has a system that allow you to vote for issues: if you are logged in, which you can do with a regular Google account, each issue will be displayed with a check box next to a star; checking this box &amp;quot;stars&amp;quot; the issue for you, which both adds a vote for the issue and gets you email notifications about changes to it.&lt;/p&gt;  &lt;p&gt;I haven't started any project-specific mailing lists yet.&amp;#160; For developers, the issue tracker seems to be enough at the moment.&amp;#160; For users, Jing and Trang are within the scope of the existing &lt;a href="http://tech.groups.yahoo.com/group/rng-users/" target="_blank"&gt;RELAX NG Users mailing list&lt;/a&gt; on Yahoo Groups.&lt;/p&gt;  &lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/3944976411672994427-3298162536941975690?l=blog.jclark.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://blog.jclark.com/feeds/3298162536941975690/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=3944976411672994427&amp;postID=3298162536941975690' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/3944976411672994427/posts/default/3298162536941975690'/><link rel='self' type='application/atom+xml' href='http://blog.jclark.com/feeds/posts/default/3298162536941975690'/><link rel='alternate' type='text/html' href='http://blog.jclark.com/2008/11/working-on-jing-and-trang.html' title='Working on Jing and Trang'/><author><name>James Clark</name><uri>http://www.blogger.com/profile/04798042939786677843</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://3.bp.blogspot.com/_CRGhVAUz8CE/SZPj6KJBPZI/AAAAAAAAAAM/TQ6htTUw0nk/S220/oriental-tea-portrait.jpg'/></author><thr:total>2</thr:total></entry></feed>
