URI Generic SyntaxΒΆ

In RFC 3986 there is embedded a grammar for parsing URIs:

:grammar URI

In chapter 1.1.2. there is started with some examples, which can be used as testcases for the grammar:

<URI> ~~ ftp://ftp.co.za/rfc/rfc1808.txt
<URI> ~~ http://www.ietf.org/rfc/rfc2396.txt
<URI> ~~ ldap://[2001:db8::7]/c=GB?objectclass?one
<URI> ~~ mailto:John.Doe@example.com
<URI> ~~ news:comp.infosystems.www.servers.unix
<URI> ~~ tel:+1-816-555-1212
<URI> ~~ telnet://192.0.2.16:80/
<URI> ~~ urn:oasis:names:specification:docbook:dtd:xml:4.1.2

For finding out which parts should be captured, an expected parse-tree can be added to some of the test-urls:

<URI> ~~ ftp://ftp.co.za/rfc/rfc1808.txt
      -> URI(
             scheme( 'ftp' ),
             authority( host( reg-name( 'ftp.co.za' ) ) ),
             path( '/rfc/rfc1808.txt' )
         )

<URI> ~~ ldap://[2001:db8::7]/c=GB?objectclass?one
      -> URI(
             scheme( 'ldap' ),
             authority( host( IPv6-address( '2001:db8::7' ) ) ),
             path( '/c=GB' ),
             query( 'objectclass?one' )
         )

<URI> ~~ tel:+1-816-555-1212
      -> URI(
             scheme( 'tel' ),
             path( '+1-816-555-1212' )
         )

<URI> ~~ urn:oasis:names:specification:docbook:dtd:xml:4.1.2
      -> URI(
             scheme( 'urn' ),
             path( 'oasis:names:specification:docbook:dtd:xml:4.1.2' )
         )

We add a test to match all parts:

<URI> ~~ http://userinfo@foo.bar.com/some/path?some=query#fragment
      -> URI(
             scheme( 'http' ),
             authority(
                 userinfo( 'userinfo' ),
                 host( reg-name( 'foo.bar.com' ) ) ),
             path( '/some/path' ),
             query( 'some=query' ),
             fragment( 'fragment' )
         )

Instead of doing the grammar 1-1 here, we create an aPTK optimized form:

<URI>  ::= <scheme> ":" <.hier-part> [ "?" <query> ]? [ "#" <fragment> ]?

There are two more possible entry-points into the grammar:

<URI-reference> ::= <.URI> | <.relative-ref>
<absolute-URI>  ::= <scheme> ":" <.hier-part> [ "?" <query> ]?

Setup basic character-classes:

alpha         = A-Z a-z
digit         = 0-9
unreserved    = {:alpha:} {:digit:} \- . ~
gen-delims    = : / ? # \[ \] @
sub-delims    = ! $ & ' ( ) * + , ; =
reserved      = {:gen-delims:} {:sub-delims:}
hexdigit      = 0-9 A-F a-f

Other basic tokens:

pct-encoded   = % [{:hexdigit:}]{2}
pchar         = [{:unreserved:} {:sub-delims:} : @] | {:pct-encoded:}
pchar-qs      = [{:unreserved:} {:sub-delims:} : @ ? /] | {:pct-encoded:}

Paths can be created as (capturing) tokens:

segment       = {pchar}*
segment-nz    = {pchar}+
segment-nz-nc = [ [{:unreserved:} {:sub-delims:} @] | {:pct-encoded:} ]+
path-abempty  = (?P<path> [ / {segment} ]* )
path-absolute = (?P<path> / [ {segment-nz} [ / {segment} ]* ]? )
path-noscheme = (?P<path> {segment-nz-nc}  [ / {segment} ]* )
path-rootless = (?P<path> {segment-nz}     [ / {segment} ]* )
path-empty    = (?P<path> (?!{pchar}) )

And IP-addresses can also be parsed by tokens:

h16         = [{:hexdigit:}]{1,4}
h16c        = {:h16:} :
dec-octet   = \d | [1-9]\d | 1\d\d | 2[0-4]\d | 25[0-5]
IPv4-address = [ {dec-octet} ]{3} {dec-octet}
ls32        = {:h16:} : {:h16:} | {IPv4-address}

IPv6-address =                         {h16c}{6}{ls32}
             |                      :: {h16c}{5}{ls32}
             |               {h16}? :: {h16c}{4}{ls32}
             | [ {h16c}{,1}{h16} ]? :: {h16c}{3}{ls32}
             | [ {h16c}{,2}{h16} ]? :: {h16c}{2}{ls32}
             | [ {h16c}{,3}{h16} ]? :: {h16c}{1}{ls32}
             | [ {h16c}{,4}{h16} ]? ::          {ls32}
             | [ {h16c}{,5}{h16} ]? ::          {h16}
             | [ {h16c}{,6}{h16} ]? ::

reg-name     = [ [{:unreserved:}{:sub-delims:}] | {pct-encoded} ]+

reg-name    := {reg-name}

Now rules are created top bottom in order of appearence:

scheme      := [{:alpha:}][{:alpha:}{:digit:}+\-.]*

hier-part   := "//" <authority> {path-abempty}
              | {path-absolute}
              | {path-rootless}
              | {path-empty}

authority   := [ <userinfo> "@" ]? <host> [ ":" <port> ]?

port        := \d+

userinfo    := [{:unreserved:}{:sub-delims:}:]*

host        := <.IP-literal> | <IPv4-address> | <reg-name>

IP-literal  := "[" [ <IPv6-address> | <IPvFuture> ] "]"

IPvFuture   := "v" [{:hexdigit:}]+ "." [{:unreserved:}{:sub-delims:}:]+

IPv4-address := {IPv4-address}
IPv6-address := {IPv6-address}

query        := {pchar-qs}*
fragment     := {pchar-qs}*

relative-ref := <.relative-part> [ "?" <query> ]? [ "#" <fragment> ]?

relative-part := "//" <authority> {path-abempty}
                | {path-absolute}
                | {path-noscheme}
                | {path-empty}