As most of script language have own regular expression mechanism, Sagittarius also has own regular expression library. It is influenced by Java's regular expression, so there are a lot of differences between the most famous Perl regular expression(perlre).
This feature may be changed, if R7RS large requires Perl like regular expression.
Following examples show how to use Sagittarius's regular expression.
;; For Perl like
(cond ((looking-at (regex "^hello\\s*(.+)") "hello world!")
=> (lambda (m) (m 1))))
world!
;; For Java like
(cond ((matches (regex "(\\w+?)\\s*(.+)") "123hello world!")) ;; this won't match
(else "incovenient eh?"))
The matches
procedure is total match, so it ignores boundary matcher
'^'
and '$'
. The looking-at
procedure is partial match, so
it works as if perlre.
This library provides Sagittarius regular expression procedures.
String must be regular expression. Returns compiled regular expression. Flags' descriptions are the end of this section. The following table is the supported regular expression constructs.
Construct | Matches |
---|---|
Characters | < |
x
|
The character x |
\\
|
The backslash character |
\0n
|
The character with octal value 0n (0 <= n <= 7) |
\0nn
|
The character with octal value 0nn (0 <= n <= 7) |
\0mnn
|
The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7) |
\xhh
|
The character with hexadecimal value 0xhh |
\uhhhh
|
The character with hexadecimal value 0xhhhh |
\Uhhhhhhhh
|
The character with hexadecimal value 0xhhhhhhhh. If the value exceed the maxinum fixnum value it rases an error. |
\t
|
The tab character ('\u0009') |
\n
|
The newline (line feed) character ('\u000A') |
\r
|
The carriage-return character ('\u000D') |
\f
|
The form-feed character ('\u000C') |
\a
|
The alert (bell) character ('\u0007') |
\e
|
The escape character ('\u001B') |
\cx
|
The control character corresponding to x |
Character classes | < |
[abc]
|
a, b, or c (simple class) |
[^abc]
|
Any character except a, b, or c (negation) |
[a-zA-Z]
|
a through z or A through Z, inclusive (range) |
[a-d[m-p]]
|
a through d, or m through p: [a-dm-p] (union) |
[a-z&&[def]]
|
d, e, or f (intersection) |
[a-z&&[^bc]]
|
a through z, except for b and c: [ad-z] (subtraction) |
[a-z&&[^m-p]]
|
a through z, and not m through p: a-lq-z |
Predefined character classes | < |
.
|
Any character (may or may not match line terminators) |
\d
|
A digit: [0-9] |
\D
|
A non-digit: [^0-9] |
\s
|
A whitespace character: [ \t\n\x0B\f\r] |
\S
|
A non-whitespace character: [^\s] |
\w
|
A word character: [a-zA-Z_0-9] |
\W
|
A non-word character: [^\w] |
Boundary matchers | < |
^
|
The beginning of a line |
$
|
The end of a line |
\b
|
A word boundary |
\B
|
A non-word boundary |
\A
|
The beginning of the input |
\G
|
The end of the previous match |
\Z
|
The end of the input but for the final terminator, if any |
\z
|
The end of the input |
Greedy quantifiers | < |
X?
|
X, once or not at all |
X*
|
X, zero or more times |
X+
|
X, one or more times |
X{n}
|
X, exactly n times |
X{n,}
|
X, at least n times |
X{n,m}
|
X, at least n but not more than m times |
Reluctant quantifiers | < |
X??
|
X, once or not at all |
X*?
|
X, zero or more times |
X+?
|
X, one or more times |
X{n}?
|
X, exactly n times |
X{n,}?
|
X, at least n times |
X{n,m}?
|
X, at least n but not more than m times |
Possessive quantifiers | < |
X?+
|
X, once or not at all |
X*+
|
X, zero or more times |
X++
|
X, one or more times |
X{n}+
|
X, exactly n times |
X{n,}+
|
X, at least n times |
X{n,m}+
|
X, at least n but not more than m times |
Logical operators | < |
XY
|
X followed by Y |
X|Y
|
Either X or Y |
(X)
|
X, as a capturing group |
Back references | < |
\n
|
Whatever the nth capturing group matched |
Quotation | < |
\
|
Nothing, but quotes the following character |
\Q
|
Nothing, but quotes all characters until \E |
\E
|
Nothing, but ends quoting started by \Q |
Special constructs (non-capturing) | < |
(?:X)
|
X, as a non-capturing group |
(?imsux-imsux)
|
Nothing, but turns match flags on - off |
(?imsux-imsux:X)
|
X, as a non-capturing group with the given flags on - off |
(?=X)
|
X, via zero-width positive lookahead |
(?!X)
|
X, via zero-width negative lookahead |
(?<=X)
|
X, via zero-width positive lookbehind |
(?<!X)
|
X, via zero-width negative lookbehind |
(?>X)
|
X, as an independent, non-capturing group |
(?#...)
|
comment. |
Since version 0.2.3, \p
and \P
are supported. It is cooporated
with SRFI-14 charset. However it is kind of tricky. For example regex parser
can reads \p{InAscii}
or \p{IsAscii}
and search charset named
char-set:ascii
from current library. It must have In
or Is
as its prefix.
This reader macro provides Perl like regular expression syntax.
It allows you to write regular expression like this #/\w+?/i
instead of
like this (regex "\\w+?" CASE-INSENSITIVE)
.
Regex must be regular expression object. Returns closure if regex matches input string.
The matches
procedure attempts to match the entire input string against
the pattern of regex.
The looking-at
procedure attempts to match the input string against the
pattern of regex.
Pattern must be pattern object.
The first form of these procedures are for convenience. It is implemented like this;
(define (regex-replace-all pattern text replacement)
(regex-replace-all (regex-matcher pattern text) replacement))
Text must be string.
Replacement must be either string or procedure which takes matcher object and string port as its arguments respectively.
Replaces part of text where regex matches with replacement.
If replacement is a string, the procedure replace text with given
string. Replacement can refer the match result with `$_n_
`.
n must be group number of given pattern or matcher.
If replacement is a procedure, then it must accept either one or two arguments. This is for backward compatibility.
The first argument is always current matcher.
If the procedure only accepts one argument, then it must return a string which will be used for replacement value.
If the procedure accepts two arguments, then the second one is string output port. User may write string to the given port and will be the replacement string.
The regex-replace-first
procedure replaces the first match.
The regex-replace-all
procedure replaces the all matches.
text must be a string.
pattern must be a string or regex-pattern object.
Split text accoding to pattern.
The above procedures are wrapped User level API. However, you might want to use low level API directory when you re-use matcher and recursively find pattern from input. For that purpose, you need to use low level APIs directly.
NOTE: This API might be changed in future depending on R7RS large.
Returns #f if obj is regular expression object, otherwise #f.
Returns #f if obj is matcher object, otherwise #f.
The same as regex
procedure.
Regex must be regular expression object. Returns matcher object.
Matcher must be matcher object. Returns #t if matcher matches the entire input string against input pattern, otherwise #f.
Matcher must be matcher object. Returns #t if matcher matches the input string against input pattern, otherwise #f.
Matcher must be matcher object. Resets matcher and then attempts to find the next subsequence of the input string that matches the pattern, starting at the specified index if optional argument is given otherwise from the beginning.
Matcher must be matcher object. Index must be non negative exact integer.
Retrieve captured group value from matcher.
Matcher must be matcher object.
Returns number of captured groups.
Regular expression compiler can take following flags.
Enables case-insensitive matching. i
as a
flag
Permits whitespace and comments in pattern. x
as a
flag
Enables multiline mode. m
as a flag
Enables literal parsing of the pattern.
Enables dotall mode. s
as a flag
Enables Unicode-aware case folding and pre defined charset.
u
as a flag.
NOTE: when this flag is set then pre defined charset, such as \d
or
\w
, expand it's content to Unicode characters. Following properties
are applied to charsets.
[[:lower:]]
The lower case charset contains Ll
and Other_Lowercase
.
[[:upper:]]
The upper case charset contains Lu
and Other_Uppercase
.
[[:title:]]
The upper case charset contains Lt
.
[[:alpha:]]
The upper case charset contains L
, Nl
and
Other_Alphabetic
.
[[:numeric:]]
The upper case charset contains Nd
.
[[:punct:]]
The upper case charset contains P
.
symbol
The upper case charset contains Sm
, Sc
, Sk
and
So
.
[[:space:]]
The upper case charset contains Zs
, Zl
and Zp
.
[[:cntrl:]]
The upper case charset contains Cc
, Cf
, Co
, Cs
,
and Cn
.
Above APIs can be used bytevectors as well. In this case, the regular expression engine treats given bytevectors as if it's ASCII strings. If users want to use this feature, users must give bytevectors instead of strings.