Match a regular expression against a string
Tcl distributed man page found at http://www.purl.org/tcl/home/man/tcl8.5/TclCmd/regexp.htm
regexp ?switches? exp string ?matchVar? ?subMatchVar subMatchVar ...?
Determines whether the regular expression exp matches part or all of string and returns 1 if it does, 0 if it doesn't. (Regular expression matching is described in the re_syntax reference page.)
If additional arguments are specified after string then they are treated as the names of variables in which to return information about which part(s) of string matched exp. MatchVar will be set to the range of string that matched all of exp. The first subMatchVar will contain the characters in string that matched the leftmost parenthesized subexpression within exp, the next subMatchVar will contain the characters that matched the next parenthesized subexpression to the right in exp, and so on.
If the initial arguments to regexp start with - then they are treated as switches. The following switches are currently supported:
If there are more subMatchVars than parenthesized subexpressions within exp, or if a particular subexpression in exp doesn't match the string (e.g. because it was in a portion of the expression that wasn't matched), then the corresponding subMatchVar will be set to "-1 -1" if -indices' has been specified or to an empty string otherwise. (From: TclHelp 8.2.3)
puts "enter string:" set input [read stdin] if {[regexp "abc" $input]} { puts "yes" } else { puts "no" }
More info about the return values from -about, written by DKF in Feb, 2007 (with further additions and clarifications by DKF from a bit later in italics):
" currently only exist for testing purposes. Going through the definitive list, I see:
If you're not an RE wonk or matcher, I'd assert that virtually all of these are totally uninteresting. :-) The backrefs, lookahead and bounds are probably most interesting from a "describing what's in there" POV."
I can't see any value in UNONPOSIX, UUNSPEC, UUNPORT or ULOCALE; they just don't seem to correspond to any question I might ever wish to ask about a regular expression. UBSALNUM and UPBOTCH are very low-value too, as they only apply when you move the RE engine into a non-standard mode.
re_syntax covers the regular expression syntax, right?
You don't use regexp for replacments - see regsub for that.
someone needs to write up greedy vs non-greedy re issues
MG OK, I'm sure someone can do this better than me, but since nothing's here at the moment I'll make a start...
By default, the regexp characters + and * match as much as possible (which is called greedy matching). By placing a ? after them, you can make them match as little as possible (non-greedy). For example...
regexp "a.+3" "abc123abc123" var set var
would show the match as abc123abc123, because the + is matches all the characters up until the last 3. If you used...
regexp "a.+?3" "abc123abc123" var set var
you'd see the match as abc123 because +? matches as little as possible. Greedy regexp matching is a particular problem in parsing HTML, etc, because...
set str "<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>" regexp "<b>(.*)</b>" $str -> var set var
would show Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text - matching as much as possible, it takes between the first occurance of <b> and the last occurance of </b>. But, using a non-greedy regexp to match...
set str "<b>Some Bold Text</b><br><i>Some Italic Text</i><br><b>More Bold Text</b>" regexp "<b>(.*?)</b>" $str -> var set var
would show what you want; Some Bold Text. Hope that explanation/rambling is some use, at least until someone with more idea what they're doing puts something up :)
AvL I'll now mention some common pitfall with non-greedy REs: Lets go back to the first example, but with a modified string:
regexp "a.+?3" "abc123ax3" var set var
Although second possible match ax3 would be shorter, it will still find the first match abc123, because even with non-greedy quantifiers, the first match always wins.
Could someone replace this line with some verbage regarding the way one uses regular expressions for specific newline carriage return handling (as opposed to the use of the $ metacharacter)?
Janos Holanyi: I would really need to build up a re that would match one line and only one line - that is, excluding carriege-return-newline's (\r\n) from matching... How would such a re look like?
LV how about something like this?
set a "abc
dev"
# a now has two lines in it regexp -line -- {(.*)} $a b c d 1 puts $b abc puts $c abc
Note that if you want to keep carriage returns or newlines by themselves, but not when they are together, you need something like:
regexp -- {^([^\r]|\r(?!\n))*} $a b c d
This allows plain carriage return or plain newline.
Thanks to bbh and Donal Fellows for this regular expression.
From the comp.lang.tcl newsgroup: I did some experimenting with other strings, like "just a HHHHEEEEAAAADDDDEEEERRRR". The regular expression {(.)\1\1\1} does the job I would have wanted, whereas {(.){4}} will return the last of each four characters - as posted as well.
That surprised me too -- being able to place backreferences within the regex is an extremely powerful technique.
regsub -all {(.)\1{3}} $string {\1} result
for exactly 4 char repeats, and {(.)\1+} for arbitrary repeats
Laurent Riesterer has written a Visual Regexp tool [L1 ] to help understand regexp operation.
Feb 9th 2007 CJL wondered on Ask#5 what the correct/best/proper way of writing a regexp with quotes and the current value of a variable in the expression was? I want to match various patterns of the form <INPUT TYPE="TEXT" NAME="$something" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">, where $something has a range of values that is a subset of all possible values, i.e. I don't want to put \S+ in place of $something as that will give unwanted matches. Note the presence of quotes and escapes to complicate things.
MG Using format is probably one of the simplest.
set something "foobar" set pattern {<INPUT TYPE="TEXT" NAME="%s" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">} set pattern [format $pattern $something]
Assuming, of course, you don't have %-'s in your string. Otherwise, building it in steps may be easiest:
set something "foobar" set pattern {<INPUT TYPE="TEXT" NAME="} append pattern $something append pattern {" SIZE="\d+" MAXLENGTH="\d" VALUE="\S+">}
LV I suspect the OP will need to replace those \d with %d and the \S with %s .
NEM - A question on the Tclers Chat brought up a common problem that I've had when dealing with regular expressions. The RE engine allows [^AB] to mean "not A or B", but what if you want to match anything but the string "AB"? The only way to do it is to put lots of negated classes one after the other, which is ugly. So, here is a way to wrap that up into something a bit more elegant:
proc not {pattern} { set ret "(?:" ;# Not capturing bracket foreach char [split $pattern {}] { append ret "\[^$char\]" } append ret ")" return $ret }
Then you can do:
regexp -- "AB([not AB]*)AB(.*)" ABcdefghABijklmnopqrst -> first rest first = "cdefgh" rest = "ijklmnopqrst"
And it handles things like:
regexp -- "AB([not AB]*)AB(.*)" ABcdefghBAijkABlmnopqrst -> first ret first = "cdefghBAijk" rest = "lmnopqrst"
Note though, that this will only match patterns which are at least the same length as the negated expression:
regexp -- "AB([not AB]*)AB(.*)" ABcABslkdjf -> first rest => 0
The proper solution to this problem is a lot more complex, unfortunately.
The above three regexp's can be written using a lookahead constraint.
foreach str {ABcdefghABijklmnopqrst ABcdefghBAijkABlmnopqrst ABcABslkdjf} { set e "regexp -- {AB(?!AB)(.*)AB(.*)} $str -> first rest" puts "$e\n=> [eval $e]\nfirst = $first\nrest = $rest\n" }
Output:
regexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghABijklmnopqrst -> first rest => 1 first = cdefgh rest = ijklmnopqrst regexp -- {AB(?!AB)(.*)AB(.*)} ABcdefghBAijkABlmnopqrst -> first rest => 1 first = cdefghBAijk rest = lmnopqrst regexp -- {AB(?!AB)(.*)AB(.*)} ABcABslkdjf -> first rest => 1 first = c rest = slkdjf
Important to note on the "not pattern" example above is that it will NOT match strings where there is an occurrence of the first letter from {pattern} when not part of the entirety of {pattern}:
% regexp -- "AB([not AB]*)AB(.*)" ABcdefghAijklmABnopqrst -> first rest 0 % regexp -- "AB([not AB]*)AB(.*)" ABcdefghBijklmABnopqrst -> first rest 1 % set first cdefghBijklm % set rest nopqrst
DKF: It's actually fairly easy to request that an RE shouldn't match something. You just need some magic around it like this:
regexp {^(?:(?!AB).)*$} $string
That matches any string that doesn't contain "AB" as a subsequence.
elfring 2003-10-29 TCL variables can be marked that an instance contains a compiled regular expression. REs can be pre-compiled by the call "regexp $RE {}" [L2 ].
I would love to see a some clarification on exactly how non-reporting subpatterns work with -inline, specifically if you can silence the overall pattern match:
% set str { asd;flkj <img src="example.jpg" > sad;lfjl;kjf<IMg src="browser/ie.gif"> asdflaj;lkfjasdf lsdk } % set _Img {<img src="?([\w\./]*)"?[^>]*>} <img src="?([\w\./]*)"?[^>]*> % regexp -all -nocase -inline $_Img $str {<img src="example.jpg" >} example.jpg {<IMg src="browser/ie.gif">} browser/ie.gif
glennj: You can't silence the full match. You will have to iterate over the results of regexp thusly:
set matches [list] foreach {full submatch} [regexp -all -nocase -inline $_Img $str] { lappend matches $submatch }
elfring 2004-07-05 Does anybody know problems and solutions to match optional parts with regular expressions [L3 ]?
MG July 17th 2004 - The problem with the regexp there seems to be that one of the parts to match optional white space is in the wrong place, and is matching too much. If you use this regexp instead, it works for me, on Win XP with Tcl 8.4.6. (The change is that, after </S_URI> and before <P_URI>, the .*? has been moved inside the (?: ... )
set pattern {<name>(.+)</name>(?:.*?<scope>(SYSTEM|PUBLIC)</scope>.*?<S_URI>(.+)</S_URI>(?:.*?<P_URI>(.+)</P_URI>)?)?(?:.*?<definition>(.*?)</definition>)?(?:.*?<attributes>(.*?)</attributes>)?.*?<content>(.*)</content>\s*$} set string {<name>gruss</name> <scope>SYSTEM</scope> <S_URI>http://XXX/Hallo.dtd</S_URI> <P_URI>http://YYY/Leute.dtd</P_URI> <definition><!ELEMENT gruss (#PCDATA)></definition> <attributes>Versuch="1"</attributes> <content><h1>Guten Tag!</h1></content>} regexp $pattern $str z name scope system public definition attributes content
regexp {([^:]+)://([^:/]+)(:([0-9]+))} ns_conn location match protocol server x port
the above author should remember this is a TCL wiki, and not an aolserver one, but thanks for the submission ;)
Tcl dynamically caches the compiled regular expressions. The Tcl core caches the last 30 REs it compiled but you can cause an number of RE's to be cached by assigning them to variables. If a regular expression is assigned to a variable and the variable is not changed, the Tcl core will save the compiled version of the RE and use the precompiled version of the variable during next evaluation. In the core the compiled version of the RE is stored in the Tcl_Obj, along with its string representation.
To find #pragma <something> statements define a pattern like
set re {^\s*#\s*pragma\s+(.)} if { [regexp $re $line -> rest] } { ... }
The above example will cause the compiled regular expression to be stored in the re variable.
(From c.l.t [L4 ])
The run time benefit of regular expression caching can easily be shown:
# Run N different regexp patterns proc test_regexps N { for {set i 0} {$i < $N} {incr i} { regexp "foobar$i" "foobar1" } } puts "29 Took: [time { test_regexps 29 } 100]" puts "30 Took: [time { test_regexps 30 } 100]" puts "31 Took: [time { test_regexps 31 } 100]" puts "32 Took: [time { test_regexps 32 } 100]"
One run of this gave:
29 Took: 298 microseconds per iteration 30 Took: 372 microseconds per iteration 31 Took: 2000 microseconds per iteration 32 Took: 2107 microseconds per iteration
...clearly showing the extra cost of having to recompile each regexp pattern each time thro' due to exceeding the NUM_REGEXPS (30).
DKF writes that it is hard to do this with any single RE on its own, though you can do it quite easily using a couple of things coupled together. This example uses regsub to strip the problematic lines, but cannot completely get rid of leading and trailing newlines without the extra string trim:
string trim [regsub -all {\n(?:\s*\n)+} $data \n] \n
However, I prefer selecting things positively, leading to a solution using regexp and join:
join [regexp -all -inline {(?=[^\n]*\S)[^\n]+} $data] \n
DKF 10-Aug-2006: More experimentation indicates that a single regsub can do the whole job:
regsub -all {^\n+|\n+$|(\n)+} $data {\1}
Note that the order of the alternatives is important!
DKF: Sometimes it is useful to be able to write a regular expression that matches a string that contains some number of substrings (typically words) in any order. In normal regexps, this is a horrible thing to write down as the size of the RE term varies exponentially with the number of substrings. However, if you don't mind matching behaviour that is guaranteed to be non-optimal in some strict sense, and if you don't want any capturing parens, you can use positive lookahead assertions to make things neater.
Thus, to match a string that contains foo, bar and spong within it in any order, use a RE like this:
set RE {(?=.*foo)(?=.*bar)(?=.*spong).} set matched [regexp $RE $string]
Just note that if you use this, you cannot know where those strings matched; lookahead assertions don't support that. If you need that data, use multiple regexp matches instead
---
MAH: What yould be the correct way to loop over all matches of a regular expression in a string? I came up with the following solution for finding all include statements in a string, but using -start has side effects on the meaning of characters like $ and ^.
set pos 0 while {[regexp -start $pos {`include "([\w/.]+)"} $data string vincfile]==1} { set pos [expr {$pos+[string length $string]}] puts "file=$vincfile" }
Lars H: The option combination -all -inline is probably what you're looking for (although in general the problem of "finding all matches" runs into several technical issues, due to the fact that matches may overlap).
In combination with -start, one has to use \A and \Z instead of $ and ^, unless the intent is to use the newline-sensitive behaviours of the latter. -indices may also be useful.
MAH: Okay, -inline is too clumsy for me since I don't want the overall match string. Instead I'll go with -indices. This gives me
set pos 0 while {[regexp -start $pos -indices {`include "([\w/.]+)"} $data -> vincfilepos]==1} { set vincfile [ string range $data [ lindex $vincfilepos 0 ] [ lindex $vincfilepos 1 ] ] set pos [ lindex $vincfilepos 1 ] puts "file=$vincfile" }
I think that's rather clumsy for a task this common. Any ideas on how to make it simpler?
See also:
BAS : just a tidbit, the Postgresql DBMS uses Tcl's regexp engine for its own regexp handling; see [L5 ].
Gururajesh: A Perfect regular expression to validate ip address with a single expression.
if {[regexp {(^[2][5][0-5].|^[2][0-4][0-9].|^[1][0-9][0-9].|^[0-9][0-9].|^[0-9].)([2][0-5][0-5].|[2][0-4][0-9].|[1][0-9][0-9].|[0-9][0-9].|[0-9].)([2][0-5][0-5].|[2][0-4][0-9].|[1][0-9][0-9].|[0-9][0-9].|[0-9].)([2][0-5][0-5]|[2][0-4][0-9]|[1][0-9][0-9]|[0-9][0-9]|[0-9])$} $string match v1 v2 v3 v4]} {puts "$v1$v2$v3$v4"} else {puts "none"}
For string "245.254.253.2", output is 245.254.253.2 For string "265.254.243.2", output is none, As ip-address can`t have a number greater than 255.
Lars H: Perfect? No, it looks like it would accept 99a99b99c99, since . will match any character. Also, it can be shortened significantly by making use of {4}and the like (see Regular expressions).
AK - 2010-03-18 17:07:32
Tcllib should be useful
AMG: Here's a very similar script that uses scan instead of regexp. It's much more readable, in my opinion.
if {[scan $string %d.%d.%d.%d a b c d] == 4 && 0 <= $a && $a <= 255 && 0 <= $b && $b <= 255 && 0 <= $c && $c <= 255 && 0 <= $d && $d <= 255} { puts $a.$b.$c.$d } else { puts none }
There are a few differences. One, the trailing dot is omitted from the first three output variables (which I call a, b, c, d instead of v1, v2, v3, v4). Two, leading zeroes are permitted and discarded. Three, -0 is accepted as 0. Four, garbage at the end of $string is silently discarded. Five, each octet can have a leading +, e.g. +255.+255.+255.+255. Six, it's OVER FIVE TIMES FASTER! On this machine, my version using scan takes 15 microseconds, whereas your version using regexp takes 78 microseconds. Use the time command to measure performance. (I replaced puts with return when testing.)
Now, here's a hybrid version that uses regexp.
if {[regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} $string _ a b c d] && 0 <= $a && $a <= 255 && 0 <= $b && $b <= 255 && 0 <= $c && $c <= 255 && 0 <= $d && $d <= 255} { puts $a.$b.$c.$d } else { puts none }
This version takes 46 microseconds to execute. It doesn't accept leading + or -. It rejects garbage at the end of the string. It treats the octets as octal if they are given leading zeroes, and invalid octal is always accepted. The reason for this last is because if treats strings containing invalid octal as nonnumeric text, so the <= operator is used to sort text rather than compare numbers. Corrected version:
if {[regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} $string _ a b c d] && [string is integer $a] && 0 <= $a && $a <= 255 && [string is integer $b] && 0 <= $b && $b <= 255 && [string is integer $c] && 0 <= $c && $c <= 255 && [string is integer $d] && 0 <= $d && $d <= 255} { puts $a.$b.$c.$d } else { puts none }
This version takes 47 microseconds and it rejects invalid octal. However, it still interprets numbers as octal if leading zeroes are given, so 0377.255.255.255 is accepted (but 0400.255.255.255 is rejected). To fix this, it would be necessary to make a pattern that rejects leading zeroes unless the octet is exactly zero, something like: {(0|[^1-9]\d*)}. But this is getting clumsy and slow; I prefer the scan solution. Regexp: not always the right tool!
Gururajesh set string "0377.255.255.255" if {[regexp {^(\d+)\.(\d+)\.(\d+)\.(\d+)$} $string _ a b c d] && [string is integer $a] && [scan $a %d v1] && 0 <= $v1 && $v1 <= 255 && [string is integer $b] && [scan $b %d v2] && 0 <= $v2 && $v2 <= 255 && [string is integer $c] && [scan $c %d v3] && 0 <= $v3 && $v3 <= 255 && [string is integer $d] && [scan $d %d v4] && 0 <= $v4 && $v4 <= 255} {puts $v1.$v2.$v3.$v4} else {puts none} This will be ok... for above mentioned issue.
Saravanan Can any one tell how to retrieve the no of a particular character from the given string(using regexp only) Eg: set a "hithisisisis" i need to find how many no of 'i' from $a