<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<font face="Helvetica, Arial, sans-serif">Hi Elias,<br>
<br>
see below.<br>
<br>
/// Jürgen<br>
<br>
</font><br>
<div class="moz-cite-prefix">On 10/12/2017 09:13 AM, Elias Mårtenson
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CADtN0WJGxBjNCMSu=Sxu8w=HKfPi3xp-***@mail.gmail.com">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">On 11 October 2017 at 21:15, Juergen
Sauermann <span dir="ltr"><<a
href="mailto:***@t-online.de"
target="_blank" moz-do-not-send="true">***@t-online.de</a>></span>
wrote:<br>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"><font face="Helvetica, Arial,
sans-serif">If I understand <b>libpcre2</b> correctly
(and I probably don't) then a general regular
expression RE is a tree whose<br>
structure is determined by the nesting of the
parentheses in RE, and the result of a match follows
the tree structure.<br>
</font></div>
</blockquote>
<div><br>
</div>
<div>Actually, this is not the case. When you have
subexpressions, what you have is simply a list of them,
and each subexpression has a value. Whether or not these
subexpressions are nested does not matter. Its position is
purely dictated by the index of the opening parentheses.</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
Not exactly. It is true that libpcre returns a list of matches in
terms of the position of each<br>
match in the subject string B. However any two matches are either
disjoint or one match is<br>
contained in the other. This containment relation defines a partial
order between the<br>
matches which is most conveniently described by a tree. In that tree
one RE, say RE1 is a<br>
child of another RE RE2 if the substring of B corresponding to RE2
is contained in the<br>
substring of B that corresponds to RE2.<br>
<br>
The question is then: shall <b>⎕RE</b> simply return the array of
matches (which was what your<br>
implementation did) or shall <b>⎕RE</b> return the matches as a
tree? This is the same question<br>
as shall the tree be represented as a simple vector of nodes
(corresponding to an APL<br>
vector of some kind) or shall it be represented as a recursive
node-properties + children structure (corresponding to a nested APL
value)?<br>
<br>
The vector of nodes and the nested APL value are both equivalent in
describing the<br>
tree. However, converting the nested tree structure to a vector of
nodes is much simpler<br>
(in APL) than the other way around because converting a node vector
to the tree involves<br>
a lot of comparisons which are quite lightweight but extremely ugly
in APL. That was why<br>
decided to return the tree and not the vector of nodes.<br>
<br>
Now, to have an option that drops the first element means to have an
option that returns<br>
the nodes of the result tree except its root node. Although
technically possible, this sounds<br>
very arbitrary to me. It may suit a particular use case, but it do
not, IMHO, deserve a<br>
special flag. I could also create a use case where it makes sense
that only every second<br>
node of the tree is returned, for example when matching some
name=value pairs where<br>
I am only interested in the values and not the names.<br>
<br>
I am not entirely against a flag that goes into that direction, but
I believe that flag should<br>
determine if either the tree is returned (default) or the node
vector of the of the tree if<br>
the flag is given. Unfortunately that flag, even though it is far
more consistent with the<br>
structure of the ⎕RE result than 1↓, does not solve your 1↓ because
it would still contain<br>
the top-level match (= the root of the tree).<br>
<br>
<blockquote type="cite"
cite="mid:CADtN0WJGxBjNCMSu=Sxu8w=HKfPi3xp-***@mail.gmail.com">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>When you use subexpressions, it means that I am
interested in specific parts of the matched string. If I
am interested in a specific part of a string, it is very
unlikely that I want to know the content of the entire
match. But, if I do, I can always retrieve that using
another set of parens that surrounds the entire regexp.<br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
Not necessarily. It could also be a boundary condition of your match
that you<br>
only want to be satisfied no matter how. REs like <b>[A-Z][a-z][0-9]</b>
are often used that way.<br>
<blockquote type="cite"
cite="mid:CADtN0WJGxBjNCMSu=Sxu8w=HKfPi3xp-***@mail.gmail.com">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>When you don't have any subexpressions, it's most
likely that I am not interested in the matched string at
all, but rather just a boolean result telling me if I have
a match at all.</div>
<div><br>
</div>
<div>The boolean case is simple, so the only aspect of this
that warrants any discussion is how that should be
achieved. My opinion is that it should be the default, but
a flag can also be used.</div>
<div><br>
</div>
<div>For subexpressions, I think a few examples will help
explain how they are used:</div>
<div><br>
</div>
<div>Let's assume the following regexp:</div>
<div><br>
</div>
<div><font face="monospace, monospace"> A(.)|B(.)</font></div>
<div><br>
</div>
<div>This regexp has two subexpressions, and the result
with therefore have two values. Due to the fact that they
are separated by the alternation symbol (|), one of the
subexpressions will always be empty. So, here are the
different possible results when matching different
strings:</div>
<div><br>
</div>
<div><font face="monospace, monospace"> "AXY" Subexpr 1:
"X", Subexpr 2: ""</font></div>
<div><font face="monospace, monospace"> "BZA" Subexpr 1:
"", Subexpr 2: "Z"</font></div>
<div><font face="monospace, monospace"> "CXY" <i>No
match</i></font></div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
Not sure if that should be so but i am not too familiar with <b>libpre2</b>
either. I would naively<br>
expect that an RE of the form <b>A|B</b> would either return a
match for <b>A</b> or a match for <b>B</b> but<br>
not both. <b>man pcre2pattern</b> says:<br>
<br>
<i> Vertical bar characters are used to separate alternative
patterns. For</i><i><br>
</i><i> example, the pattern</i><i><br>
</i><i><br>
</i><i> gilbert|sullivan</i><i><br>
</i><i><br>
</i><i> matches either "gilbert" or "sullivan". Any number of
alternatives may</i><i><br>
</i><i> appear, and an empty alternative is permitted
(matching the empty</i><i><br>
</i><i> string). The matching process tries each alternative
in turn, from left</i><i><br>
</i><i> to right, and the first one that succeeds is used.</i><br>
<br>
My understanding of this is that, for example, B is ignored if A
matches. That implies that<br>
the matching of <b>B</b> is not even performed so "" (for no match)
would be incorrect because<br>
B could also match as well.<br>
<br>
<blockquote type="cite"
cite="mid:CADtN0WJGxBjNCMSu=Sxu8w=HKfPi3xp-***@mail.gmail.com">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>(with the current implementation, there is no way I can
differentiate between cases 1 and 2, which shows that the
current implementation is not working correctly)</div>
<div><br>
</div>
<div>As you can see from this example, I can look at the
content of subexpressions 1 and 2 to determine which of
the alternatives was matched.</div>
<div><br>
</div>
<div>If I really want to see the whole match as well, I can
force this by adding a third subexpression (which will be
number 1 since its opening parenthesis comes first):</div>
<div><br>
</div>
<div><font face="monospace, monospace"> (A(.)|B(.))</font></div>
<div><br>
</div>
<div>Here, the result will also contain the full match:</div>
<div><br>
</div>
<div>
<div><font face="monospace, monospace"> "AXY" Subexpr
1: "AX", Subexpr 2: "X", Subexpr 3: ""</font></div>
<div><font face="monospace, monospace"> "BZA" Subexpr
1: "BZ", Subexpr 2: "", Subexpr 3: "Z"</font></div>
<div><font face="monospace, monospace"> "CXY" <i>No
match</i></font></div>
</div>
<div><br>
</div>
<div>I hope this helps explain why my design was the way it
was. There is an argument that the no-subexpression case
should not return the full match but rather a boolean
value simply indicating whether a match was found or not.
In that case the old behaviour can still be achieved by
wrapping the entire regexp in a set of parentheses as
shown above. However, I think a flag to achieve this would
be more clear.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Elias</div>
</div>
</div>
</div>
</blockquote>
<br>
</body>
</html>