Discussion:
[Bug-apl] Suggestion for Quad-RE
Christian Robert
2017-10-11 03:12:30 UTC
Permalink
Sometimes we only want to know if it match or not.

I suggest a new flag ['m'] (as match) that will return ...

for a string: either 0 or 1 as a scalar for "not matching" or "matching"
for an array of strings: a vector of 0/1 for each string saying like above.


lets say:

z←⎕fio[49] '/var/log/messages' // beware that this file is inaccessible by default unless being "root" on linux
// or you chmod a+r /var/log/messages # as root

who may return 50,000 lines or even 2 millions, on an average of say ~120 characters each.


I would hope to be able to use a flag as ['m']:

'Started|Stopped' ⎕RE['m'] z

who will return an array of (0/1) telling which lines match or not the pattern, so I can
only retain those matching for further fine tuning (via diadic operator "/").

It will be a LOT faster than letting ⎕RE returning the whole result of pcre2 INTO the physical Gnu-APL memory engine
creating a lot of integers arrays for no real purpose, ie: seen from the application.

comments welcome,

my usual 2 cents,
Xtian.
Elias Mårtenson
2017-10-11 05:14:16 UTC
Permalink
I think you have a point. It would be very useful to be able to have ⎕RE
filter the results for you.

In experimenting with your specific case, I came across another use-case
that might warrant another flag: One that does not return the full match,
but only the parenthesised subexpressions (this used to be the default in
my initial draft version). Now I have to use 1↓ to remove this.

Here is my somewhat realistic test case that takes the log file, and
extracts the date and the name of the service that was started or stopped:

* file ← ⎕FIO[49] "/some/file/name"*

* x ← "^([a-zA-Z]{3} [0-9]+ [0-9]{2}:[0-9]{2}:[0-9]{2}).*:
(Started|Stopped) (.*)$" ⎕RE file*
* ⍎ x*
┏→━━━━┓
┃69339┃
┗━━━━━┛
* result ← ⊃ 1↓¹ ({⍬≢⍵}šx) / x*
* ⍎ result*
┏→━━━━━┓
┃7269 3┃
┗━━━━━━┛

This is a lot more complicated than it needs to be. The two new flags
mentioned would completely remove the last line and replace it with a
simple pair of ⎕RE["XY"] flags.

Regards,
Elias
Post by Christian Robert
Sometimes we only want to know if it match or not.
I suggest a new flag ['m'] (as match) that will return ...
for a string: either 0 or 1 as a scalar for "not matching" or "matching"
for an array of strings: a vector of 0/1 for each string saying like above.
z←⎕fio[49] '/var/log/messages' // beware that this file is
inaccessible by default unless being "root" on linux
// or you chmod a+r
/var/log/messages # as root
who may return 50,000 lines or even 2 millions, on an average of say ~120 characters each.
'Started|Stopped' ⎕RE['m'] z
who will return an array of (0/1) telling which lines match or not the pattern, so I can
only retain those matching for further fine tuning (via diadic operator "/").
It will be a LOT faster than letting ⎕RE returning the whole result of
pcre2 INTO the physical Gnu-APL memory engine
creating a lot of integers arrays for no real purpose, ie: seen from the application.
comments welcome,
my usual 2 cents,
Xtian.
Juergen Sauermann
2017-10-11 13:15:45 UTC
Permalink
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<font face="Helvetica, Arial, sans-serif">Hi Elias,<br>
<br>
I understand your case but I am afraid that 1↓ is the wrong
approach in general. It happens to work in<br>
special cases (read: 1 level of sub-expressions) but not in
general.<br>
<br>
If I understand <b>libpcre</b> correctly (and I propably don't)
then a general regular expression RE is a tree whose<br>
structure is determined by the nesting of the parentheses in RE,
and the result of a match follows the tree structure.<br>
If the tree has only a single level (so it has a root and only one
level of children of the root) then you can (mis-)interpret<br>
the children as an array and dropping the root (by means of 1↓ )
happens (by chance) to return the children.<br>
But it remains incorrect because the functions to access tree-like
nested values should be ⌷, ⊃, and ⊂ and not<br>
↑, ↓, or <font face="Courier New, Courier, monospace">[].</font><br>
<br>
Best Regards,<br>
/// Jürgen<br>
<br>
</font><br>
<div class="moz-cite-prefix">On 10/11/2017 07:14 AM, Elias Mårtenson
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CADtN0W+U+SGH3u54vgV=***@mail.gmail.com">
<div dir="ltr">I think you have a point. It would be very useful
to be able to have ⎕RE filter the results for you.
<div><br>
</div>
<div>In experimenting with your specific case, I came across
another use-case that might warrant another flag: One that
does not return the full match, but only the parenthesised
subexpressions (this used to be the default in my initial
draft version). Now I have to use 1↓ to remove this.</div>
<div><br>
</div>
<div>Here is my somewhat realistic test case that takes the log
file, and extracts the date and the name of the service that
was started or stopped:</div>
<div><font face="monospace, monospace"><br>
</font></div>
<div><font face="monospace, monospace"><b>      file ← ⎕FIO[49]
"/some/file/name"</b></font></div>
<div><font face="monospace, monospace"><b>      x ←
"^([a-zA-Z]{3} [0-9]+ [0-9]{2}:[0-9]{2}:[0-9]{2}).*:
(Started|Stopped) (.*)$" ⎕RE file<br>
</b></font></div>
<div>
<div><font face="monospace, monospace">
<div><b>      ⍴ x</b></div>
<div>┏→━━━━┓</div>
<div>┃69339┃</div>
<div>┗━━━━━┛</div>
</font></div>
<div><font face="monospace, monospace"><b>      result ← ⊃ 1↓¨
({⍬≢⍵}¨x) / x</b></font></div>
<div><font face="monospace, monospace"><b>      ⍴ result</b></font></div>
<div><font face="monospace, monospace">┏→━━━━━┓</font></div>
<div><font face="monospace, monospace">┃7269 3┃</font></div>
<div><font face="monospace, monospace">┗━━━━━━┛</font></div>
</div>
<div><br>
</div>
<div>This is a lot more complicated than it needs to be. The two
new flags mentioned would completely remove the last line and
replace it with a simple pair of <font face="monospace,
monospace">⎕RE["XY"]</font> flags.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Elias</div>
</div>
<div class="gmail_extra"><br>
<div class="gmail_quote">On 11 October 2017 at 11:12, Christian
Robert <span dir="ltr">&lt;<a
href="mailto:***@polymtl.ca" target="_blank"
moz-do-not-send="true">***@polymtl.ca</a>&gt;</span>
wrote:<br>
<blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px #ccc solid;padding-left:1ex">Sometimes
we only want to know if it match or not.<br>
<br>
I suggest a new flag ['m']  (as match) that will return ...<br>
<br>
  for a string:  either 0 or 1 as a scalar for "not
matching" or "matching"<br>
  for an array of strings: a vector of 0/1 for each string
saying like above.<br>
<br>
<br>
lets say:<br>
<br>
      z←⎕fio[49] '/var/log/messages'  // beware that this
file is inaccessible by default unless being "root" on linux<br>
                                      // or you chmod a+r
/var/log/messages  # as root<br>
<br>
who may return 50,000 lines or even 2 millions, on an
average of say ~120 characters each.<br>
<br>
<br>
I would hope to be able to use a flag as ['m']:<br>
<br>
     'Started|Stopped' ⎕RE['m'] z<br>
<br>
who will return an array of (0/1) telling which lines match
or not the pattern, so I can<br>
only retain those matching for further fine tuning (via
diadic operator "/").<br>
<br>
It will be a LOT faster than letting ⎕RE returning the whole
result of pcre2 INTO the physical Gnu-APL memory engine<br>
creating a lot of integers arrays for no real purpose, ie:
seen from the application.<br>
<br>
comments welcome,<br>
<br>
my usual 2 cents,<br>
Xtian.<br>
<br>
</blockquote>
</div>
<br>
</div>
</blockquote>
<br>
</body>
</html>
Elias Mårtenson
2017-10-12 07:13:10 UTC
Permalink
On 11 October 2017 at 21:15, Juergen Sauermann <
If I understand *libpcre* correctly (and I propably don't) then a general
regular expression RE is a tree whose
structure is determined by the nesting of the parentheses in RE, and the
result of a match follows the tree structure.
Actually, this is not the case. When you have subexpressions, what you have
is simply a list of them, and each subexpression has a value. Whether or
not these subexpressions are nested does not matter. Its position is purely
dictated by the index of the opening parentheses.

When you use subexpressions, it means that I am interested in specific
parts of the matched string. If I am interested in a specific part of a
string, it is very unlikely that I want to know the content of the entire
match. But, if I do, I can always retrieve that using another set of parens
that surrounds the entire regexp.

When you don't have any subexpressions, it's most likely that I am not
interested in the matched string at all, but rather just a boolean result
telling me if I have a match at all.

The boolean case is simple, so the only aspect of this that warrants any
discussion is how that should be achieved. My opinion is that it should be
the default, but a flag can also be used.

For subexpressions, I think a few examples will help explain how they are
used:

Let's assume the following regexp:

A(.)|B(.)

This regexp has two subexpressions, and the result with therefore have two
values. Due to the fact that they are separated by the alternation symbol
(|), one of the subexpressions will always be empty. So, here are the
different possible results when matching different strings:

"AXY" Subexpr 1: "X", Subexpr 2: ""
"BZA" Subexpr 1: "", Subexpr 2: "Z"
"CXY" *No match*

(with the current implementation, there is no way I can differentiate
between cases 1 and 2, which shows that the current implementation is not
working correctly)

As you can see from this example, I can look at the content of
subexpressions 1 and 2 to determine which of the alternatives was matched.

If I really want to see the whole match as well, I can force this by adding
a third subexpression (which will be number 1 since its opening parenthesis
comes first):

(A(.)|B(.))

Here, the result will also contain the full match:

"AXY" Subexpr 1: "AX", Subexpr 2: "X", Subexpr 3: ""
"BZA" Subexpr 1: "BZ", Subexpr 2: "", Subexpr 3: "Z"
"CXY" *No match*

I hope this helps explain why my design was the way it was. There is an
argument that the no-subexpression case should not return the full match
but rather a boolean value simply indicating whether a match was found or
not. In that case the old behaviour can still be achieved by wrapping the
entire regexp in a set of parentheses as shown above. However, I think a
flag to achieve this would be more clear.

Regards,
Elias
Juergen Sauermann
2017-10-12 12:55:57 UTC
Permalink
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<font face="Helvetica, Arial, sans-serif">Hi Elias,<br>
<br>
see below.<br>
<br>
/// Jürgen<br>
<br>
</font><br>
<div class="moz-cite-prefix">On 10/12/2017 09:13 AM, Elias Mårtenson
wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CADtN0WJGxBjNCMSu=Sxu8w=HKfPi3xp-***@mail.gmail.com">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">On 11 October 2017 at 21:15, Juergen
Sauermann <span dir="ltr">&lt;<a
href="mailto:***@t-online.de"
target="_blank" moz-do-not-send="true">***@t-online.de</a>&gt;</span>
wrote:<br>
<div> </div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div bgcolor="#FFFFFF"><font face="Helvetica, Arial,
sans-serif">If I understand <b>libpcre2</b> correctly
(and I probably don't) then a general regular
expression RE is a tree whose<br>
structure is determined by the nesting of the
parentheses in RE, and the result of a match follows
the tree structure.<br>
</font></div>
</blockquote>
<div><br>
</div>
<div>Actually, this is not the case. When you have
subexpressions, what you have is simply a list of them,
and each subexpression has a value. Whether or not these
subexpressions are nested does not matter. Its position is
purely dictated by the index of the opening parentheses.</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
Not exactly. It is true that libpcre returns a list of matches in
terms of the position of each<br>
match in the subject string B. However any two matches are either
disjoint or one match is<br>
contained in the other. This containment relation defines a partial
order between the<br>
matches which is most conveniently described by a tree. In that tree
one RE, say RE1 is a<br>
child of another RE RE2 if the substring of B corresponding to RE2
is contained in the<br>
substring of B that corresponds to RE2.<br>
<br>
The question is then: shall <b>⎕RE</b> simply return the array of
matches (which was what your<br>
implementation did) or shall <b>⎕RE</b> return the matches as a
tree? This is the same question<br>
as shall the tree be represented as a simple vector of nodes
(corresponding to an APL<br>
vector of some kind) or shall it be represented as a recursive
node-properties + children structure (corresponding to a nested APL
value)?<br>
<br>
The vector of nodes and the nested APL value are both equivalent in
describing the<br>
tree. However, converting the nested tree structure to a vector of
nodes is much simpler<br>
(in APL) than the other way around because converting a node vector
to the tree involves<br>
a lot of comparisons which are quite lightweight but extremely ugly
in APL. That was why<br>
decided to return the tree and not the vector of nodes.<br>
<br>
Now, to have an option that drops the first element means to have an
option that returns<br>
the nodes of the result tree except its root node. Although
technically possible, this sounds<br>
very arbitrary to me. It may suit a particular use case, but it do
not, IMHO, deserve a<br>
special flag. I could also create a use case where it makes sense
that only every second<br>
node of the tree is returned, for example when matching some
name=value pairs where<br>
I am only interested in the values and not the names.<br>
<br>
I am not entirely against a flag that goes into that direction, but
I believe that flag should<br>
determine if either the tree is returned (default) or the node
vector of the of the tree if<br>
the flag is given. Unfortunately that flag, even though it is far
more consistent with the<br>
structure of the ⎕RE result than 1↓, does not solve your 1↓ because
it would still contain<br>
the top-level match (= the root of the tree).<br>
<br>
<blockquote type="cite"
cite="mid:CADtN0WJGxBjNCMSu=Sxu8w=HKfPi3xp-***@mail.gmail.com">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>When you use subexpressions, it means that I am
interested in specific parts of the matched string. If I
am interested in a specific part of a string, it is very
unlikely that I want to know the content of the entire
match. But, if I do, I can always retrieve that using
another set of parens that surrounds the entire regexp.<br>
</div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
Not necessarily. It could also be a boundary condition of your match
that you<br>
only want to be satisfied no matter how. REs like  <b>[A-Z][a-z][0-9]</b>
are often used that way.<br>
<blockquote type="cite"
cite="mid:CADtN0WJGxBjNCMSu=Sxu8w=HKfPi3xp-***@mail.gmail.com">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>When you don't have any subexpressions, it's most
likely that I am not interested in the matched string at
all, but rather just a boolean result telling me if I have
a match at all.</div>
<div><br>
</div>
<div>The boolean case is simple, so the only aspect of this
that warrants any discussion is how that should be
achieved. My opinion is that it should be the default, but
a flag can also be used.</div>
<div><br>
</div>
<div>For subexpressions, I think a few examples will help
explain how they are used:</div>
<div><br>
</div>
<div>Let's assume the following regexp:</div>
<div><br>
</div>
<div><font face="monospace, monospace">    A(.)|B(.)</font></div>
<div><br>
</div>
<div>This regexp has  two subexpressions, and the result
with therefore have two values. Due to the fact that they
are separated by the alternation symbol (|), one of the
subexpressions will always be empty. So, here are the
different possible results when matching different
strings:</div>
<div><br>
</div>
<div><font face="monospace, monospace">    "AXY"  Subexpr 1:
"X", Subexpr 2: ""</font></div>
<div><font face="monospace, monospace">    "BZA"  Subexpr 1:
"",  Subexpr 2: "Z"</font></div>
<div><font face="monospace, monospace">    "CXY"  <i>No
match</i></font></div>
<div><br>
</div>
</div>
</div>
</div>
</blockquote>
Not sure if that should be so but i am not too familiar with <b>libpre2</b>
either. I would naively<br>
expect that an RE of the form <b>A|B</b> would either return a
match for <b>A</b> or a match for <b>B</b> but<br>
not both. <b>man pcre2pattern</b> says:<br>
<br>
<i>       Vertical bar characters are used to separate alternative
patterns.  For</i><i><br>
</i><i>       example, the pattern</i><i><br>
</i><i><br>
</i><i>         gilbert|sullivan</i><i><br>
</i><i><br>
</i><i>       matches  either "gilbert" or "sullivan". Any number of
alternatives may</i><i><br>
</i><i>       appear, and an empty  alternative  is  permitted 
(matching  the  empty</i><i><br>
</i><i>       string). The matching process tries each alternative
in turn, from left</i><i><br>
</i><i>       to right, and the first one that succeeds is used.</i><br>
<br>
My understanding of this is that, for example, B is ignored if A
matches. That implies that<br>
the matching of <b>B</b> is not even performed so "" (for no match)
would be incorrect because<br>
B could also match as well.<br>
<br>
<blockquote type="cite"
cite="mid:CADtN0WJGxBjNCMSu=Sxu8w=HKfPi3xp-***@mail.gmail.com">
<div dir="ltr">
<div class="gmail_extra">
<div class="gmail_quote">
<div>(with the current implementation, there is no way I can
differentiate between cases 1 and 2, which shows that the
current implementation is not working correctly)</div>
<div><br>
</div>
<div>As you can see from this example, I can look at the
content of subexpressions 1 and 2 to determine which of
the alternatives was matched.</div>
<div><br>
</div>
<div>If I really want to see the whole match as well, I can
force this by adding a third subexpression (which will be
number 1 since its opening parenthesis comes first):</div>
<div><br>
</div>
<div><font face="monospace, monospace">    (A(.)|B(.))</font></div>
<div><br>
</div>
<div>Here, the result will also contain the full match:</div>
<div><br>
</div>
<div>
<div><font face="monospace, monospace">    "AXY"  Subexpr
1: "AX", Subexpr 2: "X", Subexpr 3: ""</font></div>
<div><font face="monospace, monospace">    "BZA"  Subexpr
1: "BZ", Subexpr 2: "",  Subexpr 3: "Z"</font></div>
<div><font face="monospace, monospace">    "CXY"  <i>No
match</i></font></div>
</div>
<div><br>
</div>
<div>I hope this helps explain why my design was the way it
was. There is an argument that the no-subexpression case
should not return the full match but rather a boolean
value simply indicating whether a match was found or not.
In that case the old behaviour can still be achieved by
wrapping the entire regexp in a set of parentheses as
shown above. However, I think a flag to achieve this would
be more clear.</div>
<div><br>
</div>
<div>Regards,</div>
<div>Elias</div>
</div>
</div>
</div>
</blockquote>
<br>
</body>
</html>
Elias Mårtenson
2017-10-13 08:46:36 UTC
Permalink
On 12 October 2017 at 20:55, Juergen Sauermann <
Post by Juergen Sauermann
Hi Elias,
see below.
/// JÃŒrgen
On 11 October 2017 at 21:15, Juergen Sauermann <
If I understand *libpcre2* correctly (and I probably don't) then a
general regular expression RE is a tree whose
structure is determined by the nesting of the parentheses in RE, and the
result of a match follows the tree structure.
Actually, this is not the case. When you have subexpressions, what you
have is simply a list of them, and each subexpression has a value. Whether
or not these subexpressions are nested does not matter. Its position is
purely dictated by the index of the opening parentheses.
Not exactly. It is true that libpcre returns a list of matches in terms of
the position of each
match in the subject string B. However any two matches are either disjoint
or one match is
contained in the other. This containment relation defines a partial order
between the
matches which is most conveniently described by a tree. In that tree one
RE, say RE1 is a
child of another RE RE2 if the substring of B corresponding to RE2 is
contained in the
substring of B that corresponds to RE2.
Wow, this mail became longer than I had expected. Just to be clear: I am
mostly talking about the case where strings are returned. The cases where
indexes or ⊂-data is returned are different, and what I'm saying below does
not necessarily apply entirely.

I will henceforth use the word "group" instead of "subexpression". The
formal name is “capturing group”.

The fact that parenthesised groups can nest is orthogonal to the behaviour
of the groups themselves.

The groups are referenced by index or by name (there is a special pattern
syntax to give a group a name), and the API then provides functions to
access the values of these groups. You can extract the values of a group
either by using the name or by the index.

Notice that you can even refer to a capturing group using backreferences
within the pattern itself. This should help prove how simple they are:

In a pattern the sequence \1 can be used to refer back to the result of the
first group, \2 to the second, etc. Thus (and you can test this with PCRE
yourself):

*x(.(.))-\1* matches: *xab-ab*, xwe-we, etc...
*x(.(.))-\2* matches: xab-b, xwe-e, etc...

Every single Regexp API I have used (including languages such as C, Perl,
Java, Javascript, Elixir, etc) returns the groups in the exact same way as
I have explained. You use groups to choose what parts of the text you're
interested in, and refer to them by index.

Speaking of Elixir, this is the set of group-related flags that their API
supports (something tells me they also use PCRE behind the scenes):

-

:all - all captured subpatterns including the complete matching string
(this is the default)
-

:first - only the first captured subpattern, which is always the
complete matching part of the string; all explicitly captured subpatterns
are discarded
-

:all_but_first- all but the first matching subpattern, i.e. all
explicitly captured subpatterns, but not the complete matching part of the
string
-

:none - does not return matching subpatterns at all
-

:all_names - captures all names in the Regex
-

list(binary) - a list of named captures to capture

I think this is an excellent approach, and definitely something to emulate.

The question is then: shall *⎕RE* simply return the array of matches (which
Post by Juergen Sauermann
was what your
implementation did) or shall *⎕RE* return the matches as a tree? This is
the same question
as shall the tree be represented as a simple vector of nodes
(corresponding to an APL
vector of some kind) or shall it be represented as a recursive
node-properties + children structure (corresponding to a nested APL value)?
My argument here is one of pragmatism:

1. Regexp implementations in all known languages returns an indexed list
of groups
2. A simple list is simply the most useful. I struggle to think of any
case where returning a nested structure is in any way useful.

The vector of nodes and the nested APL value are both equivalent in
Post by Juergen Sauermann
describing the
tree. However, converting the nested tree structure to a vector of nodes
is much simpler
(in APL) than the other way around because converting a node vector to the
tree involves
a lot of comparisons which are quite lightweight but extremely ugly in
APL. That was why
decided to return the tree and not the vector of nodes.
You are indeed correct that converting it into a tree is more difficult
than the other way around. But then again, why should we go out of our ways
to make an incredibly rare use-case somewhat easier when the common
use-case becomes annoyingly complicated?

Can you think of a case where returning a nested structure is useful?
Post by Juergen Sauermann
I am not entirely against a flag that goes into that direction, but I
believe that flag should
determine if either the tree is returned (default) or the node vector of
the of the tree if
the flag is given. Unfortunately that flag, even though it is far more
consistent with the
structure of the ⎕RE result than 1↓, does not solve your 1↓ because it
would still contain
the top-level match (= the root of the tree).
I'm still curious as to why you are interested in even having the option of
returning a tree. I just don't see any need for it, and no other language
that I have ever used have done anything like it.

Do you have a use-case in mind?
Post by Juergen Sauermann
When you use subexpressions, it means that I am interested in specific
parts of the matched string. If I am interested in a specific part of a
string, it is very unlikely that I want to know the content of the entire
match. But, if I do, I can always retrieve that using another set of parens
that surrounds the entire regexp.
Not necessarily. It could also be a boundary condition of your match that
you
only want to be satisfied no matter how. REs like *[A-Z][a-z][0-9]* are
often used that way.
You can easily change that to *(*[A-Z][a-z][0-9]*)* get the entire match in
this case.
Post by Juergen Sauermann
When you don't have any subexpressions, it's most likely that I am not
interested in the matched string at all, but rather just a boolean result
telling me if I have a match at all.
The boolean case is simple, so the only aspect of this that warrants any
discussion is how that should be achieved. My opinion is that it should be
the default, but a flag can also be used.
For subexpressions, I think a few examples will help explain how they are
A(.)|B(.)
This regexp has two subexpressions, and the result with therefore have
two values. Due to the fact that they are separated by the alternation
symbol (|), one of the subexpressions will always be empty. So, here are
"AXY" Subexpr 1: "X", Subexpr 2: ""
"BZA" Subexpr 1: "", Subexpr 2: "Z"
"CXY" *No match*
Not sure if that should be so but i am not too familiar with *libpre2*
either. I would naively
expect that an RE of the form *A|B* would either return a match for *A*
or a match for *B* but
The pattern always has a fixed number of groups that is defined by the
pattern itself. In fact, PCRE has an API call that allows you to extract
the number of groups, and this number is always constant for any given
pattern (the call is pcre2_get_ovector_count(), and it should be noted that
this is the number of substrings that should be returned. The return value
from pcre2_match() may be lower than this number, at which it should be
assumed that the remaining captured groups are empty. My old code didn't
handle this case correctly).
Post by Juergen Sauermann
* Vertical bar characters are used to separate alternative
patterns. For*
* example, the pattern*
* gilbert|sullivan*
* matches either "gilbert" or "sullivan". Any number of
alternatives may*
* appear, and an empty alternative is permitted (matching the
empty*
* string). The matching process tries each alternative in turn, from
left*
* to right, and the first one that succeeds is used.*
My understanding of this is that, for example, B is ignored if A matches.
That implies that
the matching of *B* is not even performed so "" (for no match) would be
incorrect because
B could also match as well.
This is correct. But it has no relation to capturing groups. If we
generalise your example a bit and add some groups, we get the following:

(gil....)|(sul......)

In this case, if you try to match against "gilbert", the second group still
exists, but is empty. The same goes for the opposite case. Let's take three
cases and look at what PCRE returns in this case:

*foo* - As there is no match, pcre2_match() will return -1.
*gilbert* - pcre2_match() will return 2, since it never looked at the
second alternative
*sullivan* - pcre2_match() will return 3, since it had to look at both sides

In all of the cases where a match happened, pcre2_get_ovector_count() will
return 3, since there is one full match + 2 capture groups.

Note that pcre returns unused capture groups with both start and end
indexes being -1. This can be used to determine which side of an alternate
actually matched. How to actually treat this in the case where strings
(rather than indexes) are being returned in an open question. Returning ⍬
is an option, as is the empty string.

Regards,
Elias
Juergen Sauermann
2017-10-13 13:24:06 UTC
Permalink
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
Hi Elias,<br>
<br>
I believe we consider the <b>⎕RE </b>from two points of view. From
a language or function<br>
designer's point of view your focus is less on specific use cases
but more on how well the function fits into the rest of the
language. From a function user's point of view you care<br>
more about use cases and the simplicity of use.<br>
<br>
Before we dive too deeply into technical details let me ask you two
questions:<br>
<br>
1. would you agree that the result that is returned by <b>pcre2_match()</b>
is a tree, regardless of<br>
 how that tree is represented in the API of <b>libpcre2</b>?<br>
<br>
2. Suppose you have to choose between two APL libraries <b>libA</b>
and <b>libB</b>.<br>
<br>
 <b>libA</b> provides a single function <font face="Courier New,
Courier, monospace"><b>V←foo B</b></font> that solves some problem
in a generic way (so that<br>
all use cases can be covered) but the result returned by <b>foo</b>
may need to be adopted to<br>
different use cases by means of simple APL operations such as <b>1↓foo
B</b>, <b>N⍴foo B</b>, etc.<br>
<br>
<b>libB</b> does not provide the generic <b>foo</b> of <b>libA</b>
but instead a number of different functions<br>
<b>foo1</b> returning<b> 1↓foo B</b>, function <b>foo2</b>
returning <b>N⍴foo B</b>, etc. Some rare use cases of<br>
<b>foo</b> in <b>libA</b> that are not covered by one of the <b>fooN</b>
functions in <b>libB</b>, but your project<br>
does not need them.<br>
<br>
Given these two libraries, you have to irrevocably decide between
using <b>libA</b> or <b>libB</b> in<br>
your project. What would be your decision?<br>
<br>
/// Jürgen<br>
<br>
<br>
<br>
<br>
<br>
<br>
<br>
</body>
</html>

Juergen Sauermann
2017-10-11 12:40:20 UTC
Permalink
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body text="#000000" bgcolor="#FFFFFF">
<font face="Helvetica, Arial, sans-serif">Hi Xtian,<br>
<br>
in general we should only add flags if they are frequently used
and if there is no appropriate solution in APL.<br>
What we should avoid, IMHO, is redundancy by introducing new
functionality inside ⎕RE that<br>
only increases the internal complexity for the sake of saving a
small number of characters at APL<br>
level. I believe that for the average APL programmer it is easier
to create or decode a slightly more complex<br>
APL expression that to remember a zoo of flags for different use
cases.<br>
<br>
I truly believe that GNU APL should aim at being minimalistic, but
not in terms of the number of APL characters<br>
typed, but in terms of concepts deployed.<br>
<br>
Coming back to your example, there is a difference between a RE
match and a line containing some RE match.<br>
If you rewrite the RE in your example (using a fixed string z
instead of a long /var/log/messages):<br>
<br>
<font face="Courier New, Courier, monospace">      z←'foo' 'bar'
'Started at 11:22' 'something' 'else' '*** Stopped  12:33'<br>
      4 ⎕CR ⊃((⊂⍬)≢¨'.*Started.*|.*Stopped.*' ⎕RE[''] z)/z<br>
┏→━━━━━━━━━━━━━━━━━┓<br>
↓Started at 11:22  ┃<br>
┃*** Stopped  12:33┃<br>
┗━━━━━━━━━━━━━━━━━━┛<b><br>
</b></font><br>
then you get the matching lines without returning large integer
vectors from <b>⎕</b><b>RE</b>.<br>
<br>
An even faster approach would be to somehow use <b><font
face="Courier New, Courier, monospace">⎕FIO[26]<font
face="Helvetica, Arial, sans-serif"> </font></font></b>instead
of </font><font face="Helvetica, Arial, sans-serif"><font
face="Helvetica, Arial, sans-serif"><b><font face="Courier New,
Courier, monospace">⎕FIO[49]</font></b> and do the line
splitting<br>
inside <b>⎕RE</b> rather than inside <b>⎕FIO</b>.<br>
<b><font face="Courier New, Courier, monospace"><font
face="Helvetica, Arial, sans-serif"> <br>
</font></font></b></font>I was actually thinking of an
output flag that can be used directly with <b>⌷</b> (not <b>⎕</b>!)
or <font face="Courier New, Courier, monospace">[]</font></font>
<font face="Helvetica, Arial, sans-serif">for cases where the result<br>
of a match is fairly sparse.<br>
<br>
Regarding performance, I would argue that you cannot gain a lot by
optimizing the output format because you<br>
always have to process the matched string and the production of
large APL values (as opposed to many small<br>
ones) is fairly efficient. And the possible overhead of additional
APL functions does not count if the results are large.<br>
<br>
Best Regards,<br>
/// Jürgen<br>
<br>
</font><br>
<div class="moz-cite-prefix">On 10/11/2017 05:12 AM, Christian
Robert wrote:<br>
</div>
<blockquote type="cite"
cite="mid:9cfd56b7-1905-7d00-8985-***@polymtl.ca">Sometimes
we only want to know if it match or not.
<br>
<br>
I suggest a new flag ['m']  (as match) that will return ...
<br>
<br>
  for a string:  either 0 or 1 as a scalar for "not matching" or
"matching"
<br>
  for an array of strings: a vector of 0/1 for each string saying
like above.
<br>
<br>
<br>
lets say:
<br>
<br>
      z←⎕fio[49] '/var/log/messages'  // beware that this file is
inaccessible by default unless being "root" on linux
<br>
                                      // or you chmod a+r
/var/log/messages  # as root
<br>
<br>
who may return 50,000 lines or even 2 millions, on an average of
say ~120 characters each.
<br>
<br>
<br>
I would hope to be able to use a flag as ['m']:
<br>
<br>
     'Started|Stopped' ⎕RE['m'] z
<br>
<br>
who will return an array of (0/1) telling which lines match or not
the pattern, so I can
<br>
only retain those matching for further fine tuning (via diadic
operator "/").
<br>
<br>
It will be a LOT faster than letting ⎕RE returning the whole
result of pcre2 INTO the physical Gnu-APL memory engine
<br>
creating a lot of integers arrays for no real purpose, ie: seen
from the application.
<br>
<br>
comments welcome,
<br>
<br>
my usual 2 cents,
<br>
Xtian.
<br>
<br>
<br>
</blockquote>
<br>
</body>
</html>
Leslie S Satenstein
2017-10-12 15:43:13 UTC
Permalink
What about using the shared variable processor capability for your requirement.
 Regards
 Leslie
Leslie Satenstein
Montréal Québec, Canada



From: Christian Robert <***@polymtl.ca>
To: "bug-***@gnu.org" <bug-***@gnu.org>
Sent: Tuesday, October 10, 2017 11:13 PM
Subject: [Bug-apl] Suggestion for Quad-RE

Sometimes we only want to know if it match or not.

I suggest a new flag ['m']  (as match) that will return ...

  for a string:  either 0 or 1 as a scalar for "not matching" or "matching"
  for an array of strings: a vector of 0/1 for each string saying like above.


lets say:

      z←⎕fio[49] '/var/log/messages'  // beware that this file is inaccessible by default unless being "root" on linux
                                      // or you chmod a+r /var/log/messages  # as root

who may return 50,000 lines or even 2 millions, on an average of say ~120 characters each.


I would hope to be able to use a flag as ['m']:

      'Started|Stopped' ⎕RE['m'] z

who will return an array of (0/1) telling which lines match or not the pattern, so I can
only retain those matching for further fine tuning (via diadic operator "/").

It will be a LOT faster than letting ⎕RE returning the whole result of pcre2 INTO the physical Gnu-APL memory engine
creating a lot of integers arrays for no real purpose, ie: seen from the application.

comments welcome,

my usual 2 cents,
Xtian.
Loading...