Sublime Forum

Oniguruma syntax definition help

#1

Hi guys. I have a bit of a complicated one here. I really wish ST2 used the same regex for syntax files as its search so I could test my regex more easily than writing to the syntax file each time (if anybody knows a better way to test syntax definition matches please let me know).

Anyways… I’m trying to make a slightly complicated match here. In Oniguruma with JSON escaping it is (I believe): code[/code] although that’s not working so well for me so I could be wrong. Using the regex search, I can achieve what I want with this: \{(?:^{}]*+|(?0))*\} But I can’t seem to get a similar result in my syntax definition file. Right now it’s only matching empty brackets: {}

While I have your attention, I should probably mention my end-goal to see if there are any alternate solutions I could explore that any of you can think of. My goal is to capture all the text between the first ‘;’ that is not between brackets and the first “:Code” following it. Essentially, detecting orphaned code which the language I’m writing this file for will interpret as a comment. I can’t simply match between the ‘;’ and ‘:Code’ because that might either match the ‘;’ in between {}'s:

@foo{bar,blah;};graaa ; asdfl
dsfsh;
:Code

or the last ‘;’:

@foo{bar,blah;};graaa ; asdfl
dsfsh;
:Code

when what I want to match is:

@foo{bar,blah;};graaa ; asdfl
dsfsh;
:Code

My current strategy has been to attempt to match the outer most parens:

@foo{bar,blah;};graaa ; asdfl
dsfsh;
:Code

And then to match from the first ‘;’ which would achieve what I want.

0 Likes

#2

I’m not familiar with Oniguruma but you can try using a negative lookahead that will prevent matching an ending curly bracket

begin: ;(?!})
end: (?i::code)

Edit: Also check out gskinner.com/RegExr/ it’s great for testing
Edit2: I guess I am familiar with Oniguruma

0 Likes

#3

The problem is there could be an ending bracket that I would want to match. This is the match I would want to make in the following:

@foo{asldfkh}; @bar{}

Because after the ‘;’ the @bar{} becomes orphaned code. Using negative look ahead I would not match that ‘;’. I’ve been trying a lot. I’m correct that ST2 uses Oniguruma for syntax definition files yes?

Also, does anybody know of a similar tool to the gskinner site that uses Oniguruma? I specifically want to test its syntax as I’m pretty sure that’s where I’m running into problems here.

0 Likes

#4

You are correct: Regex support -- doesn't ST2 use the Boost library?
Oniguruma docs: geocities.jp/kosako3/oniguruma/

It’s pretty consistent with other regex engines. The only thing I’ve missed is the conditional as described in that thread.

I’ll give it a shot…

Edit: Unless I’m missing something, this is just a matter of establishing precedence:

<key>patterns</key>
<array>
	<dict>
		<key>name</key>
		<string>entity.name.function</string>
		<key>match</key>
		<string>.*?(?={)</string>
	</dict>

	<dict>
		<key>name</key>
		<string>two</string>
		<key>begin</key>
		<string>{</string>
		<key>end</key>
		<string>}</string>
	</dict>

	<dict>
		<key>name</key>
		<string>comment</string>
		<key>begin</key>
		<string>;</string>
		<key>end</key>
		<string>Code</string>
	</dict>
</array>

Ignore the fact that the middle example is treated as a comment from the stray semi-colon on line 5… :wink:

Edit 2: Seems I might’ve missed a detail – I think you want to include the semi-colon following the closing bracket. I’ll have another go.
Not extensively tested:

[code]patterns


name
two
begin
{
end
};

<dict>
	<key>contentName</key>
	<string>comment</string>
	<key>begin</key>
	<string>(?&lt;=};)</string>
	<key>end</key>
	<string>:Code</string>
</dict>

<dict>
	<key>name</key>
	<string>entity.name.function</string>
	<key>match</key>
	<string>^{]*?(?={)</string>
</dict>

[/code]

0 Likes

#5

Hey I tried!
Nice work @nick; really thorough.

I’ve been having some memory problems (recuperating from surgery). But, now I definitely remember that post.
Again, very nice job!
Edit: Sorry for the noise :smile:

0 Likes

#6

I think you’ve given me some good things to work with. The language I’m working with is non-standard and allows for some pretty weird stuff so I’m afraid the answer isn’t completely there but you’ve definitely helped me a lot. One further question. Can anybody help me understand how the order of matching is determined and what effect each match has on those below it in ST2 syntax files? My impression is that once a match has been made, it is excluded from all other matches below it. But is this true for the actual regex match or does it just mean that once a scope has been assigned to a match it won’t be reassigned by a later match?

A specific example might make this question clearer. I have one or two other matches to pick out specific content inside of braces, and to make these matches I am already matching on the braces. Does this mean that if I try to match on the braces later (even if I’m not assigning them a scope and am only using them as a bound on my actual match), the regex won’t find anything because it’s already been matched? I know this is true within one regex (which is why zero-width assertions a.k.a. lookaround is so useful).

0 Likes

#7

Also, how would you adapt your {} matching with begin and end to account for nested {}'s? Thank you so much. This was why I was attempting something with a recursive match using Oniguruma’s named subexpression. code[/code] I’m wondering if I was just not capturing it because of precedence with another match in my syntax file.

0 Likes

#8

Assume two files, the source code file (“source”) and the syntax tmLanguage file (“syntax”). The syntax is always parsed top to bottom. The source is parsed line by line, left to right, top to bottom against the syntax until a match is found. When that happens, the cursor is moved to the end of the match in the source and parsing continues again from the top of the syntax. The important takeaway is that nothing can be matched twice.

See my answer here for dealing with nested braces.

To match content within the braces, use the key and an array of . If you give a specific example I can write it out for you.

0 Likes

#9

Does this mean for a given pattern or across all paterns? If I have g{A{B}} can I match both the ‘A’, ‘B’, and the whole nested {}?

The match I’m trying to make in plain english would be something like: From the first ‘;’ that is not enclosed in any level of nested {}'s until the very next “:Code” without actually capturing the first ; or the :Code.

Here is an example of some especially sticky code I’m trying to handle (note nested {}'s as well as ;'s and {}'s following the first ‘;’ not enclosed in {}'s):

IF{{O}@Foo(Bar)="Baz"@FB ""; {O,{O,@FB}@Foo(Bar)} @IF{@Foo(Bar)="Baz"@FB ""; 1 @Foo(Bar)@{}@Foo(Bar)}}; This would be orphaned code (comments) regardless of any ; or {}'s (like the preceding) until the next :Code

My end-goal is to match in the above This would be orphaned code (comments) regardless of any ; or {}'s (like the preceding) until the next as a comment.
Note that after a ; can be other ;'s or {}'s that are interpreted as comments.

0 Likes

#10

I know that It’s across one given pattern, BUT you can get around that by matching your syntax within your pattern by nesting another regex pattern within it.
If you don’t want to match the starting ; and ending :Code you can use non-capturing groups ie: (?::wink:

0 Likes

#11

I appreciate the tip, there are lots of ways I could deal with the bounds on it. I’m struggling more with the super specific first ‘;’. I’ve gotten an implementation of matching outer-most brackets working with the Boost regex in ST2’s search, but I can’t seem to get it to translate to the Oniguruma regex for the syntax file. And even then, I’m not 100% sure I’ll be able to match the first ‘;’ not in braces. I think I know how it will work but I haven’t been able to test it. Any implementation to find that ‘;’ and use it as the leftmost bound of my match would be seriously welcome.

Also, just curious about the “across one pattern” part. I know that once I’ve matched comments or strings, any “special” syntax within them doesn’t get highlighted. Is that because they’ve been matched already or because they’ve been matched and assigned a scope-name?

0 Likes

#12

You can still hilight syntax nested in an already matched string like comments.

What you do is nest a new pattern within that comment syntax definition and scope it.

I was always thinking something simple like:

 <dict>
    <key>begin</key>
    <string>(;)(?!})</string>
    <key>captures</key>
    <dict>
        <key>1</key>
        <dict>
            <key>name</key>
            <string>punctuation.definition.something.begin.code</string>
        </dict>
    </dict>
    <key>end</key>
    <string>(?i:\:code)</string>
    <key>name</key>
    <string>orphaned.blah.something.code</string>
    <key>patterns</key>
    <array>
        <dict>
            <key>match</key>
            <string>{(.*)}</string>
            <key>name</key>
            <string>entity.some.nested.curly.code</string>
        </dict>
    </array>
</dict>

But yea, the scope names should be changed to whatever is appropriate for your case.
EDit: Not tested…

0 Likes

#13

I’m not saying this is what you should do, it’s just a 5-minute example to get you started:


[code]<?xml version="1.0" encoding="UTF-8"?>

fileTypes pl
<key>firstLineMatch</key>
<string>^#!.*\bperl\b</string>

<key>foldingStartMarker</key>
<string></string>

<key>foldingStopMarker</key>
<string>]</string>

<key>name</key>
<string>Foo language</string>

<key>scopeName</key>
<string>source.foo</string>

<key>uuid</key>
<string>491AB614-A9F1-4733-ADC1-2F33213E87F7</string>





/* Syntax Patterns
 * ================================== */
<key>patterns</key>
<array>
	<dict>
		<key>contentName</key>
		<string>comment</string>
		<key>begin</key>
		<string>;</string>
		<key>end</key>
		<string>\:Code</string>
	</dict>

	<dict>
		<key>name</key>
		<string>three</string>
		<key>match</key>
		<string>^{]*(?={)</string>
	</dict>

	<dict>
		<key>name</key>
		<string>one</string>
		<key>begin</key>
		<string>{</string>
		<key>beginCaptures</key>
		<dict>
			<key>0</key>
			<dict>
				<key>name</key>
				<string>begin</string>
			</dict>
		</dict>
		<key>end</key>
		<string>}</string>
		<key>endCaptures</key>
		<dict>
			<key>0</key>
			<dict>
				<key>name</key>
				<string>end</string>
			</dict>
		</dict>
		<key>patterns</key>
		<array>
			<dict>
				<key>include</key>
				<string>#function</string>
			</dict>
			<dict>
				<key>include</key>
				<string>#string</string>
			</dict>
			<dict>
				<key>include</key>
				<string>#nested_braces</string>
			</dict>
		</array>
	</dict>
</array>





/* Repository
* ================================== */

<key>repository</key>
<dict>

	<key>function</key>
	<dict>
		<key>name</key>
		<string>markup.deleted.diff</string>
		<key>match</key>
		<string>@\w+</string>
	</dict>

	<key>string</key>
	<dict>
		<key>name</key>
		<string>string</string>
		<key>match</key>
		<string>"\w+"</string>
	</dict>

	<key>nested_braces</key>
	<dict>
		<key>begin</key>
		<string>{</string>
		<key>beginCaptures</key>
		<dict>
			<key>0</key>
			<dict>
				<key>name</key>
				<string>two</string>
			</dict>
		</dict>
		<key>end</key>
		<string>}</string>
		<key>endCaptures</key>
		<dict>
			<key>0</key>
			<dict>
				<key>name</key>
				<string>two</string>
			</dict>
		</dict>
		<key>patterns</key>
		<array>
			<dict>
				<key>include</key>
				<string>#function</string>
			</dict>
			<dict>
				<key>include</key>
				<string>#string</string>
			</dict>
			<dict>
				<key>include</key>
				<string>#nested_braces</string>
			</dict>
		</array>
	</dict>

</dict>
[/code]
0 Likes

#14

This looks super helpful! Thank you so much! If I could just ask you a couple of questions about what’s happening here to make sure I understand as I fine tune it:

    1. I haven’t used the repository keyword before. Is that just where you define patterns that can be used with an include? Or can any pattern be used as an include and is the repository just a way of visually separating patterns that are explicitly for that purpose? Does that mean that your “function” and “string” matches will match the entire file like any other match?
    1. Do you have to match the function/string inside the nested braces explicitly (as you’re doing with the includes)? Technically any code is valid inside braces and I’m already doing a lot of matching for that. Will the nested brace matching negate all that other stuff or do I just have to match the nested braces after (or before?) all the other matches?

I didn’t quite understand the patterns keyword inside of a dict before seeing your example. Also it seems like doing an include of a match inside itself is by far the best way to implement recursive matching in a pattern. Certainly the easiest to read. Would you agree?
Once again thank you so much!

0 Likes

#15
  1. The repository is like a library that you reference using the include key. You can use include anywhere a key is valid, but you can only include from the repository.

  2. As I said previously, nothing can be matched twice. So if you match some content as a nested brace, it will be only scoped as a nested brace. You can implement subpatterns of the nested braces (as I have done in the example) to add specific scopes within the braces.

It’s also possible to include $self, but that’s pretty tricky to get right.

See here for the full documentation.

0 Likes

#16

So anything in the repository isn’t matched unless it’s included somewhere? Ergo, if I want to match all the same things inside the nested braces as I am outside the nested braces, I’ll essentially have to move all my matches to the repository, and then then include them in my patterns as well as in the “begin”: “{”, “end”: “}”'s patterns? Obviously I could play with how I group it to be a little more readable/clean but that’s the gist of it?

0 Likes

#17

Yea, the repository is used so you can easily group, include and nest multiple patterns. It prevents copy paste code. Anything in the repository is not matched unless it’s included in “patterns”

Great questions by the way. This is going to be a nice thread to point to folks asking about tmLanguage development.

0 Likes

#18

Thanks :smile: I think writing a syntax file is one of the best ways to teach yourself regex. I’ve used it many times in the past but this experience has taught me so many tricks, like zero-width assertions (Awesome!). I’m a little disappointed by how complicated my syntax file will be now but that’s really the fault of this screwed up language I’m being forced to work with.

0 Likes

#19

Could I simplify this /* Syntax Patterns * ================================== */ <key>patterns</key> <array> ... <dict> <key>name</key> <string>one</string> <key>begin</key> <string>{</string> <key>beginCaptures</key> <dict> <key>0</key> <dict> <key>name</key> <string>begin</string> </dict> </dict> <key>end</key> <string>}</string> <key>endCaptures</key> <dict> <key>0</key> <dict> <key>name</key> <string>end</string> </dict> </dict> <key>patterns</key> <array> <dict> <key>include</key> <string>#function</string> </dict> <dict> <key>include</key> <string>#string</string> </dict> <dict> <key>include</key> <string>#nested_braces</string> </dict> </array> </dict> </array> to this: /* Syntax Patterns * ================================== */ <key>patterns</key> <array> ... <dict> <key>include</key> <string>#nested_braces</string> </dict> </array> ?

I’m thinking since it’s already in the repository and it’s already doing the recursion there I should just be able to use it right?

0 Likes

#20

Also, (sorry for the multi-posting) I see that you’re matching the ;-:Code bit before the nested braces. I’m confused about how the order of evaluation plays out here. Wouldn’t the ; match to something inside the braces (if it existed) because the braces match hasn’t happened yet?

0 Likes