Sublime Forum

Oniguruma syntax definition help

#4

You are correct: Regex support -- doesn't ST2 use the Boost library?
Oniguruma docs: geocities.jp/kosako3/oniguruma/

Itā€™s pretty consistent with other regex engines. The only thing Iā€™ve missed is the conditional as described in that thread.

Iā€™ll give it a shotā€¦

Edit: Unless Iā€™m missing something, this is just a matter of establishing precedence:

<key>patterns</key>
<array>
	<dict>
		<key>name</key>
		<string>entity.name.function</string>
		<key>match</key>
		<string>.*?(?={)</string>
	</dict>

	<dict>
		<key>name</key>
		<string>two</string>
		<key>begin</key>
		<string>{</string>
		<key>end</key>
		<string>}</string>
	</dict>

	<dict>
		<key>name</key>
		<string>comment</string>
		<key>begin</key>
		<string>;</string>
		<key>end</key>
		<string>Code</string>
	</dict>
</array>

Ignore the fact that the middle example is treated as a comment from the stray semi-colon on line 5ā€¦ :wink:

Edit 2: Seems I mightā€™ve missed a detail ā€“ I think you want to include the semi-colon following the closing bracket. Iā€™ll have another go.
Not extensively tested:

[code]patterns


name
two
begin
{
end
};

<dict>
	<key>contentName</key>
	<string>comment</string>
	<key>begin</key>
	<string>(?&lt;=};)</string>
	<key>end</key>
	<string>:Code</string>
</dict>

<dict>
	<key>name</key>
	<string>entity.name.function</string>
	<key>match</key>
	<string>^{]*?(?={)</string>
</dict>

[/code]

0 Likes

#5

Hey I tried!
Nice work @nick; really thorough.

Iā€™ve been having some memory problems (recuperating from surgery). But, now I definitely remember that post.
Again, very nice job!
Edit: Sorry for the noise :smile:

0 Likes

#6

I think youā€™ve given me some good things to work with. The language Iā€™m working with is non-standard and allows for some pretty weird stuff so Iā€™m afraid the answer isnā€™t completely there but youā€™ve definitely helped me a lot. One further question. Can anybody help me understand how the order of matching is determined and what effect each match has on those below it in ST2 syntax files? My impression is that once a match has been made, it is excluded from all other matches below it. But is this true for the actual regex match or does it just mean that once a scope has been assigned to a match it wonā€™t be reassigned by a later match?

A specific example might make this question clearer. I have one or two other matches to pick out specific content inside of braces, and to make these matches I am already matching on the braces. Does this mean that if I try to match on the braces later (even if Iā€™m not assigning them a scope and am only using them as a bound on my actual match), the regex wonā€™t find anything because itā€™s already been matched? I know this is true within one regex (which is why zero-width assertions a.k.a. lookaround is so useful).

0 Likes

#7

Also, how would you adapt your {} matching with begin and end to account for nested {}'s? Thank you so much. This was why I was attempting something with a recursive match using Onigurumaā€™s named subexpression. code[/code] Iā€™m wondering if I was just not capturing it because of precedence with another match in my syntax file.

0 Likes

#8

Assume two files, the source code file (ā€œsourceā€) and the syntax tmLanguage file (ā€œsyntaxā€). The syntax is always parsed top to bottom. The source is parsed line by line, left to right, top to bottom against the syntax until a match is found. When that happens, the cursor is moved to the end of the match in the source and parsing continues again from the top of the syntax. The important takeaway is that nothing can be matched twice.

See my answer here for dealing with nested braces.

To match content within the braces, use the key and an array of . If you give a specific example I can write it out for you.

0 Likes

#9

Does this mean for a given pattern or across all paterns? If I have g{A{B}} can I match both the ā€˜Aā€™, ā€˜Bā€™, and the whole nested {}?

The match Iā€™m trying to make in plain english would be something like: From the first ā€˜;ā€™ that is not enclosed in any level of nested {}'s until the very next ā€œ:Codeā€ without actually capturing the first ; or the :Code.

Here is an example of some especially sticky code Iā€™m trying to handle (note nested {}'s as well as ;'s and {}'s following the first ā€˜;ā€™ not enclosed in {}'s):

IF{{O}@Foo(Bar)="Baz"@FB ""; {O,{O,@FB}@Foo(Bar)} @IF{@Foo(Bar)="Baz"@FB ""; 1 @Foo(Bar)@{}@Foo(Bar)}}; This would be orphaned code (comments) regardless of any ; or {}'s (like the preceding) until the next :Code

My end-goal is to match in the above This would be orphaned code (comments) regardless of any ; or {}'s (like the preceding) until the next as a comment.
Note that after a ; can be other ;'s or {}'s that are interpreted as comments.

0 Likes

#10

I know that Itā€™s across one given pattern, BUT you can get around that by matching your syntax within your pattern by nesting another regex pattern within it.
If you donā€™t want to match the starting ; and ending :Code you can use non-capturing groups ie: (?::wink:

0 Likes

#11

I appreciate the tip, there are lots of ways I could deal with the bounds on it. Iā€™m struggling more with the super specific first ā€˜;ā€™. Iā€™ve gotten an implementation of matching outer-most brackets working with the Boost regex in ST2ā€™s search, but I canā€™t seem to get it to translate to the Oniguruma regex for the syntax file. And even then, Iā€™m not 100% sure Iā€™ll be able to match the first ā€˜;ā€™ not in braces. I think I know how it will work but I havenā€™t been able to test it. Any implementation to find that ā€˜;ā€™ and use it as the leftmost bound of my match would be seriously welcome.

Also, just curious about the ā€œacross one patternā€ part. I know that once Iā€™ve matched comments or strings, any ā€œspecialā€ syntax within them doesnā€™t get highlighted. Is that because theyā€™ve been matched already or because theyā€™ve been matched and assigned a scope-name?

0 Likes

#12

You can still hilight syntax nested in an already matched string like comments.

What you do is nest a new pattern within that comment syntax definition and scope it.

I was always thinking something simple like:

 <dict>
    <key>begin</key>
    <string>(;)(?!})</string>
    <key>captures</key>
    <dict>
        <key>1</key>
        <dict>
            <key>name</key>
            <string>punctuation.definition.something.begin.code</string>
        </dict>
    </dict>
    <key>end</key>
    <string>(?i:\:code)</string>
    <key>name</key>
    <string>orphaned.blah.something.code</string>
    <key>patterns</key>
    <array>
        <dict>
            <key>match</key>
            <string>{(.*)}</string>
            <key>name</key>
            <string>entity.some.nested.curly.code</string>
        </dict>
    </array>
</dict>

But yea, the scope names should be changed to whatever is appropriate for your case.
EDit: Not testedā€¦

0 Likes

#13

Iā€™m not saying this is what you should do, itā€™s just a 5-minute example to get you started:


[code]<?xml version="1.0" encoding="UTF-8"?>

fileTypes pl
<key>firstLineMatch</key>
<string>^#!.*\bperl\b</string>

<key>foldingStartMarker</key>
<string></string>

<key>foldingStopMarker</key>
<string>]</string>

<key>name</key>
<string>Foo language</string>

<key>scopeName</key>
<string>source.foo</string>

<key>uuid</key>
<string>491AB614-A9F1-4733-ADC1-2F33213E87F7</string>





/* Syntax Patterns
 * ================================== */
<key>patterns</key>
<array>
	<dict>
		<key>contentName</key>
		<string>comment</string>
		<key>begin</key>
		<string>;</string>
		<key>end</key>
		<string>\:Code</string>
	</dict>

	<dict>
		<key>name</key>
		<string>three</string>
		<key>match</key>
		<string>^{]*(?={)</string>
	</dict>

	<dict>
		<key>name</key>
		<string>one</string>
		<key>begin</key>
		<string>{</string>
		<key>beginCaptures</key>
		<dict>
			<key>0</key>
			<dict>
				<key>name</key>
				<string>begin</string>
			</dict>
		</dict>
		<key>end</key>
		<string>}</string>
		<key>endCaptures</key>
		<dict>
			<key>0</key>
			<dict>
				<key>name</key>
				<string>end</string>
			</dict>
		</dict>
		<key>patterns</key>
		<array>
			<dict>
				<key>include</key>
				<string>#function</string>
			</dict>
			<dict>
				<key>include</key>
				<string>#string</string>
			</dict>
			<dict>
				<key>include</key>
				<string>#nested_braces</string>
			</dict>
		</array>
	</dict>
</array>





/* Repository
* ================================== */

<key>repository</key>
<dict>

	<key>function</key>
	<dict>
		<key>name</key>
		<string>markup.deleted.diff</string>
		<key>match</key>
		<string>@\w+</string>
	</dict>

	<key>string</key>
	<dict>
		<key>name</key>
		<string>string</string>
		<key>match</key>
		<string>"\w+"</string>
	</dict>

	<key>nested_braces</key>
	<dict>
		<key>begin</key>
		<string>{</string>
		<key>beginCaptures</key>
		<dict>
			<key>0</key>
			<dict>
				<key>name</key>
				<string>two</string>
			</dict>
		</dict>
		<key>end</key>
		<string>}</string>
		<key>endCaptures</key>
		<dict>
			<key>0</key>
			<dict>
				<key>name</key>
				<string>two</string>
			</dict>
		</dict>
		<key>patterns</key>
		<array>
			<dict>
				<key>include</key>
				<string>#function</string>
			</dict>
			<dict>
				<key>include</key>
				<string>#string</string>
			</dict>
			<dict>
				<key>include</key>
				<string>#nested_braces</string>
			</dict>
		</array>
	</dict>

</dict>
[/code]
0 Likes

#14

This looks super helpful! Thank you so much! If I could just ask you a couple of questions about whatā€™s happening here to make sure I understand as I fine tune it:

    1. I havenā€™t used the repository keyword before. Is that just where you define patterns that can be used with an include? Or can any pattern be used as an include and is the repository just a way of visually separating patterns that are explicitly for that purpose? Does that mean that your ā€œfunctionā€ and ā€œstringā€ matches will match the entire file like any other match?
    1. Do you have to match the function/string inside the nested braces explicitly (as youā€™re doing with the includes)? Technically any code is valid inside braces and Iā€™m already doing a lot of matching for that. Will the nested brace matching negate all that other stuff or do I just have to match the nested braces after (or before?) all the other matches?

I didnā€™t quite understand the patterns keyword inside of a dict before seeing your example. Also it seems like doing an include of a match inside itself is by far the best way to implement recursive matching in a pattern. Certainly the easiest to read. Would you agree?
Once again thank you so much!

0 Likes

#15
  1. The repository is like a library that you reference using the include key. You can use include anywhere a key is valid, but you can only include from the repository.

  2. As I said previously, nothing can be matched twice. So if you match some content as a nested brace, it will be only scoped as a nested brace. You can implement subpatterns of the nested braces (as I have done in the example) to add specific scopes within the braces.

Itā€™s also possible to include $self, but thatā€™s pretty tricky to get right.

See here for the full documentation.

0 Likes

#16

So anything in the repository isnā€™t matched unless itā€™s included somewhere? Ergo, if I want to match all the same things inside the nested braces as I am outside the nested braces, Iā€™ll essentially have to move all my matches to the repository, and then then include them in my patterns as well as in the ā€œbeginā€: ā€œ{ā€, ā€œendā€: ā€œ}ā€'s patterns? Obviously I could play with how I group it to be a little more readable/clean but thatā€™s the gist of it?

0 Likes

#17

Yea, the repository is used so you can easily group, include and nest multiple patterns. It prevents copy paste code. Anything in the repository is not matched unless itā€™s included in ā€œpatternsā€

Great questions by the way. This is going to be a nice thread to point to folks asking about tmLanguage development.

0 Likes

#18

Thanks :smile: I think writing a syntax file is one of the best ways to teach yourself regex. Iā€™ve used it many times in the past but this experience has taught me so many tricks, like zero-width assertions (Awesome!). Iā€™m a little disappointed by how complicated my syntax file will be now but thatā€™s really the fault of this screwed up language Iā€™m being forced to work with.

0 Likes

#19

Could I simplify this /* Syntax Patterns * ================================== */ <key>patterns</key> <array> ... <dict> <key>name</key> <string>one</string> <key>begin</key> <string>{</string> <key>beginCaptures</key> <dict> <key>0</key> <dict> <key>name</key> <string>begin</string> </dict> </dict> <key>end</key> <string>}</string> <key>endCaptures</key> <dict> <key>0</key> <dict> <key>name</key> <string>end</string> </dict> </dict> <key>patterns</key> <array> <dict> <key>include</key> <string>#function</string> </dict> <dict> <key>include</key> <string>#string</string> </dict> <dict> <key>include</key> <string>#nested_braces</string> </dict> </array> </dict> </array> to this: /* Syntax Patterns * ================================== */ <key>patterns</key> <array> ... <dict> <key>include</key> <string>#nested_braces</string> </dict> </array> ?

Iā€™m thinking since itā€™s already in the repository and itā€™s already doing the recursion there I should just be able to use it right?

0 Likes

#20

Also, (sorry for the multi-posting) I see that youā€™re matching the ;-:Code bit before the nested braces. Iā€™m confused about how the order of evaluation plays out here. Wouldnā€™t the ; match to something inside the braces (if it existed) because the braces match hasnā€™t happened yet?

0 Likes

#21

Final post! I promise!

Just wanted to sayā€¦ totally working now. Iā€™ve been beating my head against this for a couple of days soā€¦ yeah pretty great feeling. Thank you so much. The one little bug Iā€™m having with it (that I can totally live with) is if thereā€™s an unmatched { before the ;, it seems to be stuck in the nested_braces match for the rest of the file. Is there an easy workaround to this? (Is that why you redid the begin/end match in the patterns?)

0 Likes

#22

Thatā€™s correct. I only tested against the sample of code you provided.

Not sure what youā€™re asking in the first of the above three posts.

If you have an unmatched brace, wonā€™t that be a syntax error anyway?

0 Likes

#23

Yeah I should apologize for that. Working on not much sleep and a little giddy on that break-through after being stuck on a problem for a while. Youā€™re totally right about the unmatched {. Sorry for cluttering up the thread :blush:

Iā€™m still a little curious about why the ;-:Code match coming before the nested braces match works. Because it is working, it just seems like it shouldnā€™t.

0 Likes