Sublime Forum

**facelessuser** · January 13, 2016, 9:01am

[quote=“FichteFoll”]
@facelessuser: Instead of (?imx-imx:subexp) you can try (?:(?imx-imx)subexp). It should be the same functionality-wise but I don’t know how it’s implemented. Still a bug.[/quote]

Thanks. I don’t have a problem working around it, but I have to manually do it when I run the convert tool which kind of sucks. And if you don’t know what the issue is, it isn’t obvious at first. I don’t particularly have any use cases where I am applying one of these to just a group, but in general, many tmLanguages do this even when they don’t have to causing convert issues. It would be nice if the root issue just gets fixed, but yeah, it is also good to post the workaround for others as well.

**huot25** · January 13, 2016, 9:01am

Have any of you experienced any performance differences between using a regex with a bunch of keywords vs doing each one written out?

vs.

\b(?i:AND)\b
\b(?i:OR)\b
…

I have about 400 keywords I need to include in a syntax def I am creating the load time is brutal. It’s still better than the tmLanguage format but still not as fast as I would like.

Here is the main context where each include is a list of keywords that needs to be checked. Any suggestions to make this faster would be helpful. I’ll be doing my own performance comparisons this weekend, but wanted to post here to see if anyone has any better ideas.

contexts:
  main:
    - include: comments
    - include: compOperators
    - include: mathOperators
    - include: types
    - include: functionDefs

    - include: keywordsControl
    - include: keywordsSql
    - include: keywordsOther
    - include: keywordsOtherUnsorted
    - include: keywordsString
    - include: keywordsBool
    - include: keywordSupport

    - match: \b(?i:TRUE|FALSE|NULL|NOTFOUND)\b
      scope: support.constant.4gl

    - match: \b\d+\b
      scope: constant.numeric.4gl

    - match: '"'
      push: string_double

    - match: \'
      push: string_single

    - match: \(
      push: parens
    - match: \)
      scope: invalid.illegal.stray-bracket-end

    - match: \
      push: brackets
    - match: \]
      scope: invalid:illegal.stray-bracket-end

Thanks!

**FichteFoll** · January 13, 2016, 9:01am

Try atomic groups, they are better in performance since there will be no useless backtracking. Should be fine for plain keyword matching.

**jps** · January 13, 2016, 9:01am

Any incompatibilities with Oniguruma are bugs, and will be fixed. The new regex engine is intended to be entirely invisible, aside from any efficiency gains.

I’ll fix the issues mentioned in the above ticket.

**jps** · January 13, 2016, 9:01am

[quote=“facelessuser”]In sublime-syntax, this works:

(?imx-imx)         option on/off

But this does not work:

(?imx-imx:subexp)  option on/off for subexp

This keeps a good number of syntax from directly converting. I really feel like this should be allowed.[/quote]

It should work, and does in the limited set of tests that I’ve got. Do you have an example of where it’s not working?

**facelessuser** · January 13, 2016, 9:01am

Sure. This is for JSON. You will notice that escape chars such as \t will scope as invalid instead of normal escape chars because the regex engine will not recognize (?x: some syntax ). If you adjust those to use (?x) instead, it starts to scope proper.

[code]%YAML 1.2

http://www.sublimetext.com/docs/3/syntax.html

name: JSON
file_extensions:

json
sublime-settings
sublime-menu
sublime-keymap
sublime-mousemap
sublime-theme
sublime-build
sublime-project
sublime-completions
sublime-commands
scope: source.json
contexts:
main:
- include: value
  array:
- match: ‘’
  captures:
  0: punctuation.definition.array.begin.json
  push:
  - meta_scope: meta.structure.array.json
  - match: ‘]’
    captures:
    0: punctuation.definition.array.end.json
    pop: true
  - include: value
  - match: “,”
    scope: punctuation.separator.array.json
  - match: ‘^\s]]’
    scope: invalid.illegal.expected-array-separator.json
    comments:
- match: /**
  captures:
  0: punctuation.definition.comment.json
  push:
  - meta_scope: comment.block.documentation.json
  - match: */
    pop: true
- match: /*
  captures:
  0: punctuation.definition.comment.json
  push:
  - meta_scope: comment.block.json
  - match: */
    pop: true
- match: (//).*$\n?
  scope: comment.line.double-slash.js
  captures:
  1: punctuation.definition.comment.json
  constant:
- match: \b(?:true|false|null)\b
  scope: constant.language.json
  keyString:
- match: ‘"’
  captures:
  0: punctuation.definition.string.begin.json
  push:
  - meta_scope: string.quoted.double.key.json
  - match: ‘"’
    captures:
    0: punctuation.definition.string.end.json
    pop: true
  - match: |
    (?x: # turn on extended mode
    \ # a literal backslash
    (?: # …followed by…
    "\/bfnrt] # one of these characters
    | # …or…
    u # a u
    [0-9a-fA-F]{4} # and four hex digits
    )
    )
    scope: constant.character.escape.json
  - match: \.
    scope: invalid.illegal.unrecognized-string-escape.json
    number:
- match: |
  (?x: # turn on extended mode
  -? # an optional minus
  (?:
  0 # a zero
  | # …or…
  [1-9] # a 1-9 character
  \d* # followed by zero or more digits
  )
  (?:
  (?:
  . # a period
  \d+ # followed by one or more digits
  )?
  (?:
  [eE] # an e character
  ±]? # followed by an option +/-
  \d+ # followed by one or more digits
  )? # make exponent optional
  )? # make decimal portion optional
  )
  comment: handles integer and decimal numbers
  scope: constant.numeric.json
  object:
- match: ‘{’
  comment: a JSON object
  captures:
  0: punctuation.definition.dictionary.begin.json
  push:
  - meta_scope: meta.structure.dictionary.json
  - match: ‘}’
    captures:
    0: punctuation.definition.dictionary.end.json
    pop: true
  - include: keyString
  - include: comments
  - match: “:”
    captures:
    0: punctuation.separator.dictionary.key-value.json
    push:
    - meta_scope: meta.structure.dictionary.value.json
    - match: ‘(,)|(?=})’
      captures:
      1: punctuation.separator.dictionary.pair.json
      pop: true
    - include: value
    - match: ‘^\s,]’
      scope: invalid.illegal.expected-dictionary-separator.json
  - match: ‘^\s}]’
    scope: invalid.illegal.expected-dictionary-separator.json
    string:
- match: ‘"’
  captures:
  0: punctuation.definition.string.begin.json
  push:
  - meta_scope: string.quoted.double.json
  - match: ‘"’
    captures:
    0: punctuation.definition.string.end.json
    pop: true
  - match: |
    (?x: # turn on extended mode
    \ # a literal backslash
    (?: # …followed by…
    "\/bfnrt] # one of these characters
    | # …or…
    u # a u
    [0-9a-fA-F]{4} # and four hex digits
    )
    )
    scope: constant.character.escape.json
  - match: \.
    scope: invalid.illegal.unrecognized-string-escape.json
    value:
- include: constant
- include: number
- include: string
- include: array
- include: object
- include: comments
  [/code]

**jps** · January 13, 2016, 9:01am

That’s a YAML surprise, rather than an issue with the regex parser. Multi line strings with the pipe indicator have a newline character added after each line, including the last line. Because your regex is using the subexp version of the extended form, it’s expecting a literal newline to be matched at the end. You can verify this by converting it to a normal double quoted string, adding newlines and escapes as required: the regex will then work as expected.

**jps** · January 13, 2016, 9:01am

I’ll see what I can do about this issue, given it’s caused by convert_syntax.py.

**FichteFoll** · January 13, 2016, 9:01am

- match: |- (?x: # turn on extended mode \\ # a literal backslash (?: # ...followed by... "\\/bfnrt] # one of these characters | # ...or... u # a u [0-9a-fA-F]{4} # and four hex digits ) )

The trailing hyphen tells YAML not to append the trailing newline.

**jps** · January 13, 2016, 9:01am

Thanks for the pointer, I wasn’t aware. I’ll update convert_syntax.py, and fix the too eager leading whitespace stripping at the same time.

**asomorjai** · January 13, 2016, 9:01am

Anyone seeing the “Unable to fetch update url contents” on OS X 10.10.3? I’m unable to update automatically since build 3084.

Thanks, Akos

**bathos** · January 13, 2016, 9:01am

The minus says not to include the final newline; however if you also wish to have it ignore the ones inside, use “>-” instead of “|-”. I think it doesn’t matter if you already have (?x) on, but it’s worth noting since >- will introduce no whitespace at all.

**jps** · January 13, 2016, 9:01am

[quote=“huot25”]Have any of you experienced any performance differences between using a regex with a bunch of keywords vs doing each one written out?

vs.

\b(?i:AND)\b
\b(?i:OR)\b
…
[/quote]

Performance with Oniguruma will be fundamentally similar between them, however I’d expect the single regex case to be faster.

I’d suggest not worrying about this case too much: sregex is currently falling back to Oniguruma for case-insensitive matching, however support will be added for it shortly. sregex is much more efficient than onig for these alternative heavy regexes: you can get a feel for this by removing the case insensitive flag and seeing the speed difference. For sregex, there is no speed difference between one large regex vs multiple smaller ones.

**facelessuser** · January 13, 2016, 9:01am

Yup, that is it. Thanks.

[quote]The minus says not to include the final newline; however if you also wish to have it ignore the ones inside, use “>-” instead of “|-”. I think it doesn’t matter if you already have (?x) on, but it’s worth noting since >- will introduce no whitespace at all.
[/quote]

Also, good to know. I still don’t use YAML enough to remember these differences about multi-line strings. This is a good reminder.

**FichteFoll** · January 13, 2016, 9:01am

This not exactly correct, > folds the line and replaces line breaks with a single space whereas >- additionally removes the trailing newline. There are a few more special cases however. (yaml.org/spec/1.2/spec.html#id2796251)

[code]import yaml

s = “”"
This is
a folded
string
With no

trailing newline
but
    leading spaces
sometimes

“”"

d = yaml.load(">" + s)

print(d, “tail”)
print()

d = yaml.load(">-" + s)

print(d, “tail”)
[/code]

[code]This is a folded string With no
trailing newline but
leading spaces
sometimes
tail

This is a folded string With no
trailing newline but
leading spaces
sometimes tail[/code]

Anyway, since the converter tool preserves the original string it needs to respect the newlines and include them in the converted representation, so > is not really an option.

**bathos** · January 13, 2016, 9:01am

Thanks, I described its effect incorrectly. It’s the version I’d come to prefer since it’s never surprised me, but I always use (?x) flag for multiline anyway since I end up using spaces for regex readability, so I hadn’t given much thought to the fact that newline=space with this case.

**huot25** · January 13, 2016, 9:01am

[quote=“jps”]
I’d suggest not worrying about this case too much: sregex is currently falling back to Oniguruma for case-insensitive matching, however support will be added for it shortly. sregex is much more efficient than onig for these alternative heavy regexes: you can get a feel for this by removing the case insensitive flag and seeing the speed difference. For sregex, there is no speed difference between one large regex vs multiple smaller ones.[/quote]

Thanks Jon! It’s good to know I am not crazy! I spend a few hours over the weekend trying to find the best performing option and I was not seeing any difference between the onig vs sregex, but it was falling back to onig since my regex was using case-insensative matching. Looking forward to having case-insenative matching in one of the future releases. I removed that and the files load instantly vs a 4 second load time.

**gwenzek** · January 13, 2016, 9:02am

Hi,

I find the new sublime-syntax format really nice.
I always wanted to change the syntax highlighting for scala but I never wanted to dive into the tmlanguage format.
Now I can do it !

I started something on github at https://github.com/gwenzek/scalaSublimeSyntax if you want to have a look.

I was wondering if there is a way to match all Unicode letters ? Something like match: [a-zA-Z] but that would also catch éèà…

**bathos** · January 13, 2016, 9:02am

Gwenzek, you can use the “core” Unicode character property classes with \p{NAME_OF_PROP}.

So for example, matching any letters, regardless of case or alphabet, would be \p{L}.
And all Latin letters, regardless of case (and including characters with diacritics) would be \p{Latin}.

See geocities.jp/kosako3/oniguruma/doc/RE.txt , section 3 for more info, or check out the Unicode homepage to figure out exactly what’s included in a given character property class.

**gwenzek** · January 13, 2016, 9:02am

Thanks for the link, just what I needed !

Dev Build 3085

[code]%YAML 1.2

http://www.sublimetext.com/docs/3/syntax.html