Sublime Forum

Dev Build 3085

#16

[quote=“FichteFoll”]
@facelessuser: Instead of (?imx-imx:subexp) you can try (?:(?imx-imx)subexp). It should be the same functionality-wise but I don’t know how it’s implemented. Still a bug.[/quote]

Thanks. I don’t have a problem working around it, but I have to manually do it when I run the convert tool which kind of sucks. And if you don’t know what the issue is, it isn’t obvious at first. I don’t particularly have any use cases where I am applying one of these to just a group, but in general, many tmLanguages do this even when they don’t have to causing convert issues. It would be nice if the root issue just gets fixed, but yeah, it is also good to post the workaround for others as well.

0 Likes

#17

Have any of you experienced any performance differences between using a regex with a bunch of keywords vs doing each one written out?

\b(?i:AND|OR|NOT|TRUE|FALSE|SELECT|INPUT)\b

vs.

\b(?i:AND)\b
\b(?i:OR)\b

I have about 400 keywords I need to include in a syntax def I am creating the load time is brutal. It’s still better than the tmLanguage format but still not as fast as I would like.

Here is the main context where each include is a list of keywords that needs to be checked. Any suggestions to make this faster would be helpful. I’ll be doing my own performance comparisons this weekend, but wanted to post here to see if anyone has any better ideas.

contexts:
  main:
    - include: comments
    - include: compOperators
    - include: mathOperators
    - include: types
    - include: functionDefs

    - include: keywordsControl
    - include: keywordsSql
    - include: keywordsOther
    - include: keywordsOtherUnsorted
    - include: keywordsString
    - include: keywordsBool
    - include: keywordSupport

    - match: \b(?i:TRUE|FALSE|NULL|NOTFOUND)\b
      scope: support.constant.4gl

    - match: \b\d+\b
      scope: constant.numeric.4gl

    - match: '"'
      push: string_double

    - match: \'
      push: string_single

    - match: \(
      push: parens
    - match: \)
      scope: invalid.illegal.stray-bracket-end

    - match: \
      push: brackets
    - match: \]
      scope: invalid:illegal.stray-bracket-end

Thanks!

0 Likes

#18

Try atomic groups, they are better in performance since there will be no useless backtracking. Should be fine for plain keyword matching.

0 Likes

#19

Any incompatibilities with Oniguruma are bugs, and will be fixed. The new regex engine is intended to be entirely invisible, aside from any efficiency gains.

I’ll fix the issues mentioned in the above ticket.

0 Likes

#20

[quote=“facelessuser”]In sublime-syntax, this works:

(?imx-imx)         option on/off

But this does not work:

(?imx-imx:subexp)  option on/off for subexp

This keeps a good number of syntax from directly converting. I really feel like this should be allowed.[/quote]

It should work, and does in the limited set of tests that I’ve got. Do you have an example of where it’s not working?

0 Likes

#21

Sure. This is for JSON. You will notice that escape chars such as \t will scope as invalid instead of normal escape chars because the regex engine will not recognize (?x: some syntax ). If you adjust those to use (?x) instead, it starts to scope proper.

[code]%YAML 1.2

http://www.sublimetext.com/docs/3/syntax.html

name: JSON
file_extensions:

  • json
  • sublime-settings
  • sublime-menu
  • sublime-keymap
  • sublime-mousemap
  • sublime-theme
  • sublime-build
  • sublime-project
  • sublime-completions
  • sublime-commands
    scope: source.json
    contexts:
    main:
    • include: value
      array:
    • match: ‘’
      captures:
      0: punctuation.definition.array.begin.json
      push:
      • meta_scope: meta.structure.array.json
      • match: ‘]’
        captures:
        0: punctuation.definition.array.end.json
        pop: true
      • include: value
      • match: “,”
        scope: punctuation.separator.array.json
      • match: ‘^\s]]’
        scope: invalid.illegal.expected-array-separator.json
        comments:
    • match: /**
      captures:
      0: punctuation.definition.comment.json
      push:
      • meta_scope: comment.block.documentation.json
      • match: */
        pop: true
    • match: /*
      captures:
      0: punctuation.definition.comment.json
      push:
      • meta_scope: comment.block.json
      • match: */
        pop: true
    • match: (//).*$\n?
      scope: comment.line.double-slash.js
      captures:
      1: punctuation.definition.comment.json
      constant:
    • match: \b(?:true|false|null)\b
      scope: constant.language.json
      keyString:
    • match: ‘"’
      captures:
      0: punctuation.definition.string.begin.json
      push:
      • meta_scope: string.quoted.double.key.json
      • match: ‘"’
        captures:
        0: punctuation.definition.string.end.json
        pop: true
      • match: |
        (?x: # turn on extended mode
        \ # a literal backslash
        (?: # …followed by…
        "\/bfnrt] # one of these characters
        | # …or…
        u # a u
        [0-9a-fA-F]{4} # and four hex digits
        )
        )
        scope: constant.character.escape.json
      • match: \.
        scope: invalid.illegal.unrecognized-string-escape.json
        number:
    • match: |
      (?x: # turn on extended mode
      -? # an optional minus
      (?:
      0 # a zero
      | # …or…
      [1-9] # a 1-9 character
      \d* # followed by zero or more digits
      )
      (?:
      (?:
      . # a period
      \d+ # followed by one or more digits
      )?
      (?:
      [eE] # an e character
      ±]? # followed by an option +/-
      \d+ # followed by one or more digits
      )? # make exponent optional
      )? # make decimal portion optional
      )
      comment: handles integer and decimal numbers
      scope: constant.numeric.json
      object:
    • match: ‘{’
      comment: a JSON object
      captures:
      0: punctuation.definition.dictionary.begin.json
      push:
      • meta_scope: meta.structure.dictionary.json
      • match: ‘}’
        captures:
        0: punctuation.definition.dictionary.end.json
        pop: true
      • include: keyString
      • include: comments
      • match: “:”
        captures:
        0: punctuation.separator.dictionary.key-value.json
        push:
        • meta_scope: meta.structure.dictionary.value.json
        • match: ‘(,)|(?=})’
          captures:
          1: punctuation.separator.dictionary.pair.json
          pop: true
        • include: value
        • match: ‘^\s,]’
          scope: invalid.illegal.expected-dictionary-separator.json
      • match: ‘^\s}]’
        scope: invalid.illegal.expected-dictionary-separator.json
        string:
    • match: ‘"’
      captures:
      0: punctuation.definition.string.begin.json
      push:
      • meta_scope: string.quoted.double.json
      • match: ‘"’
        captures:
        0: punctuation.definition.string.end.json
        pop: true
      • match: |
        (?x: # turn on extended mode
        \ # a literal backslash
        (?: # …followed by…
        "\/bfnrt] # one of these characters
        | # …or…
        u # a u
        [0-9a-fA-F]{4} # and four hex digits
        )
        )
        scope: constant.character.escape.json
      • match: \.
        scope: invalid.illegal.unrecognized-string-escape.json
        value:
    • include: constant
    • include: number
    • include: string
    • include: array
    • include: object
    • include: comments
      [/code]
0 Likes

#22

That’s a YAML surprise, rather than an issue with the regex parser. Multi line strings with the pipe indicator have a newline character added after each line, including the last line. Because your regex is using the subexp version of the extended form, it’s expecting a literal newline to be matched at the end. You can verify this by converting it to a normal double quoted string, adding newlines and escapes as required: the regex will then work as expected.

0 Likes

#23

I’ll see what I can do about this issue, given it’s caused by convert_syntax.py.

0 Likes

#24

- match: |- (?x: # turn on extended mode \\ # a literal backslash (?: # ...followed by... "\\/bfnrt] # one of these characters | # ...or... u # a u [0-9a-fA-F]{4} # and four hex digits ) )

The trailing hyphen tells YAML not to append the trailing newline.

0 Likes

#25

Thanks for the pointer, I wasn’t aware. I’ll update convert_syntax.py, and fix the too eager leading whitespace stripping at the same time.

0 Likes

#26

Anyone seeing the “Unable to fetch update url contents” on OS X 10.10.3? I’m unable to update automatically since build 3084.

Thanks, Akos

0 Likes

#27

The minus says not to include the final newline; however if you also wish to have it ignore the ones inside, use “>-” instead of “|-”. I think it doesn’t matter if you already have (?x) on, but it’s worth noting since >- will introduce no whitespace at all.

0 Likes

#28

[quote=“huot25”]Have any of you experienced any performance differences between using a regex with a bunch of keywords vs doing each one written out?

\b(?i:AND|OR|NOT|TRUE|FALSE|SELECT|INPUT)\b

vs.

\b(?i:AND)\b
\b(?i:OR)\b

[/quote]

Performance with Oniguruma will be fundamentally similar between them, however I’d expect the single regex case to be faster.

I’d suggest not worrying about this case too much: sregex is currently falling back to Oniguruma for case-insensitive matching, however support will be added for it shortly. sregex is much more efficient than onig for these alternative heavy regexes: you can get a feel for this by removing the case insensitive flag and seeing the speed difference. For sregex, there is no speed difference between one large regex vs multiple smaller ones.

0 Likes

#29

Yup, that is it. Thanks.

[quote]The minus says not to include the final newline; however if you also wish to have it ignore the ones inside, use “>-” instead of “|-”. I think it doesn’t matter if you already have (?x) on, but it’s worth noting since >- will introduce no whitespace at all.
[/quote]

Also, good to know. I still don’t use YAML enough to remember these differences about multi-line strings. This is a good reminder.

0 Likes

#30

This not exactly correct, > folds the line and replaces line breaks with a single space whereas >- additionally removes the trailing newline. There are a few more special cases however. (yaml.org/spec/1.2/spec.html#id2796251)

[code]import yaml

s = “”"
This is
a folded
string :wink:
With no

trailing newline
but
    leading spaces
sometimes

“”"

d = yaml.load(">" + s)

print(d, “tail”)
print()

d = yaml.load(">-" + s)

print(d, “tail”)
[/code]

[code]This is a folded string :wink: With no
trailing newline but
leading spaces
sometimes
tail

This is a folded string :wink: With no
trailing newline but
leading spaces
sometimes tail[/code]

Anyway, since the converter tool preserves the original string it needs to respect the newlines and include them in the converted representation, so > is not really an option.

0 Likes

#31

Thanks, I described its effect incorrectly. It’s the version I’d come to prefer since it’s never surprised me, but I always use (?x) flag for multiline anyway since I end up using spaces for regex readability, so I hadn’t given much thought to the fact that newline=space with this case.

0 Likes

#32

[quote=“jps”]
I’d suggest not worrying about this case too much: sregex is currently falling back to Oniguruma for case-insensitive matching, however support will be added for it shortly. sregex is much more efficient than onig for these alternative heavy regexes: you can get a feel for this by removing the case insensitive flag and seeing the speed difference. For sregex, there is no speed difference between one large regex vs multiple smaller ones.[/quote]

Thanks Jon! It’s good to know I am not crazy! I spend a few hours over the weekend trying to find the best performing option and I was not seeing any difference between the onig vs sregex, but it was falling back to onig since my regex was using case-insensative matching. Looking forward to having case-insenative matching in one of the future releases. I removed that and the files load instantly vs a 4 second load time.

0 Likes

#33

Hi,

I find the new sublime-syntax format really nice.
I always wanted to change the syntax highlighting for scala but I never wanted to dive into the tmlanguage format.
Now I can do it !

I started something on github at https://github.com/gwenzek/scalaSublimeSyntax if you want to have a look.

I was wondering if there is a way to match all Unicode letters ? Something like match: [a-zA-Z] but that would also catch éèà…

0 Likes

#34

Gwenzek, you can use the “core” Unicode character property classes with \p{NAME_OF_PROP}.

So for example, matching any letters, regardless of case or alphabet, would be \p{L}.
And all Latin letters, regardless of case (and including characters with diacritics) would be \p{Latin}.

See geocities.jp/kosako3/oniguruma/doc/RE.txt , section 3 for more info, or check out the Unicode homepage to figure out exactly what’s included in a given character property class.

0 Likes

#35

Thanks for the link, just what I needed !

0 Likes