Sublime Forum

Fixing XML.tmLanguage and XSL.tmLanguage

#1

I’ve been looking at .tmLanguage files and reading chapter 12 on “Language Grammars” in the TextMate manual. What I’m preparing for is to create syntax highlighting for SVG and OWL files.

However, in looking for example patterns in the XML.tmLanguage and XSL.tmLanguage files, located in **~\AppData\Roaming\Sublime Text 2\Packages\XML**, I have concluded that the basic syntax definitions in both those files are seriously flawed.

For instance, the first match pattern in the XML file is apparently describing the beginning of the XML declaration as well as processing instructions:

<key>begin</key> <string>(&lt;\?)\s*(-_a-zA-Z0-9]+)</string>
I translate this regex as “Begin with a left-angle-bracket and a question mark, then a space may optionally appear, followed by at least one upper- or lower-case alphabetic character, a digit, a hyphen or an underscore.”

The following are all allowed by this pattern but are not well-formed XML:

[code]<? XML …

<?-xml ... <?30 ...[/code] Similarly, the match pattern for XML elements is wacky, not only allowing element names to begin with a hyphen but not permitting periods to appear in element names at all. And since **?** [question mark] is not a wildcard but refers to whatever token precedes it, what is **((?:** supposed to mean in the pattern for any XML element name: [code](<)((?:(-_a-zA-Z0-9]+)((:)))?(-_a-zA-Z0-9:]+))(?=(\s^>]*)?></\2>)[/code] Of course, anyone writing XML ought to be relying on an XML parser and not syntax highlighting to get things right. Still, the syntax highlighter needs to get the scope selectors right or the colors won't be right. Being new to plist files and only moderately fluent in regexes, I don't propose to write the definitive XML language grammar for Sublime Text (and Textmate). Nonetheless, I can offer this match pattern in place of the existing one for processing instructions and the XML declaration: [code] begin (<\?)(((X|x)(M|m)(L|l))|(([_a-zA-Z0-9]+)(-_a-zA-Z0-9:\.]*))\s+ captures 1 name markup.other.xml.tag.begin 2 name markup.other.xml.pi [/code] (I chose names that seem to fit within the naming conventions better than the existing ones.) Can anyone here improve this and suggest fixes for other mal-engineered match patterns in XML? Thx, Rgr
0 Likes

#2

(?:something) creates a non-capturing group.

I would consider the pros and cons of modifying current scope-names, particularly if you intend creating new names (or parts thereof). Specific theme files would need to be created to take advantage of these, and existing themes might assume the wrong (un-desired) scope-colour.

Although theme colours should not be relied upon to spot malformed XML, the ‘invalid’ scope-name-part can be used helpfully to indicate some obvious errors.

0 Likes

#3

I’m relieved to know this is a valid construct and not surprised at all that there is even more overloading of symbols in regexes than I’m aware of. Thanks for pointing this out.

I guess the real question is why comments aren’t included in any of the .tmLanguage files I’ve checked. Since regexes are already notorious for being unreadable, I’ll just note that the unfamiliarity of scope selectors and the awkwardness of plists are like piling on, for those of us new-to-the-concept.

You’re right to point out the disadvantages of changing scope-names for an established package. On the other hand, “punctuation” is an invented group when there are already 11 root groups described in the naming conventions.

Can one assign multiple names to a scope selector?

I see that the scope selector for the Document Type Declaration (eg, DOCTYPE) is given different names in the HTML and XML packages (although they both make use of “punctuation” to describe angle brackets). How are themes expected to handle that kind of situation?

And to echo your sentiment about syntax coloring, I definitely fall among those who rely on the scope selectors to reject my typos and other such obvious errors.

Again, thanks for the information.


PS: In the SVG.tmLanguage file I’ve begun, I’ve gone back and added comments explaining just what each match pattern is intended to match.

0 Likes

#4

If you want to see what scope names are targeted by color schemes, look for .tmTheme files inside the Packages folder. Given that scope names are arbitrary, color schemes can’t possible cover all the cases, but partial matches are possible (so “punctuation.foo” and “punctuation.bar” would get styled in the same way). (Although I don’t think any default .tmTheme targets “punctuation”, btw.)

0 Likes

#5

More information here:

docs.sublimetext.info/en/latest/ … xdefs.html
docs.sublimetext.info/en/latest/ … xdefs.html

0 Likes

#6

Mulit-part names can be applied, but not completely different scopes, because only the first match will “stick”.

If creating a new scope-name it is advisable to make it, perhaps, the third item. As in ‘existing.name.yours.other’. This way, it will at least be caught by the first two parts in existing themes.

Copy an existing theme and modify it as you progress: it will be very simple to check your syntax by applying distinctive, temporary, colours. You might also consider doing this for both a light and dark theme.

Also bear in mind that sometimes the simplest solution is to move a section/scope within the syntax file :wink:

**ScopeHunter **available via PackageControl should also prove very useful. Andy.

0 Likes