The Microsoft code name "M" Modeling Language Specification - Lexical Structure

November 2009

[This content is no longer valid. For the latest information on "M", "Quadrant", SQL Server Modeling Services, and the Repository, see the Model Citizen blog.]

[This documentation targets the Microsoft SQL Server Modeling CTP (November 2009) and is subject to change in future releases. Blank topics are included as placeholders.]

Sections:
1: Introduction to "M"
2: Lexical Structure
3: Text Pattern Expressions
4: Productions
5: Rules
6: Languages
7: Types
8: Computed and Stored Values
9: Expressions
10: Module
11: Attributes
12: Catalog
13: Standard Library
14: Glossary

2 Lexical Structure

An M program consists of one or more source files, known formally as compilation units. A compilation unit file is an ordered sequence of Unicode characters. Compilation units typically have a one-to-one correspondence with files in a file system, but this correspondence is not required. For maximal portability, it is recommended that files in a file system be encoded with the UTF-8 encoding. The byte order mark is optional. 

Any Unicode code point with either of the following properties:

  • General_Category=Surrogate
  • Noncharacter_Code_Point=True

is disallowed from occurring literally in M program text, as is any code point with General_Category=Control unless it matches WhitespaceChar. The notation “none of” implicitly excludes any of these disallowed characters.

Rather than define a separate meta-language, this specification uses syntax, token, and interleave rules to define the M language. The concepts are introduced in §1.2 and defined in §5.

2.1 Pre-Processing

Pre-processing directives provide the ability to conditionally skip sections of source files, as a separate pre-processing step.

token PPDirective

    = PPDeclaration

    | PPConditional;

This needs token defs.

The following pre-processing directives are available:

  • #define which is used to define conditional compilation symbols.
  • #if, #else, and #endif, which are used to conditionally skip sections of source code.

A pre-processing directive always occupies a separate line of source code and always begins with a # character and a pre-processing directive name. White space may not occur before the # character and between the # character and the directive name.

A source line containing a #define, #if, #else, or #endif directive may end with a single-line comment. Delimited comments (the /* */ style of comments) are not permitted on source lines containing pre-processing directives. All #define directives have to appear before the first #if directive.

Pre-processing directives are neither tokens nor part of the syntactic grammar of M. However, pre-processing directives can be used to include or exclude sequences of tokens and can in that way affect the meaning of an M program. For example, after pre-processing the source text:

#define A

type C

{

#if A

    F {}

#else

    G {}

#endif

#if B

    H {}

#else

    I {}

#endif

}

results in the exact same sequence of tokens as the source text:

type C

{

    F {}

    I {}

}

Thus, whereas lexically, the two programs are quite different, syntactically, they are identical.

2.1.1 Directives

2.2 Lexical Analysis

2.2.1 Whitespace

Whitespace is defined as any character with Unicode class Zs (which includes the space character) as well as the horizontal tab character, the vertical tab character, and the form feed character.

interleave Whitespace

    = WhitespaceCharacter+;

token WhitespaceCharacter

    = "\u0009" // Horizontal Tab

    | "\u000B" // Vertical Tab

    | "\u000C" // Form Feed

    | "\u0020" // Space

    | NewLineCharacter;

token NewLineCharacter

   = "\u000A" // New Line

   | "\u000D" // Carriage Return

   | "\u0085" // Next Line

   | "\u2028" // Line Separator

   | "\u2029"; // Paragraph Separator

2.2.2 Comments

Two forms of comments are supported: single-line comments and delimited comments. Single-line comments start with the characters // and extend to the end of the source line. Delimited comments start with the characters /* and end with the characters */. Delimited comments may span multiple lines.

interleave Comment

    = CommentDelimited

    | CommentLine;

token CommentDelimited

    = "/*" CommentDelimitedContent* "*/";

token CommentDelimitedContent

    = !("*")

    | "*"  !("/");

token CommentLine

    = "//" CommentLineContent*;

token CommentLineContent

    = !NewLineCharacter;

Comments do not nest. The character sequences /* and */ have no special meaning within a // comment, and the character sequences // and /* have no special meaning within a delimited comment.

Comments are not processed within Text literals.

The example

// This defines a

// Person entity

//

type Person { 

    Name : Text;

    Age : Number;

}

shows three single-line comments.

The example

/* This defines a

   Person entity

*/

type Person {

    Name : Text;

    Age : Number;

}

includes one delimited comment.

2.3 Identifiers

A regular identifier begins with a letter or underscore and then any sequence of letter, underscore, dollar sign, or digit. An escaped identifier is enclosed in '@[]'. It contains any sequence of Text literal characters.

syntax Identifier

    = id:IdentifierVerbatim

    | id:IdentifierName;

token IdentifierBegin = "_" | Letter;

token IdentifierCharacter

    = IdentifierBegin

    | "$"

    | DecimalDigit;

token IdentifierCharacters

    = IdentifierCharacter+;

token IdentifierVerbatim

    = "@[" IdentifierVerbatimCharacters "]";

token IdentifierVerbatimCharacter

    = !( "]" )

    | IdentifierVerbatimEscape;

token IdentifierVerbatimCharacters

    = IdentifierVerbatimCharacter+;

token IdentifierVerbatimEscape

    = "\\\\" | "\\]";

2.4 Keywords

A keyword is an identifier-like sequence of characters that is reserved, and cannot be used as an identifier except when escaped with '  '.

Keywords:

any

accumulate

by

empty

equals

error

export

false

final

from

group

id

identity

import

in

interleave

join

language

labelof

left

let

module

null

precedence

right

select

syntax

token

true

type

unique

value

valuesof

where

The following keywords are reserved for future use:

checkpoint identifier nest override new virtual partial

2.5 Operators and punctuators

There are several kinds of operators and punctuators. Operators are used in expressions to describe operations involving one or more operands. For example, the expression a + b uses the + operator to add the two operands a and b. Punctuators are for grouping and separating.

[  ]  (  )  .  ,  :  ;  ?  =  <  >  <=  >=  ==  !=  +  -  *  /  %  &  |  !  &&  ||  ~  <<  >> { } # .. @ ' " ?? => ::

2.6 Literals

A literal is a source code representation of a value.

syntax Literal

    = Binary

    | Date

    | DateTime

    | DateTimeOffset

    | Decimal

    | Guid

    | Integer

    | Logical

    | Null

    | Scientific

    | Text

    | Time;

2.6.1 Date literals

Date literals are used to write a date independent of a specific time of day.

token Date

    = Sign? DateYear "-" DateMonth "-" DateDay;

token DateDay

    = "01" | "02" | "03" | "04" | "05" | "06" | "07" | "08" | "09" | "10"

    | "11" | "12" | "13" | "14" | "15"

    | "16" | "17" | "18" | "19" | "20" | "21" | "22" | "23" | "24" | "25"

    | "26" | "27" | "28" | "29" | "30"

    | "31";

token DateMonth

    = "01" | "02" | "03" | "04" | "05" | "06" | "07" | "08"

    | "09" | "10" | "11" | "12";

token DateYear

    = DecimalDigit DecimalDigit DecimalDigit DecimalDigit;

The type of a DateLiteral is Date.

  • 0001-01-01 is the representation of January1st, 1 AD.
  • There is no year 0, therefore ‘0000-01-01’ is not a valid Date Time
  • -0001-01-01 is the representation of January1st, 1 BC.

Examples of date literal follow:

0001-01-01

2008-08-14

-1184-03-01

2.6.2 DateTime literals

DateTime literals are used to write a time of day on a specific date independent of time zone.

token DateTime

    = DateLiteral "T" TimeLiteral;

The type of a DateTime literal is DateTime.

Example of date time literal follow:

2008-08-14T13:13:00

0001-01-01T00:00:00

2005-05-19T20:05:00

2.6.3 DateTimeOffset literals

DateTimeOffset literals are used to write a time of day on a specific date within a specific time zone.

token DateTimeOffset = Date "T" Time TimeZone;

token TimeZone

    = Sign OffsetTimeHourMinute

    | "Z";

token OffsetTimeHour

    = "00" | "01" | "02" | "03" | "04" | "05" | "06" | "07" | "08" | "09"

    | "10" | "11" | "12" | "13" | "14";

token OffsetTimeHourMinute

    = OffsetTimeHour ":" TimeMinute;

The type of a DateTimeOffset literal is DateTimeOffset.

Example of date time literal follow:

2008-08-14T13:13:00+06:00

0001-01-01T00:00:00-03:00

2005-05-19T20:05:00Z

2.6.4 Decimal literals

Decimal literals are used to write fixed-point or exact number values.

token Decimal = DecimalDigit+ "." DecimalDigit+;

Decimal literals default to the smallest standard library type that that can contain the value. Examples of decimal literal follow:

99.999

0.1

1.0

2.6.5 Guid literals

token GuidLiteral

    = "#" "[" HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit

       HexDigit "-" HexDigit HexDigit HexDigit HexDigit "-"

       HexDigit HexDigit HexDigit HexDigit "-"

       HexDigit HexDigit HexDigit HexDigit "-"

       HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit HexDigit

       HexDigit HexDigit HexDigit HexDigit HexDigit "]";

token HexDigit = "0".."9" | "a".."f" | "A".."F";

The type of a GuidLiteral is Guid.

Examples of Guid literal follows:

#[a0ee7e0f-c6ac-4c63-b57f-816a5259595a]

#[7fbc28ba-8205-45ca-983e-ece117f7a776]

#[a05e63ca-25de-43a6-bf70-0bc04d40a000]

2.6.6 Integer literals

Integer literals are used to write integral values.

token Integer

    = Decimal

    | Hexadecimal;

token Decimal = DecimalDigit+;

token DecimalDigit =  "0".."9";

token Hexadecimal

    = "0x" HexDigit*

    | "0X" HexDigit*;

Decimal literals default to the smallest precision type that can contain the value, starting with Integer32. Hexadecimal literals default to the smallest precision type that can contain the value starting with Unsigned32.

Examples of integer literal follow:

0

123

999999999999999999999999999999

0x00

0XFF

0x1234

2.6.7 Logical literals

Logical literals are used to write logical values.

syntax Logical

    = "true"

    | "false";

The type of a Logical literal is Logical.

Examples of logical literal:

true

false

2.6.8 Null literal

The null literal is equal to no other value.

syntax Null

    = "null";

The type of a null literal is Null.

An example of the null literal follows:

null

2.6.9 Scientific literals

Scientific literals are used to write floating-point or inexact numbers.

token ScientificLiteral

    = Decimal "e" Sign? DecimalDigit+

    | Decimal "E" Sign? DecimalDigit+;

token Sign

    = "+"

    | "-";

Scientific literals default to the smallest precision type that can contain the value, starting with Double.

This is not clear.

Examples of scientific literal follow:

0.31416e+1

9.9999e-1

0.0E0

2.6.10 Text literals

M supports two forms of Text literals: regular Text literals and verbatim Text literals.

A regular Text literal consists of zero or more characters enclosed in either double quotes, as in "hello", or in single quotes, as in 'hello', and may include both simple escape sequences (such as \t for the tab character), and hexadecimal and Unicode escape sequences.

A verbatim Text literal consists of an @ character followed by a double-quote character, zero or more characters, and a closing double-quote character (or likewise using only single-quote characters). A simple example is @"hello" and @'hello'. In a verbatim Text literal, the characters between the delimiters are interpreted exactly as they occur in the compilation unit, the only exception being a QuoteEscapeSequence. In particular, simple escape sequences, and hexadecimal and Unicode escape sequences are not processed in verbatim Text literals. A verbatim Text literal may span multiple lines.

token TextLiteral

    = RegularStringLiteral

    | VerbatimStringLiteral;

token RegularStringLiteral

    = '"' DoubleQuoteTextCharacter* '"'

    | "'" SingleQuoteTextCharacter* "'";

token VerbatimStringLiteral

    = '@' '"' DoubleQuoteTextVerbatimCharacter* '"'

    | '@' "'" SingleQuoteTextVerbatimCharacter* "'";

token SingleQuoteTextCharacter

    = SingleQuoteTextSimple

    | CharacterEscapeSimple

    | CharacterEscapeUnicode ;

token SingleQuoteTextSimple

    = !(  '\u0027'  // Single Quote

    | '\u005C'  // Backslash

    | NewLineCharacter) ;

token SingleQuoteTextVerbatimCharacter

    = !('\u0027')   // SingleQuote

    | SingleQuoteTextVerbatimCharacterEscape ;

token SingleQuoteTextVerbatimCharacterEscape = '\u0027' '\u0027';

token SingleQuoteTextVerbatimCharacters = SingleQuoteTextVerbatimCharacter+;

token DoubleQuoteTextCharacter

    = DoubleQuoteTextSimple

    | CharacterEscapeSimple

    | CharacterEscapeUnicode ;

token DoubleQuoteTextSimple

    = !(  '\u0022'  // DoubleQuote

    | '\u005C'  // Backslash

    | NewLineCharacter) ;

token DoubleQuoteTextVerbatimCharacter

    = !('\u0022')   // DoubleQuote

    | DoubleQuoteTextVerbatimCharacterEscape ;

token DoubleQuoteTextVerbatimCharacterEscape = '\u0022' '\u0022';

token DoubleQuoteTextVerbatimCharacters = DoubleQuoteTextVerbatimCharacter+;

token NewLineCharacter

    = '\u000A'  // New Line

    | '\u000D'  // Carriage Return

    | '\u0085'  // Next Line

    | '\u2028'  // Line Separator

    | '\u2029'; // Paragraph Separator

token CharacterEscapeSimple = '\u005C' CharacterEscapeSimpleCharacter;

token CharacterEscapeSimpleCharacter

    = "'"      // Single Quote

    | '"'      // Double Quote

    | '\u005C' // Backslash

    | '0'      // Null

    | 'a'      // Alert

    | 'b'      // Backspace

    | 'f'      // Form Feed

    | 'n'      // New Line

    | 'r'      // Carriage Return

    | 't'      // Horizontal Tab

    | 'v';     // Vertical Tab

token CharacterEscapeUnicode

    = "\\u"  HexDigit#4

    | "\\U"  HexDigit#8;

The value of the hexadecimal number in the NumericEscape must be between 0 and 10FFFF, but must not be a surrogate (that is, must not be between D800 and DFFF).

The type of a Text literal is Text.

Examples of text literal follow:

"a"

"\u2323"

"Hello World"

@"""Hello World"""

"\u2323"

The following values are all equal:

"@"

"\u26"

"\u0026"

"\@"

2.6.11 Time literals

Time literals are used to write a time of day independent of a specific date or time zone.

token TimeLiteral

    = TimeHourMinute ":" TimeSecond;

token TimeHour

    = "00" | "01" | "02" | "03" | "04" | "05" | "06" | "07" | "08" | "09"

    | "10" | "11"

    | "12" | "13" | "14" | "15" | "16" | "17" | "18" | "19" | "20" | "21"

    | "22" | "23";

token TimeHourMinute

    = TimeHour ":" TimeMinute;

token TimeMinute

    = "0" DecimalDigit

    | "1" DecimalDigit

    | "2" DecimalDigit

    | "3" DecimalDigit

    | "4" DecimalDigit

    | "5" DecimalDigit;

token TimeSecond

    = "0" DecimalDigit TimeSecondDecimalPart?

    | "1" DecimalDigit TimeSecondDecimalPart?

    | "2" DecimalDigit TimeSecondDecimalPart?

    | "3" DecimalDigit TimeSecondDecimalPart?

    | "4" DecimalDigit TimeSecondDecimalPart?

    | "5" DecimalDigit TimeSecondDecimalPart?;

token TimeSecondDecimalPart = "." DecimalDigits;

Examples of time literal follow:

11:30:00

01:01:01.111

13:13:00