Unicode: The Programmer’s Alphabet
Many times, when you see someone thinking about starting to learn programming, the first question used to be one of these:
- What programming language should I learn first?
- What is the easiest programming language for a beginner?
- I’d like to learn this particular language, where I should start?
- And so on…
Every possible answer could be influenced by the preferences of each single programmer and can vary in a way that those people can rapidly get stuck into the first attempt to get a concise answer and a good starting point for the learning journey.
But wait, a programming language is a language itself! All programming languages inherit the structure of a natural language (majority English language) and have a very small and strict subset of rules compared with that natural language. Due to this, when you have a deep knowledge in a particular programming language it is easier for you to learn another one. The more programming languages you know, the easier it is to learn a new one. This cognitive behaviour is caused because you can link your knowledge among the different programming languages. More explicitly, the common features all programming languages, and hence, natural languages share.
Then, what are those common features? The grammar. Sounds boring and it is because learning a new natural language is considered a very hard and long-term task 😔. Don’t worry, this article is focused on programming 😅, and you only need a shallow sight towards grammar concepts 🙏🏻.
Before continuing, take into account:
- Natural languages: English, Spanish, French, Japanese… have their own grammar and all of those grammars have a very similar structure.
- Programming languages: Java, JavaScript, TypeScript, Python, C, C++, C#, Ruby, PHP, Go… have their own grammar as well, but is a very small subset of the grammar of the natural language they belong to (majority English language). Hence, they share a common grammar structure.
You can figure out why the English language is crucial for programmers, not only for documentation, communication or collaboration, but for coding itself.
Let’s see what a natural language grammar structure looks like in the next diagram. Note that this is only a simplistic diagram focused on programming, I apologize if you are an expert in linguistics. I know there are more complexity behind the scenes, but I think this is enough for the article’s scope.
As you can see, there are two sub grammars:
- Lexical grammar: the purpose of the lexical grammar is to serve the necessary elements to the syntax grammar. If you don’t follow the lexical rules to construct words or you provide non-existent alphabet or dictionary elements, you will get a lexical error, a.k.a. orthographic error.
- Syntax grammar: the purpose of the syntax grammar is to produce syntactically correct phrases with the elements served by the lexical grammar. If you don’t follow the syntax rules you will get a syntax error. Note that only syntax correctness is analyzed here. Syntax grammar doesn’t worry about the semantic meaning of the produced phrase.
The purpose of any grammar is to produce syntactically and semantically correct phrases. For example, the phrase “My house is a girl.” is syntactically correct, but semantically speaking, makes no sense.
Syntax grammar only have one type of element: phrases, and its own ruleset to construct them. Lexical grammar, in the other hand, needs a bit more clarification:
- Alphabet: is a set that contains the smallest units of information and forms part of the repertoire that must be used by the syntax grammar. Every element in this set is also known as a character. There are several types of characters depending on their purpose: letters (to create dictionary words), numbers (0 to 9 in our decimal number system), separators (white space, new line…), punctuators (colon, semicolon, comma, plus, minus…) and symbols (currency, emoji…).
- Dictionary: is a set that contains all the existing words for that language constructed only by alphabet letters. This set, with the alphabet set, completes the whole repertoire that must be used by the syntax grammar. Obviously, to construct words, is mandatory to follow the lexical rules (nouns, verbs, adjetives, prepositions, prefixes, suffixes…).
At this point, you can figure out how a natural language gramar works and you can apply this to every natural language you know. But, let’s see an example using the English and Spanish language grammars before go further with the programming languages grammar structure.
English grammar example:
Lexical grammar:
Alphabet:
Letters: [a-z] [A-Z]
Numbers: [0-9]
Separators: ' '
Punctuators: ',' '.'
Symbols: '£'
Dictionary:
Words: 'I' 'am' 'years' 'old' 'My' 'car' 'the' 'blue' 'one' 'was'
Syntactic grammar:
Phrases: 'I am 45 years old.'
'My car, the blue one, was £1000.'
Spanish grammar same example:
Lexical grammar:
Alphabet:
Letters: [a-z] [A-Z] [ñ,Ñ] [á,é,í,ó,ú] [Á,É,Í,Ó,Ú] [ü,Ü]
Numbers: [0-9]
Separators: ' '
Punctuators: ',' '.'
Symbols: '€'
Dictionary:
Words: 'Tengo' 'años' 'Mi' 'coche' 'el' 'azul' 'costó'
Syntactic grammar:
Phrases: 'Tengo 45 años.'
'Mi coche, el azul, costó 1000€.'
As you can see, you could continue with more and more examples in every natural language. Tough! Well, this is the main purpose of Unicode: try to classify and unify the representation of every alphabet character of all natural languages to give us a robust, reliable and standardized way to store, transmit and visualize text-based data. I’m planning to write an article to cover in depth the Unicode standard focused on programming.
Now, is the turn to talk about programming language grammar structure. Take a look to the next diagram and compare to the above.
Apparently, things don’t change too much. And that’s true. Every concept is applicable to this grammar as well. But take into account that now, every element of the alphabet can be used to construct dictionary words. And there are two types of dictionary words:
- Predefined words: are programming language specific words, a.k.a. reserved words or just keywords. It is usual that those words follow different lexical rules than user-defined words and depending on the programming language and word, can’t be redefined as a user-defined word. They usually consist on two or more characters (Unicode basic plane, ASCII) but never can contain characters of type separator. Examples: if, for, var, goto, import, using, do, while…
- User-defined words: are words that the programmers need to define in order to name variables, functions, methods, method arguments or parameters, classes, interfaces… Depending on the programming language, you can use different types of characters (excluding separators, obviously) and can’t coincide with a predefined word. Examples: myAge, add_numbers, doThisAction, $this, #myPrivate, _thisIsAField, Employee…
That is how a compiler, interpreter, linter… differentiate between the programmer’s intent and the purpose of data and code provided.
Another difference is that two types of phrases (or sentences is this context) are defined: expressions and statements. Not every single programming language differentiate between them or define them. This could be a nice topic to write another article because is little confusing and funny enough. I can’t let them undefined, so this will be a very simple introduction:
- Expressions: are phrases (or sentences) that produces a value. Examples: 1 + 4, 3 > 5, 4 + 6 / 2, “My string”…
- Statements: are phrases (or sentences) that performs an action from the point of view of the compiler, interpreter, linter... Examples: var myVar = “Hello”, if (myNumber == 5), while (true)…
It is easy to view that an expression can contain several expressions and a statement can contain expressions, but expressions can’t contain statements.
Let’s see a couple of examples:
Lexical grammar:
Alphabet:
Letters: [a-z] [A-Z]
Numbers: [0-9]
Separators: ' ' '_'
Punctuators: '+' '=' '"' ';'
Symbols: n/a
Dictionary:
Predefined words: 'var'
User-defined words: 'my_String'
Syntactic grammar:
Phrases:
Expressions: '4 + 7'
Statements: 'var my_String = "Hello";'
Lexical grammar:
Alphabet:
Letters: [a-z] [A-Z]
Numbers: [0-9]
Separators: ' '
Punctuators: '(' '-' '/' '!' '=' ';' '<' '+' ')' '{' '}'
Symbols: n/a
Dictionary:
Predefined words: 'for' 'int'
User-defined words: 'i' 'count'
Syntactic grammar:
Phrases:
Expressions: '((25 - 4) / 2) != 9'
Statements: 'for (int i = 0; i < count; i++) { }'
In conclusion, if you want to learn programming, this language-agnostic knowledge is extremely useful to understand the programming languages grammar, one of their pillars. Unicode is widely used and standardized in many programming languages grammars and give us the ability to write code, after all, our main tasks are text-based.