What the heck is Regex? How it came into existence? What was before Regex?
Have you ever found yourself searching for specific patterns in text data, wishing there was a more efficient way to extract or modify the information you need? Look no further than regular expressions, commonly known as regex. This versatile and powerful tool is designed to handle precisely these situations and has become a fundamental aspect of working with text data.
What is regex?
A regular expression (shortened as regex or regexp) is a sequence of characters that specifies a match pattern in text. Usually, such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
For a concrete example, let's consider the following regex: /h[aeiou]+/g
. This regex pattern matches the letter 'h' followed by one or more vowels in a given text. Let's see how it works with the text "hello how are you?":
Text: "hello how are you?"
Matches:
'h'
'he'
'ho'
'hou'
In this example, the regex pattern /h[aeiou]+/g
successfully identifies all occurrences of the letter 'h' followed by one or more vowels in the given text.
How did regex come into existence?
Regular expressions came into existence through the pioneering work of American mathematician Stephen Cole Kleene in the 1950s. Kleene formalized the concept of regular languages, which are sets of strings that can be recognized by a finite automaton—a simple abstract machine with a finite number of states and transitions.
So what's a finite automaton? understand it with the following example -
Imagine you have a box with different buttons on it. Each button has a specific label. When you press a button, something happens inside the box, and it may change to a different state.
A finite automaton is like that box with buttons, but it can only be in one state at a time. It starts in an initial state. When you give it input, it looks at the current state and the input, and based on that, it may transition to a different state. It keeps doing this for each input until there are no more inputs left.
For example, let's say the automaton is a traffic light. It has three states: red, yellow, and green. When you start, it's in the red state. If you press a button, it transitions to the green state, and the traffic light shows a green light. If you press the button again, it transitions to the yellow state, and the light turns yellow. Pressing the button once more takes it back to the red state.
In this example, the traffic light is like a finite automaton because it has a limited number of states (red, yellow, green) and transitions between those states based on the input (pressing the button).
So, a finite automaton is like a box with buttons that can be in different states, and it changes its state based on the input it receives. It's a simple way to represent machines that can do different things depending on what you give them.
(Now, Let's go back to regular expressions)
To describe regular languages, Kleene introduced a notation called regular events. These events utilize symbols like concatenation, union, and closure. For instance, the regular event ab*c represents the set of strings that start with 'a', followed by zero or more occurrences of 'b', and end with 'c'. This pattern can match strings such as 'ac', 'abc', 'abbc', and so on.
Kleene's work laid the foundation for regular expressions as a means to describe and manipulate patterns in text data. Over time, regular expressions found their way into practical applications, particularly in the field of computer science. They became an essential tool for tasks such as pattern matching, search, and text manipulation.
Today, regular expressions are widely supported in programming languages, text editors, and various software tools. They have become a fundamental aspect of working with text data, empowering developers, data analysts, and other professionals to efficiently extract, validate, and transform information based on specific patterns.
How was text processing done before regex?
Before regex became popular, text processing was done using various tools and languages that did not use regular expressions, but instead their pattern-matching constructs. For example, SNOBOL was a programming language developed in the 1960s that had powerful string manipulation features, such as pattern variables, alternation, concatenation and repetition.
For example, the SNOBOL pattern A = . . . 'HELLO' . . .
matches any string that contains HELLO
, and assigns the matched string to the variable A.
Apart from this, manual string manipulation techniques or customized algorithms were designed for specific pattern-matching tasks. This includes custom parsing algorithms, pattern-specific libraries, string manipulation functions and so on.
How did regex enter the computing world?
Regular expressions entered popular use in 1968 in two uses: pattern matching in a text editor and lexical analysis in a compiler.
Among the first appearances of regular expressions in program form was when Ken Thompson built Kleene's notation into the editor QED as a means to match patterns in text files. He later added this capability to the Unix editor ed, which eventually led to the popular search tool grep's use of regular expressions ("grep" is a word derived from the command for regular expression searching in the ed editor: g/re/p meaning "Global search for Regular Expression and Print matching lines").
Around the same time when Thompson developed QED, a group of researchers including Douglas T. Ross implemented a tool based on regular expressions that is used for lexical analysis in compiler design. Lexical analysis is the process of converting a sequence of characters into a sequence of tokens, such as keywords, identifiers, literals, operators, etc.
How did regex evolve?
Many variations of these original forms of regular expressions were used in Unix programs at Bell Labs in the 1970s, including vi, lex, sed, AWK, and expr, and in other programs such as Emacs (which has its own, incompatible syntax and behavior).
Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard and another, widely used, being the Perl syntax. The POSIX standard defines two flavors of regular expressions: basic (BRE) and extended (ERE). The basic flavor supports only a limited set of metacharacters and modifiers, while the extended flavor adds more features such as parentheses for grouping, alternation with |
, and repetition with ?
, +
and {}
. The Perl syntax is more expressive and flexible than both POSIX flavors and supports features such as backreferences, named capture groups, look-around assertions, non-greedy quantifiers, Unicode properties, etc.
Regular expressions are supported in many programming languages such as Java, Python, Ruby, PHP, JavaScript, C#, etc., either natively or through libraries. Each language may have its variations and extensions of the regex syntax and semantics.
Conclusion
Regular expressions are a powerful tool for working with text data by describing patterns to match, search and manipulate text. They have a long history that dates back to the 1950s and have evolved to become more expressive and flexible. Regular expressions are widely used in various domains and applications, such as text editors, compilers, web development, data analysis, etc.
Regex offers a wide range of pattern-matching rules and operators, giving it great flexibility and power. There is much more to explore when it comes to the various pattern syntax and advanced techniques of regular expressions. Delving into these topics could be the subject of another blog, where we can dive deeper into the fascinating world of regex and uncover its hidden gems. Stay tuned for future discussions on advanced regex techniques and take your text manipulation skills to the next level!