قالب وردپرس درنا توس
Home / Tips and Tricks / How to use Regex? – CloudSavvy IT

How to use Regex? – CloudSavvy IT



Regex, a card for regular expression, is often used in programming languages ​​to match patterns in strings, find and replace, feed validation and reformat text. Learning how to use Regex properly can make text work much easier.

Regex Syntax, Explained

Regex has a reputation for having a terrible syntax, but it is much easier to write than it is to read. For example, here is a general regex of an RFC 5322-compliant email validator:

  (?: [a-z0-9!#$%&'*+/=?^_`{|}~-] + (?: . [a-z0-9!#$%&'*+/=?^_`{|}~-] +) * | "(?: [x01
- x08x0bx0cx0e-x1fx21x23-x5bx5d-x7f] | \ [x01-x09x0bx0cx0e-x7f]) * ") ??? @ (: (: [a-z0-9] (: [a-z0-9-] * [a-z0-9]) ) + [a-z0-9] (:?.? [a-z0-9-] * [a-z0-9]) | [(? :(?:25[0-5] | 2 [19659005] | [19659005]) ) {3} (?: 25 [0-5] | ??. 2 [19659005] | [01] [19659008] | ?? [a-z0-9-] * [a-z0-9] :(?: [x01-x08x0bx0cx0e-x1fx21-x5ax53-x7f] | \ [x01-x09x0bx0cx0e-x7f]) +) ])

If it looks like someone smashed their face on the keyboard, you're not alone. But under the hood, all this mess is actually programming a finite machine. This machine runs for each character, chugs and matches based on rules you have specified. Lots of online tools provide railway charts that show how your Regex machine works. Here's the same Regex in visual form:

Still very confusing, but it's much more understandable. It is a machine with moving parts that has rules that define how everything fits together. You can see how someone mounted this;

First off: Use a Regex Debugger

Before we begin, unless your Regex is very short or you are very skilled, you should use an online debugger when writing and testing it. This makes it easier to understand the syntax. We recommend Regex101 and RegExr, both offering testing and built-in syntax reference.

How does Regex work?

Now let's focus on something much simpler. This is a chart from Regulex for a very short (and definitely not RFC 5322 compliant) e-matching Regex:

The Regex engine starts on the left and travels along the lines, matching characters as it goes. Group # 1 matches all characters except one line break and will continue to match characters until the next block finds a match. In this case, it stops when it reaches a @ symbol, which means that group # 1 captures the name of the email address and eventually matches the domain.

Regex defining Group # 1 in our email example is:

  (. +) 

The brackets define a catch group, which tells the Regex engine to include the contents of the group match into a particular variable. When you run a Regex on a string, the default return is the entire match (in this case, the entire email). But it also returns every catch group, making this Regex useful for dragging names from emails.

The period is the symbol for "All characters except Newline." This matches everything in a row, so if you sent this email Regex an address like:

% $ # ^ &% * #% $ # ^ @ gmail.com 

It would match % $ # ^ &% * #% $ # ^ as the name, although ridiculous.

The plus symbol (+) is a control structure which means "match the previous character or group one or more times." It ensures that the entire name is matched and not just the first character. This is what creates the loop that is on the rail chart.

The rest of Regex is pretty easy to decipher:

  (. +) @ (. +  .. +) 

The first group stops when it hits the symbol @ . The next group then starts, which in turn matches several characters until it reaches a period character.

Because characters such as periods, parentheses, and slashes are used as part of the Regrex syntax, when you want to match the characters you need to avoid them properly with an opposite. In this example, to match the period, we write . and the interpreter treats it as a symbol meaning "match a period."

Character Matching

If you have non-control characters in your Regex, the Regex engine will assume that these characters will be a matching block. For example, Regex will:

  he + llo 

Match the word "hello" with any number of e. All other characters must be avoided to function properly.

Regex also has character classes, which act as a brief for a set of characters. These may vary based on the Regex implementation, but these few are standard:

  • . – matches everything except new.
  • w – matches all "words" characters, including numbers and underscores.
  • d – Matching Number.
  • b – matches the white space characters (ie space, tab, new line).

These three all have capital letters corresponding to their function. For example, D matches everything that is not a number.

Regex also has character sets matching. For example:

  [abc] 

Will match either a b or c . This acts as a block, and the square brackets are just control structures. Alternatively, you can enter a number of characters:

  [a-c] 

Or ignore the set, which will match all characters not in the set:

  [^a-c] 

Quantifier

Quantifier is an important part of Regex. They let you match strings where you do not know the exact exact format, but you have a pretty good idea.

The operator + from the email example is a quantifier, specifically "one or more" quantifiers. If we do not know how long a given string is, but we know that it consists of alphanumeric characters (and is not empty), we can write:

   w + 

Except + There are also:

  • The operator * that matches "zero or more." Essentially the same as + except that it has the ability not to find a match.
  • The operator ? matching "zero or one." It has the effect of making a character optional; either it is there or it is not, and it does not match more than once.
  • Numerical Quantifiers. These can be a single number like {3} which means "exactly three times", or an interval like {3-6} . You can leave the second number to do so unlimited. For example, {3,} means "three or more times". Oddly enough, you can't leave the first number, so if you want "3 or less times" you must

Greedy and Lazy Quantifiers

Under the hood, * And + operators are greedy . It matches as much as possible and gives back what it takes to start the next block. This can be a huge problem.

Here's an example: say you're trying to match HTML, or something else with closing braces. Your input text is:

  
Hello World

And you want to match everything in parentheses. You might be writing something like:

  <.*> 

This is the right idea, but it fails for one crucial reason: The Regex engine matches " div> Hello World

" for the sequence . * and then reverse until the next block matches, in this case, a closing console (> ). You can expect it to backtrack to match only " div ", and then repeat again to match the closing div. But the backtracker runs from the end of the string and will stop on the end console, which stops matching everything inside the brackets.

The solution is to make our quantifier lazy, which means it will match as few characters as possible. Under the hood, this will actually only match one character and then expand to fill the space until the next block game, which makes it much more performant in large Regex operations.

Making a quantifier lazy is done by adding a question mark immediately after the quantifier. This is a little confusing because ? is already a quantifier (and is actually greedy by default). For our HTML example, Regex is fixed with this simple add-on:

  <.*?> 

The lazy operator can be connected to any quantifier, including +? {0, 3}? and even ?? . Although the last one has no effect; because you still match zero or one character, there is no room to expand.

Grouping and Lookarounds

Groups in Regex have many purposes. On a basic level, they combine multiple tokens into one block. For example, you can create a group and then use a quantifier for the whole group:

  ba (na) + 

This groups the repeated "na" to match the phrase banana and banananana and so on. Without the group, the Regex engine would only match the end character over and over again.

This type of group with two simple brackets is called a catch group and will include it in the output:

[19659016] If you want to avoid this and simply group tokens for execution reasons, you can use a group that does not captures:

  ba (?: Na) 

The query character (a reserved character) defines a non-standard group, and the following characters define what type of group it is. Starting groups with a question mark is ideal, because if you otherwise want to match semicolons in a group, you would need to escape them without any good reason. But you must always escape the question marks in Regex.

You can also easily name your groups as you work with the production:

  (? & # 39; Group & # 39;) 

You can refer to these in your Regex, which makes them function as variables. You can refer to unnamed groups with the token 1 but this only goes up to 7, after which you have to start naming groups. The syntax for referring to named groups is:

   k {group} 

This refers to the results of the mentioned group, which can be dynamic. In essence, it checks whether the group performs several times but does not care about the position. For example, this can be used to match all text between three identical words:

The group class is where you will find most of Regex's control structure, including lookaheads. Lookaheads make sure that an expression must match but not include it in the result. In a way, it resembles an if statement and will not match if it returns false.

The syntax for a positive lookahead is (? =) . Here is an example:

This matches the name of an email address very cleanly by stopping the run at @ . Lookaheads do not consume any characters, so if you wanted to keep running after a lookahead succeeded, you can still match the character used in lookahead.

In addition to positive lookaheads, there are also:

  • (?!) – Negative lookaheads, which ensure an expression does not match .
  • (? <=) – Positive appearance that is not supported everywhere due to some technical limitations. These are placed before the expression you want to match, and they must have a fixed width (i.e. no quantifiers except {number} . In this example, you can use (? <= @) W + . w + to match the domain portion of the email.
  • (? <!) – Negative lookbacks, which are the same as positive lookbehind, but are ignored.

Differences between Regex engines

Not all Regex are created equal, most Regex engines do not meet a specific standard and some switch things up to suit their language, some features that work in one language may not work in another.

For example, not the versions of sed compiled for macOS and FreeBSD to use t to represent a tab character You must manually copy a tab character and paste it into the terminal to use a tab in the command line sed . [19659014] Most of this tutorial is compatible with PCRE, the standard exx engine is used for PHP. But JavaScript's Regex engine is different – it doesn't support named catch groups with quotation marks (it wants brackets) and can't recurs, among other things. Even PCRE is not completely compatible with different versions, and it has many differences from Perl regex.

There are too many minor differences to list here, so you can use this reference table to compare the differences between multiple Regex engines. Regex troubleshooters like Regex101 also allow you to switch Regex engines, so make sure you troubleshoot with the right engine.

How to run Regex

We have discussed the matching part of regular expressions, which is most of what makes a Regex. But when you actually want to run your Regex, you have to shape it into a completely regular expression.

This usually takes the format:

  / match / g 

Everything inside the slash is our match. g is a modification mode. In this case, it tells the engine not to stop after finding the first match. To find and replace Regex, you often have to format it like this:

  / find / replace / g 

This replaces everything in the file. You can use reference group references when replacing, which makes Regex very good at formatting text. For example, this Regex will match all HTML tags and replace the standard brackets with square brackets:

  / <(.+?)> / [1] / g 

When this goes, the engine will match

and [19659053] so you can replace this text (and only this text). As you can see, the inner HTML is not affected:

This makes Regex very useful for finding and replacing text. The command line tool to do this is sed which uses the basic format for:

  sed & # 39; / find / replace / g & # 39; file> file 

This runs on a file and outputs to STDOUT. You need to direct it to itself (shown here) to actually replace the file on the disk.

Regex is also supported in many text editors, and can really speed up your workflow when doing batch operations. Vim, Atom and the VS Code all have Regex found and replace built-in.

Of course, Regex can also be used programmatically and is usually built into many languages. The exact implementation will depend on the language, so you need to consult your language's documentation.

For example, in JavaScript regex can be created literally, or dynamically using the global RegExp object:

  var re = new RegExp (& # 39; abc & # 39;) 

This can be used directly by calling method .exec () for the newly created regex object, or by using .replace () .match () and .matchAll () methods on strings.


Source link