Brainy Ghosts Blog: Java Tutorial: Regular Expressions

Chapters

Java Regular Expressions

matches() Method
lookingAt() Method
find() Method
split() Method
replaceAll() Method
region() Method
PatternSyntaxException

Regular-Expression Constructs

Characters
Character Classes

Negate(^)
Range(-)
Union and Intersection(&&)

Pre-Defined Character Classes
Boundary Matchers

^ and $
Word Boundary(\b)
Non-Word Boundary(\B)

Quantifiers

Greedy Quantifiers
Reluctant Quantifiers
Possessive Quantifiers
Zero-Length Match

Logical Operators

Escaping Metacharacters
Groups and Capturing

Embedded Flag Expressions(Non-capturing groups)
Capturing Groups
Backreferences
Named Capturing Groups

Java Regular Expressions

Regular Expressions(or regex) are sets of characters that form as patterns. These patterns can be used as criteria for searching specific sets of characters like words, names, repeating characters, etc.

There are two main classes that we need to understand in order to perform a search operation using regular expressions. The classes are: Matcher and Pattern class. These classes reside in java.util.regex package.

First off, we're going to compile our pattern by using compile() method then, we need to set our pattern and the input sequence(a character or a set of characters) for matching operation; We will use the matcher() method to do that. Then, We're going to choose a matching operation. There are three matching operations available for us: matches(), lookingAt() and find().

matches() Method

This method attempts to match the pattern to the entire input sequence. Input sequence is the string where the search operation is performed to. This method is one of the class members of Matcher class.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
  
    //we use Pattern.compile() to create a pattern
    //The string that is used as argument is the pattern
    //that we're going to use
    //Pattern.compile returns a Pattern object
    //with the created pattern
    Pattern pattern = Pattern.compile("Ghosts");
    
    //matcher() method sets the pattern and
    //input sequence to be ready for matching operation.
    //returns a Matcher object.
    //
    //
    //Once they are set, we can perform three matching
    //operations: matches(), lookingAt() and find()
    Matcher matcher = pattern.matcher("Brainy Ghosts Blogs");
    
    //
    //this will return false 'cause the pattern is just
    //a set of characters so, matches() method will just
    //attempt to match the entire pattern to the entire
    //region of the input sequence.
    System.out.println("Match Found: " + matcher.matches());
    
    //We can use method chaining to shortern our regex
    //syntax.
    boolean isMatch = Pattern.compile("Ghosts").
                      matcher("Ghosts").matches();
    
    //This will return true 'cause the pattern and the 
    //entire region of the input sequence are equal.
    System.out.println("Match Found: " + isMatch);
    
    //We can use the overloaded form of matches() method with
    //two parameters: the first parameter is the 
    //pattern and the second parameter is the input sequence.
    isMatch = Pattern.matches("Ghosts","Ghosts Blogs");
    
    //this will return false
    System.out.println("Match Found: " + isMatch);
    
    //The compile() method has overloaded method with two
    //paramaters. The first parameter is the pattern and 
    //the second is a flag. There are different types of
    //flags that we can use as argument. To know the flags,
    //visit the Pattern class documentation.
    //
    //Pattern.CASE_INSENSITIVE ignores the lettercases of both
    //pattern and the input sequence during comparison.
    //
    //Two add multiple flags, use the bitwise OR operator("|")
    //e.g. Pattern.compile("Ghosts",Pattern.CASE_INSENSITIVE | 
    //                              Pattern.MULTILINE);
    //
    //or put the flags in an int variable
    //e.g
    //int flags = Pattern.CASE_INSENSITIVE | Pattern.MULTILINE;
    //Pattern.compile("Ghosts",flags);
    pattern = Pattern.compile("Ghosts",Pattern.CASE_INSENSITIVE);
    matcher = pattern.matcher("GhOsTs");
    
    //This will return true regardless of the lettercases of
    //the pattern and input sequence.
    System.out.println("Match Found: " + matcher.matches());
    
  }
}

lookingAt() Method

This method attempts to match the pattern against the region of the input sequence, Starting from the first index(0). This is similar to matches(), but unlike matches(), this method is not required to match the entire region of the input sequence against the pattern. This method is one of the class members of Matcher class.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    Pattern pattern = Pattern.compile("Ghosts");
    Matcher matcher = pattern.matcher("Ghosts "
                                     +"Blogs Ghosts");
                                 
    if(matcher.lookingAt()){
      System.out.println("Match Found!");
      System.out.println("Start index: " + matcher.start());
      System.out.println("End index: " + matcher.end());
      System.out.println();
    }
  }
}

find() Method

This method scans the input sequence and look for each subsequence that matches the pattern. We will use the start() and end() methods to determine the starting and end indexes of the matched subsequences(subsequence is a sequence that is a subset of a sequence) starting from the first index which is 0. This method is one of the class members of Matcher class.
Note: start() returns the start index of the previous match whereas end() returns the offset after the last character matched.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    Pattern pattern = Pattern.compile("Ghosts");
    Matcher matcher = pattern.matcher("Brainy Ghosts "
                                     +"Blogs Ghosts");
    int matchCount = 0;
    
    //find() method will look for each subsequence
    //that matches the pattern even a match is already
    //found as long as the matcher doesn't reset.
    //Otherwise, find() method starts at the
    //beginning of the input sequence
    while(matcher.find()){
      matchCount++;
      System.out.println("Match Found!");
      System.out.println("Start index: " + matcher.start());
      System.out.println("End index: " + matcher.end());
      System.out.println();
    }
    System.out.println("Match Count: " + matchCount);
    System.out.println();
    System.out.println("Resetting Matcher...");
    //reset the matcher
    matcher.reset();
    
    //we can use find() method to get the first match
    //only.
    if(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("Start index: " + matcher.start());
      System.out.println("End index: " + matcher.end());
      System.out.println();
    }
    
  }
}

split() Method

This method splits the input sequence into multiple sequences, which is based on a delimiter. This method is one of the class members of Pattern class.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Apple,Banana,Citrus,Durian";
    
    //"," is the delimiter that we're going to use
    Pattern pattern = Pattern.compile(",");
    String[] strArr = pattern.split(str);
    
    for(String s: strArr)
      System.out.println(s);
  }
}

replaceAll() Method

This method replaces the subsequence(sequence that is a subset of a sequence) that matches the pattern with the replacement string and returns the constructed string with the string that replaces the subsequence. This method is one of the class members of Matcher class.

Method form: replaceAll(String replacement){}

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Apple,Banana,Citrus,Durian";
  
    Pattern pattern = Pattern.compile(",");
    Matcher matcher = pattern.matcher(str);
    
    String modifiedStr = matcher.replaceAll("-");
    System.out.println(modifiedStr);
    System.out.println(str);
  }
}

region() method

This method limits the searchable region of an input sequence. By default, matching operations use the entire region of an input sequence if necessary. By using region() method, we can specify which region is searchable.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "apt, ape, apart, uptown apostle";
    Pattern pattern = Pattern.compile("ap");
    Matcher matcher = pattern.matcher(str);
    
    //the first parameter is the starting point
    //of our searchable region
    //the second parameter is the end point of
    //our searchable region
    matcher.region(0,9);
    
    while(matcher.find()){
      System.out.println("Found a Match!");
      System.out.println("First index: " + matcher.start());
      System.out.println("Last index: " + matcher.end());
      System.out.println();
    }
  }
}

Result

Found a Match!
First index: 0
Last index: 2

Found a Match!
First index: 5
Last index: 7

PatternSyntaxException

PatternSyntaxException is an exception being thrown if we use an invalid pattern.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    try{
      //The pattern here is invalid. Therefore,
      //java will throw a PatternSyntaxException
      //at runtime.
      Pattern pattern = Pattern.compile("\\");
    }
    catch(PatternSyntaxException e){
      e.printStackTrace();
    }
    
  }
}

Regular-Expression Constructs

We already know that we can use string literals as a construct in a pattern. Regular expression is not limited to string literals, There are other constructs that we can use to create a pattern.

Characters

We can use characters in octal,ascii and unicode format; Also, we can use special escape sequence like "\n" as part of our pattern. We need to wrap characters into double quotes when using as a pattern or part of a pattern.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "\"Banana\" \"Nuts\"";
  
    //\42 is the octal representation of double
    //quotes(")
    Pattern pattern = Pattern.compile("\042");
    Matcher matcher = pattern.matcher(str);
    int quotesCount = 0;
    
    while(matcher.find())
      quotesCount++;
    
    System.out.println("Input: " + str);
    System.out.println("Double quotes count: " + quotesCount);
    
    str = "Banana\nNuts\n";
    //\n or newline is a special escape sequence
    pattern = Pattern.compile("\n");
    String[] strArr = pattern.split(str);
    
    for(String s : strArr)
      System.out.println(s);
  }
}

Character Classes

Character classes are sets of characters enclosed within brackets([]). Character classes may contain union(implicit) or intersection(&&) operator. Character classes may also contain another character class and other operators like the negate(^) and range(-) operators.
Note: Java standard operators and regex operators function differently.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    
    //Using lookingAt() operation
    //[tuv] means, one of these letters(t,u,v)
    //must be the first character in the
    //input sequence for a match to happen.
    Pattern pattern = Pattern.compile("[tuv]",
                      Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher("Tell Me Why...");
    boolean isMatch = matcher.lookingAt();
    System.out.println("using lookingAt()");
    System.out.println("Match Found? " + isMatch);
    System.out.println();
    
    //reset matcher with new input sequence
     matcher.reset("abc");
    
    //Using matches() operation
    //[tuv] means,only one of these letters(t,u,v)
    //must be in the input sequence for a
    //match to happen.
    //It's not considered as a match if two
    //or more of those letters(t,u,v) are
    //present in the input sequence
    //
    //matches() will return false 'cause
    //the input sequence is not "t","u"
    //or "v"
    isMatch = matcher.matches();
    System.out.println("using matches()");
    System.out.println("Match Found? " + isMatch);
    System.out.println();
    
    //reset matcher with new input sequence
    matcher.reset("t");
    //This will return true 'cause the input
    //sequence is one of the letters of the
    //character class([tuv])
    isMatch = matcher.matches();
    System.out.println("using matches()");
    System.out.println("Match Found? " + isMatch);
    System.out.println();
     
    //reset matcher with new input sequence
    matcher.reset("tu");
    
    //This will return false 'cause the input
    //sequence is not "t","u" or "v".
    isMatch = matcher.matches();
    System.out.println("using matches()");
    System.out.println("Match Found? " + isMatch);
    System.out.println();
    
    //Using find() operation
    //[tuv] means, one of these letters(t,u,v)
    //must be in the input sequence for a
    //match to happen.
    
    String str = "Tell them to vacate the area and"+
                 " hide underground!";
    //reset matcher with new input sequence
    matcher.reset(str);
    
    System.out.println("using find()");
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("index: " + matcher.start());
      System.out.println("Character: " + 
                         str.charAt(matcher.start()));
      System.out.println();
    }
    
  }
}

Negate(^)

Let's create another example. Let's use the negate(^) operator this time. Negate(^) operator reversed the effect of an expression in a character class.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Trace my card.";
  
    //[^td] means, any character except for "t"
    //and "d".
    Pattern pattern = Pattern.compile("[^td]",
                      Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(str);
    
    System.out.println("Input Sequence: " + str);
    System.out.print("Result: ");
    while(matcher.find())
      System.out.print(str.charAt(matcher.start()));
  }
}

Range(-)

Next, let's use the range(-) operator. Range(-) operator simply sets a range between two characters.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Gone, Igloo, Fear, Hall";
    
    //[f-i] means, any letter between "f" and "i"
    //in this example, we use the negate operator
    //so, [^f-i] means, any letter except for the
    //letters between "f" and "i"
    Pattern pattern = Pattern.compile("[^f-i]",
                      Pattern.CASE_INSENSITIVE);
    Matcher matcher = pattern.matcher(str);
    
    System.out.println("Input Sequence: " + str);
    System.out.print("Result: ");
    while(matcher.find())
      System.out.print(str.charAt(matcher.start()));
  }
}

Note, the left operand of range operator should be greater than the right operand. Otherwise, you will encounter a PatternSyntaxException. Try changing [^f-i] to [^i-f].

Here's a list of unicode characters. Each character has a number representation, you can use those numbers as reference to compare if a character is greater than the other one.

Union and Intersection(&&)

Next, let's try the union(implicit) and intersection(&&) operators. Quoted from Pattern class documentation: "The union operator denotes a class that contains every character that is in at least one of its operand classes. The intersection operator denotes a class that contains every character that is in both of its operand classes."

First off, Let's crete an example to demonstrate union operator.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Six6Nine9";
    
    //This pattern "[a-z[A-Z]]" has a union operator
    //between "a-z" and "[A-Z]" expressions.
    //
    //[a-z[A-Z]] can be shortened into [a-zA-Z]
    //
    //[a-z[A-Z]] means, any character ranging from "a"
    //to "z" or "A" to "Z"
    Pattern pattern = Pattern.compile("[a-z[A-Z]]");
    Matcher matcher = pattern.matcher(str);
    
    System.out.println("Input Sequence: " + str);
    System.out.print("Result: ");
    while(matcher.find())
      System.out.print(str.charAt(matcher.start()));
  }
}

Next, let's create an example to demonstrate intersection operator.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "because dude";
    
    //"[a-d&&[bc]]" can be shortened into:
    //"[a-d&&bc]"
    Pattern pattern = Pattern.compile("[a-d&&[bc]]");
    Matcher matcher = pattern.matcher(str);
    
    System.out.println("Input Sequence: " + str);
    System.out.print("Result: ");
    while(matcher.find())
      System.out.print(str.charAt(matcher.start()));
  }
}

The result is "bc" because the characters "b" and "c" are present in both intersection operands(a-d and bc).

Let's try a more complex character class.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "1234abcdefghi";
    
    //This pattern means, "123" or "a" to "z" but limited to
    //"a" to "g" and exclude "bc".
    //The result is: 123adefg
    //4 is not included in the result 'cause we only include
    //"123" in our pattern.
    //b and c are not included 'cause those are excluded
    //h and i are not included 'cause they're not in the 
    //range between "a" and "g"
    String regEx = "[[123][a-z&&[a-g&&[^bc]]]]";
    Pattern pattern = Pattern.compile(regEx);
    Matcher matcher = pattern.matcher(str);
    
    System.out.println("Input Sequence: " + str);
    System.out.print("Result: ");
    while(matcher.find())
      System.out.print(str.charAt(matcher.start()));
  }
}

Pre-Defined Character Classes

The are characters or metacharacters that denote sets of pre-defined character classes. For example, "\d" denotes a digit-only character class "[0-9]", "." denotes any character(may or may not match line terminators), etc. Check out Pattern class documentation: for more pre-defined character classes.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    
    System.out.println("\".\" metacharacter");
    //"p.." means, a "p" and two characters of
    //any kind
    boolean isMatch = Pattern.matches("p..","pie");
    //This will return true.
    System.out.println("isMatch? " + isMatch);
    
    isMatch = Pattern.matches("p..","pit");
    //This will return true.
    System.out.println("isMatch? " + isMatch);
    
    isMatch = Pattern.matches("p..","pi");
    //This will return false 'cause the match
    //operation requires three characters.
    System.out.println("isMatch? " + isMatch);
    
    isMatch = Pattern.matches("p..","sit");
    //This will return false 'cause the first
    //character must be "p"
    System.out.println("isMatch? " + isMatch);
    
    isMatch = Pattern.matches(".pe","ape");
    //This will return true 'cause the first
    //character can be any character
    System.out.println("isMatch? " + isMatch);
    
    //"." in the character class brackets and
    //"." outside the brackets function differently
    //"." outside brackets works as a metacharacter
    //whereas "." inside brackets works as
    //a literal.
    //So, .[.] means, any character followed by
    //a dot.
    isMatch = Pattern.matches(".[.]","a.");
    //This will return true
    System.out.println("isMatch? " + isMatch);
    
    isMatch = Pattern.matches(".[.]","ap");
    //This will return false
    System.out.println("isMatch? " + isMatch);
    System.out.println();
    
    System.out.println("\\d metacharacter");
    //\d is equivalent to [0-9] character class
    //We need to escape "\" by using "\"
    //that's why there are two "\"
    isMatch = Pattern.matches("\\d","2");
    //This will return true
    System.out.println("isMatch? " + isMatch);
    
    isMatch = Pattern.matches("\\d","a");
    //This will return false
    System.out.println("isMatch? " + isMatch);
    System.out.println();
    
    System.out.println("\\W metacharacter");
    //\W means whitespace character. \W is
    //equivalent to [ \t\n\x0B\f\r] character class
    isMatch = Pattern.matches("\\W"," ");
    //This will return true
    System.out.println("isMatch? " + isMatch);
    
    isMatch = Pattern.matches("\\W","\t");
    //This will return true
    System.out.println("isMatch? " + isMatch);
  }
}

Note: Some metacharacters that denote pre-defined character classes may not work as intended if they're inside the chracter class brackets([]). One example is the "." metacharacter that I explain in the example above.

Boundary Matchers

boundary matchers are metacharacters that can be used to create a pattern regarding bounds. The metacharacters are placed at the beginning or end of a pattern. For example, "^" character denotes that a character or set of characters must be in the beginning of a line for a match to happen e.g. ^Hello. Check out Pattern class documentation: for more boundary matchers.

^ and $

"^" matches the input sequence at the beginning of the line whereas "$" matches the input sequence at the end of the line.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Hello This is Me!\nHello There!";
    
    Pattern pattern = Pattern.compile("^Hello");
    Matcher matcher = pattern.matcher(str);
    int helloCount = 0;
    
    while(matcher.find()) helloCount++;
    
    System.out.println("Input: " + str);
    //helloCount is 1 'cause there's one "Hello" word
    //at the beginning
    System.out.println("How many Hello: " + helloCount);
    System.out.println();
    
    matcher.reset();
    helloCount = 0;
    
    //"$" denotes that a character or a word
    //must be at the end of a line.
    matcher.usePattern(Pattern.compile("Hello$"));
    
    while(matcher.find()) helloCount++;
    
    System.out.println("Input: ");
    System.out.println(str);
    //helloCount is 0 'cause there's no "Hello" word
    //at the end
    System.out.println("How many Hello: " + helloCount);
    System.out.println();
    
    boolean isMatch = Pattern.matches("^Hello$","Hello");
    //This will return true 'cause the word "Hello" is
    //at the beginning and end of the input sequence
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("^Hello$",str);
    //This will return false 'cause "Hello" is not
    //at the beginning and at the end of the input
    //sequence
    System.out.println("isMatch: " + isMatch);
  }
}

In the example above, java considers "Hello This is Me!\nHello There!" input as a single line even there's a \n in the input because by default, java ignores line terminators(\n, \r\n, etc.) if we use "^" or "$" metecharacters. To allow java to recognize line terminators when using "^" or "$", we need to enable the MULTILINE flag.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Hello This is Me!\nHello There!";
    
    //Enable MULTILINE Flag
    Pattern pattern = Pattern.compile("^Hello",Pattern.MULTILINE);
    Matcher matcher = pattern.matcher(str);
    int helloCount = 0;
    
    while(matcher.find()) helloCount++;
    
    System.out.println("Input: " + str);
    //helloCount is 2 'cause java recognizes \n so,
    //"Hello This is Me!" is on the first line and
    //"Hello There!" is on the second line.
    System.out.println("How many Hello: " + helloCount);
    System.out.println();
  }
}

Word Boundary

Next, let's use the word boundary(\b).Word boundary matches the first character or the character after the last character of a word in the input sequence.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
  
    String str = "Hello! Java,regex File?Tutorial";
	
	Pattern pattern = Pattern.compile("\\b");
    Matcher matcher = pattern.matcher(str);
    
    while(matcher.find()){
      try{
        System.out.println("index: " + matcher.start() +
                           " character: " + 
                           str.charAt(matcher.start()));
      }
      catch(StringIndexOutOfBoundsException e){
        System.out.println("An exception occured");
        System.out.println("Input last index: " + 
                          (str.length()-1));
        System.out.println("Matcher index: " + matcher.start());
      }
    }
    
  }
}

Result

index: 0 character: H
index: 5 character: !
index: 7 character: J
index: 11 character: ,
index: 12 character: r
index: 17 character: 
index: 18 character: F
index: 22 character: ?
index: 23 character: T
An exception occured
Input last index: 30
Matcher index: 31

So, the first match is "H" which is the first character of "Hello" word then, the next match is "!" which is the character after "o" which is the last character of "Hello" word and so on. Notice the last match, Matcher index is larger than the last index of the input sequence. Index 31 is an ambiguous match and it's called "zero-length match".

Try removing the try-catch clause in the example above and you will encounter an StringOutOfBoundsException in the last part of the matching operation.

We can specify which word boundary to get by adding a sequence next to "\b" or after it.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
  
    String str = "Hello! Java,regex Jill?Loo";
	
	Pattern pattern = Pattern.compile("\\bJ");
    Matcher matcher = pattern.matcher(str);
    
    System.out.println("Get first subsequence boundary");
    while(matcher.find()){
      if(matcher.start() < str.length())
        System.out.println("index: " + matcher.start() +
                           " character: " + 
                           str.charAt(matcher.start()));
      else
        System.out.println("An empty string");
    }
    
    System.out.println();
    matcher.reset();
    matcher.usePattern(Pattern.compile("o\\b"));
    
    System.out.println("Get last subsequence boundary");
    while(matcher.find()){
      if(matcher.start() < str.length())
        System.out.println("index: " + matcher.start() +
                           " character: " + 
                           str.charAt(matcher.start()));
      else
        System.out.println("An empty string");
    }
    
  }
}

Result

Get first subsequence boundary
index: 7 character: J
index: 18 character: J

Get last subsequence boundary
index: 4 character: o
index: 25 character: o

Non-Word Boundary

Non-word boundary is like the opposite of word boundary. Non-word boundary is a boundary between two characters(word or non-word). I like to think of word boundary as container bounds whereas non-word boundary as separator bounds, If you will.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Hello! This is me!";
	
	Pattern pattern = Pattern.compile("\\B");
    Matcher matcher = pattern.matcher(str);
    
    String result = "";
    while(matcher.find())
      try{
        result += str.charAt(matcher.start());
      }catch(StringIndexOutOfBoundsException e){
        System.out.println("An exception occured");
        System.out.println("Input last index: " + 
        (str.length()-1));
        System.out.println("Matcher index: " + matcher.start());
      }
      System.out.println();
      System.out.println("Non-word boundary");
      System.out.println("Input: " + str);
      System.out.println("result: " + result);
      System.out.println();
      
      pattern = Pattern.compile("\\b");
      matcher = pattern.matcher(str);
      
      result = "";
      while(matcher.find())
      try{
        result += str.charAt(matcher.start());
      }catch(StringIndexOutOfBoundsException e){
        System.out.println("An exception occured");
        System.out.println("Input last index: " + (str.length()-1));
        System.out.println("Matcher index: " + matcher.start());
      }
      System.out.println();
      System.out.println("Word boundary");
      System.out.println("Input: " + str);
      System.out.println("result: " + result);
      System.out.println();
      
  }
}

Result

An exception occured
Input last index: 17
Matcher index: 18

Non-word boundary
Input: Hello! This is me!
result: ello hisse


Word boundary
Input: Hello! This is me!
result: H!T i m!

So, In the non-word boundary result, "ello" sequence is the bounds(separator) between "H" and "!" characters, whereas in the word boundary result, "H" and "!" are the bounds(container) of "ello" sequence. "his" sequence is the bounds(separator) between "T" and " "(whitespace) characters whereas "T" and " " are bounds(container) of "his" sequence and so on.

We can specify which non-word boundary to get by adding a sequence next to "\B" or after it.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String str = "Hello! meat is great!";
	
	Pattern pattern = Pattern.compile("\\Be");
    Matcher matcher = pattern.matcher(str);
    
    System.out.println("Pattern: " + "\\Be");
    System.out.println("Input: " + str);
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("index: " + matcher.start());
    }
    System.out.println();
    
    pattern = Pattern.compile("a\\B");
    matcher = pattern.matcher(str);
    
    System.out.println("Pattern: " + "a\\B");
    System.out.println("Input: " + str);
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("index: " + matcher.start());
    }
  }
}

Result

Pattern: \Be
Input: Hello! meat is great!
Match Found!
index: 1
Match Found!
index: 8
Match Found!
index: 17

Pattern: a\B
Input: Hello! meat is great!
Match Found!
index: 9
Match Found!
index: 18

Quantifiers

Note: Quantifiers can be used in conjunction with character classes and capturing groups.
As the name implies, Quantifiers quantifies the quantity of how many matches a sequence has in the input sequence. There are three types of quantifiers: Greedy, Reluctant and Possessive. These quantifiers have similar constructs, However, Their mechanics are subtly different. Let's create an example to demonstrate quantifiers.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
  	System.out.println("? quantifier");
    //"?" means once or not at all.
    boolean isMatch = Pattern.matches("pie?","pie");
    //isMatch is true 'cause "e" occurs only once
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("pie?","pi");
    //isMatch is true 'cause "e" doesn't occur
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("pie?","piee");
    //isMatch is false 'cause "e" occurs many times
    System.out.println("isMatch: " + isMatch);
    System.out.println();
    
    System.out.println("* quantifier");
    //"*" means zero or more
    isMatch = Pattern.matches("pi*e","pie");
    //isMatch is true 'cause "i" occurs one time
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("pi*e","pe");
    //isMatch is true 'cause "i" doesn't occur
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("pi*e","piiie");
    //isMatch is true 'cause "i" occurs many times
    System.out.println("isMatch: " + isMatch);
    System.out.println();
    
    System.out.println("+ quantifier");
    //"+" means one or more
    isMatch = Pattern.matches("p+ie","pie");
    //isMatch is true 'cause "p" occurs one time
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("p+ie","pppie");
    //isMatch is true 'cause "p" occurs many times
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("p+ie","ie");
    //isMatch is false 'cause "p" doesn't occur
    System.out.println("isMatch: " + isMatch);
    System.out.println();
    
    System.out.println("{n} quantifier");
    //{n} means, exactly n times
    isMatch = Pattern.matches("p{2}ie","ppie");
    //isMatch is true 'cause "p" occurs
    //exactly two times
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("p{2}ie","pie");
    //isMatch is false 'cause "p" occurs one time
    System.out.println("isMatch: " + isMatch);
    System.out.println();
    
    System.out.println("{n,} quantifier");
    //{n,} means, at least n times
    isMatch = Pattern.matches("p{2,}ie","pppie");
    //isMatch is true 'cause "p" occurs
    //at least two times
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("p{2,}ie","pie");
    //isMatch is false 'cause "p" doesn't occur
    //at least two times
    System.out.println("isMatch: " + isMatch);
    System.out.println();
    
    System.out.println("{n,m} quantifier");
    //{n,m} means, at least n times and not more
    //than m times
    isMatch = Pattern.matches("pie{2,3}","piee");
    //isMatch is true 'cause "e" occurs at
    //least 2 times and not more than 3 times 
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("pie{2,3}","pieeee");
    //isMatch is false 'cause "e" occurs 
    //more than 3 times 
    System.out.println("isMatch: " + isMatch);
    
    isMatch = Pattern.matches("pie{2,3}","pie");
    //isMatch is false 'cause "e" doesn't occur 
    //at least 2 times 
    System.out.println("isMatch: " + isMatch);
    System.out.println();
    
    //This pattern is more complicated than previous
    //patterns and we need to understand the subtle
    //differences between greedy, reluctant and
    //possessive quantifiers in order to deeply
    //understand this pattern
    Matcher matcher = Pattern.compile(".*pie").
                      matcher("apple pie apple pie");
    //Result:
    //Match Found!
    //start index: 0
    //end index: 19
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("start index: " + matcher.start());
      System.out.println("end index: " + matcher.end());
    }
    
  }
}

Greedy Quantifiers

Greedy quantifiers grabs every character it can grab in the input for itself and then try to match the input against the subsequent part. The matcher will release one character from the grabbed sequence back to the input as the operation progresses, starting from the last index. The matcher will stop releasing characters if the overall pattern is satisfied or there's no character to be released.

Greedy Quantifiers
X? 	X, once or not at all
X* 	X, zero or more times
X+ 	X, one or more times
X{n} 	X, exactly n times
X{n,} 	X, at least n times
X{n,m} 	X, at least n but not more than m times

Let's create an example to demonstrate the mechanics of greedy quantifier.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
	
    //".*" means one or more character/s of
    //any kind before the sequence "pie"
    //"." means any character, "*" is a greedy
    //quantifier means zero or more occurences
	Pattern pattern = Pattern.compile(".*pie");
    Matcher matcher = pattern.matcher("apple pie apple pie");
    
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("start index: " + matcher.start());
      System.out.println("end index: " + matcher.end());
    }
  }
}

Result

Match Found!
start index: 0
end index: 19

So first off, we have two parts here: ".*" and "pie". ".*" grabs the entire sequence in the input first and then the matcher tries to match the input against the subsequent part. The problem here is that ".*" grabs the entire sequence in the input so, we don't have any sequence to be matched against.

The overall pattern needs to be satisfied for a match to happen. So, the matcher releases one character from the grabbed sequence, starting from the last index and then tries to match the relased characters against the subsequent part.

If the match is still a failure then, the matcher will release another character until the overall pattern is satisfied or there's no character to be released. ".*pie" will be satisfied if the released characters are equivalent to "pie" and ".*" holds a character or nothing.

The start index is 0 'cause that's the starting index of the most suitable match. End index is 19 'cause end() method returns the offset after the last character matched. The lotal length of the input sequence is 18 and that last character in index 19 is an ambiguous match which is called "zero-length match".

Reluctant Quantifiers

Reluctant quantifier doesn't grab every character it can grab in the input for itself immediately, it lets the sequence in the input to be matched against the subsequent part first. Then, grabs one character as the matching operation progresses, starting from the first index. The quantifier will stop grabbing characters if the overall pattern is satisfied or there's no character to be grabbed.

Reluctant quantifiers
X?? 	X, once or not at all
X*? 	X, zero or more times
X+? 	X, one or more times
X{n}? 	X, exactly n times
X{n,}? 	X, at least n times
X{n,m}? 	X, at least n but not more than m times

Let's create an example to demonstrate the mechanics of reluctant quantifier.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
	
    //Reluctant quantifier construct is similar to
    //greedy but with additional "?" before the
    //quantifier.
    //
    //".*?" still accepts zero or more
    //characters of any kind, but this time, the
    //quantifier is reluctant.
	Pattern pattern = Pattern.compile(".*?pie");
    Matcher matcher = pattern.matcher("apple pie apple pie");
    
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("start index: " + matcher.start());
      System.out.println("end index: " + matcher.end());
      System.out.println();
    }
  }
}

Result

Match Found!
start index: 0
end index: 9

Match Found!
start index: 9
end index: 19

First off, the input is matched against "pie" which fails 'cause the first sequence of the input is not equal to "pie". So, the quantifier grabs one character from the starting index and then, match the input against the subsequent part until it finds an overall match. When our search reaches index 6 the overall pattern is satisfied. Now, start() and end() give the first and last index of the previous match.

Since we're using find(), our search operation continues until it reaches the second match with a start index of 9 and end index of 19.

Possessive Quantifiers

When we use "." with a possessive quantifier then, it grabs every character it can grab in the input. However, possessive quantifier doesn't release any characters that it grabbed.

Possessive quantifiers
X?+ 	X, once or not at all
X*+ 	X, zero or more times
X++ 	X, one or more times
X{n}+ 	X, exactly n times
X{n,}+ 	X, at least n times
X{n,m}+ 	X, at least n but not more than m times

Let's create an example to demonstrate the mechanics of possessive quantifier.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
	
    //Possessive quantifier construct is similar to
    //greedy but with additional "+" before the
    //quantifier.
    //
    //".*+" still accepts zero or more
    //characters of any kind, but this time, the
    //quantifier is possessive.
	Pattern pattern = Pattern.compile(".*+pie");
    Matcher matcher = pattern.matcher("apple pie apple pie");
    
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("start index: " + matcher.start());
      System.out.println("end index: " + matcher.end());
      System.out.println();
    }
    System.out.println("End of the program.");
    
  }
}

Result

End of the program.

As we can see, our program didn't find any matches that's because ".*+" didn't release any characters back to the input to be matched against "pie". So, the matcher can't find a match between input and the overall pattern.

Use possessive quantifier if you want to grab all input characters without needing to release any of those. It will outperform the equivalent greedy quantifier in a situation where a match is not immediately found.

Zero-Length Match

Zero-length match happens when a match that is found is ambiguous. We already encountered zero-length matches in previous topics. These ambiguous matches can be found in an empty input string(""); beginning and end of an input string; between two characters of an input string. Zero-length match can be spotted easily by checking the start and end index of a match. If both indexes are equal then, that match is a zero-length match.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    Pattern pattern = Pattern.compile("a?");
    Matcher matcher = pattern.matcher("");
    
    System.out.println("a? pattern");
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("start index: " + matcher.start());
      System.out.println("end index: " + matcher.end());
      System.out.println();
    }
    
    matcher = Pattern.compile("a*").matcher("");
    
    System.out.println("a* pattern");
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.println("start index: " + matcher.start());
      System.out.println("end index: " + matcher.end());
      System.out.println();
    }
    
  }
}

Result
Match Found!
start index: 0
end index: 0

Match Found!
start index: 0
end index: 0

The matcher found a match even we use an empty string. This happens 'cause "?" and "*" accepts zero occurences. "?" and "*" checks if "a" appears in the empty string, a match happens if "a" is found or not. Try this pattern "a+" and the matcher won't find any matches in the example above. Because, "+" doesn't accept zero occurences. Let's try another example.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    Pattern pattern = Pattern.compile("a?");
    Matcher matcher = pattern.matcher("abaaa");
    
    System.out.println("pattern: a?");
    System.out.println("input: abaaa");
    System.out.println("Matches");
    while(matcher.find()){
      System.out.print("start index: " + matcher.start());
      System.out.print(" ");
      System.out.print("end index: " + matcher.end());
      System.out.println();
    }
    
    matcher = Pattern.compile("a*").matcher("abaaa");
    
    System.out.println();
    System.out.println("pattern: a*");
    System.out.println("input: abaaa");
    System.out.println("Matches");
    while(matcher.find()){
      System.out.print("start index: " + matcher.start());
      System.out.print(" ");
      System.out.print("end index: " + matcher.end());
      System.out.println();
    }
    
    matcher = Pattern.compile("a+").matcher("abaaa");
    
    System.out.println();
    System.out.println("pattern: a*");
    System.out.println("input: abaaa");
    System.out.println("Matches");
    while(matcher.find()){
      System.out.print("start index: " + matcher.start());
      System.out.print(" ");
      System.out.print("end index: " + matcher.end());
      System.out.println();
    }
    
  }
}

Result
a? pattern
input: abaaa
Matches
start index: 0 end index: 1
start index: 1 end index: 1
start index: 2 end index: 3
start index: 3 end index: 4
start index: 4 end index: 5
start index: 5 end index: 5

a* pattern
input: abaaa
Matches
start index: 0 end index: 1
start index: 1 end index: 1
start index: 2 end index: 5
start index: 5 end index: 5

a+ pattern
input: abaaa
Matches
start index: 0 end index: 1
start index: 2 end index: 5

If we look at the result, "a?" checks every character 'cause "?" means once or not at all. So, "a?" needs to check if "a" appears or not in every character. "a*" checks the input differently, "*" means zero or more. So, First, "a*" checks if "a" is present, if it's true then, it checks if this "a" is followed by another a and so on.

Notice this result "start index: 1 end index: 1" in both "?" and "*" quantifiers. Index 1 has "b" character in it and it's considered as a match. This happens 'cause "?" and "*" only check if "a" appears in that index or not. "?" and "*" don't check the difference between "a" and "b". If "a" doesn't appear in those indexes then, the result is zero occurence or zero-length match.

Next, look at the result "start index: 5 end index: 5" in both "?" and "*" quantifiers. The max index of the "abaaa" is 4, but still, we got a character at index 5. Index 5 doesn't exist in the index range of the input, therefore the character in index 5 is an empty string. Thus, the result is a zero-length match. Though, it could be a null character that terminates a string.

Now, let's check the "a+" pattern. "+" means one or more. "+" checks if "a" appears once in an index then checks if it's followed by another "a" and so on. In the result, "+" didn't consider index 1 and index 5 as valid matches 'cause "a" didn't appear in those indexes at least once.

Logical Operators

We use logical operators to combine different types of regular-expression constructs. These are the logical operators:
XY(AND operator)
X|Y(OR operator)
(X)(Capturing Group)
We're going to discuss AND and OR operators in this topic. Capturing group has its own topic. Let's create an example to demonstrate AND and OR operators.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    
    //.+[mu].++ simpily means, one or more
    //characters on both sides of "m" or "u".
    //".+" is a greedy quantifier whereas .++
    //is a possessive quantifier.
    //
    //This is a combination of both quantifiers
    //and character class. AND operator is 
    //implicit, That's why we don't see any
    //operator token between ".+", "[mu]" and ".++"
    //
    //regex constructs that are connected with one
    //another using AND operator are combined
    //together and each construct must be satisfied
    //in the given order for a match to happen
    Matcher matcher = Pattern.compile(".+[mu].++").
                      matcher("immutable");
    
    boolean isMatch = matcher.matches(); 
    //isMatch is true
    System.out.println("AND operator");
    System.out.println("isMatch? " + isMatch);
    
    matcher.reset("metabolism");
    isMatch = matcher.matches();
    //isMatch is false
    System.out.println("isMatch? " + isMatch);
    System.out.println();
    
    //In this pattern, we use "|" or OR operator
    //between ^facebook$ and ^youtube$".
    //
    //regex constructs that are connected with one
    //another using OR operator are combined
    //together and only one construct is required
    //to be satisfied for a match to happen
    //
    //Don't be confused between "^facebook$|^youtube$"
    //and "^facebook|youtube$". These two are different.
    //
    //"^facebook|youtube$" means "facebook" at the start
    //of a line or "youtube" at the end of a line.
    //
    //"^facebook$|^youtube$" means, "facebook" or 
    //"youtube" at the start and end of a line.
    matcher = Pattern.compile("^facebook$|^youtube$").
              matcher("facebook");
    
    isMatch = matcher.matches();      
    System.out.println("OR operator");
    //isMatch is true 'cause even ^youtube$
    //is not satisfied, ^facebook$
    //is satisfied.
    //
    //Since we're using OR operator between
    //^facebook$ and ^youtube$, only one of
    //them is needed to be satisfied for a
    //match to happen
    System.out.println("isMatch? " + isMatch);
              
  }
}

Escaping Metacharacters

Metacharacters are characters that have special meaning in regular expressions like brackets([]), braces({}), dot(.), etc. Sometimes, we wanna override their intended purposes. Escaping metacharacters is similar to escaping character sequence, we will use backslash(\) to escape metacharacters.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    
    //use double backslash to escape "[" and "]"
    //if we only use one backslash then, the compiler
    //will consider "[" and "]" as special characters like \n
    //which is invalid 'cause "\[" and "\]" are 
    //not special characters.
    //
    //we add another backslash to escape the backslash
    //after the newly added backslash. Now, the matcher
    //treats those metacharacters as regular literals.
    //
    //In this pattern, "[" and "]" are treated as
    //literals
    boolean isMatch = Pattern.matches("\\[\\]","[]");
    //isMatch is true
    System.out.println("isMatch? " + isMatch);
    
    //This pattern escapes "." which means
    //"any characters"
    Matcher matcher = Pattern.compile("\\.+").
                      matcher(".abba.");
    
    System.out.println();
    //Two matches are found 'cause "\\.+" looks
    //for one or more occurences of dot(.)
    System.out.println("Matches");
    while(matcher.find()){
      System.out.print("start index: " + matcher.start());
      System.out.print(" ");
      System.out.print("end index: " + matcher.end());
      System.out.println();
    }
    
    //In this pattern, "." is not escaped
    //matcher treats the "." here as a
    //metacharacter
    matcher = Pattern.compile(".+").
                      matcher(".abba.");
    
    System.out.println();
    //match index is ranging from 0 to 6 
    //'cause ".+" looks for one or more
    //occurences of any character
    System.out.println("Matches");
    while(matcher.find()){
      System.out.print("start index: " + matcher.start());
      System.out.print(" ");
      System.out.print("end index: " + matcher.end());
      System.out.println();
    }
  }
}

Next, let's escape metacharacters with backslash like "\b" and others.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
  
    boolean isMatch = Pattern.matches("\b","\b");
    //isMatch is true 'cause "\b" in the first
    //argument is treated as a special character.
    //Same as the second argument.
    //
    //\b, as a special character, is a backspace
    System.out.println("isMatch? " + isMatch);
    
    //we escape the backslash by adding another
    //backslash before it. The additional
    //backslash overrides the special character
    //function of \b and changes its function to
    //a word boundary
    isMatch = Pattern.matches("\\b","\b");
    
    //isMatch is false 'cause "\\b" is treated
    //as a metacharacter. \b ,as a metacharacter,
    //is a word boundary
    System.out.println("isMatch? " + isMatch);
    
    //we add another backslash in this pattern,
    //the additional backslash overrides the
    //metacharacter function of \b and changes
    //its function back to a special character
    isMatch = Pattern.matches("\\\b","\b");
    //isMatch is true
    System.out.println("isMatch? " + isMatch);
    
    //we add another backslash in this pattern,
    //the additional backslash overrides the
    //special character function of \b and since
    //one of the backslashes already overrides
    //the metacharacter function of \b, the
    //matcher will treat backslash as a literal.
    isMatch = Pattern.matches("\\\\b","\\b");
    
    //isMatch is true 'cause "\\\\b" represents
    //"\b" literal, same as the second argument.
    System.out.println("isMatch? " + isMatch); 
    
    //This is how we put the literal backslash(\)
    //in a pattern
    isMatch = Pattern.matches("\\\\","\\");
    //isMatch is true
    System.out.println("isMatch? " + isMatch);
  }
}

Another way of escaping metacharacters is using the quote() method in Pattern class.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    
    //quote() method return a string object
    //where all characters are treated as
    //literals
    //
    //special meanings of characters like
    //metacharacters, special character, etc.
    //will be ignored
    boolean isMatch = Pattern.matches(Pattern.quote("[]"),"[]");
    //isMatch is true
    System.out.println("isMatch? " + isMatch);
  }
}

Groups and Capturing

Capturing group is a process of wrapping multiple characters into a single unit. If a capturing group gets a match against the input then, those matches will be saved in memory and can be recalled by using backreferences which we will discuss later.

Capturing groups are numbered from left to right. For example, we group our regular expression like this: ((A)(B(C)))
We can count the number of groups like this:

1. ((A)(B(C)))
2. (A)
3. (B(C))
4. (C)

To get how many groups are in an expression, use the groupCount() method in the matcher class. There's a group that is called "group 0". This group represents the whole expression and it's not counted as part of a capturing group.

There are two types of groups: capturing and non-capturing groups. The group that we discussed recently is a capturing group. Non-capturing groups are groups that don't capture string and not included in the total count of groups.

Groups that start with "(?" are either non-capturing groups or named-capturing groups. Embedded Flag Expressions are non-capturing groups that start with "(?".

Embedded Flag Expressions(Non-capturing groups)

Embedded Flag Expressions are non-capturing groups that act as an alternative to setting flags via compile() method with two arguments. We can include these expressions to set the flags that we need in an expression. These are the embedded flag expressions.
Pattern.CANON_EQ:  None
Pattern.CASE_INSENSITIVE:  (?i)
Pattern.COMMENTS:  (?x)
Pattern.MULTILINE:  (?m)
Pattern.DOTALL:  (?s)
Pattern.LITERAL:  None
Pattern.UNICODE_CASE:  (?u)
Pattern.UNIX_LINES:  (?d)

Let's create an example to demonstrate these expressions.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String input = "SixFive\nSixFour";
    //If we only need one flag then, we write
    //it like this: (?:X)pattern
    //where ?:X is the flag
    //e.g. "(?i)^six"
    String pattern = "(?i)(?m)^six";
    
    Matcher matcher = Pattern.compile(pattern).matcher(input);
    
    while(matcher.find()){
      System.out.println("Match Found!");
      System.out.print("start: " + matcher.start());
      System.out.print(" ");
      System.out.print("end: " + matcher.end());
      System.out.println();
    }
  }
}

Result
Match Found!
start: 0 end: 3
start: 8 end: 11

Capturing Groups

I already discussed the definition of capturing groups in the "Groups and Capturing" topic which can be seen above. Now, let's create an example to demonstrate capturing groups.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String pattern = "((http://|https://)(www\\.)"+
                     "(dailymotion|youtube))\\.com";
    String input = "https://www.youtube.com";
    
    Matcher matcher = Pattern.compile(pattern).
                      matcher(input);
    
    //Use groupCount() to count how many groups are in an expression
    System.out.println("Group Count: " + matcher.groupCount());
    
    if(matcher.lookingAt()){
      System.out.println("It's a match!");
      //input. Use the group(int index) method to get the captured
      //input that matched a portion of the input
      //the length and the max index count of capturing groups are 
      //equal
      //
      //Note: group() without argument returns the input
      //subsequence matched by the previous match
      System.out.println("Overall Match: " + matcher.group(0));
      System.out.println("1st group: " + matcher.group(1));
      System.out.println("2nd group: " + matcher.group(2));
      System.out.println("3rd group: " + matcher.group(3));
      System.out.println("4th group: " + matcher.group(4));
      
      //Use start(int group) and end(int group) to get the
      //start and end of the captured input in the expression.
      System.out.println();
      System.out.println("1st group start: " + matcher.start(1));
      System.out.println("3rd group start: " + matcher.start(3));
      System.out.println("4th group end: " + matcher.start(4));
    }
  }
}

Result
Group Count: 4
It's a match!
Overall Match: https://www.youtube.com
1st Group: https://www.youtube
2nd Group: https:
3rd Group: www.
4th Group: youtube

1st group start: 0
3rd group start: 8
4th group start: 12

Let's try another example.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    //The capture groups here are a bit complicated
    String pattern = "(.*?([Pp]([Aa])))\\d{2}";
    String input = "Peppa69";
    
    Matcher matcher = Pattern.compile(pattern).
                      matcher(input);
    
    System.out.println("Group Count: " + matcher.groupCount());
    
    if(matcher.matches()){
      System.out.println("It's a match!");
      
      for(int i = 0; i <= matcher.groupCount(); i++){
        System.out.println("Group " + i + ": " +
                           matcher.group(i));
      }
    }
    matcher.reset();
    
    System.out.println();
    while(matcher.find()){
      System.out.println("It's a match!");
      System.out.println("Group 2 start: " + matcher.start(2));
      System.out.println("Group 3 start: " + matcher.start(3));
    }
    
  }
}

Result
Group Count: 3
It's a match!
Group 0: Peppa69
Group 1: Peppa
Group 2: pa
Group 3: a

It's a match
Group 2 start: 3
Group 3 start: 4

The pattern in the above example is kinda complicated. My explanation about that complicated pattern is based on my intuition. So, take it with a grain of salt. I expect readers to have an understanding about the regex constructs that I used in the pattern above.

First off, the input is matched against the pattern. ".*?" grabs one character at the start of the input and then repeat the first and second steps until a match is found at index 3 where "p" that is near "a" is found. "p" and the subsequent "a" satisfy group 2 and 3. Group 2 match is found at index 3 and Group 3 match is found at index 4. Then, the subsequent numbers satify the "\\d++".

So, group0 is the combination of all subsequence that matches the patteren. Group1 is the combination of all subsequence in the group capturing that matches the pattern. Group2 is the combination of group2 and group3 subsequences. Group3 only has its own subsequence.

Backreferences

Backreference allows us to recall a pattern that is in a group. Backreference starts with a backslash, followed by a digit that represents the group number.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String pattern = "(\\d\\d)\\1";
    String input = "1212";
    
    Matcher matcher = Pattern.compile(pattern).
                      matcher(input);
    
    System.out.println("Group Count: " + 
                       matcher.groupCount());
    
    System.out.println();
    while(matcher.find()){
      System.out.println("Match found!");
      System.out.println("start: " + matcher.start());
      System.out.println("end: " + matcher.end());
    }
    
  }
}

Result
Group Count: 1

Match Found!
start: 0
end: 4

In the example above, we use "\1" to recall "(\d\d)" group. Try changing the input to "1234" and the matcher won't find any match. That's because, when we recall "\d\d", the backreference repeats the process that (\d\d) had done previously.

In previous process of (\d\d), it looked for two digits and found "1" and "2". Then, the backreference "\1" did the same thing. It looked for two digits and those digits must be "1" and "2". Let's try another example.

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String pattern = "([a-zA-Z]*?(\\d{3,3}))\\-\\2";
    String input = "Thud333-333";
    
    Matcher matcher = Pattern.compile(pattern).
                      matcher(input);
    
    System.out.println("Group Count: " + 
                       matcher.groupCount());
    
    System.out.println();
    while(matcher.find()){
      System.out.println("First Match");
      System.out.println("start: " + matcher.start());
      System.out.println("end: " + matcher.end());
      System.out.println();
      System.out.println("Group 0: " + matcher.group(0));
      System.out.println("Group 1: " + matcher.group(1));
      System.out.println("Group 2: " + matcher.group(2));
      System.out.println();
    }
    
    pattern = "([a-zA-Z]*?(\\d{3,3}))\\-\\1";
    input = "Thud333-333";
    
    matcher = Pattern.compile(pattern).
              matcher(input);
              
    System.out.println();
    while(matcher.find()){
      System.out.println("Second Match");
      System.out.println("start: " + matcher.start());
      System.out.println("end: " + matcher.end());
      System.out.println();
      System.out.println("Group 0: " + matcher.group(0));
      System.out.println("Group 1: " + matcher.group(1));
      System.out.println("Group 2: " + matcher.group(2));
      System.out.println();
    }
    
  }
}

Result
Group Count: 2

First Match
start: 0
end: 11

Group 0: Thud333-333
Group 1: Thud333
Group 2: 333

Second Match
start: 0
end: 11

Group 0: 333-333
Group 1: 333
Group 2: 333

So, in the first match, the most suitable match for "([a-zA-Z]*?(\\d{3,3}))\\-\\2" pattern is "Thud333-333" or the entire input. "\2" is a backreference that refers to "(\\d{3,3})" group.

In the second match, the most suitable match for "([a-zA-Z]*?(\\d{3,3}))\\-\\1" is "333-333". "\1" is a backreference the refers to "([a-zA-Z]*?(\\d{3,3}))" group.

Named Capturing Groups

We can put names in our capturing groups. Capturing groups are still numbered even we assign names in them.
Syntax: (?<name>X)

import java.util.regex.*;
public class SampleClass{

  public static void main(String[]args){
    String pattern = "(?<num>(\\d){4})\\-\\1";
    String input = "1234-1234";
    
    Matcher matcher = Pattern.compile(pattern).
                      matcher(input);
    
    System.out.println("Group Count: " + 
                       matcher.groupCount());
    
    System.out.println();
    while(matcher.find()){
      System.out.println("Match found!");
      System.out.println("start: " + matcher.start());
      System.out.println("end: " + matcher.end());
      System.out.println();
      
      System.out.println("Group 0: " + matcher.group(0));
      //we use the name of group1 to get the subsequence
      //in it
      System.out.println("num: " + matcher.group("num"));
      System.out.println("Group 2: " + matcher.group(2));
    }
    
  }
}

Brainy Ghosts Blog

Sunday, July 11, 2021

Java Tutorial: Regular Expressions

No comments:

Post a Comment