tl;dr

My regular expression is as follows:

pattern = (
    r"^" # start
    r"(https?://)?" # protocol
    r"(([a-z\d][a-z\d-]*[a-z\d]\.)+[a-z]{2,}|" # domain
    r"(\d{1,3}\.){3}(\d{1,3}))" # OR IPv4
    r"(:\d+)?" # port
    r"(/[a-z\d\-%_.~+]*)*" # path
    r"(\?[a-z\d%_.~+=&]*)?" # query parameters
    r"(\#[a-z\d_\-]*)?" # fragment locators
    r"$" # end
)

Explanation

Split up into every line and check what is happening:

  • ^ - This defines the start of a regular expression. There should not be anything before that.
  • (https?://)? - This defines the protocol. The ? in the end indicates that the protocol may be missing; s? does the same for the s. This group essentially matches the following two strings: http:// and https://. I realize that there are more protocols one may consider, but this is sufficient.
  • The next two lines have to be seen as a unit - particularly, there is an “or” between them (see the pipe at the end of the first line):
    • ([a-z\d][a-z\d-]*[a-z\d]\.)+[a-z]{2,} indicates the usage of a domain.
      Starting in the end, we have got a top level domain which only consists of characters and is at least two characters long. That’s the [a-z]{2,} part.
      The group before that essentially defines each level of domains. By the +, we need to have at least one subdomain. Each subdomain starts and ends with a digit or a number. In the middle, there may be digits, numbers or the - sign.
    • The second option here is an IP address. My regular expression currently only supports IPv4 addresses, v6 still to come. We first have got three blocks where each block has got between $1$ and $3$ numbers and then a dot. Finally, the fourth group again has between $1$ and $3$ dots but is not finalized with a dot.
  • The port again can be there or not. It is also quite simple: It is just a colon, followed by a number of digits. We technically could restrict the expression even further because ports cannot be higher than $65535$ but I did not do that here.
  • The path is quite simple: It starts with a forward trailing slash and then has got an arbitrary number of defined characters. We can have an arbitrary number of paths.
  • Query and Fragments are a bit hard to define, so I chose the simple way: They start with ? and # respectively, and then there is a number of allowed characters. Both are optional.
  • Finally, I close the expression with $ - there should not be anything after that.