04 Regular Expressions

String Canonicalization

Canonicalize: Convert data that has more than one possible presentation into a standard form

A Joining Problem

lifecycle

  • useful codes:
    • Replacement: str.replace('&', 'and')
    • Deletion: str.replace(' ', '')
    • Transformation: str.lower()

Extracting data from text

Date Information

169.237.46.168 - - [26/Jan/2014:10:47:58 -0800] "GET /stat141/Winter04/
HTTP/1.1" 200 2585 "http://anson.ucdavis.edu/courses/"
  • how to get Date Information from the above log file
  • Better to use \(Regular\) \(Expression\) rather than scratch

    import re
    pattern = r'\[(\d+)/(\w+)/(\d+):(\d+):(\d+):(\d+) (.+)\]'
    day, month, year, hour, minute, second, time_zone = re.search(pattern, text).groups()
    
  • Formal Language : set of strings, typically described implicitly

    • "The set of all strings of length less than 10 & that contains a 'horse'"
  • Regular Language : a formal language that can be described by a regular expression
    • `[0-9]{3}-[0-9]{2}-[0-9]{4}`
    • 3 of any digit, then a dash, then 2 of any digit, then a dash, then 4 of any digit.
      text = "My social security number is 123-45-6789.";
      pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
      re.findall(pattern, text)
      
  • Useful Site for testing Regex: Regex101

  • Basic Operators
operationorderexamplematchesdoes not match
concatenation3AABAABAABAABevery other string
or4AA|BAABAA, BAABevery other string
closure (zero or more)2AB*AAA, ABBBBBBAAB, ABABA
parenthesis1A(A|B)AABAAAAB, ABAABevery other string
  (AB)*AA, ABABABABAAA ABBA
Regex that matches `moon`, `moooon` (even `o`s except 0)
moo(oo)*n
Regex that matches `muun`, `muuuun`, `moon`, `moooon (even `o`s or `u`s except 0)
mo(u(uu)*|o(oo)*)n
  • Expanded Regex Syntax
operationexamplematchesdoes not match
any character (except newline).U.U.U.CUMULUS, JUGULUMSUCCUBUS, TUMULTUOUS
character class[A-Za-z][az]*word, CapitalizedcamelCase, 4illegal
at least onejo+hnjohn, joooooohnjhn, jjohn
zero or onejoh?njon johnany other string
repeated exactly {a} timesj[aeiou]{3}hnjaoehn,jooohnjhn, jaeiouhn
repeated from a to b times: {a,b}j[ou]{1,2}hnjohn, juohnjhn, jooohn
  • More Regular Expression Examples

    regexmatchesdoes not match
    .*SPB.*RASPBERRY, CRISPBREADSUBSPACE, SUBSPECIES
    [0-9]{3}-[0-9]{2}-[0-9]{4}231-41-5121, 573-57-1821231415121, 57-3571821
    [a-z]+@([a-z]+\.)+(edu | com)horse@pizza.com, horse@pizza.food.comfrank_99@yahoo.com, hug@cs
Regex n for any lowercase string that has a repeated vowel
[a-z]*(aa|ee|ii|oo|uu)+[a-z]*
Regex n for any string that contains both a lowercase letter & number
(.*[a-z].*[0-9].*)|(.*[0-9].*[a-z].*)

Advanced Regular Expressions Syntax

  • since RE is difficult to read, it is (sarcastically) called “write only language”
operationexamplematchesdoes not match
built-in character classes\w+, \d+fawef, 231231this person, 423 people
character class negation[^a-z]+PEPPERS3982, 17211!@#porch, CLAmS
escape charactercow\.comcow.comcowscom
beginning of line^arkark two, ark o arkdark
end of lineark$dark, ark o arkark two
lazy version of zero or more *?5.*?55005, 555005005
  • escape character: can be thought of it as “take this next character literally”
Regex that matches anything inside of the angle brackets <>
`<.*?>`

Regular Expressions in Python

  • re.findall(pattern, text) : return list of all matches

    text = """My social security number is 456-76-4295 bro,
            or actually maybe it’s 456-67-4295."""
    pattern = r"[0-9]{3}-[0-9]{2}-[0-9]{4}"
    m = re.findall(pattern, text)
    print(m)
    

    >>> ['456-76-4295', '456-67-4295']

  • re.sub(pattern, repl, text) : return text with all instances of pattern replaced by repl.

    text = '<div><td valign="top">Moo</td></div>'
    pattern = r"<[^>]+>"
    cleaned = re.sub(pattern, '', text)
    print(cleaned)
    

    >>> 'Moo'

  • Raw strings in Python : strongly suggest using RAW STRINGS

    • using r" " instead of "" or ''
    • Rough idea: Regular expressions and Python strings both use \ as an escape character
    • Using non-raw strings leads to uglier regular expressions.
  • RE Groups

    • Parentheses specifies a so-called ‘group’
    • regular expression matchers ex: `re.findall` will return matches organized by groups (tuples, in Python)
    s = """Observations: 03:04:53 - Horse awakens.
    03:05:14 - Horse goes back to sleep."""
    pattern = "(\d\d):(\d\d):(\d\d) - (.*)"
    matches = re.findall(pattern, s)
    

    >>> [('03', '04', '53', 'Horse awakens.'), ('03', '05', '14', 'Horse goes back to sleep.')]

  • Practice Problem:

    pattern = "YOUR REGEX HERE"
    matches = re.findall(pattern, log[0])
    day, month, year = matches[0]
    
    Answer

    ”[(\d{2})/(\w{3})/(\d{4})”

  • Summary and other (alternative) tools

basic pythonrepandas
 re.findalldf.str.findall
str.replacere.subdf.str.replace
str.splitre.splitdf.str.split
'ab' in strre.searchdf.str.contain
len(str) df.str.len
str[1:4] df.str[1:4]