Text Processing: grep, sed & awk

The data-mining trio for parsing logs, scraping output, and finding the one line that matters.

Medium 24 mingrepsedawkregex

Raw output is noise. The ability to slice, filter, transform, and count text is what separates a hacker who stares at walls of data from one who instantly finds the one line that matters. This lesson teaches you to think with the text-processing toolkit.

We will work with a realistic scenario throughout: you have captured an Apache access log from a web server and you need to analyze it for reconnaissance value, anomalies, and attacker fingerprints.

kali@vr4cs: ~

Just from this sample we can see a brute-force login attempt and a sqlmap probe. Let's build the tools to find these at scale.

grep - find lines that match

grep is your first filter. It prints lines matching a pattern.

Essential flags

grep "pattern" file          # Basic match
grep -i "pattern" file       # Case-insensitive
grep -r "pattern" /dir       # Recursive (search whole directory)
grep -v "pattern" file       # Invert: print non-matching lines
grep -n "pattern" file       # Show line numbers
grep -c "pattern" file       # Count matching lines only
grep -l "pattern" /dir/*     # Print filenames with matches
grep -E "regex" file         # Extended regex (ERE)
grep -o "pattern" file       # Print only the matched part, one per line

grep on the access log

# Find all 401 (unauthorized) responses - potential brute force
grep ' 401 ' access.log
 
# Find all requests from a specific IP
grep '^10\.0\.0\.42' access.log
 
# Find sqlmap, nikto, or dirbuster user agents (case-insensitive)
grep -iE 'sqlmap|nikto|dirbuster|hydra|masscan' access.log
 
# Find all POST requests
grep '"POST ' access.log
 
# Count how many 401s came from one IP
grep ' 401 ' access.log | grep -c '^10\.0\.0\.42'

kali@vr4cs: ~

Regex quick reference for grep -E

Pattern	Matches
`.`	Any single character
`*`	Zero or more of the preceding
`+`	One or more (ERE only)
`?`	Zero or one (ERE only)
`[abc]`	One of: a, b, or c
`[0-9]`	Any digit
`^`	Start of line
`$`	End of line
`\b`	Word boundary
`(a\|b)`	a or b (ERE)

# Extract all IP addresses from the log using -o
grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' access.log | sort -u

cut - extract columns

cut extracts specific fields from structured text. It works on fixed character positions or delimiters.

cut -d',' -f1         # Field 1 of comma-delimited data
cut -d':' -f1,3       # Fields 1 and 3, colon-delimited
cut -c1-10            # Characters 1 through 10

The access log uses spaces as delimiters, so:

# Extract just the IP addresses (field 1)
cut -d' ' -f1 access.log
 
# Extract the HTTP method from quoted field (requires some creativity)
grep -oE '"(GET|POST|PUT|DELETE|HEAD|OPTIONS)' access.log | cut -c2-

When cut struggles, reach for awk

cut only works with single-character delimiters and fixed fields. The moment your data has variable whitespace or you need conditional logic, awk handles it more cleanly.

sort and uniq - ranking and deduplication

sort sorts lines; uniq collapses consecutive duplicate lines. They are almost always used together.

sort file                    # Alphabetical
sort -n file                 # Numeric sort
sort -r file                 # Reverse
sort -k2 -t':' file          # Sort by field 2, colon-separated
sort -u file                 # Sort and deduplicate
 
uniq file                    # Collapse adjacent duplicates
uniq -c file                 # Prefix each line with count
uniq -d file                 # Show only duplicated lines

Finding the top attacking IPs

cut -d' ' -f1 access.log | sort | uniq -c | sort -rn | head -10

kali@vr4cs: ~

That single pipeline - 5 commands - identifies the most active clients in any log file. 10.0.0.42 made 47 requests; that matches the 47 failed logins we found earlier.

wc - counting

wc -l file      # Line count
wc -w file      # Word count
wc -c file      # Byte count

# How many unique IPs hit the server today?
cut -d' ' -f1 access.log | sort -u | wc -l
 
# How many 500 errors?
grep ' 500 ' access.log | wc -l

tr - character translation

tr replaces or deletes individual characters. It reads stdin only (no filename argument).

echo "Hello World" | tr 'a-z' 'A-Z'    # Lowercase to uppercase
echo "hello:world" | tr ':' ' '         # Replace colons with spaces
cat file | tr -d '\r'                    # Strip Windows carriage returns
cat file | tr -s ' '                     # Squeeze repeated spaces to one

Useful for normalizing output before feeding it to other tools.

sed - stream editor

sed applies editing operations to a stream of text line by line. The most important operation by far is substitution.

Substitution: s/pattern/replacement/flags

sed 's/foo/bar/'        # Replace first occurrence per line
sed 's/foo/bar/g'       # Replace ALL occurrences (g = global)
sed 's/foo/bar/i'       # Case-insensitive match
sed 's/foo/bar/2'       # Replace only the second occurrence

Other useful sed operations

sed -n '5,10p' file         # Print only lines 5 through 10
sed '5,10d' file            # Delete lines 5 through 10
sed '/pattern/d' file       # Delete all lines matching a pattern
sed -i 's/old/new/g' file   # Edit file in-place (be careful)
sed -n '/pattern/p' file    # Print only matching lines (like grep)

sed on the log

# Anonymize IPs for sharing (replace last octet)
sed 's/\([0-9]\+\.[0-9]\+\.[0-9]\+\)\.[0-9]\+/\1.XXX/g' access.log
 
# Strip user agent strings from log lines
sed 's/ "[^"]*"$//' access.log
 
# Extract just the URL paths
sed -n 's/.*"\(GET\|POST\) \([^ ]*\).*/\2/p' access.log

kali@vr4cs: ~

Interesting - someone probed /wp-login.php and /phpmyadmin even though neither should exist on this server. That is a vulnerability scanner fingerprint.

awk - the scripting swiss army knife

awk is a full mini-language for processing structured text. Each line is automatically split into fields ($1, $2, etc.) separated by whitespace (or a custom FS).

Basic structure

awk 'pattern { action }' file

If pattern is omitted, the action runs on every line. If action is omitted, matching lines are printed.

Field access

awk '{print $1}' file           # Print field 1
awk '{print $1, $NF}' file      # Print first and last field
awk 'NR==5' file                # Print only line 5
awk 'NR>=5 && NR<=10' file      # Print lines 5-10
awk 'NF > 0' file               # Skip blank lines

Custom field separator

awk -F':' '{print $1}' /etc/passwd      # Colon delimiter
awk -F',' '{print $2}' data.csv         # CSV
awk 'BEGIN{FS="\t"} {print $3}' file    # Tab delimiter

Patterns and conditions

awk '/pattern/' file                    # Lines matching regex (like grep)
awk '!/pattern/' file                   # Lines NOT matching
awk '$9 == 401' access.log              # Lines where field 9 equals 401
awk '$9 >= 500' access.log              # Server errors
awk '$1 == "10.0.0.42"' access.log      # Specific IP

Aggregation with awk

# Count requests per IP (like the uniq -c pipeline but in one step)
awk '{count[$1]++} END{for(ip in count) print count[ip], ip}' access.log | sort -rn
 
# Sum the response size (field 10) for a specific IP
awk '$1 == "10.0.0.42" {total += $10} END{print total " bytes"}' access.log

kali@vr4cs: ~

printf for formatted output

awk '{printf "%-20s %s\n", $1, $7}' access.log   # Left-aligned columns

Putting it all together - full log analysis pipeline

Here is a complete pipeline that takes the raw access log and produces a security-relevant report:

echo "=== Top 10 IPs ==="
cut -d' ' -f1 access.log | sort | uniq -c | sort -rn | head -10
 
echo ""
echo "=== Scanner User-Agents ==="
grep -oiE '"[^"]*(sqlmap|nikto|nmap|masscan|dirbuster|hydra)[^"]*"' access.log | sort -u
 
echo ""
echo "=== Failed Logins by IP ==="
awk '$9 == 401 {count[$1]++} END{for(ip in count) print count[ip], ip}' access.log | sort -rn
 
echo ""
echo "=== Top Requested URLs ==="
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10
 
echo ""
echo "=== 500 Errors ==="
awk '$9 == 500 {print $1, $7}' access.log | sort -u

Hands-on Lab

Grep the Logs

Practice these techniques on a real Apache log dump - find the attacker, their tools, and what they were targeting.

Key takeaways

grep filters lines; -i ignores case, -v inverts, -E enables extended regex, -o extracts just the match.
cut -d'x' -f2 extracts a field from delimited text.
sort | uniq -c | sort -rn is the canonical "frequency ranking" pipeline - memorize it.
sed 's/old/new/g' substitutes text; sed '/pattern/d' deletes matching lines.
awk '{count[$1]++} END{...}' aggregates data in ways that sort | uniq can't.
Chaining these tools with pipes turns a raw log file into actionable intelligence in seconds.

Streams, Redirection & Pipes

Networking from the CLI