Text Processing: grep, sed & awk
The data-mining trio for parsing logs, scraping output, and finding the one line that matters.
Raw output is noise. The ability to slice, filter, transform, and count text is what separates a hacker who stares at walls of data from one who instantly finds the one line that matters. This lesson teaches you to think with the text-processing toolkit.
We will work with a realistic scenario throughout: you have captured an Apache access log from a web server and you need to analyze it for reconnaissance value, anomalies, and attacker fingerprints.
Just from this sample we can see a brute-force login attempt and a sqlmap probe. Let's build the tools to find these at scale.
grep - find lines that match
grep is your first filter. It prints lines matching a pattern.
Essential flags
grep "pattern" file # Basic match
grep -i "pattern" file # Case-insensitive
grep -r "pattern" /dir # Recursive (search whole directory)
grep -v "pattern" file # Invert: print non-matching lines
grep -n "pattern" file # Show line numbers
grep -c "pattern" file # Count matching lines only
grep -l "pattern" /dir/* # Print filenames with matches
grep -E "regex" file # Extended regex (ERE)
grep -o "pattern" file # Print only the matched part, one per linegrep on the access log
# Find all 401 (unauthorized) responses - potential brute force
grep ' 401 ' access.log
# Find all requests from a specific IP
grep '^10\.0\.0\.42' access.log
# Find sqlmap, nikto, or dirbuster user agents (case-insensitive)
grep -iE 'sqlmap|nikto|dirbuster|hydra|masscan' access.log
# Find all POST requests
grep '"POST ' access.log
# Count how many 401s came from one IP
grep ' 401 ' access.log | grep -c '^10\.0\.0\.42' Regex quick reference for grep -E
| Pattern | Matches |
|---|---|
. | Any single character |
* | Zero or more of the preceding |
+ | One or more (ERE only) |
? | Zero or one (ERE only) |
[abc] | One of: a, b, or c |
[0-9] | Any digit |
^ | Start of line |
$ | End of line |
\b | Word boundary |
(a|b) | a or b (ERE) |
# Extract all IP addresses from the log using -o
grep -oE '[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}' access.log | sort -ucut - extract columns
cut extracts specific fields from structured text. It works on fixed character positions or delimiters.
cut -d',' -f1 # Field 1 of comma-delimited data
cut -d':' -f1,3 # Fields 1 and 3, colon-delimited
cut -c1-10 # Characters 1 through 10The access log uses spaces as delimiters, so:
# Extract just the IP addresses (field 1)
cut -d' ' -f1 access.log
# Extract the HTTP method from quoted field (requires some creativity)
grep -oE '"(GET|POST|PUT|DELETE|HEAD|OPTIONS)' access.log | cut -c2-When cut struggles, reach for awk
cut only works with single-character delimiters and fixed fields. The moment your data has variable whitespace or you need conditional logic, awk handles it more cleanly.
sort and uniq - ranking and deduplication
sort sorts lines; uniq collapses consecutive duplicate lines. They are almost always used together.
sort file # Alphabetical
sort -n file # Numeric sort
sort -r file # Reverse
sort -k2 -t':' file # Sort by field 2, colon-separated
sort -u file # Sort and deduplicate
uniq file # Collapse adjacent duplicates
uniq -c file # Prefix each line with count
uniq -d file # Show only duplicated linesFinding the top attacking IPs
cut -d' ' -f1 access.log | sort | uniq -c | sort -rn | head -10 That single pipeline - 5 commands - identifies the most active clients in any log file. 10.0.0.42 made 47 requests; that matches the 47 failed logins we found earlier.
wc - counting
wc -l file # Line count
wc -w file # Word count
wc -c file # Byte count# How many unique IPs hit the server today?
cut -d' ' -f1 access.log | sort -u | wc -l
# How many 500 errors?
grep ' 500 ' access.log | wc -ltr - character translation
tr replaces or deletes individual characters. It reads stdin only (no filename argument).
echo "Hello World" | tr 'a-z' 'A-Z' # Lowercase to uppercase
echo "hello:world" | tr ':' ' ' # Replace colons with spaces
cat file | tr -d '\r' # Strip Windows carriage returns
cat file | tr -s ' ' # Squeeze repeated spaces to oneUseful for normalizing output before feeding it to other tools.
sed - stream editor
sed applies editing operations to a stream of text line by line. The most important operation by far is substitution.
Substitution: s/pattern/replacement/flags
sed 's/foo/bar/' # Replace first occurrence per line
sed 's/foo/bar/g' # Replace ALL occurrences (g = global)
sed 's/foo/bar/i' # Case-insensitive match
sed 's/foo/bar/2' # Replace only the second occurrenceOther useful sed operations
sed -n '5,10p' file # Print only lines 5 through 10
sed '5,10d' file # Delete lines 5 through 10
sed '/pattern/d' file # Delete all lines matching a pattern
sed -i 's/old/new/g' file # Edit file in-place (be careful)
sed -n '/pattern/p' file # Print only matching lines (like grep)sed on the log
# Anonymize IPs for sharing (replace last octet)
sed 's/\([0-9]\+\.[0-9]\+\.[0-9]\+\)\.[0-9]\+/\1.XXX/g' access.log
# Strip user agent strings from log lines
sed 's/ "[^"]*"$//' access.log
# Extract just the URL paths
sed -n 's/.*"\(GET\|POST\) \([^ ]*\).*/\2/p' access.log Interesting - someone probed /wp-login.php and /phpmyadmin even though neither should exist on this server. That is a vulnerability scanner fingerprint.
awk - the scripting swiss army knife
awk is a full mini-language for processing structured text. Each line is automatically split into fields ($1, $2, etc.) separated by whitespace (or a custom FS).
Basic structure
awk 'pattern { action }' fileIf pattern is omitted, the action runs on every line. If action is omitted, matching lines are printed.
Field access
awk '{print $1}' file # Print field 1
awk '{print $1, $NF}' file # Print first and last field
awk 'NR==5' file # Print only line 5
awk 'NR>=5 && NR<=10' file # Print lines 5-10
awk 'NF > 0' file # Skip blank linesCustom field separator
awk -F':' '{print $1}' /etc/passwd # Colon delimiter
awk -F',' '{print $2}' data.csv # CSV
awk 'BEGIN{FS="\t"} {print $3}' file # Tab delimiterPatterns and conditions
awk '/pattern/' file # Lines matching regex (like grep)
awk '!/pattern/' file # Lines NOT matching
awk '$9 == 401' access.log # Lines where field 9 equals 401
awk '$9 >= 500' access.log # Server errors
awk '$1 == "10.0.0.42"' access.log # Specific IPAggregation with awk
# Count requests per IP (like the uniq -c pipeline but in one step)
awk '{count[$1]++} END{for(ip in count) print count[ip], ip}' access.log | sort -rn
# Sum the response size (field 10) for a specific IP
awk '$1 == "10.0.0.42" {total += $10} END{print total " bytes"}' access.log printf for formatted output
awk '{printf "%-20s %s\n", $1, $7}' access.log # Left-aligned columnsPutting it all together - full log analysis pipeline
Here is a complete pipeline that takes the raw access log and produces a security-relevant report:
echo "=== Top 10 IPs ==="
cut -d' ' -f1 access.log | sort | uniq -c | sort -rn | head -10
echo ""
echo "=== Scanner User-Agents ==="
grep -oiE '"[^"]*(sqlmap|nikto|nmap|masscan|dirbuster|hydra)[^"]*"' access.log | sort -u
echo ""
echo "=== Failed Logins by IP ==="
awk '$9 == 401 {count[$1]++} END{for(ip in count) print count[ip], ip}' access.log | sort -rn
echo ""
echo "=== Top Requested URLs ==="
awk '{print $7}' access.log | sort | uniq -c | sort -rn | head -10
echo ""
echo "=== 500 Errors ==="
awk '$9 == 500 {print $1, $7}' access.log | sort -uHands-on Lab
Grep the Logs
Practice these techniques on a real Apache log dump - find the attacker, their tools, and what they were targeting.
Key takeaways
grepfilters lines;-iignores case,-vinverts,-Eenables extended regex,-oextracts just the match.cut -d'x' -f2extracts a field from delimited text.sort | uniq -c | sort -rnis the canonical "frequency ranking" pipeline - memorize it.sed 's/old/new/g'substitutes text;sed '/pattern/d'deletes matching lines.awk '{count[$1]++} END{...}'aggregates data in ways thatsort | uniqcan't.- Chaining these tools with pipes turns a raw log file into actionable intelligence in seconds.