2 Unix Filters

Filters are building block programs.
- Read from stdin and write to stdout.
- Output is a function of input.
- Used in pipelines, often in shell scripts.
- Typical usage: prog file1 file2 | filter1 | filter2
cat file1 file2 | filter1 | filter2
filter1 <file1 | filter2
Although often output of final filter will be redirected to a file.

- Some filters can have file arguments:

sort file1 file2 | filter1 | filter2 Now concentrate on certain filters.
 

2.1 tr

- tr translates characters.
- Normal usage tr string1 string2 translates any character read from stdin that is in string1 into the corresponding character from string2; otherwise the character is simply passed through.

- string1 and string2 normally have the same number of characters.

tr abc X5Z translates a into X, b into 5 and c into Z.

Can have a run of characters using special (limited) syntax with [-] characters:

    tr [a-z] [A-Z]

Unfortunately [] causes wildcard treatment for a filename to the shells, which is not intended here - Must quote the arguments (the quotes are not passed by the shell).

      tr '[a-z]' '[A-Z]'

translates lower-case letters to upper-case.

- Can use *, meaning times number required, if translating to the same character, eg 0 in the following:

       tr '[a-z]' '[0*]'

- To delete characters, eg all alphabetic

      tr -d '[a-zA-Z]'

- The complement can be specified by -c and multiple replacements can be squashed into a single replacement with -s.

Control characters can be specified as octal numbers of the form \012.

- Eg to get all words in lower case on a separate line (ie separated by a newline or \012 character) while deleting any other character(!:-)

      tr '[A-Z]' '[a-z]' | tr -cs '[a-z]' '[\012*]'
 

2.2 grep

- grep searches for regular expression patterns in lines.

- As a filter is used to select lines from stdin that contain search pattern and copy to
stdout; other lines are discarded.

- Egs
    grep found
    displays lines containing found.

    grep '[Pp]hrase of interest'
    displays lines containing phrase of interest or Phrase of interest. Quotes are needed for
    the shell so spaces and [] are passed on.

- As program reading its file arguments is used to find files and/or show lines containing search pattern.

    grep '^function' file1.c file2.c
    names file and displays lines with function at beginning of line.

- grep returns 0 for success, 1 for failure, 2 for bad pattern. ($ echo $?)

- grep has several options:
        -i ignore case
        -l list file names only
        -c count only of occurrences
        -n number lines
        -v output lines that don't match pattern
        -e pattern can start with - or multiple patterns

- Eg to find processes run by user fred in BSD and SystemV respectively:

       ps aux | grep fred | grep -v grep
       ps -ef | grep fred | grep -v grep

Eg to edit C source files containing Fred:

    vi `grep -l Fred *.c`

There are also fgrep (fast grep with no regular expressions) and egrep (extended regular expressions).
 

2.3 head

- head is used to extract only the first lines from a pipeline or from files, often when prototyping.

- As a filter,
    prog file | head -100 | filter2
    reads only 100 lines from stdin and writes them to stdout.

- As a program with one file argument, it can drive a pipeline from the start of an input file:

    head -1000 largeFile | filter1 | filter2
 
 

2.4 tail


- tail displays lines through to the end of a file.

- Commonly used to look at the end of a file, 30 lines in following:

       tail -30 file

- Also used to start displaying lines after skipping lines. Eg to display from line 100000 to end:

       tail +100000 file

- Works as filter. Eg start from line 2000 and use head to stop after next 1000:

       prog file | tail +2000 | head -1000 | filter
 
 

2.5 sort

- sort reads lines and reorders them into ascending or descending order (according to default or given options).

- sort as a filter:
    prog file1 file2 | sort | filter2

- sort as program driving a pipeline:

    sort file1 file2 | filter1 | filter2

- By default, each line is sorted using Ascii collation sequence (or order)
        -Display Ascii order by man ascii.
        -sort treats lines as records separated by newline characters.

- In following examples, the input file displayed in the left column(s) produces the output shown in the right column(s). Eg sort sAscii

        sAscii:        output:
        abcc a         aa b
        ab a           aabc 9
        aabc 9         ab a
        c d            abcc a
        ba 3           abcc e
        abcc e         ba 3
        aa b           c d

- Can reverse the sort order, ie descending order, by
    sort -r sAscii

        sAscii:         output:
        abcc a          c d
        ab a            abcc e
        aabc 9          abcc a
        c d             ab a
        ba 3            aabc 9
        abcc e          aa b
        aa b            ba 3

- For sorting numbers, the default Ascii order is inappropriate.

    Eg sort sNumbers

        sNumbers: output:
        11          1
         1          23
        21         05
        05         10
        10         11
         23        2
        2          21

Notice space is before the digits (and the alphabetic characters)in Ascii colation sequence. Also integer numbers are being treated as strings (so 2 was placed after 10!).

To sort numbers, an option is given, eg sort -n sNumbers

    Numbers: output:
    11          1
     1         2
    21         4
    05         05
    4          10
    10         11
     23        21
    2           23
 

- Fields in sort are separated by space and/or Tab characters.
        -Following has 5 fields, numbered 1, 2, 3, 4 and 5:

      a*b->c**->d->->e

    where * represents a space and -> represents a Tab character.

- Sorting can require the use of certain fields before others, ie specifying the sort key order.

    - Sort keys are usually specified by (skip,finish) pairs in the following format +skip
       -finish where skip is the number of fields to skip across and finish is the inclusive
        field to finish on:

       +1 -3 skip field 1 (so start on field 2) and finish on field 3
       +5 -6 skip field 5 (so start on field 6) and finish on field 6
       +0 -2 skip field 0 (so start on field 1) and finish on field 2

But can imply finishing on last field:

       +4 for 5 fields is same as +4 -5

- Eg for 5 fields, could specify sort key order of fields 2 and 3, then field 1, then fields 4 and 5 by the pairs
       sort +1 -3 +0 -1 +3 -5

- Can have more complicated orderings. Eg for 5 fields again, could specify sort key order of
field 2 as numbers, then fields 4 and 5 as Ascii (default) but in reverse order, then field 1 as numbers in reverse order, then field 3 by either of
    sort +1 -2n +3 -5r +0 -1nr +2 -3
    sort +1n -2 +3r -5 +0nr -1 +2 -3

- sort +1 -2nr +0 -1 sAscii produces:

        sAscii:         output:
        abcc a         aabc 9
        ab a           ba 3
        aabc 9         aa b
        c d            ab a
        ba 3           abcc a
        abcc e         abcc e
        aa b           c d

Notice the first field was needed to resolve the order when the second field did not contain numbers (and were taken as 0).

- Changing the tabulation character (from space or Tab):
       -t: changes tabulation character to :

- Ignoring case (or folding):
       -f fold upper case to lower case

- Merging pre-sorted files is liniear in time complexity and hence quicker than sorting all the files together. sort -m file1 file2 merges the pre-sorted files.

- It is sometimes convenient to specify an output file
       -o outFile

- Subfields may be specified using dot notation on the fields. Eg
       +2.4 -3.7 skip 2 fields and 4 characters of field 3; finish on 7th character of field 3

    - Note, you can treat a line as one field by specifying a tabulation character that is
    never used on any line; then threat as characters in the field.

- To discard duplicated lines after writing the first occurrence
    -u unique output
 

2.6 uniq

uniq assumes lines of Ascii file are sorted and deals with duplicated lines depending on the option given.

- With no option, one copy of each unique line is printed to stdout.
- Duplicated lines are discarded.
- With -u only those lines which have no duplicates are printed.
- With -d only those lines which have duplicates are printed.

Egs uniq dups, uniq -u dups and uniq -d dups produce following

dups:     default output:     -u output:     -d output:

a         a                 a                 b
b         b                 c                 ddd
b         c                 dd
c         dd
dd        ddd
ddd
ddd
ddd

uniq runs as filter or program driving a pipeline.

There are options to compare fields and to count duplicated lines.
 
 

2.7 cut


- cut prints selected columns or fields to stdout.
- cut is available on SystemV and as GNU public domain program.
- cut runs as filter or program driving a pipeline.
- For columns, use -clist where list is comma separated and/or ranges. Eg to output column 7, then columns 3 to 6 and finally column 11 onwards:
       cut -c7,3-6,11-

- Fields are separated by the Tab character. For fields, use -flist where list is comma separated and/or ranges. Eg to output fields 3, 7 and 1:
       cut -f3,7,1

- To change the delimiting (or tabulation) character use -dc, where c is the required character. Eg -d' ' to set to space and -d: to set to :.
 

2.8 paste


- paste puts files into columns, printing to stdout.

- paste is available on SystemV and as GNU public domain program.

- To put multiple files into side-by-side columns, use for example paste pfile1 pfile2 to produce

        pfile1:     pfile2:            output:

        a line 1     b line 1         a line 1     b line 1
        a line 2     b line 2         a line 2     b line 2
        a line 3                      a line 3
 

- By default a Tab character is placed between the columns for tabulation.

- To change the delimiting (or tabulation) character use -dc, where c is the required character. Eg paste -d: pfile1 pfile2 produces

       pfile1:     pfile2:         output:

        a line 1     b line 1         a line 1:b line 1
        a line 2     b line 2         a line 2:b line 2
        a line 3                      a line 3:
 

- To squash lines of a file into columns, use for example paste -s -d:, pfile1 to squash three lines to a line with a : and a , between each respectively to produce

        pfile1:     output:

        a line 1     a line 1:a line 2,a line 3
        a line 2
        a line 3

- cut and paste can be used to reorder columns. Eg from file with columns col1 col2 col3 to get
col2 col1 col3:

    cut -f2 file > col2
    cut -f1,3 file > cols1+3
    paste col2 cols1+3 > newfile

This can also be done as follows

    cut -f2 file > col2
    cut -f1,3 file | paste col2 - > newfile

where - as a filename means stdin. (This is used in many other filters too.)