Each of them is optimized to handle large files without storing a lot of information in memory, so they can be useful for quick operations on large bioinformatic data sets. Neither is a fully-featured programming language that you would want to write large, complex programs in that said, I did once implement a complete program for full-sibling inference from multiallelic markers in awk ; however they do share many of the useful text-manipulation capabilities of such languages, such as Perl and Python.
Both awk and sed rely heavily on regular expressions to describe patterns in text upon which some operation should be performed. In this chapter we will only scratch the surface of what can be done with awk and sed. Our goal here is to provide an introduction to a few basic maneuvers with both awk and sed and to describe instances where they can be useful, as well as to give an introduction to many but certainly not all of the patterns that can be expressed using regular expressions.
We will start with a basic overview of how awk works. Then we will have a short look at regular expressions, then we will use those further in awk , and finally we will play with sed a little bit.
In order to have a set of files to use in the examples, I have made a small GitHub repository called awk-and-sed-inputs. You can get that with. All of the examples in this chapter that use such external files assume that the current working directory is the repository directory awk-and-sed-inputs. At each new line awk tests whether it should do anything with the contents of the line.
The scope of ways that awk can operate on text is quite wide, but, the most common use of awk is to print parts of the line, or do small calculations on parts of the line. The basic syntax looks like this:. Code describing the actions must always appear within a set of curly braces. We will refer to the text within those curly braces as an action block. Two things to note: first, different tests and actions do not have to be written on separate lines.
This makes it easy to pipe data into awk. For example, putting the above two points together, the above awk script skeleton could have been written this way:. Take a moment to make sure you understand which parts are the tests, and which are the actions in the above examples. Every time awk processes a line of text it breaks it into different fields , which you can think of as columns as in a spreadsheet. By default, any number of whitespace space or TAB characters constitutes a break between columns.
If you want field-splitting to be done using more specific characters, you can specify that on the command line with the the -F option. For example:. While in all those examples, the field separator is a single character, in full-blown practice, the field separator can be specified as a regular expression see below. Note that if you give the action print without any arguments, that also just prints the whole line. Chapter 8 — 13 covers Awk: Chapter 8 — 9 explains Awk syntax, basic commands, and Awk built-in variables.
Chapter 10 — 11 explains Awk variables, operators, conditional statements, and loops. Chapter 12 — 13 explains the powerful Awk associative arrays, and several additional Awk commands.
Free Bonus Gifts Included! For a limited time only Bonus 1: Sed cheatsheet. Topics include: quick searching and replacing text, creating simple macro packages, automatic text processing, report generation, and filtering data. Web sites includes more than sed and awk code examples. The most common operation done with sed is substitution, replacing one block of text with another.
Unlike many conventional languages, awk is "data driven" -- you specify what kind of data you are interested in and the operations to be performed when that data is found. Subsequent sed commands are always applied to the current content of the pattern space and not the original input line.
There is a separate print command p for printing the pattern space. The second buffer used by sed is the hold space. The pattern space can be stored in the hold space for, e. The hold space is not erased if a new cycle is begun. The content of the pattern space can be overwritten by the content of the hold space. In addition, appending one of the two buffers to the other is possible. If pce is applied to a file, then, first, all strings george in a line are replaced by strings bill.
These two actions comprise the cycle per line. In that case, Command is then applied to every pattern space. If an Address is given, then Command is applied to the pattern space only in the case the latter matches Address. Patterns are matched by sed as the longest, non-overlapping strings possible. As already illustrated above, some of the sed commands allow the placement of newline characters in the pattern space.
They must not be repeated in the replacement in a substitution command. The backslash only represents itself. R2: The closing bracket ] must be the first character in what in order to be recognized as itself. R3: Ranges of the type a-z, A-Z, in what are permitted.
R4: The hyphen - must be at the beginning or the end of what in order to be recognized as itself. The rules R1—R4 set under 5 also apply here. As indicated in the last two examples, patterns are matched by sed as the longest, non- overlapping strings possible. If one wants to process overlapping pattern, then one can use the t command described below. In the next sections, we shall explore the possibilities in using patterns in substitution commands.
This is in our experience the most frequent use of patterns. Patterns as addresses and other types of addresses will be discussed afterwards. Alternatively, the code given below may be included in larger sed programs when needed. However, dividing processes into small entities as given in the examples below is a very useful technique to isolate reusable components and to avoid programming mistakes resulting from over-complexity of single programs.
In what follows, we shall refer to this program as addBlanks. All ranges in the sed program contain a blank and a tab. Then, a single blank is placed at the beginning and the end of the pattern space. Finally, any resulting white pattern space is cleared from blanks and tabs in the last substitution command. To identify the first liberal, sed needs the blank in the string which is then not available to identify the second.
Recall that sed matches non-overlapping patterns. Instead of repeating the first pattern, one could loop over it once.
Looping with sed will be explained below. If one preprocesses the source file with addBlanks, only the first pattern is needed once. Thus, a sed based search program for Liberal and liberal is shortened and faster. Example: The following program is a variation of addBlanks.
It can be used to isolate words in text in a somewhat crude fashion. In fact, abbreviations and words that contain a hyphen or an apostrophe are not properly identified.
The white ranges in the sed program contain a blank and a tab each. Then, a single blank is placed at the beginning and the end of a line. In what follows, we shall refer to this program as adjustBlankTabs. Every range contains a blank and a tab.
All white strings are replaced by a single blank in the last substitution command. Application: adjustBlankTabs standardizes and minimizes phrases as strings which may automatically be obtained from e-mail messages with inconsistent typing style or text files that have been justified left and right.
This is useful if one wants to analyze sentences and, e. Example: The following program folds all lines in a text inserts newline characters after the first string of blank or tabs following every string of at least 10 characters.
All ranges contain a blank and a tab. A newline character is inserted in the pattern space after every sequence of characters specified in the combined pattern. Application: Some editors allow sending files via e-mail from within the editor. This leads to particularly long lines in e-mail messages which, e. If one intends to process such e-mail messages automatically, then a customized version of the above program that folds after characters can be used to counter this effect. It can be used for extending, deviding and rearranging patterns and their parts.
More detail about the usage of tagged regular expressions is given in the following examples: Example: The following program shows a first application of the techniques introduced so far. We shall refer to it as markDeterminers.
After the tagging is completed, the triple period is restored. For example, the string "A liberal? A note on addBlanks: Instead of using addBlanks one may be tempted to work, e. However, this substitution command causes the string "Another?
Application: A collection of tagging programs such as markDeterminers can be used for ele- mentary grammatical analysis. Example: The following program shows how one can properly identify words in text. We shall refer to it as leaveOnlyWords in the sequel. This is the longest program listing in this paper. Next lines , strings of the type letters. For example, v. Next lines comes a collection of substitution commands that replaces the period in standard abbreviations with an underscore character.
Then line 7 , all period characters are replaced by blanks and subsequently all underscore characters by periods. Next line 8 , every apostrophe which is embedded in between two letters is replaced by an underscore character.
All other apostrophes are then replaced by blanks, and subsequently all underscore characters are replaced by apostrophes. Finally line 9 , the hyphen is treated in a similar way as the apostrophe. We shall refer to this program as doubleLetterWords. In the second substitution command, all unmarked words are deleted. To illustrate this by an example consider the following: after being processed by the first sub- stitution command, the line Now, I will tell you why.
Finally, the underscore char- acters are deleted. Exercise: 1 Modify doubleLetterWords to search for double vowels as in moon. Use only one tagged regular expression for the latter. In that case, retain also the word that follows the word containing the string ing.
In what follows, we shall refer to this program as hideUnderscore. In what follows, we shall refer to this inverse program as restoreUnderscore.
Observe that sed scannes the pattern space from left to right. This technique has already been demonstrated above in leaveOnlyWords and doubleLetterWords. Framed by underscore characters, these keywords are easily distinguishable from regular words in the text. Another application is to recognize the ending of sentences in the case of the period character. The period appears also in numbers and in abbreviations.
By first replacing the period in the two latter cases by an underscore character and then interpreting the period as marker for the ending of sentences is, with minor additions, one way to generate a file which contains whole sentences per line. In that case, only the format of numbers changes occasionally.
Usually, the format of numbers in text sources is not checked for the purpose of language analysis. As outlined at the end of the last section, one has to implement the following steps: 1 If necessary, encrypt the source in such a way that one character which is unimportant for the subsequent analysis disappears from the text. This can be achieved by a program such as hideUnderscore. This has to be done in such a way that the pattern matching done under 3 does not apply to the special cases marked here.
Thus, rearrangement of tagged regular expressions is possible in the replacement in a substitution command. Example: The following program acts on short sentences on single lines. Line 42 is the last line that is put into the pattern space and is processed copied by quitting. Consult man more and man less for alternatives to the above program.
Line numbers are cumulative over several files to which a sed program is applied. Consult man tail and man less for alternatives to the above program. The addresses 1 resp. Address1 specifies where on which line resp.
Address2 specifies where actions end. The following program indents the code by two blanks. In fact, non-empty code lines are indented only. All white ranges contain a blank and a tab. The period in the line addressed by the range matches only the first character in a non- empty pattern space since there is no g trailing the substitution command.
Example: The source code for this document contains several test programs for the claims made about sed commands in the next section. These programs are eliminated from the document through preprocessing with a one-line sed program. This is done in a similar fashion as above using a begin and an end address and the delete command d addressed by the range begin,end.
In it, the number at the end of each section is the number of addresses possible. Usually, the command labeled by an address range is executed for every line in the range. We shall mention those commands that behave differently. The others may be skipped on first reading. What is appended is not put into the pattern space and not subject to the following sed commands.
The appended text is printed even if the pattern space is deleted afterwards in the cycle or the quit command is executed. If there is no whereTo, then branch to the end of the script. The : whereTo may occur before b whereTo in the program creating a loop.
In that case, another b command has to be used to leave the loop. Or, an address in front of the b com- mand must deactivate the loop eventually. The current content of the pattern space is deleted, and a new cycle is started. Consequently, what is printed is not subject to the following sed commands. If an address range is given, then printing is done at the end of the address range. However, the current content of the pattern space is deleted for the full address range. Thus, with an address range one can exchange, e.
In those cases, a newline character separating old and new is appended first. If the hold space is empty, then this results in an empty pattern space. This is useful for repeated analysis of the original input line which can be stored in the hold space with the command h. This in- cludes appending a newline character first separating old and new. Storing the pattern space in the hold space makes it possible to reinvestigate the original line or an intermediate state of the pattern space.
This includes appending a newline character first separating old and new. Thus, an H-G sequence may create many empty lines due to double newline characters. What is inserted is not subject to the following sed commands. In connection with the first line address 1, the i command can be used to prepend something to a document. This can be used to identify Japanese characters [Lunde ] in bilingual text. In addition, the next line of input is put into the pattern space.
The current line number changes. However, a new cycle is not started from the top. Instead, the sed program is continued at the current program line for the pattern space with the new content. If there is no interference by other commands, then the switch by the n command in the pattern space is done for every second line of input.
In the case of an address range, the addresses will only work, if the pattern space is matched before the n command is executed.
Compare the example given next. If sed is invoked as sed -n, then printing is suppressed, and only the next line of input is put into the pattern space. The lines 2, 4 and 6 were only subject to the first substitution command. The lines 3, 5 and 7 were only subject to the final substitution command. Note that 7ay was obtained after 6bxE. This shows that the n command may have consequences one line beyond an address range associated with it. The 8by in the output shows that executing n stopped at 6bxE since both substitution commands were applied to the line containing 8ax.
In contrast to that, 1ax 2axS 3ax 4ax 5axE 6ax 7ax 8ax yields 1by 2bxS 3ay 4bx 5ayE 6bx 7ay 8bx. As above, N has an effect one line beyond a range and can miss an address, if the line matching the address is appended. If there is an attempt to append something beyond the the end of the file, then sed quits and misses processing and printing the last pattern space.
Thus, in the usual sed mode one gets an additional line of output if the pattern space is not deleted afterwards. However, the default printing by sed can be switched off by invoking it as sed -n.
What is copied is not put into the pattern space. Copying is done even if the current pattern space is deleted or the q command is executed afterwards in the cycle. If no n or N commands are used, then the copying is done before processing the next input line. One can print to at most 10 different files. In case one has to use more files, one can split the sed program and use a pipe in which every piece uses up to 10 files.
Larger text files that are processed may contain exceptional cases to patterns that are manipulated. If there is no whereTo, then start a new cycle. The : whereTo may occur before t whereTo in the program creating a loop. Creating loops with the t command can be used to re substitute in overlapping patterns. It can also be used for reprocessing if the pattern in a particular preceeding substitution is possibly generated by a subsequent substitution. The w command can be used to sort pieces of a file into several files.
After a w command, everything that follows after some white space on the same line is understood as the filename to which the command is supposed to write. Thus, after a w command no other command can follow on the same line. The lengths of string1 and string2 must be equal. Ranges are not allowed. A substitution for it can be achieved by an additional s command. The y command can, e.
As with the p command, printing is done immediately. Commands can be on separate lines or be separated by semicolons ;. Using a framing pair of parentheses, a non-address range command such as i can be applied to a range. Note the semicolon termi- nating the s command. An address range is only allowed if function allows it. For example, a header containing an address may be inserted in a document several times.
Or, a certain piece of code such as the declaration of a standard set of variables is used in many function definitions. This should be done only if the headers resp. If a header or footer is always added to a document, then using the UNIX command cat mentioned above together with separate files that contain the additions is best. It isolates non-white strings of characters in a text and puts every such string on a separate line. We shall call this oneItemPerLine in the sequel.
For non-white lines, white characters at the beginning and the end of lines are removed. Finally, all remaining strings of white characters are replaced by newline characters. Example: The following program finds all four-letter-words in a text. We shall refer to it as findFourLetterWords in the sequel.
This will occur only if a four-letter-word was found on a line. Example: The following program sorts all characters 0 zero to the right of a line. This shows the typical use of the t command. The second command exchanges all characters 0 with a neighboring non-zero to the right.
The last command tests whether or not a substitution happened. If a substitution happened, then the cycle is continued at : again. Otherwise, the cycle is terminated. Application: In the course of the investigation in [Abramson et al. Such control sequences had to be removed. This was done using substitution commands with empty re- placements. Some of the control sequences in the source are important in regard to the database which was generated. In [Nelson ], Japanese is represented using kanji, kun pronuncia- tion and on pronunciation.
The on pronunciation of kanji is typeset in italics characters. In the source file, the associated text is framed by a unique pair of control sequences. Similarly, the kun pronunciation of kanji is represented by small caps printing. Though quite regular already, it contains a certain collection of describable irregularities. For example, the ranges of framing pairs of control sequences overlap some- times.
In order to match kun pronunciation and on pronunciation in the source file of [Nelson ] properly, a collection of commutation rules for control sequences was implemented to achieve that the control sequences needed for pattern matching only frame a piece of text and no other control sequences. These commutation rules were implemented in a similar way as the last example shows.
We shall refer to this program as quadrupleWords. The third command tests whether or not a substitution happened. If a substitution happened, then the cycle is continued at : AD. If no substitution happened, then the cycle is continued in the next line of the program. In the last line, all pattern spaces are deleted that do not contain a triple underscore character corresponding to a quadruple word in the original line of input.
In the last command of the program, everything after the first word is deleted. This will be outlined below in greater detail. We shall refer to it as sortByVowel in the sequel. An alternative is to use the UNIX rm-command.
Consult man rm for more details. Observe the use of the single quotes. Note that output by the w command is always appended to an existing file. Thus, the files have to be removed or empty versions have to be created in case the program has been used before on the same file. There is no direct output by this UNIX command. It is clear how to generalize this procedure to a more significant analysis, e. Recall that everything after a w command and separating white space until the end of the line is understood as the filename the w command is supposed to write to.
Example: We shall refer to the following program as mapToLowerCase. It does what its name says. The latter pro- cedure was applied to short essays submitted by Japanese students via e-mail as homework. We were subsequently interested in selecting student-generated example sentences containing a specific problematical pattern for presentation in class.
The next two examples show such selection procedures. Also consult man grep. The grep-family of filters is designed to find lines in a file that match a certain pattern. We shall refer to it as printPredecessorBecause in the sequel. Also consult man grep in regard to the options -n n a positive integer , -A, -B, and -C.
Next, the new pattern space containing the previous line is printed by p. Then, the pattern space is overwritten by g with the current line which is also printed by p. The b command terminates the cycle. Write an awk program that does the same as the latter program. Disregard applying the program to multiple files. Print the filename using echo. Use tagged regular expressions in order to recognize the double words.
Be aware of properly processing the last line in connection with the N command. Such action can preceed the use of the generated program. Alternatively, the generation of a program and its subsequent use are part of a single UNIX command. The latter possibility is outlined next. For example, the words the, a, an, if, then, and, or, We shall refer to the following program as eliminateList in the sequel.
0コメント