I searched for \s{2,}
to select any spot where there
were two or more spaces and then replaced them with a comma followed by
a space.
Then, I searched for (?<=\d),(?=\d)
to find all
commas squished between two numbers, and replaced them with nothing.
Text produced:
Candidate Choice, Absentee Mail, Early Voting, Election Day, Total
Votes
TODD RUSS, 7021, 8194, 135216, 150431
CLARK JOLLEY, 7012, 5835, 107714, 120561
I used \s{2,}
to find all spots where number of spaces
was greater than 2 and replaced them with one space.
Then I used \s\w{1}.\s
to find any single letters after
a space and before a period and another space. These are the initials. I
replaced each one with a space.
Then I captured first name, last name, and email using this command:
(\w+), (\w+) (\w+@\w+.\w+)
Each capture contains a word
with a variable number of letters. These words are separated by a comma,
which I did not capture. For the email addresses, I included the at sign
and period. as part of the capture. Then I replaced with this command:
\2 \1 (\3)
This takes the second element I captured and
pastes it before the first element that I captured, and puts parentheses
around the third element.
Text produced:
Emily Adamic (ema3896@utulsa.edu)
Emily Bierbaum (elb0588@utulsa.edu)
Laci Cartmell (ljc454@utulsa.edu)
Elise Delaporte (eld0070@utulsa.edu)
Rebekah Hansen (reh9623@utulsa.edu)
Madison Herrboldt (mah1626@utulsa.edu)
Cari Lewis (cdl5261@utulsa.edu)
Tanner Mierow (ttm5619@utulsa.edu)
Daniel Naranjo (dsn8679@utulsa.edu)
Caleb Paslay (cap1050@utulsa.edu)
Olivia Pletcher (omp9336@utulsa.edu)
Amy West (acw1471@utulsa.edu)
I used this command , \w+ \w+,
to select the genus
species name. This selected the two words of variable length following
the space and comma and before another comma. Then I replaced this with
a new comma.
Text produced:
Banded sculpin, 5
Redspot chub, 5
Northern hog sucker, 6
Creek chub, 8
Stippled darter, 9
Smallmouth bass, 10
Logperch, 13
Slender madtom, 14
I searched for , (\w)\w+ (\w+)
This captured the first
letter of the word (genus) after the first comma and space, ignored the
rest of the word, and captured the species name after the space.
I replaced this with , \1_\2
This replaced it with a new
comma followed by a space, pasted the captured first letter, added an
underscore, and pasted the captured species name.
Text produced:
Banded sculpin, C_carolinae, 5
Redspot chub, N_asper, 5
Northern hog sucker, H_nigricans, 6
Creek chub, S_atromaculatus, 8
Stippled darter, E_punctulatum, 9
Smallmouth bass, M_dolomieu, 10
Logperch, P_caprodes, 13
Slender madtom, N_exilis, 14
I searched for , (\w)\w+ (\w{3})\w+
This captured the
first letter of the genus name after the first comma and space, ignored
the rest of the word, captured the first three letters of the species
name and ignored the rest of that word.
Then I replaced with , \1_\2.
This replaced it with a
new comma, pasted the captured first letter of the genus, added an
underscore, pasted the captured first three letters of species, and
added new period after the three letters.
Text produced:
Banded sculpin, C_car., 5
Redspot chub, N_asp., 5
Northern hog sucker, H_nig., 6
Creek chub, S_atr., 8
Stippled darter, E_pun., 9
Smallmouth bass, M_dol., 10
Logperch, P_cap., 13
Slender madtom, N_exi., 14
I used the command
grep '^>' protein.faa > headers.txt
to create a
document containing only the FASTA headers.
grep searches the text for the regular expression. The regular
expression is inside of the ’’ marks, and the ^ symbol tells the
terminal to check the beginning of each line in the indicated file for
the > symbol. We can then use > headers.txt
to create
a new text file name headers which will display each line that starts
with >. Lastly, we can then use the command
head headers.txt
to see the start of the new text file.
Result:
>NP_001009188.1 carboxylesterase 5A precursor [Felis catus]\
>NP_001009190.1 agouti-signaling protein precursor [Felis catus]\
>NP_001009193.1 interferon alpha-8 precursor [Felis catus]\
>NP_001009197.2 interferon alpha precursor [Felis catus]\
>NP_001009199.1 growth arrest and DNA damage-inducible protein GADD45 alpha [Felis catus]\
>NP_001009200.1 tumor necrosis factor receptor superfamily member 4 precursor [Felis catus]\
>NP_001009203.1 acetylcholinesterase precursor [Felis catus]\
>NP_001009205.1 low affinity immunoglobulin gamma Fc region receptor III-A precursor [Felis catus]\
>NP_001009206.1 potassium voltage-gated channel subfamily E member 1 [Felis catus]\
>NP_001009207.1 interleukin-15 precursor [Felis catus]\
sed '/>/ i\
ENDSEQUENCE \
' ./protein.faa | sed -n '/ribosom/,/ENDSEQUENCE/ {
/ENDSEQUENCE/! p
}' > ribosomes.txt
My explanation:
First, we used the '/>/ i\
command to insert a temporary
marker called “ENDSEQUENCE” as a new line above each line containing the
> bracket in the protein.faa file. This marker will allow us to
select the information before it without consuming any following lines
of text that must be searched for a match to ‘ribosom’. We need this
marker to address cases where two ribosomal protein sequences are listed
consecutively in the file.
Then, we selected each section of lines where the first line contains
“ribosom” and the last line contains “ENDSEQUENCE”. The comma in between
/ribosom/ and /ENDSEQUENCE/ indicates the range. We fed that into a
command that prints the line as long as the line doesn’t contain
ENDSEQUENCE. In this command, the exclamation point is a ‘not’ specifier
that says we are looking for lines that do not contain ENDSEQUENCE, and
the p command tells it to print. We then printed to the file
ribosomes.txt. The ENDSEQUENCE line is not printed to the file thanks to
the ‘not’ command earlier, and therefore our temporary marker is
removed.
My brother’s more detailed explanation of why we used a temporary
marker:
The use of the ENDSEQUENCE line is necessary because when matching a
range in sed, it consumes all the lines in the range. So without adding
the ENDSEQUENCE line, there are cases where we would have back-to-back
ribosomal proteins and the match on the first protein consumes the first
line of the next protein, which means sed would not match the next
protein properly. As a result, it provides the first line with this
protein’s name without printing any of the protein sequence
underneath.
Result:
>NP_001009832.1 40S ribosomal protein S7 [Felis catus]
MFSSSAKIVKPNGEKPDEFESGISQALLELEMNSDLKAQLRELNITAAKEIEVGGGRKAIIIFVPVPQLKSFQKIQVRLV
RELEKKFSGKHVVFIAQRRILPKPTRKSRTKNKQKRPRSRTLTAVHDAILEDLVFPSEIVGKRIRVKLDGSRLIKVHLDK
AQQNNVEHKVETFSGVYKKLTGKDVNFEFPEFQL
>NP_001041622.1 60S ribosomal protein L41 [Felis catus]
MRAKWRKKRMRRLKRKRRKMRQRSK
>NP_001116826.1 ubiquitin-60S ribosomal protein L40 precursor [Felis catus]
MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGGIIEP
SLRQLAQKYNCDKMICRKCYARLHPRAVNCRKKKCGHTNNLRPKKKVK