GeneCore basic linux tutorial

Introduction to terminal and bash

Command-line is a classical way to run programs in (not only) UNIX environment.
It might not have fancy windows, but it is unsurpassed in terms of power and effective resource cost.
You need to grasp two concepts to work with command-line:

paths
commands

Path describes a unique location on file-system. It points to a file or a directory.
You are most likely familiar with windows folder structure. For example C:\Users\Jonathan\Documents\StreetFrenchLessons\

In UNIX systems this is almost the same, just the separator is not \ but /. Unix also does not recognize separate discs in the same way as windows. You will not have C:\ or D:\ drives. Instead drives are linked to a folder.
There are two special "folders" in UNIX environment. They are the "dot" . and "dot-dot" .. folders. A "dot" folder refers to "here", meaning the folder you are in right now. The "dot-dot" folder refers to one folder above the one you are in right now. These exist to simplify the way you write paths, so that you can use a shorter notation and save time.

The folder where the materials for this tutorial reside is /home/solexa/teaching. In case you want to go to this directory you will have to use the command line and use a command to change directories.

This gets us to commands. Commands are programs that preform some function. Commands usually take in some additional information from the user. Such information is called parameter or option. These parameters can be for example path to a file, option (switch) denoted by - and letter (for example most command recognize -h or --help to give you quick help), or an option denoted by - and letter that requires additional argument such as path to file, or number etc. (for example -t [number] is used by many commands to set number of processors to use).

To run a command you type in the command name and additional parameters, separated by a space. For example to change directory as mentioned you would run a simple command called cd and as a parameter you would input the target directory, like this: cd /home/solexa/teaching.

Very brief list of commands you might need:

Command	Role	Example
pwd	prints your current directory	`pwd`
cd	change directory	`cd /home/student/rna-seq-tutorial/`
ls	list files in directory	`ls`
cp	copy from to	`cp sourcefile targetfile`
rm	delete file (use with caution, there is no trash-bin)	`rm file`
cal	display calendar	`cal`
date	display date	`date`

Few tricky things about terminal:

Shortcut	Effect
`ctrl+c`	KILLS your running process (super effective)
`ctrl+shift+c`	copies from terminal
`ctrl+shift+v`	pastes into terminal
`ctrl+z`	stops/pauses process
`tab`	tab completion (if you are typing, you are doing something wrong)
`up arrow`	list through last commands
`ctrl+r`	start searching command history
`ctrl+l`	clears your terminal window

Examples and practical part

First of all, let us all get to the same place on the computer. Run following command:

cd /home/solexa/teaching You should observe change in the prompt text to something like:

solexa@orpheus[teaching]:$ With the last part in the square brackets being the last folder in your path. Now we are all in the same folder. To reduce (hopefully) confusion later on, I want you all to create your own respective folders. To do this, we are going to use mkdir NAMEOFFOLDER. Please replace NAMEOFFOLDER with your firs name. In my case it would be mkdir jan. Remember that linux is case-sensitive - Jan and jan would be two different folders.

Parameters and options

Since commands perform various tasks they also require very varying amount of input information. From very simple commands requiring nothing from the user (for example cal ) to commands that take whole lines of options (pretty much any bioinformatics analysis tool :) ) To accommodate the complexity of input, there are many different ways of how one can supply the parameters. Let us see some examples.

ls will list the contents of your current working directory. But it will not list any hidden files. If you want it to do that, you need to give it the option -a like this: ls -a. This will list files and folders that start with a dot (that is how you tel linux to make the file hidden). If you were in /home/solexa/teaching along with all the folders that you and your colleagues created, there should be a secret file also. to view this file, you can run cat SECRETFILENAME replacing SECRETFILENAME with, you guessed it, the name of the file (don't forget the dot in the beginning, as it is part of the filename ). This command illustrates another way of input. This time we pointed the command to a file. You can imagine that by using only one-letter options, you would run out of letters quite fast, so by convention you can also use whole words with -- preposition as options. For the ls -a there is long option alternative ls --all. This option switches behaviour. It is either on or off. Some switch options can be chained together after a single dash. ls -hal is exactly the same as ls -h -a -l. In some cases options require more of a verbose input from the user. One example sticking with ls: changing of time display ls -hal --time-style=locale in this case user can choose different styles by typing one of these: "full-iso, long-iso, iso, locale".

Saving output

To understand how to save output of our commands, we have to understand how terminal handles input and output. Imagine that the terminal has 3 open files at its disposal when you run it:

0 - Standard input
1 - Standard output
2 - Standard error

File number 0 would correspond to what you type on the keyboard. 1 would correspond to what is displayed to the terminal window on your screen and number 2 is also what is displayed on your screen but its existence allows you to separate errors from normal output.

So by default, when you type something on the keyboard, it is recorded in file 0, then as soon as you hit enter, whatever you type is processed by the terminal and the output of such processing is written in file 1 that is then displayed on your screen. In case your program runs into an error it is written in file 2 and displayed on screen.

Example:

cat /proc/cpuinfo

The cat command is used here to read the contents of a file called /proc/cpuinfo. Again the content is not important, try to think about what is going on with the input/output.

Now what happens if we try to cat a file that does not exist?

cat /foo/bar/nonsense

It looks pretty much the same, but the output went to file 2 (standard error). We can use this behavior later on when we want to split log files to have the non-erroneous output in one file and the errors in another for easy debugging.

Now for sure there is a way to redirect the flow of the output, right? Indeed there is! The magical symbol is the greater-than sign >.

The syntax of this is:

cat /proc/cpuinfo > savedOutput.txt

We can now view the file using for example:

less savedOutput.txt

What happens is that you run your command and what would normally go to the file 1 (displayed on screen). But the > makes it go into a file, that we decided to name savedOutput.txt. You can name the file whatever you want and are not limited to savedOutput.txt :) If the file does not exist, it will create it. If it does, it will remove it and make a fresh and empty one to write into.

In case you just want to append to a file you can use the >> operator instead of the >.

Pipes

Now that we have some idea about how to save the output from screen into a file, let us talk about pipes. Pipes allow you to redirect the output not to files or a screen but to another programs. The operator we use for this is | (found above the backslash \ key on most keyboards). What this operator does is feed the output from the program on the left as input to the program on the right.

Imagine that in the previous example we want to see only first five lines of the file cpuinfo:

cat /proc/cpuinfo | head -n 5

We are displaying the contents of the file and giving it to a program called head with parameters -n 5. Head displays first x lines (default is 10).

This might not seem like much, but try to imagine the possibilities. You can string almost a limitless number of programs to parse, transform and do something with your data.

One real life example you might know. Listing undetermined barcodes from a fastq.gz file. Those of you that remember the command will have it bit easier, but let us see. Here are the commands you can use (scrambled order of things :) to build your pipeline. Feel free to use Google.

tail
zcat
cut
uniq
sort
awk
head

A hint: You need to open a file, select only some lines, select only certain part of the line (think columns) and then somehow count the occurrences of barcodes. The barcode is a part of the header for each read. Work step by step and add on to the pipe.

Wildcards (adapted from here )

Wildcards are a set of building blocks that allow you to create a pattern defining a set of files or directories. As you would remember, whenever we refer to a file or directory on the command line we are actually referring to a path. Whenever we refer to a path we may also use wildcards in that path to turn it into a set of files or directories.

Here is the basic set of wildcards:

* represents zero or more characters
? represents a single character
[] represents a range of characters

This can make your life much easier if you want to do something with more than one file or directory at the same time. Let us practice basics now. Please go to /home/solexa/teaching/wildcards folder using the cd command. Then you can list the contents of the folder with ls. Imagine you want information only on the files starting with letter "a". You can do that easily with:

ls a*

Now how do you list only files that have "e" as the third letter?

ls ??e*

Or only files that start with "a" or "h"?

ls [ah]*

Now how would you select all the ingredients for a Hawaii toast form the files? Namely: ananas, cheese, ham and toast_bread? You can list the files again and look for a pattern :)

Wildcards are very useful in day to day life. Imagine you want to remove the folder before re-running the pipeline. You can simply type:

rm -rf *ligned_lane? log_pipeline

and all those pesky folders are gone forever. Just be careful :)

Variables

Variable is a place in the memory of the computer that can hold some sort of information. You can assign a value to a variable of your choosing. The way we do it in bash is:

NAME=Jan

after you hit enter, it looks like not much has happened, but the system now remembers a variable called "NAME" and it holds a value "Jan".

You can now use this variable in your commands:

echo "Hello, my name is $NAME"

Notice the "$". This tells the command line that the next word following is a name of a variable.

You can also assign a variable from an output of a file!

NAME=$(echo "Jan")
echo "Hello, my name is $NAME"
NAME=$(echo $NAME | sed 's/a/onatha/')
echo "Hello, my name is $NAME"

Again a little exercise :) There are some files in the folder "variables". I would like you to make your command line print a nice sentence, telling us how many files in the folder are food. You need to use a variable in the sentence and you need to get the number by some command, not just write it.

Hints: wildcards, listing of folder content, piping, number of lines.

Variables are also very useful in daily work, mainly for larger scripts like our pipeline. In there we define a lot of variables like the run folder, where the software is and then in the script we do not have to reuse it all the time.

Loops

Now for the juicy stuff that some of you already use but maybe you do not have full understanding of how it actually works.

Imagine you have a very repetitive task, like running a program for each file in a folder. And there are few hundreds of files. And the task itself cannot be solved by simple wildcard. You could do this manually but surely there is a better way.

Enter For loop!

the basic pseudo-syntax is:

for variable in <list>
do
<commands>
done

or on one line:

for variable in <list>; do <commands>; done

Lets see an example combining what we have learned so far. In the folder "loops" run:

for F in cars*; do echo $F; done

This will just list the files in a same way that ls -1 would.

You can also loop through a variable!

NAMES='Michal Davide Jonathan Tobi Jürgen Ferris Jan'
for NAME in $NAMES; do echo "$NAME sits in the \"boys room\". "; done

Could you write a for loop that would rename all these files so that they have a prefix "cars_"? I want you to use a variable to loop through, declare this variable using a program. Renaming in linux is done with the mv <from> <to> command. For example mv ananas.txt food_ananas.txt.

As a final part we can take a look at the renaming loop that can be used for HiSeq run Aligned_lane folders.

for i in {1..8}
do
cd Aligned_lane$i && rename FCID SE $i ../SampleSheetOriginal.csv && cd ..
done

Introduction