قالب وردپرس درنا توس
Home / Tips and Tricks / How to extract and sort columns from log files on Linux – CloudSavvy IT

How to extract and sort columns from log files on Linux – CloudSavvy IT



  Sort a log file.

Sorting a log file by a specific column is useful for quickly finding information. Logs are usually stored as plain text, so you can use the command line text manipulation tool to process them and view them in a more readable way.

Extracting Column with cut and awk

The cut and awk tool are two different ways to extract a column of information from text files. Both assume that your log files are separated by spaces, for example:

  column column column 

This presents a problem if the data in the columns contains spaces, such as date ("Wed, June 1

2"). While clips may see this as three separate columns, you can still extract all three at once, provided the structure of your log file is consistent.

clip is very easy to use:

  cat system.log | cut -d & # 39; & # 39; -f 1-6 

cat command reads the contents of the system.log and moves it to cut . The flag -d specifies the delimiter, in this case it is a white space. (The default is the t tab.) The -f flag specifies which fields to output. This command will specifically print the first six columns of system.log . If you only wanted to print the third column, you would use the -f 3 flag.

awk is more powerful but not so short. clips are useful for extracting columns, as if you wanted to retrieve a list of IP addresses from your Apache logs. awk can arrange whole lines, which can be useful for sorting an entire document by a specific column. awk is a complete programming language, but you can use a simple command to print columns:

  cat system.log | awk & # 39; {print $ 1, $ 2} & # 39; 

awk executes your command for each row in the file. By default, it divides the file into spaces and stores each column in variables $ 1 $ 2 $ 3 and so on. By using the command print $ 1 you can print the first column, but there is no easy way to print a number of columns without using loops.

An advantage of awk is that the command can refer to the entire line at once. The row contents are stored in the variable $ 0 which you can use to print the entire row. For example, you can print the third column before printing the rest of the line:

  awk & # 39; {print $ 3 "" $ 0} & # 39; 

"" prints a space between $ 3 and $ 0 . This command repeats column three twice, but you can resolve it by setting the $ 3 variable to null:

  awk & # 39; {printf $ 3; $ 3 = ""; print "" $ 0} & # 39; 

The command printf does not print a new line. Similarly, you can exclude specific columns from the output by setting them all to empty strings before printing $ 0 :

  awk & # 39; {$ 1 = $ 2 = $ 3 = ""; print $ 0} & # 39; 

You can do much more with awk including regex matching, but the column section outside the box works well for this use case.

Sort columns with sort and uniq

The command sort can be used to order a list of data based on a specific column. The syntax is:

  sort -k 1 

where the flag -k indicates the column number. You are moving input into this command, and it releases an ordered list. By default, uses sort alphabetical order but supports more options through flags, for example -n for numeric sorting, -h for sequential sorting (1M> 1K), -M for sorting monthly abbreviations, and -V for sorting file version number (file-1.2.3> file-1.2.1).

uniq command filters out duplicate rows and leaves only unique ones. It only works for adjacent rows (for performance reasons), so you always have to use it after sorting to remove duplicate throughout the file. The syntax is simply:

  sort -k 1 | uniq 

If you just want to list the duplicates, use the -d flag.

uniq can also count the number of duplicates with -c Flag, which makes it very good for tracking frequency. For example, if you want to get a list of the best IP addresses that hit your Apache server, you can run the following command on your access.log :

  cut -d & # 39; & # 39 ; -f 1 | sort | uniq -c | black -nr | head 

This string of commands will cut the IP address column, group the duplicates, remove the duplicates while counting each occurrence, then sort based on the count column in descending numeric order, giving you a list that looks like: [19659005] 21 192.168.1.1
12 10.0.0.1
5 1.1.1.1
2 8.0.0.8

You can apply the same techniques to your log files, except for other tools like awk and sed to extract useful information. These chained commands are long, but you do not need to type them each time, because you can always store them in a base script or alias them through your ~ / .bashrc .

Filtering data with grep and awk

grep is a very simple command; you give it a search term and submit it, and it will spit out every line containing that keyword. For example, if you wanted to search your Apache access log for 404 errors, you could do:

  cat access.log | grab "404" 

that would spit out a list of log entries that match the given text.

grep however, cannot restrict the search to a specific column, so this command fails if you have the text "404" elsewhere in the file. If you only want to search the HTTP status code column, you must use awk :

  cat access.log | awk & # 39; {if ($ 9 == "404") print $ 0;} & # 39; 

With awk you also have the advantage that you can do negative searches. For example, you can search for all log entries that did not return with status code 200 (OK):

  cat access.log | awk & # 39; {if ($ 9! = "200") print $ 0;} & # 39; 

and have access to all programmatic functions awk provides.

GUI options for web blogs

 Monitor a web server's access log in real time.

GoAccess is a CLI tool to monitor your web server's access log in real time and sort by any useful field. It runs entirely in your terminal, so you can use it over SSH, but it also has a much more intuitive web interface.

apachetop is another tool specific to apache that can be used to filter and sort by columns in your access log. It runs in real time directly on your access.log.


Source link