قالب وردپرس درنا توس
Home / Tips and Tricks / Here's how to scratch a list of topics from a subreddit using Bash

Here's how to scratch a list of topics from a subreddit using Bash



  Ubuntu bearer concept Linux terminal
Fatmawati Achmad Zaenuri / Shutterstock.com

Reddit offers JSON feeds for each sub-suite. Here's how to create a Bash script that retrieves and analyzes a list of posts from any sub-site you want. This is just one thing you can do with Reddit's JSON feeds.

Install Curl and JQ

We will use curl to retrieve the JSON flow from Reddit and jq ] to analyze JSON data and extract the fields we want from the results. Install these two dependencies using apt-get on Ubuntu and other Debian-based Linux distributions. On other Linux distributions, use your distribution package management tool instead.

  sudo apt-get install curl jq 

Download some JSON data from Reddit

Let's see how the data flow looks. Use curl to download the latest posts from MildlyInteresting subreddit:

  curl -s -A "reddit scraper example" https://www.reddit.com/r/MildlyInteresting.json

Note how the alternatives used before the URL: -s force curl to run in silent mode so that we do not see any output, except data from Reddit's servers. The next option and the following parameter, -A "reddit scraper example" specify a custom user agent string that helps Reddit identify the service that accesses their data. The Reddit API servers apply ceiling limits based on the user agent string. By specifying a custom value, Reddit will segment our boundary limit away from other callers and reduce the risk of us getting an HTTP 429 cross-border error.

The output should fill in the terminal window and look something like this: [1

9659011]   Scratch a subreddit from Bash

There are lots of fields in the output data, but all we are interested in is Title, Permalink and URL. You can see an exhaustive list of types and their fields on Reddit's API documentation page: https://github.com/reddit-archive/reddit/wiki/JSON

Extracting data from the JSON output

We want to extract Title, Permalink and URL, from the output data and save it to a tab-delimited file. We can use word processing tools such as and but but we have another tool at our disposal which understands JSON data structures, called or . For our first attempt, let's use it to print and color-code the output. We use the same call as before, but this time the output is related to jq and instructs it to analyze and print JSON data.

  curl -s -A "reddit scraper example" https: //www.reddit.com/r/MildlyInteresting.json | jq. 

Note the period that follows the command. This expression simply parses the input and prints what is. The output looks neatly formatted and color-coded:

 Extracting data from a suberdit's JSON in Bash

Let's examine the structure of JSON data that we return from Reddit. The root result is an object that contains two properties: variety and data. The latter holds a property called children which contains a number of posts to this sub-site.

Each object in the matrix is ​​an object that also contains two fields called type and data. The properties we want to capture can be found in the data object. expects an expression that can be applied to input data and produces the desired output. It must describe the content in terms of their hierarchy and membership of an array, as well as how to transform data. Let's run the entire command again with the correct expression:

  curl -s -A "reddit scraper example" https://www.reddit.com/r/MildlyInteresting.json | yes & # 39; .data.children | . [] | .data.title, .data.url, .data.permalink 

The output shows Title, URL and Permalink each on its own line:

 Analyze the contents of a subreddit from the Linux command line

Let's dive into the command jq that we called:

  jq & # 39; .data.children | . [] | .data.title, .data.url, .data.permalink 

There are three expressions in this command separated by two tube symbols. The results of each expression are sent to the next for further evaluation. The first expression filters out everything except a series of Reddit lists. This output is inserted into the second expression and forced into an array. The third expression acts on each element of the matrix and extracts three properties. For more information on jq and its expression syntax, see jq's official manual.

Put it all together in a script

Let's put the API call and JSON post-processing together in a script It will generate a file with the services we want. We add support to retrieve posts from any subreddit, not just /r/MildlyInteresting.

Open your editor and copy the contents of this code into a file called scrape-reddit.sh

  #! / Bin / bash

to [ -z "$1" ]
then
echo "Please specify a subreddit"
exit 1
f

SUBREDDIT = $ 1
NOW = $ (date + "% m_% d_% y-% H_% M")
Output File = "$ {SUBREDDIT} _ $ {NOW} .txt"

curl -s -A "bash-scrape topics" https://www.reddit.com/r/${SUBREDDIT}.json | 
yes & # 39; .data.children | . [] | .data.title, .data.url, .data.permalink & # 39; | 
while reading -r TITLE; do
read -r URL
read -r PERMALINK
echo -e $ {TITLE} $ {URL} $ {PERMALINK} "tr - delete" >> $ {OUTPUT_FILE}
Done

This script will first check if the user has delivered an under-edited name. If not, it will expire with an error message and a zero return code.

Thereafter, the first argument is stored as the subdivided name and builds up a date stamped file name where the output is saved. [19659011] The action begins when curl is called with a custom header and URL for subreddit to scratch. The output is piped to jq where it is analyzed and reduced to three fields: Title, URL and Permalink. These lines are read, one at a time and saved in a variable using the read command, all inside a moment loop, which continues until there are no more lines to read. The last row of the inner block echoes the three fields, delimited by a tab character, and then moves it through the command tr so that double quotes can be removed. The output is then added to a file.

Before we can perform this script, we must ensure that it has been granted executive authority. Use the chmod command to apply these permissions to the file:

  chmod u + x scrape-reddit.sh 

And finally, run the script with an under-edited name:

  ./ scrape-reddit .sh MildlyInteresting 

An output file is created in the same directory and its content looks like this:

 Scratch and view topics from a sub-site in Bash

Each row contains the three fields we after are separated using a tab character.

Moving on

Reddit is a gold mine of interesting content and media, and it is easy to reach with its JSON API. Now that you have access to these data and process the results, you can do things like:

  • Take the latest headlines from / r / WorldNews and send them to your desktop using the send-send
  • Integrate the best jokes from / paper / paper in your system Message by day
  • Get today's best picture from / r / aww and make it your desktop wallpaper

All this is possible with the help of the information and tools you have on your system. Congratulations on hacking!




Source link