Nov 6, 2020 | data science

Collecting and Storing Corgi Tweets: Part 1

seeking dope corgi content

It's been a long week here in North America and for those of us who lurk on Twitter the world has ended and been reborn at least 13 times in the past 48 hours. And sure, Twitter probably isn't the best place for the anxious brain to be spending their time these days but this anxious brain is stubborn and prefers to lean in, particularly if it can find those crevices of Twitter that harbor Billy Joel and corgi videos. So before we dive into this post I would like to begin today's feature with a bit of a palette cleanse.

I'm not crying! You're crying!

Now imagine you wanted to garner some insight from all of this corgi content. How could you go about setting up this pipeline of 1) collecting corgi Tweets to 2) storing corgi Tweets to 3) harnessing your corgi Tweets for corgi analysis? In this first part of this two part series, we'll take a look at these first two bullet points. Here, I decided to use JavaScript and the axios library to make calls to the Twitter API but there are many packages in many languages that can also accomplish this task for you. Regardless, if hopping into Tweet collection is your perogative, hopefully you can find this post to be helpful.

Getting started with the Twitter API

Wooo boy... the Twitter API has gone through some changes since I was last setting up my Tweet collection projects. They've actually begun to migrate to a version 2, which has some pretty neat features that we will barely scratch the surface of today. But whether you are using V1.1 or V2, you'll still need to set up a Twitter Developer account, which means that you will need to have a Twitter account. From there, it's actually pretty simple.

Once you have your Twitter account, you are going to want to go here to set up your developer account. Go ahead and click Apply for a Developer Account and they will ask you a couple of questions about your intended use. They will want you to be as detailed as possible, which includes some word minimums on some of the responses that they are seeking from you, so think of something a bit more informative than just "seeking dope corgi content".

You make it through those questions and you should be redirected to your brand new developer portal. For the sake of brevity, I'll let you click around here but I will point out a couple of key things that you will need to do once you are here before we can think about harvesting our own corgi farm. First, you're going to want to start a new Project and within that Project, you are going to start an App. Creating a Project allows you to start using Twitter's V2 endpoints so I highly recommend... because... it's the future!!! Boom! so now you have a Project and an App set up. You are sooooo close. Now all you have to do is click on that little key icon within your app (in the Overview section) to collect you authentication keys that will allow you to start making requests to Twitter's endpoints. From here, you going to first "generate" a bearer token. In fact, that is all you will need to start making requests, so once it is generated, press the "View Keys" to check it out. In just a moment, you are going to want to copy this keep and keep it in a safe place but we'll get there.

Token Screen

Making Axios calls to collect Tweets

Alright, you've made it past the boring stuff. Now we can venture into the actual collection of Tweets. We're going to be building a small Node.js application that is going to be responsible for making requests to the Twitter API and storing the response into a database so we can pull these Tweets at a later time. Let's take care of that first part here...

quick note...

You'll need Node in order to run this application, which you can download here. I also prefer yarn over npm as my package manager for node, but both should work fine. If you would like to go ahead and install yarn, you can do so by following these simple instructions. You can also clone my copy of this project at the repository here if that is your prefered method of following along.

Following the directory structure and naming conventions of the repository, let's go ahead and create the top level directory and name it tweet-collection and then initialize it as a Node application by running yarn init in your terminal like so...

shell
mkdir tweet-collection
cd tweet-collection
yarn init

Then from there we have to download the packages that will make this whole process run (I promise, we'll get to the fun stuff before the end of this post 😉). The first is dotenv, which is a package that allows us to pull in our environment variables off of our machine or a environment file in our application. The second package we'll need to grab is axios, the bread and butter, the API that makes making requests to other APIs a bit less taxing. Then finally, we're going to need pg, which gives us an interface in JavaScript to make a connection to a postgres database and send it queries (but more on this later). Let's quickly add these to our application...

shell
yarn add dotenv axios pg

Okay, remember a few centuries ago when we generated that bearer token in our Twitter Developer environement? Well, it's time time to grab that token and bring it into our application. First we are going to create a .env file to store these variables so that we can access them in our scripts without hard-coding them into our files. You really don't want to be making these keys public so if you are going to publish to a repo you are going to want to reference the .env file in your .gitignore. You can go ahead and create this file at the top level.

txt
# ./.env
TWITTER_BEARER_TOKEN="paste your bearer token here"

Again, you if you are wanting to push this application to a remote repository such as github or gitlab, be sure to reference this .env file into your .gitignore.

Okay, so what are we doing here? Well, essentially, the purpose of this file is to save environment variables that are specific to this project. We can now reference these local environment variables using the dotenv package when writing scripts so we don't have to store that sensitive information in the script itself. Let's start writing out our script to collect corgi Tweets to see what I'm talking about here...

js
// ./endpoint_calls/twitter_search.js
require('dotenv').config({ path: '../.env' })
const bearer = process.env.TWITTER_BEARER_TOKEN

Alright, we've now created a JavaScript variable called bearer and saved it to a string reference our TWITTER_BEARER_TOKEN variable stored in our .env file. Let's keep things rolling.

Twitter is a REST API, which I am not going to get into very deeply, but essentially, we will be making requests to Twitter's API using HTTP methods and receiving a response back. In our case, part of the response will include the Corgi tweets we so desparately need. Let's take a look at how we begin to build this out.

js
// ./endpoint_calls/twitter_search.js
require('dotenv').config({ path: '../.env' })
const axios = require('axios').default;
const bearer = process.env.TWITTER_BEARER_TOKEN
const headers = { headers: { AUTHORIZATION : `Bearer ${bearer}`} }
axios.get('https://api.twitter.com/2/tweets/search/recent?query=(corgi has:images)&tweet.fields=id,text', headers)
.then(res => console.log(res.data));

You will notice that we brought in the axios package above as well as created a new variables called headers. This object is going to be sent along with the request in order to authorize our ability to make this request to the Twitter API. You can actually see here that it is the second parameter of the axios.get method, which provides a nice segue into looking at the first parameter, the request string itself. The first part https://api.twitter.com is the API domain followed by the endpoint /2/tweets/search/recent. We are using the recent search endpoint, which returns to us tweets created within the last week. And now, finally we have arrived! This next part is where we actually tell the API that we want corgi tweets. The ?query=(corgi has:images) tells the Twitter API that we only want tweets that contain the word "corgi" and while we are at it we can throw in the has:images option to ensure we only get Tweets that contain an image. If I'm searching for corgis I also want to see a corgi...d'uh. And finally we wrap it up with our response items that we want returned to us &tweet.fields=id,text. This will tell the Twitter API that once you've filtered based off our corgi with images, send back the id of the tweet and the text...please.

This request takes a couple seconds, but we don't want the rest of the script to execute before we get the response back, so axios here returns a promise which we can then operate on once the response is returned. Thus, from here, the .then is saying when we get the response res print out the data field, which looks like this after running node ./endpoint_calls/twitter_search.js in your terminal...

js
[
{
"id": "1324752352771932160",
"text": "Researchers can apparently extract effective COVID antibody fragments from llamas!\n\nIf everyone just dresses up their dogs in llama costumes & convinces them they were merely “assigned corgi at birth”, we should be out of this pandemic in no time!\n\nhttps://t.co/JCSXSVja3y https://t.co/UQAxQuBZEE"
},
{
"id": "1324751495997390850",
"text": "Join Ms. Megan and George the Corgi for songs and rhymes featuring the letter H.\nhttps://t.co/fDku6j3R4p https://t.co/5jnynZFJgf"
},
{
"id": "1324751245274435587",
"text": "@BasoStream Ein süßer Corgi ist immer gut für die Seele 😉 https://t.co/iemcwBLQrX"
},
{
"id": "1324751147933073408",
"text": "cubone corgi https://t.co/7uhK4Zif0g"
},
{
"id": "1324751089174896643",
"text": "RT @hourlykoga: wolf corgi ☆ミ https://t.co/omfV6o0pHf"
},
{
"id": "1324750731388329984",
"text": "Got my Deck of Many Things ready in case people donate to #EXTRALIFE during our Death Zone Corgi #ttrpg #dnd5e one shot! https://t.co/302bEMPEQw"
},
{
"id": "1324749989990596609",
"text": "INTRODUCING THE CORGI COMMITTEE 2020/2021 🐶\n\nLauren Burns: Chair\nBrona Starrs: Vice Chair\nHolly Haughian: Social Media\nCeilidh Davison: Organiser\nKatherine Freeman: Organiser\nNatasha Rooney: Organiser https://t.co/2m6lCnHkHs"
},
{
"id": "1324748433840840705",
"text": "RT @retroist: Super Hero cars from Corgi. https://t.co/DyZB0OBH48"
},
{
"id": "1324747894566604800",
"text": "@corgi___yuta \nご参加ありがとうございます!\n結果は…\n\n残念、はずれです…\nキャンペーンは明日も開催しますので、ぜひまた挑戦してくださいね!\n\n▼原神は好評配信中\nhttps://t.co/aHse2rIWmD https://t.co/wdCIT5YCYT"
},
{
"id": "1324747839793221633",
"text": "@Ms5000Watts Our Corgi Cinder https://t.co/UK9oT5lIS8"
}
]

Now we're cookin. But printing out the Tweets isn't really what we want here. Ideally, you are going to scale this up and probably collect more than 10 tweets at a time. Currently, the console.log doesn't allow us to do anything with these Tweets, so let's talk about storing these in a database so you can access them for later use.

Storing those Tweets in a Postgres Database

Okay, I feel a little bit guilty for the direction I am about to take this without really giving any background into databases but...that's what I am about to do. I'll circle back around to the intracacies that are databases at some point in the future but for now just know we are storing our corgi Tweets in one, specifically, a postgresql database. If you have psql already downloaded on your computer, the following should work nicely.

If you are following along with the repository, you'll notice a folder called ./database/migrations. In here, you will notice a fairly small sql file called 01_create_tables.sql. Open it up and we can see what's inside...

sql
/* ./migrations/01_create_tables.sql */
DROP TABLE IF EXISTS tweets CASCADE;
CREATE TABLE tweets (
id TEXT PRIMARY KEY,
tweet_text TEXT
);

Now, if you open up psql and navigate to a database where you will store this data (for me, it's a database called tweets. Not the most imaginitive but it gets the job done), you will run the command \i ./migrations/01_create_tables.sql. This will first, remove any table called tweets if it already exists and then create a new tweets tables with columns of id and tweet_text. These are going to correspond to the id and text field coming back to us in the response from the Twitter API.

So now we have a table set up in our database, now we need to establish a connection to this database in our application. Take a look at the file ./database/db.js...

js
require('dotenv').config()
const { Pool } = require("pg");
const config = {
user: process.env.USER,
host: process.env.DB_HOST || 'localhost',
database: process.env.DB_NAME || 'tweets',
password: process.env.DB_PASS,
port: 5432
};
const pool = new Pool(config);
module.exports = pool;

dotenv makes an appearance yet again and if you can imagine, I've just created a couple more variables in my .env file that contain my database credentials. Once I pass those to my config object, I can go ahead and create a pool which allows me to make a connection. We will go ahead and export it so we can pull it into to other scripts.

Okay, so we have created a developer account, made an axios call to retrieve Tweets about corgis, set up our database, now all we need is to modify our ./endpoint_calls/twitter_search.js script to push our response object into our database. This will tie everything together.

js
// ./endpoint_calls/twitter_search.js
require('dotenv').config({ path: '../.env' })
const axios = require('axios').default;
const pool = require('../database/db');
const bearer = process.env.TWITTER_BEARER_TOKEN
const headers = { headers: { AUTHORIZATION : `Bearer ${bearer}`} }
const query = 'INSERT INTO tweets(id, tweet_text) VALUES($1, $2)'
// fetch tweets from a certain endpoint
const fetchTwitterData = (callback) => {
axios.get('https://api.twitter.com/2/tweets/search/recent?query=(corgi has:images)', headers)
.then(res => {
callback(res.data);
})
.catch(err => `error: ${err}`)
}
// invoke fetch tweets and save the results to the database
fetchTwitterData((res) => {
pool.connect((err, client, done) => {
if ( err ) throw `error: ${err}`;
try {
console.log('🕶 Connected to the database. Grab a coffee ☕️ and call it a day!')
res.data.forEach( row => {
console.log(row)
client.query(query, [row.id, row.text], (err, response) => {
if ( err ){
console.log(err.stack)
} else {
console.log("inserted")
}
})
})
} finally {
done()
}
})
})

Whoa! This got a lot larger from when we last looked at this file. But not to worry, there arw just a couple of things that needed to happen to make this work nicely. I'll point them out...

First, take a look at the function fetchTwitterData. We've gone ahead and wrapped our axios call in there so we can use a callback function to save the response to the database.

That brings me to when fetchTwitterData is invoked. You can see in this example, an anonymous function is pased that seems like it is doing a lot... and... honestly, it is. Let's take it piece by piece. First, we need to open up a connection to our database with pool.connect. Boom, easy! From there, if there is an error with the connection, we want that error to be shown to us. However, if there isn't an error then go ahead and execute the try block. If we get this far (in other words, if an error isn't thrown) then we are connected to the database and we can start opporating on the response that the Twitter API returns to us. If you look above at the response data again, you'll notice that it is an array of objects, and each of those objects represents a Tweet with an id and its text. Essentially, we want to iterate through this array and insert that data into the postgres table that we created. The query for doing that is stored in the variable query and we use the pool method .query() to pass in the Tweet data to create a new row for each new Tweet. Finally, we close the connection and we call it a day.

Now, if we navigate to our database and do a select * from tweets;, you should see your brand new Tweets all containing content pertaining to corgis!

What's next

We did a lot here but there's a whole lot more that we didn't do. For one, it's kind of annoying to run node ./endpoint_calls/twitter_search.js every single time we want to collect a new batch of Tweets. Secondly, we aren't doing anything with these Tweets once we store them in the database. This is primarily what I am going to focus on during part 2 of this series. Pulling the data out and playing with the results, so stay tuned. Thirdly, there's a whole world out there beyond corgis, for example, shiba inus. For real, there is a lot you can do with the Twitter API beyond just searching for text. Plus, the post only touches on one endpoint, recent search, so I highly recommend at least looking through their documentation for other ways you can pull data pertaining to Twitter.

Special Thanks shout out to Brianna Santellan for the wonderful featured photo for this post.