In this tutorial, we are going to be using C# to scrape Craigslist to look for web development gigs. We will be using .NET Core with ScrapySharp. To get started, you will need to install the .NET Core SDK, Visual Studio Code, and the C# extension for Visual Studio Code. We will be creating a console app, but you can take a similar approach to any other type of project.
Creating a Console App
The first thing we’ll need to do is create a folder for this project. You can put it wherever you’d like and name it anything you want. For this project, I created a folder called “Scraper” and placed it on my desktop.
Once you have that created, open that folder in VS Code. .NET Core comes with a command-line interface we will use to create our console app. In VS Code, open your terminal (Terminal → new Terminal) and run this command:
dotnet new console
After this command runs, you should have the following files:
The main file we will be working out of is Program.cs and it should look like this:
Configuring Debug Settings
To debug the application, we need to configure a launch.json file. To do this, click on the debugger icon in VS Code. Then click on “create launch.json file”.
After you click on the create launch.json file, you should be presented with a dropdown list. In that list, selected .Net Core. This will create this configuration in your launch.json file. Before we continue, we’re going to change one property so we can capture inputs. Change the console property from internalConsole to externalTerminal.
To test that it is working, click on the play button in the debug window:
This will launch your terminal on Mac or command prompt on Windows.
ScrapySharp is a Nuget package we are going to add that will make parsing HTML documents much easier. To add the package, open the terminal and run the following command:
dotnet add package ScrapySharp
Now that we have ScrapySharp added as a Nuget Package, we need to import it into our Program.cs file. At the top of the file under “using System;” add the following:
Scraping Craigslist For Developer Gigs
You can create a scraper for just about anything, but for our purposes, we will be creating one to scrape Craigslist and search for Developer gigs. Particularly developer gigs in New York City. To get our initial URL, I went to Craigslist, found New York City, and clicked on computer gigs. This yielded the following URL: https://newyork.craigslist.org/d/computer-gigs/search/cpg. This will be our starting point for the scraper.
Scraping the Initial Url
The first thing we want to do is get all the links on the main gigs page. To do this, we will use ScrapySharp ScrapingBrowser. This mimics a real browser navigating to a web page. We’ll keep this as a global variable since we will be reusing it in different functions we create. Right above static void Main(string args) add the following line:
static ScrapingBrowser _browser = new ScrapingBrowser();
Now we’ll create another method right below static void Main(string args) that returns the HTML of a particular web page.
Then we’ll create another method above our GetHtml and below Main to grab all the links off of the main page of our Craigslist Url. This method takes in a URL, gets the HTML, grabs all the links on the page, checks to see if they’re the links to gigs (has .html extension), and puts them into List.
Now that we’ve created this method, let's call it in our Main method and pass our main gig page URL.
Scraping the Gig Details
Now that we have the links to gigs, let's create a method that parses the page and pulls out the title of the post and description. At the bottom of the file, let's create a new class to act as our model to hold the page details.
We can then create the method to scrape the details from the page. Be sure to watch the video below this article to see how I determined what path to pass to SelectSingleNode.
Now we can call our GetPageDetails method inside of Main.
At this point that there may be a lot of gigs being scraped especially in the New York area. You can also see that there is a lot of spammy gigs on Craigslist.
So we’re not wasting our time with these, let’s add a search filter.
In our Main method, let's allow us to input a search term with Console.ReadLine(). Now when our program runs we enter the search term in and hit enter. Then, we’ll use that search term in our get page details function to determine whether we’ll add a listing to our list of PageDetails we return.
Writing Gigs to CSV
At this point, we’ve set up our scraper and we only return relevant results. So, now we just need to put those into a usable format. For this, we’re going to be using another NuGet package called CsvHelper (https://www.nuget.org/packages/CsvHelper).
In the terminal, run this command:
dotnet add package CsvHelper
After we have the NuGet package installed, at the top of the file under our existing dependencies, import the dependencies we’ll need for this.
Now, let's write the method that will create our CSV file and save it to our computer.
In the method above, we’ve passed our list of page details and search term, used some string interpolation on the file name to make it unique, and saved a CSV with our results. The last step is to use it in our Main method.
That’s it! You can now scrape gigs from any Craigslist page you’d like. Your Program.cs final file should look something like this:
You can view the Github Repo for this code here: https://github.com/TheDiligentDev/CraigslistScraper