Writing Your Own Resume Parser

You are viewing: Writing Your Own Resume Parser At HTTLEN: Share Good Articles

Resumes are an excellent example of unstructured data. Each resume has its unique formatting style, has its own data blocks, and has many forms of data formatting. This makes reading resumes difficult, programmatically. Recruiters spend a great deal of time reviewing resumes and selecting the ones that are a good fit for their jobs. Tech giants like Google and Facebook receive thousands of resumes every day for various job openings, and recruiters can’t review each and every resume. This is why resume screeners are a great option for people like them. Resume screeners make it easy to select the perfect resume from the pool of received resumes.

We’ll learn how to write our own simple resume screener in this blog. For the length of this blog post, we will be extracting names, phone numbers, email IDs, education, and skills from resumes.

Step One: Read the resume
- Installing pdfminer
- Installing doc2text
- Extracting text from PDF
- Extracting text from doc and docx
Second step: name extraction
- Installing spaCy li>
- Rule-Based Matching
Step Three: Extract Phone Numbers
Step Four: Extract Emails
Step Five: Extract Skills
- Installing Pandas
- Tokenization and Word Extraction
Step Six: Extract Education

Resumes are not in a file format and therefore can be in any file format, such as .pdf, .doc, or .docx. So our main challenge is reading the resume and converting it to plain text. For this we can use two Python modules: pdfminer and doc2text. These modules help to extract text from .pdf and .doc, .docx file formats.

pdfminer installation:

doc2text installation:

Extraction text from PDF:

Extracting text from doc and docx:

To extract resume names, we can make use of regular expressions. But we will use a more sophisticated tool called spaCy. Spacy is an industrial-strength natural language processing module used for text and language processing. It comes with pre-trained models for labeling, parsing, and entity recognition. Our main moto here is to use entity recognition to extract names (after all, a name is an entity!). Without a doubt, spaCy has become my favorite tool for language processing these days. So, let’s start by installing spacy.

Installing spacy:

Now, we want to download pre-trained models from spacy. For this, we need to run:

Rule-Based Matching:

spaCy gives us the ability to process text or language based on rule-based matching. We’ll use this spaCy function to extract the first and last names from our resumes.

As you can see above, we’ve first defined a pattern that we want to search for in our text. Here, we have created a simple pattern based on the fact that a person’s First and Last Name is always a Proper Noun. So we’ve specified spacy that looks for a pattern such that two continuous words whose part-of-speech tag equals PROPN(proper noun).

To extract phone numbers, we’ll use regular expressions. Phone numbers also have various forms, such as (+91) 1234567890 or +911234567890 or +91 123 456 7890 or +91 1234567890. Therefore, we need to define a generic regular expression that can match all similar combinations of phone numbers. phone. Thanks to this blog, I was able to extract phone numbers from the resume text by making minor adjustments.

Our phone number extraction function will be as follows:

For more explanation on what previous regular expressions, please visit this website.

To extract email IDs from resume, we can use a similar approach to the one we use to extract mobile phone numbers. Email ids have a fixed form, i.e. an alphanumeric string must be followed by an @ symbol, again followed by a string, followed by a . (dot) and a string at the end. We can use regular expressions to extract that expression from the text.

Now that we have extracted basic information about the person, let’s extract what matters most from the recruiter’s point of view, namely the skills. We can extract abilities using a technique called tokenization. Tokenization is simply breaking down text into paragraphs, paragraphs into sentences, sentences into words. Therefore, there are two main techniques of tokenization: sentence tokenization and word tokenization.

Before implementing tokenization, we will need to create a data set against which we can compare the skills on a resume in particular. For this, we’ll create a comma-separated values (.csv) file with the desired skill sets.For example, if I am the recruiter looking for a candidate with skills including NLP, ML, AI, then I can create a csv file with content:

Assuming we gave the above file a name like skills. csv, we can go further to tokenize our extracted text and compare the skills to those in the skills.csv file. To read the csv file, we will use the pandas module. After reading the file, we will remove all stop words from our resume text. In short, a stop word is a word that does not change the meaning of the sentence even if it is removed.

Pandas installation:

Word tokenization and extraction:

Now moving to the last step of our resume parser, we will extract the education details of the candidates. The data that we will be extracting specifically are the degree and the year of improvement. For example, XYZ completed MS in 2018, then we’ll extract a tuple like (‘MS’, ‘2018’). For this we will be requiring to discard all the stop words. We’ll use the nltk module to load an entire list of stop words and then discard them from our resume text.

Installing nltk:

Recruiters are very specific about minimum education/degree required for a particular job. Therefore, we will prepare an EDUCATION list that will specify all the equivalent titles according to the requirements.

This is how we can implement our own resume analyzer. It’s fun, isn’t it? You can play with words, sentences and of course grammar too! By integrating the steps above, we can extract the entities and get our final output as:

The full code can be found on github. Feel free to open any issues you are facing. You too can contribute! Have an idea to help make the code even better? Open a pull request 🙂

Everything else