/machine-learning/NLP/01-python-basics.md

01 - Python Basics for NLP

NLP (Natural Language Processing) is a subsection of machine learning so is a common practice to use Anaconda Python distribution.
In anaconda is possible to create a venv using dependencies listed in a given yml file by doing:

conda env create -f <filename.yml>

String interpolation

Python 2.6 uses .format() interpolation method as follow:

'{}'.format()

Python 3.6 allows f-string interpolation:

f'{var1} {var2} {var3}'

Filling with N spaces the interpolated string variable:

f'{var1: {10}} {var2: {20}} {var3:>{30}}'

Last var3 is represented with :> to align also the number variable.

Fill with N "-" char the interpolated string variable:

f'{var1:-{10}} {var2:-{20}} {var3:->{30}}'

Files I/O

It's possible to write a text file via Jupyter notebook as follow:

%%writefile test.txt
Hello, this is a quick test file.
This is second line of quick test file.

In order to retrieve the file content into the python environment is possible to use the python keyword open() as follow:

file = open('test.txt')

The file variable will contain an instance of the TextIOWrapper class.
In windows OS is necessary to specify a path with \ escape notation.

text_content = file.read()

Isn't possible to call file.read() multiple times sequentially because the search index of TextIOWrapper class is moved to the end after first read.
Using .seek(0) is possible to move back the class index to the 0:

file.seek(0)

file.close()

Using readlines() method is possible to retrieve an array of strings for each line of the text file:

lines = file.readlines()

It's possible to open a file in both read and write modes as follow:

file = open('test.txt', 'w+')

Note that when a file is opened in w or w+ the original content is overwritten.
Using write method is possible to write text to file:

file.write('NEW ADDED LINE')

It's possible to open the file in append mode by passing mode param a+:

file = open('test.txt', 'a+')
file.write('THIS IS THE FIRST LINE')
file.write('\nTHIS IS THE SECOND LINE')
print(file.read())

Using with keyword is possible to open the file with the context manager:

with open('test.txt', 'r') as myfile:
  variable = myfile.readlines()

Using context manager the file will closed automatically once finished.

PDF

Often may be necessary to extract text data from PDF files, using PyPDF2 library.
By writing:

import PyPDf2

file = open('test.pdf', mode='rb')

pdf_reader = PyPDf2.PdfFileReader(file)

pdf_reader.numPages => number of pdf pages

page_1 = pdf_reader.getPage(0)

text = page_1.extractText()

file.close()

Is necessary rb mode("read binary") because isn't a normal text file.

Using a pdf writer:

f = open('test.pdf', 'rb')

pdf_reader = PyPDF2.PdfFileReader(f)

first_page = pdf_reader.getPage(0)

pdf_writer = PyPDF2.PdfFileWriter()

pdf_writer.addPage(first_page)

pdf_output = open('test_out.pdf', 'wb')

pdf_writer.write(pdf_output)

pdf_output.close()
f.close()

Array of extracted text in pages:

pdf_text = []

pdf_reader = PyPDF2.PdfFileReader(f)

for p in range(pdf_reader.numPages):
  page = pdf_reader.getPage(p)
  pdf_text.append(page.extractText())

f.close()

for page in pdf_text:
  print(page)
  print('\n\n\n\n\n')

REGEX

A fundamental Python capability for NLP are regex.
Regex are expressions to find and/or extract patterns from text data.
Python has built-in a library named re to handle them.
The most important re methods are:

re.search(, ) => returns a RE Object of the first occurrence found
re.findall(, ) => returns an array of found occurrences
re.finditer(, ) => returns an array of RE Objects of found occurrences

REGEX Characters

Most used characters are the following, to catch more complex patterns is easier to use some online tool.

\d => one Unicode digit in any script
\w => "word character": Unicode letter, ideogram, digit, or underscore
\s => any Unicode separator

REGEX Quantifiers

- One or more
{3} Exactly three times
{2,4} Two to four times
{3,} Three or more times
- Zero or more times
? Once or none

REGEX logics and more

. => Any character except line break
\ => Escapes a special character
| => Alternation / OR operand
( … ) => Capturing group
(?: … ) => non-capturing group
[ … ] => One of the characters in the brackets

Python

Git

PythonGitCImachine-learningnlpmdjupyter