NLP* in Python


Tarek Amr


eg.okfn.org

NLP: Natural Language Processing.


NLP: Neuro-linguistic Programming.


Back

What do I do?


Open Knowledge Foundation Ambassador*


Freelance Data Mining and Visualisation*


Back

Just take notes


Slides: http://tarekamr.appspot.com/slides/pynlp


Code: https://github.com/gr33ndata/NLP_GDGCairo2013

Why NLP?



Search Engines*

Why NLP?



Classification*

Why NLP?



Clustering

Why NLP?



Unstructured to Structured Data*

NLTK


Natural Language Toolkit


Free Book: http://nltk.org/book/

NLTK Modules


Accessing corpora: nltk.corpus

String processing: nltk.tokenize, nltk.stem

Part-of-speech tagging: nltk.tag

Classification: nltk.classify, nltk.cluster

Chunking: nltk.chunk

Etc...

Normalization


Free domain and free hosting.

"This is an Apple".lower() => "this is an apple"


The CEO of Apple gave me an apple.

Normalization


"mba".upper()


"\t\n String ".strip()


"The cost is $ 500".replace('$','Money')


re.sub("\$\d+", "Money", "Fees are $500 plut $20 VAT")

Normalization


You have won <b>$6,000<b>, click <a herf="http://site.com">here</a> to claim your prize.


nltk.clean_html()

Tokenization


myText = "Some text with spaces in between"

myText.split(" ")


["Some", "text", "with", "spaces"]

Tokenization


from nltk.tokenize import *


sent_tokenize()

word_tokenize()

wordpunct_tokenize()

Analysis


myString = "Be who you are and say what you feel, because those who mind don't matter, and those who matter don't mind"


t = nltk.Text()

t.vocab()

t.count("who")

t.collocations(num, window_size)

t.plot()

Analysis


Text([list])

t.tokens

t.vocab().keys()[0:100]

+GDGCairoOrg



[+1s < 3] [+1s >= 3]

Term Weighting


Analysis for Wikipedia pages of Egypt, Tunisia and Lebanon.

VSM


How Google Works


Language Identification



Search for this video on YouTube


Dawwar 3al video dah 3ala YouTube

Language Identification



She has 3 kids and 2 cars.


El wad dah beyeshrab Pepsi

Characters Distribution

English - Francoarab

n-grams


2-grams (bigrams): 'is', 'el', 'q, etc.

3-grams (trigrams): 'the', 'ing', 'ion', etc.

4-grams, 5-grams, 6-grams, etc.

Top 10 bigrams


English: 'in', 'th', 'an', 'he', 'te', 'ed', 'or', 'at', 'it', 'ng'

Francoarab: 'al', 'el', 'la', 'la', '3a', 'ar', 'sh', 'sa', 'et', 'ha'

PoS Tagging


Cairo Traffic


The traffic from Cairo to Alex is totally blocked.


Kobry October to Nasr City from Down Town looz el 3enab.



[FROM] - [TO] - [DIRECTION]

Conclusion


Feel free to contact me if you have any comments about the code, or if you would like me to help you using it in any of your projects.


Contact me!

Published under Creative Commons license