Tutorial Indonesian Natural Language Processing using Sastrawi | Google Colab Python

Definition of terms:

Sastrawi is a simple PHP library that allows you to reduce inflected words in Indonesian (Bahasa Indonesia) to their basic form (stem).

Cleansing is an activity to improve data systematically using certain algorithms.

Stemming is the process of changing affixed words into root words.

Tokenizing is the process of dividing text, which can be in the form of sentences, paragraphs or documents, into certain tokens/parts. Tokenization is often used in linguistics and the tokenization results are useful for further text analysis.

Stop-words are common words that usually appear in large numbers and are considered meaningless. Stop words are commonly used in information retrieval tasks, including by Google.

Source:

import requests
import string
import re

from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords

!pip install Sastrawi
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory

web = requests.get('https://wartakota.tribunnews.com/').text
soup = BeautifulSoup(web)
for s in soup(['script', 'style']):
        s.decompose()
teks = ' '.join(soup.stripped_strings)
print (teks)

teks = teks.lower()
teks = re.sub(r"\d+", "", teks) #remove number
teks = teks.translate(str.maketrans("","",string.punctuation)) #remove punctuation
teks = teks.strip() #remove empty character

factory = StemmerFactory()
stemmer = factory.create_stemmer()
output   = stemmer.stem(teks)
print (output)

tokens = [t for t in output.split()]
print(tokens)

nltk.download()
clean_tokens = tokens[:]
for token in tokens:
  if token in stopwords.words('indonesian'):
      clean_tokens.remove(token)

freq = nltk.FreqDist(clean_tokens)
for key,val in freq.items():
  print(str(key) + ':' + str(val))

freq.plot(30)

And I wrapped them all into single video below:


Please support this blog or my video channel with subscribe button or like & share if you like it.

Labels: ,


PS: If you've benefit from this blog,
you can support it by making a small contribution.

Enter your email address to receive feed update from this blog:

Post a Comment

 

Post a Comment

Leave comments here...