Python - Word2Vec ব্যবহার করে ওয়ার্ড এমবেডিং

ওয়ার্ড এমবেডিং একটি ভাষা মডেলিং কৌশল যা বাস্তব সংখ্যার ভেক্টরে শব্দ ম্যাপ করার জন্য ব্যবহৃত হয়। এটি বিভিন্ন মাত্রা সহ ভেক্টর স্পেসে শব্দ বা বাক্যাংশ উপস্থাপন করে। শব্দ এমবেডিং বিভিন্ন পদ্ধতি যেমন নিউরাল নেটওয়ার্ক, কো-অ্যাকারেন্স ম্যাট্রিক্স, সম্ভাব্য মডেল ইত্যাদি ব্যবহার করে তৈরি করা যেতে পারে।

Word2Vec শব্দ এমবেডিং তৈরি করার জন্য মডেল নিয়ে গঠিত। এই মডেলগুলি হল অগভীর দুই-স্তর নিউরাল নেটওয়ার্ক যার একটি ইনপুট স্তর, একটি লুকানো স্তর এবং একটি আউটপুট স্তর রয়েছে৷

উদাহরণ

# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action = 'ignore')
import gensim
from gensim.models import Word2Vec  
#  Reads ‘alice.txt’ file
sample = open("C:\\Users\\Vishesh\\Desktop\\alice.txt", "r")
s = sample.read()  
# Replaces escape character with space
f = s.replace("\n", " ")
data = []  
# iterate through each sentence in the file
for i in sent_tokenize(f):
   temp = []    
   # tokenize the sentence into words
   for j in word_tokenize(i):
      temp.append(j.lower())  
   data.append(temp)  
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1,  size = 100, window = 5)  
# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : ", model1.similarity('alice', 'wonderland'))    
print("Cosine similarity between 'alice' " + "and 'machines' - CBOW : ", model1.similarity('alice', 'machines'))  
# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1, size = 100, window =5, sg = 1)
# Print results
print("Cosine similarity between 'alice' " + "and 'wonderland' - Skip Gram : ", model2.similarity('alice', 'wonderland'))      
print("Cosine similarity between 'alice' " + "and 'machines' - Skip Gram : ", model2.similarity('alice', 'machines'))