Actions

Difference between revisions of "Word2vec basic.py"

From Algolit

Line 24: Line 24:
 
* Step 1: Download data.
 
* Step 1: Download data.
 
* '''Algolit step 1''': read data from plain text file
 
* '''Algolit step 1''': read data from plain text file
 +
** '''Algolit inspection''': from continuous text to list of words, exported as wordlist.txt.
 
* Step 2: Create a dictionary and replace rare words with UNK token.
 
* Step 2: Create a dictionary and replace rare words with UNK token.
** '''Algolit adaptation''': write the dictionary to dictionary.txt
+
** '''Algolit inspection''': from list of words to a list with the structure [(word, value)], exported as counted.txt.
* Step 3: Function to generate a training batch for the skip-gram model.
+
** '''Algolit inspection''': reversed dictionary, a list of the 5000 (=vocabulary size) most common words, accompanied by an index number.
 +
** '''Algolit inspection''': the object ''data'' is created, the original texts where words are replaced with index numbers, exported as data.txt.
 +
** '''Algolit inspection''': list of disregarded words, that fall outside the vocabulary size, exported as disregarded.txt.
 +
** '''Algolit adaption''': reversed version of the initial dataset, where all the disregard words are replaced with ''UNK'' (unkown).
 +
* Step 3: Function to generate a training batch for the skip-gram model, exported as reversed-input.txt.
 +
** '''Algolit inspection''': an example of a training batch, a vector with a vector size 128x20.  
 
* Step 4: Build and train a skip-gram model.
 
* Step 4: Build and train a skip-gram model.
** '''Algolit adaptation''': select your own set of test words, using the dictionary.txt
+
** '''Algolit inspection''': a big random matrix is created, with a vector size of 5000x20, exported as big-random-matrix.txt.
 +
** '''Algolit adaption''': select your own set of test words to be previewed during the training proces, using the dictionary.txt
 
* Step 5: Begin training.
 
* Step 5: Begin training.
** '''Algolit adaptation''': write training log to logfile.txt
+
** '''Algolit inspection''': export a training batch of 64 words, with a vector size of 128x20, exported as training-words.txt.
 +
** '''Algolit inspection''': export a the 128 connected window words, one to the left, one to the right, with a vector size of 128x20, exported as training-window-words.txt.
 +
** '''Algolit adaption''': visualisation of the cosine similarity calculation updates.
 +
** '''Algolit inspection''': save training log, exported as logfile.txt
 
* Step 6: Visualize the embeddings.
 
* Step 6: Visualize the embeddings.
 +
** '''Algolit adaption''': select 3 words to be included in the graph and highlighted in red.
 +
** '''Algolit adaption''': add metadata to the plot.
  
 
==Source==
 
==Source==
 
The script word2vec_basic.py provides an option to download a dataset from [http://mattmahoney.net/dc/text8.zip Matt Mahoney's home page]. It turns out to be a plain text document, without any punctuation or line breaks.
 
The script word2vec_basic.py provides an option to download a dataset from [http://mattmahoney.net/dc/text8.zip Matt Mahoney's home page]. It turns out to be a plain text document, without any punctuation or line breaks.
  
For the tests that we wanted to do with the script, we decided to work with a piece of literature instead. As we would like to share and publish our code and training data, we picked a book that is in the public domain: ''Mankind in the Making'', written by H. G. Wells (downloaded from the [http://www.gutenberg.org/ebooks/7058 Gutenberg archive website]).
+
For the tests that we wanted to do with the script, we decided to work with a piece of academic literature instead: [[Tristes Tropiques]], written by Claude Lévi-Strauss and translated by John Russell. (https://archive.org/details/tristestropiques000177mbp).
  
Before we could use Wells' text as training material, we needed to remove all the punctuation from the file. To do this, we wrote a small python script [[text-punctuation-clean-up.py]]. The script saves a *stripped* version of the original book under another filename.
+
Before we could use Lévi-Strauss' text as training material, we needed to remove all the punctuation from the file. To do this, we wrote a small python script [[text-punctuation-clean-up.py]]. The script saves a *stripped* version of the original book under another filename.
 +
 
 +
The book contains 153.003 words in total.
 +
 
 +
==wordlist.txt==
 +
 
 +
['xt', '1250', 'By', 'Claude', 'levistrauss', 'Translated', 'by', 'john', 'r', 'ussell', 'Illustrated', 'with', '48', 'pages', 'of', 'photographs', 'and', '48', 'line', 'drawings', 'Have', 'sought', 'a', 'human', 'society', 'reduced', 'To', 'its', 'most', 'basic', 'expression', 'His', 'search', 'has', 'taken', 'claude', 'levi', 'Strauss', 'eminent', 'french', 'anthropologist', 'And', 'one', 'of', 'the', 'founders', 'of', 'structural', 'Anthropology', 'to', 'the', 'far', 'corners', 'of', 'the', 'Earth', 'not', 'as', 'a', 'superficial', 'sightseer', 'but', 'As', 'a', 'close', 'student', 'of', 'man', 'and', 'the', 'varied', 'Cultures', 'he', 'has', 'erected', 'around', 'himself', 'While', 'a', 'professor', 'at', 'sao', 'paolo', 'univer', 'Sity', 'in', 'brazil' ... ]
 +
 
 +
==counted.txt==
 +
 
 +
[['UNK', 18767], ('the', 10108), ('of', 5790), ('and', 4229), ('to', 3895), ('a', 3407), ('in', 3092), ('that', 1633), ('was', 1380), ('it', 1367), ('as', 1271), ('with', 1206), ('for', 1196), ('which', 1158), ('had', 1129), ('is', 1119), ('on', 1015), ('i', 1014), ('or', 945), ('they', 905), ('their', 886), ('by', 876), ('were', 868), ('one', 800), ('at', 794), ('from', 764), ('The', 762), ('be', 731), ('we', 726), ('he', 678), ('not', 668), ('his', 646), ('an', 596), ('this', 584), ('but', 576), ('have', 558), ('are', 555), ('all', 547), ('them', 509), ('its', 454), ('our', 452), ('would', 449), ('s', 445), ('so', 440), ('been', 396), ('my', 394), ('these', 386), ('who', 375), ('there', 361), ('And', 348), ('two', 346), ('no', 341), ('into', 336), ('up', 336), ('more', 335), ('when', 335), ('Of', 324), ('has', 296), ('if', 291), ('other', 289), ('out', 287), ('me', 282), ('only', 274), ('us', 272), ('could', 262), ('some', 250), ('To', 243), ('time', 232), ('can', 232), ('In', 229), ('made', 223), ('die', 222), ('what', 222), ('those', 221), ('than', 214), ('men', 209), ('where', 208), ('will', 202), ('first', 201), ('him', 198), ('A', 192), ('between', 191), ('each', 189), ('any', 185), ('own', 183), ('another', 182), ('way', 178) ... ]
  
 
==dictionary.txt==
 
==dictionary.txt==
A snippet from the dictionary.txt file:
 
  
0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'is', 8: 'that', 9: 'it', 10: 'be', 11: 'for', 12: 'as', 13: 'are', 14: 'with', 15: 'not', 16: 'this', 17: 'or', 18: 'will', 19: 'at', 20: 'we', 21: 'but', 22: 'by', 23: 'may', 24: 'his', 25: 'all', 26: 'an', 27: 'these', 28: 'they', 29: 'have', 30: 'he', 31: 'from', 32: 'our', 33: 'has', 34: 'The', 35: 'no', 36: 'more', 37: 'which', 38: 'one', 39: 'there', 40: 'would', 41: 'its', 42: 'so', 43: 'their', 44: 'than', 45: 'children', 46: 'very', 47: 'things', 48: 'any', 49: 'upon', 50: 'i', 51: 'can', 52: 'if', 53: 'do', 54: 'who', 55: 'child', 56: 'new', 57: 'life', 58: 'It', 59: 'should', 60: 'them', 61: 'only', 62: 'world', 63: 'must', 64: 'on', 65: 'such', 66: 'great', 67: 'people', 68: 'man', 69: 'into', 70: 'most', 71: 'out', 72: 'little', 73: 'what', 74: 'was', 75: 'every', 76: 'some', 77: 'much', 78: 'certain', 79: 'And', 80: 'about', 81: 'men', 82: 'english', 83: 'far', 84: 'present', 85: 'first', 86: 'many', 87: 'been', 88: 'thing', 89: 'those', 90: 'home', 91: 'good', 92: 'But', 93: 'quite', 94: 'way', 95: 'might', 96: 'other', 97: 'us', 98: 'general', 99: 'They', 100: 'social',
+
{0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'that', 8: 'was', 9: 'it', 10: 'as', 11: 'with', 12: 'for', 13: 'which', 14: 'had', 15: 'is', 16: 'on', 17: 'i', 18: 'or', 19: 'they', 20: 'their', 21: 'by', 22: 'were', 23: 'one', 24: 'at', 25: 'from', 26: 'The', 27: 'be', 28: 'we', 29: 'he', 30: 'not', 31: 'his', 32: 'an', 33: 'this', 34: 'but', 35: 'have', 36: 'are', 37: 'all', 38: 'them', 39: 'its', 40: 'our', 41: 'would', 42: 's', 43: 'so', 44: 'been', 45: 'my', 46: 'these', 47: 'who', 48: 'there', 49: 'And', 50: 'two', 51: 'no', 52: 'into', 53: 'up', 54: 'more', 55: 'when', 56: 'Of', 57: 'has', 58: 'if', 59: 'other', 60: 'out', 61: 'me', 62: 'only', 63: 'us', 64: 'could', 65: 'some', 66: 'To', 67: 'time', 68: 'can', 69: 'In', 70: 'made', 71: 'die', 72: 'what', 73: 'those', 74: 'than', 75: 'men', 76: 'where', 77: 'will', 78: 'first', 79: 'him', 80: 'A', 81: 'between', 82: 'each', 83: 'any', 84: 'own', 85: 'another', 86: 'way' ... }
 +
 
 +
==data.txt==
 +
 
 +
[0, 0, 223, 0, 2465, 0, 21, 0, 1951, 0, 0, 11, 2574, 3339, 2, 3858, 3, 2574, 232, 1882, 427, 1493, 5, 189, 115, 1404, 66, 39, 116, 2493, 2328, 477, 1090, 57, 269, 0, 0, 0, 0, 382, 487, 49, 23, 2, 1, 0, 2, 0, 3917, 4, 1, 149, 1715, 2, 1, 0, 30, 10, 5, 4136, 0, 34, 192, 5, 1487, 1303, 2, 104, 3, 1, 2203, 0, 29, 57, 3905, 418, 144, 872, 5, 3282, 24, 248, 4672, 0, 0, 6, 227, 686, 2465, 1457, 0, 172, 1, 741, 1000, 49, 1, 4837, 0, 0, 2, 227, 66, 1, 0, 2639, 2, 31, 4563, 180, 8, 295, 105, 1, 116, 433, 56, 1, 0, 480, 7, 29, 131, 26, 2493, 0, 408, 29, 8, 0, 2480, 2639, 15, 1, 818, 2, 31, 2098, 105, 46, 480, 295, 589, 0, 0, 0, 2, 1, 3697, 3, 1, 2001, 516, 0, 429, 13, 19, 2578, 20, 2621, 1019, 1, 0, 0, 0, 115, 2, 1, 185, 1, 953, 47, 0, 5, 267, 2, 1468, 223, 1171, 504, 4, 20, 179, 1, 4349, 3, 0, 705, 3903, 147, 0, 2748, 2192, 1516, 190, 12, 166, 0, 16, 106, 0, 0, 2262, 2262, 0, 2480, 2639, 0, 0, 0, 2053, 0, 42, 2480, 2639, 0, 4004, 0, 339, 888, 3225, 0, 77, 27, 0, 62, 246, 0, 2, 3225, 2885, 0, 0, 373, 0, 3, 0, 2, 2173, 0, 0, 0, 36, 1036, 12, 310, 1214, 0, 0, 0, 297, 59, 3225, 3705, 0, 60, 16, 20, 0, 184, 0, 375, 2213, 1236, 3, 50, 627, 0, 2, 1, 196, 0, 1, 0, 36, 1412, 1737, 214, 0, 0, 3, 0, 4, 1, 185, 0, 6, 1, 1108, 19, 154, 36, 23, 56, 1, 2736, 480, 2, 481, 227 ... ]
 +
 
 +
==disregarded.txt==
 +
 
 +
['xt', '1250', 'Claude', 'Translated', 'john', 'ussell', 'Illustrated', 'claude', 'levi', 'Strauss', 'eminent', 'founders', 'structural', 'Earth', 'sightseer', 'Cultures', 'univer', 'Sity', 'Extensively', 'upland', 'jungles', 'tristes', 'amerindian', 'humain', 'seeking', 'intricate', 'detailed', 'accounts', 'Designs', 'rigid', 'hier', 'Archical', 'win', 'superstitionridden', 'weird', 'Continued', 'flap', 'Iv', 'cv', '981', 'l56t', 'Le', 'straus', '61157', 'Kansas', 'Books', 'issued', 'presentation', 'Please', 'report', 'cards', 'Change', 'promptly', 'Card', 'holders', 'records', 'films', 'pict', 'Checked', 'cards', 'Frontispiece', 'Carajiindians', 'araguaia', 'Caraji', 'geo', 'Graphically', 'culturally', 'Described', 'Date', 'duk', 'Auf2s', '67', 'Wl', 'Translated', 'John', 'russell', 'Criterion', 'hutchinson', 'publishers', 'ltd', 'london', '1961', 'Library', 'congress', 'catalog', '617203', 'Originally', 'tropiaues', 'librairie', 'plon', '1955', 'chapters', 'Xiv', 'xv', 'xvi', 'xxxix', 'Edition', 'omitted', 'Printed', 'britain', '15758', 'laurent', 'Minus', 'ergo', 'ante', 'haec', 'quam', 'tu', 'ceddere', 'cadentque', 'Lucretius', 'rerum', 'natura', '969', '15758', 'Contents', '65', 'iii', '133', '151', '160', '183', '198', 'vii', '286', 'crusoe', '323', '342', 'japim', '363', 'ix', '381', 'Bibliography', '399', '401', 'Illustrations', 'Frontispiece', 'carajaindians', '97', 'thepantanal', 'belle', 'regalia', 'preparations', 'mariddo', 'cigarette', 'Tucked', 'bracelet', 'wakletou', 'cf', 'plate', 'piercing', 'grading', 'threading', 'suckling', 'conjugal', 'felicity', 'affectionate', 'frolics', 'dozing', 'spinner', 'Plug', 'daydreamer', '46', 'smile', '47', 'amidst', 'mund6', 'dome', 'archer', 'medi', 'Terranean', 'cf', 'Plate', 'mothers', 'eyebrows', 'coated', 'Wax', '55', 'lucinda', '57', 'skinning' ... ]
 +
 
 +
==reversed-input.txt==
 +
 
 +
UNK UNK By UNK levistrauss UNK by UNK r UNK UNK with 48 pages of photographs and 48 line drawings Have sought a human society reduced To its most basic expression His search has taken UNK UNK UNK UNK french anthropologist And one of the UNK of UNK Anthropology to the far corners of the UNK not as a superficial UNK but As a close student of man and the varied UNK he has erected around himself While a professor at sao paolo UNK UNK in brazil m levistrauss travelled UNK through the amazon basin And the dense UNK UNK of brazil To the UNK tropiques of his title It was here among the most primitive Of the UNK tribes that he found The basic UNK societies he was UNK Tristes tropiques is the story of his Experience among these tribes here Are UNK UNK UNK of the Caduveo and the elaborate painted UNK behind which they hide their Natural faces the UNK UNK UNK society of the bororo the Nambikwara who UNK a sort of security By giving wives to their chief the Disease and UNK tupi Kawahib whose UNK tribal dances Sometimes last for days UNK on back UNK UNK v v UNK Tristes tropiques UNK UNK UNK vi UNK s Tristes tropiques UNK L UNK city public library UNK will be UNK only On UNK of library card UNK UNK lost UNK and UNK of residence UNK UNK UNK are responsible for All books UNK UNK UNK Or other library materials UNK out on their UNK I UNK Two masked dancers and two girls UNK of the rio UNK the UNK are closely related both UNK UNK and UNK to the bororo UNK in the book they too are one Of the wandering tribes of central brazil ...
 +
 
 +
==big-random-matrix.txt==
 +
 
 +
[[  2.85661697e-01  9.69764948e-01  -7.59074926e-01  -6.15304947e-01
 +
    6.77072048e-01  -3.78361940e-01  -6.71523094e-01  3.94770384e-01
 +
    7.04541206e-02  -8.92262936e-01  5.87280035e-01  4.58304882e-02
 +
    2.53162384e-01  1.90168381e-01  -6.61255836e-01  -3.75634432e-01
 +
  -5.55147886e-01  4.49278116e-01  3.26536417e-01  8.64576340e-01]
 +
[ -6.70668364e-01  -5.53100824e-01  -3.71278524e-01  1.25042677e-01
 +
  -1.46459818e-01  -6.10010624e-01  9.19621468e-01  -1.55832767e-01
 +
  -7.70623922e-01  -1.44968033e-01  -6.36267662e-01  -1.87215090e-01
 +
    7.09211111e-01  -6.57156706e-01  3.26824188e-02  -4.25864220e-01
 +
  -5.86277485e-01  8.16827059e-01  -5.57327747e-01  -3.35038900e-01]
 +
[ -9.33161497e-01  8.45068693e-01  -8.14761639e-01  -5.67158937e-01
 +
    5.23060560e-01  4.90430593e-01  -9.11595106e-01  4.36383963e-01
 +
  -9.69607353e-01  -6.64181471e-01  -4.44166183e-01  7.78196335e-01
 +
  -5.34924030e-01  6.49461985e-01  5.69838047e-01  2.50927448e-01
 +
  -8.87476921e-01  -3.74064207e-01  4.24978733e-02  1.25571489e-01]
 +
[  9.89913464e-01  3.36525917e-01  -1.86083794e-01  -5.25027514e-01
 +
  -8.87480021e-01  8.53247643e-02  4.10822868e-01  3.29172134e-01
 +
    8.56166363e-01  5.12266636e-01  7.75470734e-01  7.89757490e-01
 +
  -9.44452286e-02  -8.79762173e-01  1.57778263e-02  -8.59814644e-01
 +
    4.55990076e-01  4.06166315e-01  -8.40348721e-01  -2.75753498e-01]
 +
[  5.79052448e-01  -3.62973213e-01  -8.79675150e-01  -9.98473167e-01
 +
  -1.73240185e-01  7.07520723e-01  4.95352268e-01  4.99097586e-01
 +
  -5.02996445e-02  -4.01979208e-01  5.94721079e-01  7.37986326e-01
 +
  -6.61164761e-01  6.45744085e-01  -4.68054295e-01  -5.54257870e-01
 +
    5.12778997e-01  7.89849758e-01  2.42011547e-02  -2.77193785e-01] ... ]
 +
 
 +
==training-words.txt==
 +
 
 +
[2831 2831 1906 1906  25  25    1    1  221  221  37  37    1    1 1840
 +
1840  655  655    3    3  22  22  971  971    4    4    1    1  481  481
 +
4235 4235  297  297    0    0    7    7 1343 1343  16  16  53  53  172
 +
  172    1    1 1080 1080 1831 1831    0    0    2    2    0    0 1804 1804
 +
    1    1  590  590  653  653    3    3  16  16  489  489    2    2    7
 +
    7    8    8    5    5    0    0  56  56 1313 1313  13  13  14  14
 +
  44  44 3432 3432    6    6    1    1  98  98  744  744  23  23  16
 +
  16  489  489  56  56  85  85    4    4  224  224    5    5    0    0
 +
1080 1080    1    1    0    0  474  474]
  
==logfile.txt==
+
In words:
An example snippted from the logfile.txt:
+
 
 +
['thirteen', 'thirteen', 'Feet', 'Feet', 'from', 'from', 'the', 'the', 'ground', 'ground', 'all', 'all', 'the', 'the', 'poles', 'poles', 'met', 'met', 'and', 'and', 'were', 'were', 'tied', 'tied', 'to', 'to', 'the', 'the', 'central', 'central', 'pole', 'pole', 'Or', 'Or', 'UNK', 'UNK', 'that', 'that', 'pushed', 'pushed', 'on', 'on', 'up', 'up', 'through', 'through', 'the', 'the', 'roof', 'roof', 'horizontal', 'horizontal', 'UNK', 'UNK', 'of', 'of', 'UNK', 'UNK', 'completed', 'completed', 'the', 'the', 'main', 'main', 'structure', 'structure', 'and', 'and', 'on', 'on', 'top', 'top', 'of', 'of', 'that', 'that', 'was', 'was', 'a', 'a', 'UNK', 'UNK', 'Of', 'Of', 'palmleaves', 'palmleaves', 'which', 'which', 'had', 'had', 'been', 'been', 'folded', 'folded', 'in', 'in', 'the', 'the', 'same', 'same', 'direction', 'direction', 'one', 'one', 'on', 'on', 'top', 'top', 'Of', 'Of', 'another', 'another', 'to', 'to', 'form', 'form', 'a', 'a', 'UNK', 'UNK', 'roof', 'roof', 'the', 'the', 'UNK', 'UNK', 'hut', 'hut']
 +
 
 +
==training-window-words.txt==
 +
 
 +
[[1906] [18] [25] [2831] [1] [1906] [221] [25] [1] [37] [1] [221] [1840] [37] [655] [1] [1840] [3] [655] [22] [3] [971] [22] [4] [971] [1] [4] [481] [1] [4235] [297] [481] [0] [4235] [7] [297] [1343] [0] [16] [7] [1343] [53] [172] [16] [1] [53] [1080] [172] [1] [1831] [1080] [0] [2] [1831] [0] [0] [2] [1804] [0] [1] [590] [1804] [1] [653] [590] [3] [16] [653] [489] [3] [2] [16] [7] [489] [2] [8] [7] [5] [0] [8] [5] [56] [1313] [0] [13] [56] [1313] [14] [44] [13] [14] [3432] [6] [44] [3432] [1] [98] [6] [744] [1] [98] [23] [16] [744] [489] [23] [56] [16] [489] [85] [4] [56] [85] [224] [5] [4] [224] [0] [1080] [5] [0] [1] [1080] [0] [474] [1] [0] [8]]
  
'''Nearest to education''': criticism, family, statistics, varieties, sign, karl, manner, euphemism, concurrence, absurdity,
+
In words:  
  
'''Nearest to complex''': love, ascribed, sadder, abundance, positivist, spin, subtlety, spectacle, heedless, number,
+
['Feet', 'or', 'from', 'thirteen', 'the', 'Feet', 'ground', 'from', 'the', 'all', 'the', 'ground', 'poles', 'all', 'met', 'the', 'poles', 'and', 'met', 'were', 'and', 'tied', 'were', 'to', 'tied', 'the', 'to', 'central', 'the', 'pole', 'Or', 'central', 'UNK', 'pole', 'that', 'Or', 'pushed', 'UNK', 'on', 'that', 'pushed', 'up', 'through', 'on', 'the', 'up', 'roof', 'through', 'the', 'horizontal', 'roof', 'UNK', 'of', 'horizontal', 'UNK', 'UNK', 'of', 'completed', 'UNK', 'the', 'main', 'completed', 'the', 'structure', 'main', 'and', 'on', 'structure', 'top', 'and', 'of', 'on', 'that', 'top', 'of', 'was', 'that', 'a', 'UNK', 'was', 'a', 'Of', 'palmleaves', 'UNK', 'which', 'Of', 'palmleaves', 'had', 'been', 'which', 'had', 'folded', 'in', 'been', 'folded', 'the', 'same', 'in', 'direction', 'the', 'same', 'one', 'on', 'direction', 'top', 'one', 'Of', 'on', 'top', 'another', 'to', 'Of', 'another', 'form', 'a', 'to', 'form', 'UNK', 'roof', 'a', 'UNK', 'the', 'roof', 'UNK', 'hut', 'the', 'UNK', 'was']
  
'''Nearest to beliefs''': access, advantageous, bound, opens, determined, idle, bringing, binding, considered, unprotected,
+
==cosine similarity calculation updates==
  
'''Nearest to harmony''': fumble, alienists, sketching, disaster, compete, survival, rule, textbooks, encumbered, dowries,
+
...
  
'''Nearest to uncivil''': narrowchested, inferiors, pitiful, angry, beautifully, accentuate, petals, predisposition, individualistic,
+
==logfile.txt==
produced,
+
An example snippted from the logfile.txt:

Revision as of 21:32, 24 October 2017

Type: Algolit extension
Datasets: Tristes Tropiques
Technique: calculating semantic similarity with word-embeddings, code inspection
Collectively developed by: The people behind Google Tensorflow's word2vec, Algolit
Graph generated by the word2vec_basic.py example script, trained on the book "Mankind in the Making" by H.G. Wells.

This is an annotated version of the basic word2vec script. The code is based on this Word2Vec tutorial provided by Tensorflow.

History

Word2vec consists of related models used to generate vectors from words (also called word embeddings). It is a two-layer neural network, produced by a team of researchers led by Tomas Mikolov at *Google*.

word2vec_basic_algolit.py

The structure of the annotated word2vec script is the following:

  • Step 1: Download data.
  • Algolit step 1: read data from plain text file
    • Algolit inspection: from continuous text to list of words, exported as wordlist.txt.
  • Step 2: Create a dictionary and replace rare words with UNK token.
    • Algolit inspection: from list of words to a list with the structure [(word, value)], exported as counted.txt.
    • Algolit inspection: reversed dictionary, a list of the 5000 (=vocabulary size) most common words, accompanied by an index number.
    • Algolit inspection: the object data is created, the original texts where words are replaced with index numbers, exported as data.txt.
    • Algolit inspection: list of disregarded words, that fall outside the vocabulary size, exported as disregarded.txt.
    • Algolit adaption: reversed version of the initial dataset, where all the disregard words are replaced with UNK (unkown).
  • Step 3: Function to generate a training batch for the skip-gram model, exported as reversed-input.txt.
    • Algolit inspection: an example of a training batch, a vector with a vector size 128x20.
  • Step 4: Build and train a skip-gram model.
    • Algolit inspection: a big random matrix is created, with a vector size of 5000x20, exported as big-random-matrix.txt.
    • Algolit adaption: select your own set of test words to be previewed during the training proces, using the dictionary.txt
  • Step 5: Begin training.
    • Algolit inspection: export a training batch of 64 words, with a vector size of 128x20, exported as training-words.txt.
    • Algolit inspection: export a the 128 connected window words, one to the left, one to the right, with a vector size of 128x20, exported as training-window-words.txt.
    • Algolit adaption: visualisation of the cosine similarity calculation updates.
    • Algolit inspection: save training log, exported as logfile.txt
  • Step 6: Visualize the embeddings.
    • Algolit adaption: select 3 words to be included in the graph and highlighted in red.
    • Algolit adaption: add metadata to the plot.

Source

The script word2vec_basic.py provides an option to download a dataset from Matt Mahoney's home page. It turns out to be a plain text document, without any punctuation or line breaks.

For the tests that we wanted to do with the script, we decided to work with a piece of academic literature instead: Tristes Tropiques, written by Claude Lévi-Strauss and translated by John Russell. (https://archive.org/details/tristestropiques000177mbp).

Before we could use Lévi-Strauss' text as training material, we needed to remove all the punctuation from the file. To do this, we wrote a small python script text-punctuation-clean-up.py. The script saves a *stripped* version of the original book under another filename.

The book contains 153.003 words in total.

wordlist.txt

['xt', '1250', 'By', 'Claude', 'levistrauss', 'Translated', 'by', 'john', 'r', 'ussell', 'Illustrated', 'with', '48', 'pages', 'of', 'photographs', 'and', '48', 'line', 'drawings', 'Have', 'sought', 'a', 'human', 'society', 'reduced', 'To', 'its', 'most', 'basic', 'expression', 'His', 'search', 'has', 'taken', 'claude', 'levi', 'Strauss', 'eminent', 'french', 'anthropologist', 'And', 'one', 'of', 'the', 'founders', 'of', 'structural', 'Anthropology', 'to', 'the', 'far', 'corners', 'of', 'the', 'Earth', 'not', 'as', 'a', 'superficial', 'sightseer', 'but', 'As', 'a', 'close', 'student', 'of', 'man', 'and', 'the', 'varied', 'Cultures', 'he', 'has', 'erected', 'around', 'himself', 'While', 'a', 'professor', 'at', 'sao', 'paolo', 'univer', 'Sity', 'in', 'brazil' ... ]

counted.txt

[['UNK', 18767], ('the', 10108), ('of', 5790), ('and', 4229), ('to', 3895), ('a', 3407), ('in', 3092), ('that', 1633), ('was', 1380), ('it', 1367), ('as', 1271), ('with', 1206), ('for', 1196), ('which', 1158), ('had', 1129), ('is', 1119), ('on', 1015), ('i', 1014), ('or', 945), ('they', 905), ('their', 886), ('by', 876), ('were', 868), ('one', 800), ('at', 794), ('from', 764), ('The', 762), ('be', 731), ('we', 726), ('he', 678), ('not', 668), ('his', 646), ('an', 596), ('this', 584), ('but', 576), ('have', 558), ('are', 555), ('all', 547), ('them', 509), ('its', 454), ('our', 452), ('would', 449), ('s', 445), ('so', 440), ('been', 396), ('my', 394), ('these', 386), ('who', 375), ('there', 361), ('And', 348), ('two', 346), ('no', 341), ('into', 336), ('up', 336), ('more', 335), ('when', 335), ('Of', 324), ('has', 296), ('if', 291), ('other', 289), ('out', 287), ('me', 282), ('only', 274), ('us', 272), ('could', 262), ('some', 250), ('To', 243), ('time', 232), ('can', 232), ('In', 229), ('made', 223), ('die', 222), ('what', 222), ('those', 221), ('than', 214), ('men', 209), ('where', 208), ('will', 202), ('first', 201), ('him', 198), ('A', 192), ('between', 191), ('each', 189), ('any', 185), ('own', 183), ('another', 182), ('way', 178) ... ]

dictionary.txt

{0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'that', 8: 'was', 9: 'it', 10: 'as', 11: 'with', 12: 'for', 13: 'which', 14: 'had', 15: 'is', 16: 'on', 17: 'i', 18: 'or', 19: 'they', 20: 'their', 21: 'by', 22: 'were', 23: 'one', 24: 'at', 25: 'from', 26: 'The', 27: 'be', 28: 'we', 29: 'he', 30: 'not', 31: 'his', 32: 'an', 33: 'this', 34: 'but', 35: 'have', 36: 'are', 37: 'all', 38: 'them', 39: 'its', 40: 'our', 41: 'would', 42: 's', 43: 'so', 44: 'been', 45: 'my', 46: 'these', 47: 'who', 48: 'there', 49: 'And', 50: 'two', 51: 'no', 52: 'into', 53: 'up', 54: 'more', 55: 'when', 56: 'Of', 57: 'has', 58: 'if', 59: 'other', 60: 'out', 61: 'me', 62: 'only', 63: 'us', 64: 'could', 65: 'some', 66: 'To', 67: 'time', 68: 'can', 69: 'In', 70: 'made', 71: 'die', 72: 'what', 73: 'those', 74: 'than', 75: 'men', 76: 'where', 77: 'will', 78: 'first', 79: 'him', 80: 'A', 81: 'between', 82: 'each', 83: 'any', 84: 'own', 85: 'another', 86: 'way' ... }

data.txt

[0, 0, 223, 0, 2465, 0, 21, 0, 1951, 0, 0, 11, 2574, 3339, 2, 3858, 3, 2574, 232, 1882, 427, 1493, 5, 189, 115, 1404, 66, 39, 116, 2493, 2328, 477, 1090, 57, 269, 0, 0, 0, 0, 382, 487, 49, 23, 2, 1, 0, 2, 0, 3917, 4, 1, 149, 1715, 2, 1, 0, 30, 10, 5, 4136, 0, 34, 192, 5, 1487, 1303, 2, 104, 3, 1, 2203, 0, 29, 57, 3905, 418, 144, 872, 5, 3282, 24, 248, 4672, 0, 0, 6, 227, 686, 2465, 1457, 0, 172, 1, 741, 1000, 49, 1, 4837, 0, 0, 2, 227, 66, 1, 0, 2639, 2, 31, 4563, 180, 8, 295, 105, 1, 116, 433, 56, 1, 0, 480, 7, 29, 131, 26, 2493, 0, 408, 29, 8, 0, 2480, 2639, 15, 1, 818, 2, 31, 2098, 105, 46, 480, 295, 589, 0, 0, 0, 2, 1, 3697, 3, 1, 2001, 516, 0, 429, 13, 19, 2578, 20, 2621, 1019, 1, 0, 0, 0, 115, 2, 1, 185, 1, 953, 47, 0, 5, 267, 2, 1468, 223, 1171, 504, 4, 20, 179, 1, 4349, 3, 0, 705, 3903, 147, 0, 2748, 2192, 1516, 190, 12, 166, 0, 16, 106, 0, 0, 2262, 2262, 0, 2480, 2639, 0, 0, 0, 2053, 0, 42, 2480, 2639, 0, 4004, 0, 339, 888, 3225, 0, 77, 27, 0, 62, 246, 0, 2, 3225, 2885, 0, 0, 373, 0, 3, 0, 2, 2173, 0, 0, 0, 36, 1036, 12, 310, 1214, 0, 0, 0, 297, 59, 3225, 3705, 0, 60, 16, 20, 0, 184, 0, 375, 2213, 1236, 3, 50, 627, 0, 2, 1, 196, 0, 1, 0, 36, 1412, 1737, 214, 0, 0, 3, 0, 4, 1, 185, 0, 6, 1, 1108, 19, 154, 36, 23, 56, 1, 2736, 480, 2, 481, 227 ... ]

disregarded.txt

['xt', '1250', 'Claude', 'Translated', 'john', 'ussell', 'Illustrated', 'claude', 'levi', 'Strauss', 'eminent', 'founders', 'structural', 'Earth', 'sightseer', 'Cultures', 'univer', 'Sity', 'Extensively', 'upland', 'jungles', 'tristes', 'amerindian', 'humain', 'seeking', 'intricate', 'detailed', 'accounts', 'Designs', 'rigid', 'hier', 'Archical', 'win', 'superstitionridden', 'weird', 'Continued', 'flap', 'Iv', 'cv', '981', 'l56t', 'Le', 'straus', '61157', 'Kansas', 'Books', 'issued', 'presentation', 'Please', 'report', 'cards', 'Change', 'promptly', 'Card', 'holders', 'records', 'films', 'pict', 'Checked', 'cards', 'Frontispiece', 'Carajiindians', 'araguaia', 'Caraji', 'geo', 'Graphically', 'culturally', 'Described', 'Date', 'duk', 'Auf2s', '67', 'Wl', 'Translated', 'John', 'russell', 'Criterion', 'hutchinson', 'publishers', 'ltd', 'london', '1961', 'Library', 'congress', 'catalog', '617203', 'Originally', 'tropiaues', 'librairie', 'plon', '1955', 'chapters', 'Xiv', 'xv', 'xvi', 'xxxix', 'Edition', 'omitted', 'Printed', 'britain', '15758', 'laurent', 'Minus', 'ergo', 'ante', 'haec', 'quam', 'tu', 'ceddere', 'cadentque', 'Lucretius', 'rerum', 'natura', '969', '15758', 'Contents', '65', 'iii', '133', '151', '160', '183', '198', 'vii', '286', 'crusoe', '323', '342', 'japim', '363', 'ix', '381', 'Bibliography', '399', '401', 'Illustrations', 'Frontispiece', 'carajaindians', '97', 'thepantanal', 'belle', 'regalia', 'preparations', 'mariddo', 'cigarette', 'Tucked', 'bracelet', 'wakletou', 'cf', 'plate', 'piercing', 'grading', 'threading', 'suckling', 'conjugal', 'felicity', 'affectionate', 'frolics', 'dozing', 'spinner', 'Plug', 'daydreamer', '46', 'smile', '47', 'amidst', 'mund6', 'dome', 'archer', 'medi', 'Terranean', 'cf', 'Plate', 'mothers', 'eyebrows', 'coated', 'Wax', '55', 'lucinda', '57', 'skinning' ... ]

reversed-input.txt

UNK UNK By UNK levistrauss UNK by UNK r UNK UNK with 48 pages of photographs and 48 line drawings Have sought a human society reduced To its most basic expression His search has taken UNK UNK UNK UNK french anthropologist And one of the UNK of UNK Anthropology to the far corners of the UNK not as a superficial UNK but As a close student of man and the varied UNK he has erected around himself While a professor at sao paolo UNK UNK in brazil m levistrauss travelled UNK through the amazon basin And the dense UNK UNK of brazil To the UNK tropiques of his title It was here among the most primitive Of the UNK tribes that he found The basic UNK societies he was UNK Tristes tropiques is the story of his Experience among these tribes here Are UNK UNK UNK of the Caduveo and the elaborate painted UNK behind which they hide their Natural faces the UNK UNK UNK society of the bororo the Nambikwara who UNK a sort of security By giving wives to their chief the Disease and UNK tupi Kawahib whose UNK tribal dances Sometimes last for days UNK on back UNK UNK v v UNK Tristes tropiques UNK UNK UNK vi UNK s Tristes tropiques UNK L UNK city public library UNK will be UNK only On UNK of library card UNK UNK lost UNK and UNK of residence UNK UNK UNK are responsible for All books UNK UNK UNK Or other library materials UNK out on their UNK I UNK Two masked dancers and two girls UNK of the rio UNK the UNK are closely related both UNK UNK and UNK to the bororo UNK in the book they too are one Of the wandering tribes of central brazil ...

big-random-matrix.txt

[[ 2.85661697e-01 9.69764948e-01 -7.59074926e-01 -6.15304947e-01

   6.77072048e-01  -3.78361940e-01  -6.71523094e-01   3.94770384e-01
   7.04541206e-02  -8.92262936e-01   5.87280035e-01   4.58304882e-02
   2.53162384e-01   1.90168381e-01  -6.61255836e-01  -3.75634432e-01
  -5.55147886e-01   4.49278116e-01   3.26536417e-01   8.64576340e-01]
[ -6.70668364e-01  -5.53100824e-01  -3.71278524e-01   1.25042677e-01
  -1.46459818e-01  -6.10010624e-01   9.19621468e-01  -1.55832767e-01
  -7.70623922e-01  -1.44968033e-01  -6.36267662e-01  -1.87215090e-01
   7.09211111e-01  -6.57156706e-01   3.26824188e-02  -4.25864220e-01
  -5.86277485e-01   8.16827059e-01  -5.57327747e-01  -3.35038900e-01]
[ -9.33161497e-01   8.45068693e-01  -8.14761639e-01  -5.67158937e-01
   5.23060560e-01   4.90430593e-01  -9.11595106e-01   4.36383963e-01
  -9.69607353e-01  -6.64181471e-01  -4.44166183e-01   7.78196335e-01
  -5.34924030e-01   6.49461985e-01   5.69838047e-01   2.50927448e-01
  -8.87476921e-01  -3.74064207e-01   4.24978733e-02   1.25571489e-01]
[  9.89913464e-01   3.36525917e-01  -1.86083794e-01  -5.25027514e-01
  -8.87480021e-01   8.53247643e-02   4.10822868e-01   3.29172134e-01
   8.56166363e-01   5.12266636e-01   7.75470734e-01   7.89757490e-01
  -9.44452286e-02  -8.79762173e-01   1.57778263e-02  -8.59814644e-01
   4.55990076e-01   4.06166315e-01  -8.40348721e-01  -2.75753498e-01]
[  5.79052448e-01  -3.62973213e-01  -8.79675150e-01  -9.98473167e-01
  -1.73240185e-01   7.07520723e-01   4.95352268e-01   4.99097586e-01
  -5.02996445e-02  -4.01979208e-01   5.94721079e-01   7.37986326e-01
  -6.61164761e-01   6.45744085e-01  -4.68054295e-01  -5.54257870e-01
   5.12778997e-01   7.89849758e-01   2.42011547e-02  -2.77193785e-01] ... ]

training-words.txt

[2831 2831 1906 1906 25 25 1 1 221 221 37 37 1 1 1840

1840  655  655    3    3   22   22  971  971    4    4    1    1  481  481
4235 4235  297  297    0    0    7    7 1343 1343   16   16   53   53  172
 172    1    1 1080 1080 1831 1831    0    0    2    2    0    0 1804 1804
   1    1  590  590  653  653    3    3   16   16  489  489    2    2    7
   7    8    8    5    5    0    0   56   56 1313 1313   13   13   14   14
  44   44 3432 3432    6    6    1    1   98   98  744  744   23   23   16
  16  489  489   56   56   85   85    4    4  224  224    5    5    0    0
1080 1080    1    1    0    0  474  474]

In words:

['thirteen', 'thirteen', 'Feet', 'Feet', 'from', 'from', 'the', 'the', 'ground', 'ground', 'all', 'all', 'the', 'the', 'poles', 'poles', 'met', 'met', 'and', 'and', 'were', 'were', 'tied', 'tied', 'to', 'to', 'the', 'the', 'central', 'central', 'pole', 'pole', 'Or', 'Or', 'UNK', 'UNK', 'that', 'that', 'pushed', 'pushed', 'on', 'on', 'up', 'up', 'through', 'through', 'the', 'the', 'roof', 'roof', 'horizontal', 'horizontal', 'UNK', 'UNK', 'of', 'of', 'UNK', 'UNK', 'completed', 'completed', 'the', 'the', 'main', 'main', 'structure', 'structure', 'and', 'and', 'on', 'on', 'top', 'top', 'of', 'of', 'that', 'that', 'was', 'was', 'a', 'a', 'UNK', 'UNK', 'Of', 'Of', 'palmleaves', 'palmleaves', 'which', 'which', 'had', 'had', 'been', 'been', 'folded', 'folded', 'in', 'in', 'the', 'the', 'same', 'same', 'direction', 'direction', 'one', 'one', 'on', 'on', 'top', 'top', 'Of', 'Of', 'another', 'another', 'to', 'to', 'form', 'form', 'a', 'a', 'UNK', 'UNK', 'roof', 'roof', 'the', 'the', 'UNK', 'UNK', 'hut', 'hut']

training-window-words.txt

[[1906] [18] [25] [2831] [1] [1906] [221] [25] [1] [37] [1] [221] [1840] [37] [655] [1] [1840] [3] [655] [22] [3] [971] [22] [4] [971] [1] [4] [481] [1] [4235] [297] [481] [0] [4235] [7] [297] [1343] [0] [16] [7] [1343] [53] [172] [16] [1] [53] [1080] [172] [1] [1831] [1080] [0] [2] [1831] [0] [0] [2] [1804] [0] [1] [590] [1804] [1] [653] [590] [3] [16] [653] [489] [3] [2] [16] [7] [489] [2] [8] [7] [5] [0] [8] [5] [56] [1313] [0] [13] [56] [1313] [14] [44] [13] [14] [3432] [6] [44] [3432] [1] [98] [6] [744] [1] [98] [23] [16] [744] [489] [23] [56] [16] [489] [85] [4] [56] [85] [224] [5] [4] [224] [0] [1080] [5] [0] [1] [1080] [0] [474] [1] [0] [8]]

In words:

['Feet', 'or', 'from', 'thirteen', 'the', 'Feet', 'ground', 'from', 'the', 'all', 'the', 'ground', 'poles', 'all', 'met', 'the', 'poles', 'and', 'met', 'were', 'and', 'tied', 'were', 'to', 'tied', 'the', 'to', 'central', 'the', 'pole', 'Or', 'central', 'UNK', 'pole', 'that', 'Or', 'pushed', 'UNK', 'on', 'that', 'pushed', 'up', 'through', 'on', 'the', 'up', 'roof', 'through', 'the', 'horizontal', 'roof', 'UNK', 'of', 'horizontal', 'UNK', 'UNK', 'of', 'completed', 'UNK', 'the', 'main', 'completed', 'the', 'structure', 'main', 'and', 'on', 'structure', 'top', 'and', 'of', 'on', 'that', 'top', 'of', 'was', 'that', 'a', 'UNK', 'was', 'a', 'Of', 'palmleaves', 'UNK', 'which', 'Of', 'palmleaves', 'had', 'been', 'which', 'had', 'folded', 'in', 'been', 'folded', 'the', 'same', 'in', 'direction', 'the', 'same', 'one', 'on', 'direction', 'top', 'one', 'Of', 'on', 'top', 'another', 'to', 'Of', 'another', 'form', 'a', 'to', 'form', 'UNK', 'roof', 'a', 'UNK', 'the', 'roof', 'UNK', 'hut', 'the', 'UNK', 'was']

cosine similarity calculation updates

...

logfile.txt

An example snippted from the logfile.txt: