Actions

Difference between revisions of "Word2vec basic.py"

From Algolit

 
(12 intermediate revisions by 2 users not shown)
Line 4: Line 4:
 
| Type: || Algolit extension
 
| Type: || Algolit extension
 
|-
 
|-
| Datasets: || [[Tristes Tropiques]]
+
| Datasets: || [[NearbySaussure|nearbySaussre]]
 
|-
 
|-
 
| Technique: || [[word embeddings]]
 
| Technique: || [[word embeddings]]
Line 11: Line 11:
 
|}
 
|}
  
[[Category:Algoliterary-Encounters]]
+
This is an annotated version of the basic word2vec script. The code is based on [https://www.tensorflow.org/tutorials/word2vec this Word2Vec tutorial] provided by Tensorflow.  
[[File:5 graphs claude-levi-strauss tristestropiques000177mbp djvu strippted.bak2.png|thumb|right|Graph generated by the word2vec_basic.py example script, trained on the book "Tristes Tropiques" by Clause Lévi-Strauss.]]
 
 
 
This is an annotated version of the basic word2vec script. The code is based on [https://www.tensorflow.org/tutorials/word2vec this Word2Vec tutorial] provided by Tensorflow.
 
  
 
==History==
 
==History==
Word2vec consists of related models used to generate vectors from words (also called [[word embeddings]]). It is a two-layer neural network, produced by a team of researchers led by [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Tomas Mikolov at Google].
+
Word2vec consists of related models used to generate vectors from words (also called [[word embeddings]]). It is a two-layer neural network, produced by a team of researchers led by [http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf Tomas Mikolov at Google]. The script that we use here is not the original version of word2vec. The original project is written in the programming language C, which made us look for a version of the script written in the programming language Python. Another Python implementation of word2vec is provided by [https://radimrehurek.com/gensim/models/word2vec.html Gensim].
  
 
==word2vec_basic_algolit.py==
 
==word2vec_basic_algolit.py==
 +
[[File:Word-embeddings-steps-algoliterary-encounter.JPG|thumb|right|Each table is occupied with one of the multiple steps of the script word2vec_basic.py. Picture taken during the Algoliterary Encounter event in November 2017.]]
 +
 
The structure of the annotated word2vec script is the following:
 
The structure of the annotated word2vec script is the following:
  
Line 42: Line 41:
 
* Step 6: Visualize the embeddings.
 
* Step 6: Visualize the embeddings.
 
** '''Algolit adaption''': select 3 words to be included in the graph
 
** '''Algolit adaption''': select 3 words to be included in the graph
<br>
+
 
 +
[[File:5 graphs nearbySaussure.png|thumb|right|Graph generated by the word2vec_basic.py Tensorflow tutorial, based on the [[NearbySaussure|nearbySaussre]] dataset.]]
  
 
===Source===
 
===Source===
 
The script word2vec_basic.py provides an option to download a dataset from [http://mattmahoney.net/dc/text8.zip Matt Mahoney's home page]. It turns out to be a plain text document, without any punctuation or line breaks.
 
The script word2vec_basic.py provides an option to download a dataset from [http://mattmahoney.net/dc/text8.zip Matt Mahoney's home page]. It turns out to be a plain text document, without any punctuation or line breaks.
  
For the tests that we wanted to do with the script, we decided to work with a piece of academic literature instead: [[Tristes Tropiques]], written by Claude Lévi-Strauss and translated by John Russell. (https://archive.org/details/tristestropiques000177mbp).
+
For the tests that we wanted to do with the script, we decided to work with an algoliterary dataset that circles around the structuralist linguistic theory of Ferdinand de Saussure: [[NearbySaussure|nearbySaussure]]. The dataset contains 424.811 words in total of which 24.651 words are unique.
 
 
Before we could use Lévi-Strauss' text as training material, we needed to remove all the punctuation from the file. To do this, we wrote a small python script [[text-punctuation-clean-up.py]]. The script saves a *stripped* version of the original book under another filename.
 
  
The book contains 153.003 words in total of which 19.869 words are unique.
+
Before we could use the three books that form this dataset as training material, we needed to remove all the punctuation from the file. To do this, we wrote a small python script [[text-punctuation-clean-up.py]]. The script saves a *stripped* version of the original book under another filename.
  
 
===wordlist.txt===
 
===wordlist.txt===
Line 57: Line 55:
  
 
<pre>
 
<pre>
['xt', '1250', 'By', 'Claude', 'levistrauss', 'Translated', 'by', 'john', 'r', 'ussell', 'Illustrated', 'with', '48', 'pages', 'of', 'photographs', 'and', '48', 'line', 'drawings', 'Have', 'sought', 'a', 'human', 'society', 'reduced', 'To', 'its', 'most', 'basic', 'expression', 'His', 'search', 'has', 'taken', 'claude', 'levi', 'Strauss', 'eminent', 'french', 'anthropologist', 'And', 'one', 'of', 'the', 'founders', 'of', 'structural', 'Anthropology', 'to', 'the', 'far', 'corners', 'of', 'the', 'Earth', 'not', 'as', 'a', 'superficial', 'sightseer', 'but', 'As', 'a', 'close', 'student', 'of', 'man', 'and', 'the', 'varied', 'Cultures', 'he', 'has', 'erected', 'around', 'himself', 'While', 'a', 'professor', 'at', 'sao', 'paolo', 'univer', 'Sity', 'in', 'brazil' ... ]
+
[u'Introduction', u'saussure', u'today', u'Carol', u'sanders', u'Why', u'still', u'today', u'do', u'we', u'\ufb01nd', u'the', u'name', u'of', u'ferdinand', u'de', u'saussure', u'featuring', u'prominently', u'in', u'volumes', u'published', u'not', u'only', u'on', u'linguistics', u'but', u'on', u'a', u'multitude', u'of', u'topics', ... ]
 
</pre>
 
</pre>
  
Line 64: Line 62:
  
 
<pre>
 
<pre>
[['UNK', 18767], ('the', 10108), ('of', 5790), ('and', 4229), ('to', 3895), ('a', 3407), ('in', 3092), ('that', 1633), ('was', 1380), ('it', 1367), ('as', 1271), ('with', 1206), ('for', 1196), ('which', 1158), ('had', 1129), ('is', 1119), ('on', 1015), ('i', 1014), ('or', 945), ('they', 905), ('their', 886), ('by', 876), ('were', 868), ('one', 800), ('at', 794), ('from', 764), ('The', 762), ('be', 731), ('we', 726), ('he', 678), ('not', 668), ('his', 646), ('an', 596), ('this', 584), ('but', 576), ('have', 558), ('are', 555), ('all', 547), ('them', 509), ('its', 454), ('our', 452), ('would', 449), ('s', 445), ('so', 440), ('been', 396), ('my', 394), ('these', 386), ('who', 375), ('there', 361), ('And', 348), ('two', 346), ('no', 341), ('into', 336), ('up', 336), ('more', 335), ('when', 335), ('Of', 324), ('has', 296), ('if', 291), ('other', 289), ('out', 287), ('me', 282), ('only', 274), ('us', 272), ('could', 262), ('some', 250), ('To', 243), ('time', 232), ('can', 232), ('In', 229), ('made', 223), ('die', 222), ('what', 222), ('those', 221), ('than', 214), ('men', 209), ('where', 208), ('will', 202), ('first', 201), ('him', 198), ('A', 192), ('between', 191), ('each', 189), ('any', 185), ('own', 183), ('another', 182), ('way', 178) ... ]
+
Counter({u'the': 22315, u'of': 16396, u'and': 8271, u'a': 8246, u'to': 7797, u'in': 7314, u'is': 5983, u'as': 4143, u'that': 3586, u'it': 2629, u'e': 2500, u'The': 2478, u's': 2332, u'language': 2281, u'saussure': 2201, u'which': 2101, u'by': 1962, u'this': 1944, u'on': 1937, u'be': 1808, u'or': 1751, u'r': 1713, u'not': 1689, u'an': 1680, ... })
 
</pre>
 
</pre>
  
Line 71: Line 69:
  
 
<pre>
 
<pre>
{0: 'UNK', 1: 'the', 2: 'of', 3: 'and', 4: 'to', 5: 'a', 6: 'in', 7: 'that', 8: 'was', 9: 'it', 10: 'as', 11: 'with', 12: 'for', 13: 'which', 14: 'had', 15: 'is', 16: 'on', 17: 'i', 18: 'or', 19: 'they', 20: 'their', 21: 'by', 22: 'were', 23: 'one', 24: 'at', 25: 'from', 26: 'The', 27: 'be', 28: 'we', 29: 'he', 30: 'not', 31: 'his', 32: 'an', 33: 'this', 34: 'but', 35: 'have', 36: 'are', 37: 'all', 38: 'them', 39: 'its', 40: 'our', 41: 'would', 42: 's', 43: 'so', 44: 'been', 45: 'my', 46: 'these', 47: 'who', 48: 'there', 49: 'And', 50: 'two', 51: 'no', 52: 'into', 53: 'up', 54: 'more', 55: 'when', 56: 'Of', 57: 'has', 58: 'if', 59: 'other', 60: 'out', 61: 'me', 62: 'only', 63: 'us', 64: 'could', 65: 'some', 66: 'To', 67: 'time', 68: 'can', 69: 'In', 70: 'made', 71: 'die', 72: 'what', 73: 'those', 74: 'than', 75: 'men', 76: 'where', 77: 'will', 78: 'first', 79: 'him', 80: 'A', 81: 'between', 82: 'each', 83: 'any', 84: 'own', 85: 'another', 86: 'way' ... }
+
{0: 'UNK', 1: u'the', 2: u'of', 3: u'and', 4: u'a', 5: u'to', 6: u'in', 7: u'is', 8: u'as', 9: u'that', 10: u'it', 11: u'e', 12: u'The', 13: u's', 14: u'language', 15: u'saussure', 16: u'which', 17: u'by', 18: u'this', 19: u'on', 20: u'be', 21: u'or', 22: u'r', 23: u'not', 24: u'an', ... }
 
</pre>
 
</pre>
  
Line 78: Line 76:
  
 
<pre>
 
<pre>
[0, 0, 223, 0, 2465, 0, 21, 0, 1951, 0, 0, 11, 2574, 3339, 2, 3858, 3, 2574, 232, 1882, 427, 1493, 5, 189, 115, 1404, 66, 39, 116, 2493, 2328, 477, 1090, 57, 269, 0, 0, 0, 0, 382, 487, 49, 23, 2, 1, 0, 2, 0, 3917, 4, 1, 149, 1715, 2, 1, 0, 30, 10, 5, 4136, 0, 34, 192, 5, 1487, 1303, 2, 104, 3, 1, 2203, 0, 29, 57, 3905, 418, 144, 872, 5, 3282, 24, 248, 4672, 0, 0, 6, 227, 686, 2465, 1457, 0, 172, 1, 741, 1000, 49, 1, 4837, 0, 0, 2, 227, 66, 1, 0, 2639, 2, 31, 4563, 180, 8, 295, 105, 1, 116, 433, 56, 1, 0, 480, 7, 29, 131, 26, 2493, 0, 408, 29, 8, 0, 2480, 2639, 15, 1, 818, 2, 31, 2098, 105, 46, 480, 295, 589, 0, 0, 0, 2, 1, 3697, 3, 1, 2001, 516, 0, 429, 13, 19, 2578, 20, 2621, 1019, 1, 0, 0, 0, 115, 2, 1, 185, 1, 953, 47, 0, 5, 267, 2, 1468, 223, 1171, 504, 4, 20, 179, 1, 4349, 3, 0, 705, 3903, 147, 0, 2748, 2192, 1516, 190, 12, 166, 0, 16, 106, 0, 0, 2262, 2262, 0, 2480, 2639, 0, 0, 0, 2053, 0, 42, 2480, 2639, 0, 4004, 0, 339, 888, 3225, 0, 77, 27, 0, 62, 246, 0, 2, 3225, 2885, 0, 0, 373, 0, 3, 0, 2, 2173, 0, 0, 0, 36, 1036, 12, 310, 1214, 0, 0, 0, 297, 59, 3225, 3705, 0, 60, 16, 20, 0, 184, 0, 375, 2213, 1236, 3, 50, 627, 0, 2, 1, 196, 0, 1, 0, 36, 1412, 1737, 214, 0, 0, 3, 0, 4, 1, 185, 0, 6, 1, 1108, 19, 154, 36, 23, 56, 1, 2736, 480, 2, 481, 227 ... ]
+
[1169, 15, 1289, 3020, 1427, 3697, 354, 1289, 269, 68, 1021, 1, 345, 2, 234, 34, 15, 4416, 0, 6, 3052, 293, 23, 64, 19, 31, 38, 19, 4, 0, 2, 3877, ... ]
 
</pre>
 
</pre>
  
Line 85: Line 83:
  
 
<pre>
 
<pre>
['xt', '1250', 'Claude', 'Translated', 'john', 'ussell', 'Illustrated', 'claude', 'levi', 'Strauss', 'eminent', 'founders', 'structural', 'Earth', 'sightseer', 'Cultures', 'univer', 'Sity', 'Extensively', 'upland', 'jungles', 'tristes', 'amerindian', 'humain', 'seeking', 'intricate', 'detailed', 'accounts', 'Designs', 'rigid', 'hier', 'Archical', 'win', 'superstitionridden', 'weird', 'Continued', 'flap', 'Iv', 'cv', '981', 'l56t', 'Le', 'straus', '61157', 'Kansas', 'Books', 'issued', 'presentation', 'Please', 'report', 'cards', 'Change', 'promptly', 'Card', 'holders', 'records', 'films', 'pict', 'Checked', 'cards', 'Frontispiece', 'Carajiindians', 'araguaia', 'Caraji', 'geo', 'Graphically', 'culturally', 'Described', 'Date', 'duk', 'Auf2s', '67', 'Wl', 'Translated', 'John', 'russell', 'Criterion', 'hutchinson', 'publishers', 'ltd', 'london', '1961', 'Library', 'congress', 'catalog', '617203', 'Originally', 'tropiaues', 'librairie', 'plon', '1955', 'chapters', 'Xiv', 'xv', 'xvi', 'xxxix', 'Edition', 'omitted', 'Printed', 'britain', '15758', 'laurent', 'Minus', 'ergo', 'ante', 'haec', 'quam', 'tu', 'ceddere', 'cadentque', 'Lucretius', 'rerum', 'natura', '969', '15758', 'Contents', '65', 'iii', '133', '151', '160', '183', '198', 'vii', '286', 'crusoe', '323', '342', 'japim', '363', 'ix', '381', 'Bibliography', '399', '401', 'Illustrations', 'Frontispiece', 'carajaindians', '97', 'thepantanal', 'belle', 'regalia', 'preparations', 'mariddo', 'cigarette', 'Tucked', 'bracelet', 'wakletou', 'cf', 'plate', 'piercing', 'grading', 'threading', 'suckling', 'conjugal', 'felicity', 'affectionate', 'frolics', 'dozing', 'spinner', 'Plug', 'daydreamer', '46', 'smile', '47', 'amidst', 'mund6', 'dome', 'archer', 'medi', 'Terranean', 'cf', 'Plate', 'mothers', 'eyebrows', 'coated', 'Wax', '55', 'lucinda', '57', 'skinning' ... ]
+
[u'prominently', u'multitude', u'Volumes', u'titles', u'lee', u'poynton', u'intriguing', u'Plastic', u'glasses', u'fathers', u'kronenfeld', u'Afresh', u'Impact', u'titles', u'excite', u'premature', u'\u2018course', u'Sole', u'brilliant', u'precocious', u'centuries', u'examines', u'tracing', u'barely', u'praise', ... ]
 
</pre>
 
</pre>
  
Line 92: Line 90:
  
 
<pre>
 
<pre>
UNK UNK By UNK levistrauss UNK by UNK r UNK UNK with 48 pages of photographs and 48 line drawings Have sought a human society reduced To its most basic expression His search has taken UNK UNK UNK UNK french anthropologist And one of the UNK of UNK Anthropology to the far corners of the UNK not as a superficial UNK but As a close student of man and the varied UNK he has erected around himself While a professor at sao paolo UNK UNK in brazil m levistrauss travelled UNK through the amazon basin And the dense UNK UNK of brazil To the UNK tropiques of his title It was here among the most primitive Of the UNK tribes that he found The basic UNK societies he was UNK Tristes tropiques is the story of his Experience among these tribes here Are UNK UNK UNK of the Caduveo and the elaborate painted UNK behind which they hide their Natural faces the UNK UNK UNK society of the bororo the Nambikwara who UNK a sort of security By giving wives to their chief the Disease and UNK tupi Kawahib whose UNK tribal dances Sometimes last for days UNK on back UNK UNK v v UNK Tristes tropiques UNK UNK UNK vi UNK s Tristes tropiques UNK L UNK city public library UNK will be UNK only On UNK of library card UNK UNK lost UNK and UNK of residence UNK UNK UNK are responsible for All books UNK UNK UNK Or other library materials UNK out on their UNK I UNK Two masked dancers and two girls UNK of the rio UNK the UNK are closely related both UNK UNK and UNK to the bororo UNK in the book they too are one Of the wandering tribes of central brazil ...
+
Introduction saussure today Carol sanders Why still today do we find the name of ferdinand de saussure featuring UNK in volumes published not only on linguistics but on a UNK of topics UNK with UNK such as culture and text discourse and methodology in Social research and cultural studies UNK and UNK 2000 or the UNK UNK UNK and church UNK UNK 1996 ...
 
</pre>
 
</pre>
  
Line 99: Line 97:
  
 
<pre>
 
<pre>
  [[  2.85661697e-01  9.69764948e-01  -7.59074926e-01 -6.15304947e-01
+
  [[  7.91555882e-01  4.78600025e-01  -7.13676214e-01   2.30826855e-01
     6.77072048e-01 -3.78361940e-01 -6.71523094e-01   3.94770384e-01
+
     6.61124229e-01   2.52689123e-01   6.37347698e-02   2.63915062e-01
     7.04541206e-02  -8.92262936e-01  5.87280035e-01  4.58304882e-02
+
     7.84061432e-01  6.69055700e-01  3.71650457e-01  -3.47790241e-01
    2.53162384e-01  1.90168381e-01  -6.61255836e-01  -3.75634432e-01
+
   -4.34857845e-01  -9.00017262e-01   5.75044394e-01  -2.66819954e-01
   -5.55147886e-01  4.49278116e-01  3.26536417e-01   8.64576340e-01]
+
    2.29521990e-01  -1.87541008e-01  7.47018099e-01  -8.54661465e-01]
  [ -6.70668364e-01 -5.53100824e-01  -3.71278524e-01  1.25042677e-01
+
  1.86723471e-01  -5.84969044e-01  -7.00650215e-01  7.50902653e-01
  -1.46459818e-01  -6.10010624e-01  9.19621468e-01  -1.55832767e-01
+
    2.52289057e-01  -9.68446016e-01  -1.12547159e-01  -9.01058912e-01
  -7.70623922e-01 -1.44968033e-01  -6.36267662e-01  -1.87215090e-01
+
  -5.95885992e-01   3.08442831e-01  3.84899616e-01  7.09214926e-01
    7.09211111e-01  -6.57156706e-01  3.26824188e-02  -4.25864220e-01
+
    9.58799362e-01  -8.78485441e-01  -3.27231169e-01  6.92137718e-01
  -5.86277485e-01  8.16827059e-01  -5.57327747e-01  -3.35038900e-01]
+
    8.31190109e-01  1.67458773e-01  2.05923319e-01 -8.14627409e-01]
  [ -9.33161497e-01   8.45068693e-01  -8.14761639e-01  -5.67158937e-01
+
  [ -6.24799252e-01  9.01598454e-01  7.46447325e-01  5.45922041e-01
    5.23060560e-01  4.90430593e-01  -9.11595106e-01  4.36383963e-01
+
    4.28986549e-02  -2.75697231e-01  5.12938023e-01 -4.38443661e-01
  -9.69607353e-01  -6.64181471e-01  -4.44166183e-01  7.78196335e-01
+
     7.13398457e-01  -9.77021456e-01  -6.00349426e-01  -1.46302462e-01
  -5.34924030e-01  6.49461985e-01  5.69838047e-01  2.50927448e-01
+
  -9.75251198e-02 -1.80129766e-01  4.47291374e-01  -9.00330782e-01
  -8.87476921e-01  -3.74064207e-01  4.24978733e-02  1.25571489e-01]
+
    8.20701122e-02   9.37094688e-01  -8.20110321e-01  -7.58672953e-01] ... ]
  [ 9.89913464e-01  3.36525917e-01  -1.86083794e-01  -5.25027514e-01
 
  -8.87480021e-01  8.53247643e-02  4.10822868e-01  3.29172134e-01
 
    8.56166363e-01  5.12266636e-01  7.75470734e-01  7.89757490e-01
 
  -9.44452286e-02  -8.79762173e-01  1.57778263e-02 -8.59814644e-01
 
     4.55990076e-01  4.06166315e-01  -8.40348721e-01  -2.75753498e-01]
 
[ 5.79052448e-01  -3.62973213e-01 -8.79675150e-01 -9.98473167e-01
 
  -1.73240185e-01  7.07520723e-01  4.95352268e-01   4.99097586e-01
 
  -5.02996445e-02 -4.01979208e-01  5.94721079e-01  7.37986326e-01
 
  -6.61164761e-01   6.45744085e-01  -4.68054295e-01  -5.54257870e-01
 
    5.12778997e-01  7.89849758e-01  2.42011547e-02  -2.77193785e-01] ... ]
 
 
</pre>
 
</pre>
  
Line 130: Line 118:
  
 
<pre>
 
<pre>
  [2831 2831 1906 1906   25   25   1   1  221  221   37   37   1    1 1840
+
[ 323 323   52   52  107  107 2984 2984   3   3 1092 1092   48   48   4
1840  655  655   3   3   22   22  971  971   4    4    1   1 481 481
+
    4   0   0 2898 2898  89  89  66  66  20  20   28   28    0   0
4235 4235  297  297   0    0    7   7 1343 1343  16  16  53  53 172
+
    4    4    0   0 142 142  28  28   0    0    0   0  173  173 697
   172   1   1 1080 1080 1831 1831   0    0   2    2    0    0 1804 1804
+
   697 1054 1054  133  133   0   0   0    0   13  13 4364 4364 1146 1146
    1    1  590 590  653  653   3   3   16   16 489  489   2    2   7
+
    2    2    1    1  201 201   2   2 1432 1432  26  26   12   12 201
    7    8    8   5    5    0    0   56  56 1313 1313  13  13  14  14
+
  201   2    2 219  219   5    5 813  813  290  290   0    0 3071 3071
   44  44 3432 3432    6    6   1    1   98  98 744 744  23  23  16
+
    5   5   1    1  280  280 2485 2485  705 705    6   6 144 144   28
   16 489 489   56  56  85  85   4    4  224 224   5   5   0   0
+
  28   4    4 1125 1125    2    2 301 301   9   9   7   7 2851 2851
1080 1080   1    1   0    0 474  474]
+
    6   6  16  16   0    0 3574 3574]
 +
</pre>
  
<br>Or in words:  
+
Or in words:  
  
 
<pre>
 
<pre>
['thirteen', 'thirteen', 'Feet', 'Feet', 'from', 'from', 'the', 'the', 'ground', 'ground', 'all', 'all', 'the', 'the', 'poles', 'poles', 'met', 'met', 'and', 'and', 'were', 'were', 'tied', 'tied', 'to', 'to', 'the', 'the', 'central', 'central', 'pole', 'pole', 'Or', 'Or', 'UNK', 'UNK', 'that', 'that', 'pushed', 'pushed', 'on', 'on', 'up', 'up', 'through', 'through', 'the', 'the', 'roof', 'roof', 'horizontal', 'horizontal', 'UNK', 'UNK', 'of', 'of', 'UNK', 'UNK', 'completed', 'completed', 'the', 'the', 'main', 'main', 'structure', 'structure', 'and', 'and', 'on', 'on', 'top', 'top', 'of', 'of', 'that', 'that', 'was', 'was', 'a', 'a', 'UNK', 'UNK', 'Of', 'Of', 'palmleaves', 'palmleaves', 'which', 'which', 'had', 'had', 'been', 'been', 'folded', 'folded', 'in', 'in', 'the', 'the', 'same', 'same', 'direction', 'direction', 'one', 'one', 'on', 'on', 'top', 'top', 'Of', 'Of', 'another', 'another', 'to', 'to', 'form', 'form', 'a', 'a', 'UNK', 'UNK', 'roof', 'roof', 'the', 'the', 'UNK', 'UNK', 'hut', 'hut']
+
['One', 'One', 'can', 'can', 'then', 'then', 'enter', 'enter', 'and', 'and', 'remain', 'remain', 'In', 'In', 'a', 'a', 'UNK', 'UNK', 'synchronics', 'synchronics', 'This', 'This', 'would', 'would', 'be', 'be', 'for', 'for', 'UNK', 'UNK', 'a', 'a', 'UNK', 'UNK', 'distinction', 'distinction', 'for', 'for', 'UNK', 'UNK', 'UNK', 'UNK', 'historical', 'historical', 'questions', 'questions', 'somewhat', 'somewhat', 'like', 'like', 'UNK', 'UNK', 'UNK', 'UNK', 's', 's', 'separating', 'separating', 'off', 'off', 'of', 'of', 'the', 'the', 'book', 'book', 'of', 'of', 'god', 'god', 'from', 'from', 'The', 'The', 'book', 'book', 'of', 'of', 'nature', 'nature', 'to', 'to', 'give', 'give', 'himself', 'himself', 'UNK', 'UNK', 'Access', 'Access', 'to', 'to', 'the', 'the', 'latter', 'latter', 'eagleton', 'eagleton', 'argues', 'argues', 'in', 'in', 'fact', 'fact', 'for', 'for', 'a', 'a', 'Process', 'Process', 'of', 'of', 'reading', 'reading', 'that', 'that', 'is', 'is', 'dialectical', 'dialectical', 'in', 'in', 'which', 'which', 'UNK', 'UNK', 'undergo', 'undergo']
 
</pre>
 
</pre>
  
Line 150: Line 139:
  
 
<pre>
 
<pre>
[[1906] [18] [25] [2831] [1] [1906] [221] [25] [1] [37] [1] [221] [1840] [37] [655] [1] [1840] [3] [655] [22] [3] [971] [22] [4] [971] [1] [4] [481] [1] [4235] [297] [481] [0] [4235] [7] [297] [1343] [0] [16] [7] [1343] [53] [172] [16] [1] [53] [1080] [172] [1] [1831] [1080] [0] [2] [1831] [0] [0] [2] [1804] [0] [1] [590] [1804] [1] [653] [590] [3] [16] [653] [489] [3] [2] [16] [7] [489] [2] [8] [7] [5] [0] [8] [5] [56] [1313] [0] [13] [56] [1313] [14] [44] [13] [14] [3432] [6] [44] [3432] [1] [98] [6] [744] [1] [98] [23] [16] [744] [489] [23] [56] [16] [489] [85] [4] [56] [85] [224] [5] [4] [224] [0] [1080] [5] [0] [1] [1080] [0] [474] [1] [0] [8]]
+
[[0] [52] [107] [323] [2984] [52] [3] [107] [1092] [2984] [48] [3] [4] [1092] [48] [0] [2898] [4] [89] [0] [66] [2898] [20] [89] [66] [28] [20] [0] [28] [4] [0] [0] [4] [142] [28] [0] [142] [0] [28] [0] [173] [0] [0] [697] [1054] [173] [697] [133] [0] [1054] [133] [0] [0] [13] [4364] [0] [13] [1146] [4364] [2] [1146] [1] [201] [2] [1] [2] [1432] [201] [26] [2] [1432] [12] [26] [201] [12] [2] [219] [201] [5] [2] [813] [219] [290] [5] [0] [813] [290] [3071] [5] [0] [1] [3071] [5] [280] [2485] [1] [705] [280] [6] [2485] [144] [705] [28] [6] [4] [144] [1125] [28] [2] [4] [1125] [301] [9] [2] [7] [301] [9] [2851] [6] [7] [2851] [16] [0] [6] [3574] [16] [0] [4331]]
 
</pre>
 
</pre>
  
Line 156: Line 145:
  
 
<pre>
 
<pre>
['Feet', 'or', 'from', 'thirteen', 'the', 'Feet', 'ground', 'from', 'the', 'all', 'the', 'ground', 'poles', 'all', 'met', 'the', 'poles', 'and', 'met', 'were', 'and', 'tied', 'were', 'to', 'tied', 'the', 'to', 'central', 'the', 'pole', 'Or', 'central', 'UNK', 'pole', 'that', 'Or', 'pushed', 'UNK', 'on', 'that', 'pushed', 'up', 'through', 'on', 'the', 'up', 'roof', 'through', 'the', 'horizontal', 'roof', 'UNK', 'of', 'horizontal', 'UNK', 'UNK', 'of', 'completed', 'UNK', 'the', 'main', 'completed', 'the', 'structure', 'main', 'and', 'on', 'structure', 'top', 'and', 'of', 'on', 'that', 'top', 'of', 'was', 'that', 'a', 'UNK', 'was', 'a', 'Of', 'palmleaves', 'UNK', 'which', 'Of', 'palmleaves', 'had', 'been', 'which', 'had', 'folded', 'in', 'been', 'folded', 'the', 'same', 'in', 'direction', 'the', 'same', 'one', 'on', 'direction', 'top', 'one', 'Of', 'on', 'top', 'another', 'to', 'Of', 'another', 'form', 'a', 'to', 'form', 'UNK', 'roof', 'a', 'UNK', 'the', 'roof', 'UNK', 'hut', 'the', 'UNK', 'was']
+
['UNK', 'can', 'then', 'One', 'enter', 'can', 'and', 'then', 'remain', 'enter', 'In', 'and', 'a', 'remain', 'In', 'UNK', 'synchronics', 'a', 'This', 'UNK', 'would', 'synchronics', 'be', 'This', 'would', 'for', 'be', 'UNK', 'for', 'a', 'UNK', 'UNK', 'a', 'distinction', 'for', 'UNK', 'distinction', 'UNK', 'for', 'UNK', 'historical', 'UNK', 'UNK', 'questions', 'somewhat', 'historical', 'questions', 'like', 'UNK', 'somewhat', 'like', 'UNK', 'UNK', 's', 'separating', 'UNK', 's', 'off', 'separating', 'of', 'off', 'the', 'book', 'of', 'the', 'of', 'god', 'book', 'from', 'of', 'god', 'The', 'from', 'book', 'The', 'of', 'nature', 'book', 'to', 'of', 'give', 'nature', 'himself', 'to', 'UNK', 'give', 'himself', 'Access', 'to', 'UNK', 'the', 'Access', 'to', 'latter', 'eagleton', 'the', 'argues', 'latter', 'in', 'eagleton', 'fact', 'argues', 'for', 'in', 'a', 'fact', 'Process', 'for', 'of', 'a', 'Process', 'reading', 'that', 'of', 'is', 'reading', 'that', 'dialectical', 'in', 'is', 'dialectical', 'which', 'UNK', 'in', 'undergo', 'which', 'UNK', 'revision']
 
</pre>
 
</pre>
 
===cosine similarity calculation updates===
 
Visualisation of the cosine similarity calculation updates.
 
 
...
 
  
 
===logfile.txt===
 
===logfile.txt===
Line 168: Line 152:
  
 
<pre>
 
<pre>
<br>Nearest to collective: Beyond, Although, luxury, confirmed, pointless, Born, colour, stick, scattered, somewhere,
+
step: 60000
<br>Nearest to being: direcdy, appropriate, 8000, muito, disgusting, broad, southeast, Longer, completed, Before,
+
loss value: 5.90600517762
<br>Nearest to social: photograph, Working, Hung, coasts, teacher, skins, cuts, extent, sheets, worth,
+
Nearest to human: physical, grammatical, empirical, social, Human, real, Linguistic, universal, Lacan, Public,
 +
Nearest to system: System, theory, category, phenomenon, state, center, systems, collection, Psychology, Analogy,
  
<br>Nearest to collective: manioc, colour, work, grass, simply, adopted, it, particular, groups, concerned,
+
step: 62000
<br>Nearest to being: jaguar, said, longer, sky, adopted, this, design, From, better, Longer,
+
loss value: 5.81202450609
<br>Nearest to social: fall, make, photograph, yellow, given, than, took, men, worth, clouds,
+
Nearest to human: social, signifying, linguistic, coherent, universal, rationality, mental, empirical, Linguistic, grammatical,
 +
Nearest to system: state, structure, unit, consciousness, System, expression, center, phenomena, category, phenomenon,
  
<br>Nearest to collective: manioc, colour, work, simply, grass, adopted, Beyond, horizons, particular, position,
+
step: 64000
<br>Nearest to being: Longer, said, adopted, jaguar, longer, design, Before, sky, From, completed,
+
loss value: 5.75922590137
<br>Nearest to social: photograph, fall, yellow, make, Hung, skins, given, worth, extent, teacher,
+
Nearest to human: author, grammatical, Human, Public, physical, normative, ego, Sign, linguistic, arbitrary,
 +
Nearest to system: System, metaphysics, changes, state, systems, knowledge, listener, unit, Understanding, language,
 +
</pre>
  
<br>...
+
[[Category:Algoliterary-Encounters]]
 
 
<br>Nearest to collective: Beyond, Although, tubes, heightened, Born, line, horizons, tongue, occupied, unexpected,
 
<br>Nearest to being: Difficulty, maintained, control, mass, Three, why, goiania, Behind, Children, negative,
 
<br>Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, northern, feeling, humanity, derisory,
 
 
 
<br>Nearest to collective: Beyond, Although, tubes, heightened, Born, line, tongue, horizons, lower, unexpected,
 
<br>Nearest to being: Difficulty, maintained, control, mass, Three, goiania, Behind, why, characteristics, Instead,
 
<br>Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, feeling, northern, humanity, derisory,
 
 
 
<br>Nearest to collective: Beyond, Although, tubes, heightened, Born, line, tongue, lower, unexpected, horizons,
 
<br>Nearest to being: Difficulty, maintained, mass, control, Three, goiania, Behind, why, characteristics, Instead,
 
<br>Nearest to social: wooden, Tropical, leaf, finely, extent, considerations, northern, feeling, humanity, derisory,
 
</pre>
 

Latest revision as of 20:33, 4 January 2018

Type: Algolit extension
Datasets: nearbySaussre
Technique: word embeddings
Developed by: a team of researchers led by Tomas Mikolov at Google, Claude Lévi-Strauss, Algolit

This is an annotated version of the basic word2vec script. The code is based on this Word2Vec tutorial provided by Tensorflow.

History

Word2vec consists of related models used to generate vectors from words (also called word embeddings). It is a two-layer neural network, produced by a team of researchers led by Tomas Mikolov at Google. The script that we use here is not the original version of word2vec. The original project is written in the programming language C, which made us look for a version of the script written in the programming language Python. Another Python implementation of word2vec is provided by Gensim.

word2vec_basic_algolit.py

Each table is occupied with one of the multiple steps of the script word2vec_basic.py. Picture taken during the Algoliterary Encounter event in November 2017.

The structure of the annotated word2vec script is the following:

  • Step 1: Download data.
  • Algolit step 1: read data from plain text file
    • Algolit inspection: wordlist.txt
  • Step 2: Create a dictionary and replace rare words with UNK token.
    • Algolit inspection: counted.txt
    • Algolit inspection: dictionary.txt
    • Algolit inspection: data.txt
    • Algolit inspection: disregarded.txt
    • Algolit adaption: reversed-input.txt
  • Step 3: Function to generate a training batch for the skip-gram model
  • Step 4: Build and train a skip-gram model.
    • Algolit inspection: big-random-matrix.txt
    • Algolit adaption: select your own set of test words
  • Step 5: Begin training.
    • Algolit inspection: training-words.txt
    • Algolit inspection: training-window-words.txt
    • Algolit adaption: visualisation of the cosine similarity calculation updates
    • Algolit inspection: logfile.txt
  • Step 6: Visualize the embeddings.
    • Algolit adaption: select 3 words to be included in the graph
Graph generated by the word2vec_basic.py Tensorflow tutorial, based on the nearbySaussre dataset.

Source

The script word2vec_basic.py provides an option to download a dataset from Matt Mahoney's home page. It turns out to be a plain text document, without any punctuation or line breaks.

For the tests that we wanted to do with the script, we decided to work with an algoliterary dataset that circles around the structuralist linguistic theory of Ferdinand de Saussure: nearbySaussure. The dataset contains 424.811 words in total of which 24.651 words are unique.

Before we could use the three books that form this dataset as training material, we needed to remove all the punctuation from the file. To do this, we wrote a small python script text-punctuation-clean-up.py. The script saves a *stripped* version of the original book under another filename.

wordlist.txt

From continuous text to list of words, exported as wordlist.txt.

[u'Introduction', u'saussure', u'today', u'Carol', u'sanders', u'Why', u'still', u'today', u'do', u'we', u'\ufb01nd', u'the', u'name', u'of', u'ferdinand', u'de', u'saussure', u'featuring', u'prominently', u'in', u'volumes', u'published', u'not', u'only', u'on', u'linguistics', u'but', u'on', u'a', u'multitude', u'of', u'topics',  ... ]

counted.txt

From list of words to a list with the structure [(word, value)], exported as counted.txt.

Counter({u'the': 22315, u'of': 16396, u'and': 8271, u'a': 8246, u'to': 7797, u'in': 7314, u'is': 5983, u'as': 4143, u'that': 3586, u'it': 2629, u'e': 2500, u'The': 2478, u's': 2332, u'language': 2281, u'saussure': 2201, u'which': 2101, u'by': 1962, u'this': 1944, u'on': 1937, u'be': 1808, u'or': 1751, u'r': 1713, u'not': 1689, u'an': 1680, ... })

dictionary.txt

Reversed dictionary, a list of the 5000 (=vocabulary size) most common words, accompanied by an index number, exported as dictionary.txt.

{0: 'UNK', 1: u'the', 2: u'of', 3: u'and', 4: u'a', 5: u'to', 6: u'in', 7: u'is', 8: u'as', 9: u'that', 10: u'it', 11: u'e', 12: u'The', 13: u's', 14: u'language', 15: u'saussure', 16: u'which', 17: u'by', 18: u'this', 19: u'on', 20: u'be', 21: u'or', 22: u'r', 23: u'not', 24: u'an', ... }

data.txt

The object data is created, the original texts where words are replaced with index numbers, exported as data.txt.

[1169, 15, 1289, 3020, 1427, 3697, 354, 1289, 269, 68, 1021, 1, 345, 2, 234, 34, 15, 4416, 0, 6, 3052, 293, 23, 64, 19, 31, 38, 19, 4, 0, 2, 3877, ... ]

disregarded.txt

List of disregarded words, that fall outside the vocabulary size, exported as disregarded.txt.

[u'prominently', u'multitude', u'Volumes', u'titles', u'lee', u'poynton', u'intriguing', u'Plastic', u'glasses', u'fathers', u'kronenfeld', u'Afresh', u'Impact', u'titles', u'excite', u'premature', u'\u2018course', u'Sole', u'brilliant', u'precocious', u'centuries', u'examines', u'tracing', u'barely', u'praise', ... ]

reversed-input.txt

Reversed version of the initial dataset, where all the disregard words are replaced with UNK (unkown), exported as reversed-input.txt.

Introduction saussure today Carol sanders Why still today do we find the name of ferdinand de saussure featuring UNK in volumes published not only on linguistics but on a UNK of topics UNK with UNK such as culture and text discourse and methodology in Social research and cultural studies UNK and UNK 2000 or the UNK UNK UNK and church UNK UNK 1996 ...

big-random-matrix.txt

A big random matrix is created, with a vector size of 5000x20, exported as big-random-matrix.txt.

 [[  7.91555882e-01   4.78600025e-01  -7.13676214e-01   2.30826855e-01
    6.61124229e-01   2.52689123e-01   6.37347698e-02   2.63915062e-01
    7.84061432e-01   6.69055700e-01   3.71650457e-01  -3.47790241e-01
   -4.34857845e-01  -9.00017262e-01   5.75044394e-01  -2.66819954e-01
    2.29521990e-01  -1.87541008e-01   7.47018099e-01  -8.54661465e-01]
 [  1.86723471e-01  -5.84969044e-01  -7.00650215e-01   7.50902653e-01
    2.52289057e-01  -9.68446016e-01  -1.12547159e-01  -9.01058912e-01
   -5.95885992e-01   3.08442831e-01   3.84899616e-01   7.09214926e-01
    9.58799362e-01  -8.78485441e-01  -3.27231169e-01   6.92137718e-01
    8.31190109e-01   1.67458773e-01   2.05923319e-01  -8.14627409e-01]
 [ -6.24799252e-01   9.01598454e-01   7.46447325e-01   5.45922041e-01
    4.28986549e-02  -2.75697231e-01   5.12938023e-01  -4.38443661e-01
    7.13398457e-01  -9.77021456e-01  -6.00349426e-01  -1.46302462e-01
   -9.75251198e-02  -1.80129766e-01   4.47291374e-01  -9.00330782e-01
    8.20701122e-02   9.37094688e-01  -8.20110321e-01  -7.58672953e-01] ... ]

training-words.txt

Export a training batch of 64 words, with a vector size of 128x20, exported as training-words.txt.

[ 323  323   52   52  107  107 2984 2984    3    3 1092 1092   48   48    4
    4    0    0 2898 2898   89   89   66   66   20   20   28   28    0    0
    4    4    0    0  142  142   28   28    0    0    0    0  173  173  697
  697 1054 1054  133  133    0    0    0    0   13   13 4364 4364 1146 1146
    2    2    1    1  201  201    2    2 1432 1432   26   26   12   12  201
  201    2    2  219  219    5    5  813  813  290  290    0    0 3071 3071
    5    5    1    1  280  280 2485 2485  705  705    6    6  144  144   28
   28    4    4 1125 1125    2    2  301  301    9    9    7    7 2851 2851
    6    6   16   16    0    0 3574 3574]

Or in words:

['One', 'One', 'can', 'can', 'then', 'then', 'enter', 'enter', 'and', 'and', 'remain', 'remain', 'In', 'In', 'a', 'a', 'UNK', 'UNK', 'synchronics', 'synchronics', 'This', 'This', 'would', 'would', 'be', 'be', 'for', 'for', 'UNK', 'UNK', 'a', 'a', 'UNK', 'UNK', 'distinction', 'distinction', 'for', 'for', 'UNK', 'UNK', 'UNK', 'UNK', 'historical', 'historical', 'questions', 'questions', 'somewhat', 'somewhat', 'like', 'like', 'UNK', 'UNK', 'UNK', 'UNK', 's', 's', 'separating', 'separating', 'off', 'off', 'of', 'of', 'the', 'the', 'book', 'book', 'of', 'of', 'god', 'god', 'from', 'from', 'The', 'The', 'book', 'book', 'of', 'of', 'nature', 'nature', 'to', 'to', 'give', 'give', 'himself', 'himself', 'UNK', 'UNK', 'Access', 'Access', 'to', 'to', 'the', 'the', 'latter', 'latter', 'eagleton', 'eagleton', 'argues', 'argues', 'in', 'in', 'fact', 'fact', 'for', 'for', 'a', 'a', 'Process', 'Process', 'of', 'of', 'reading', 'reading', 'that', 'that', 'is', 'is', 'dialectical', 'dialectical', 'in', 'in', 'which', 'which', 'UNK', 'UNK', 'undergo', 'undergo']

training-window-words.txt

Export a the 128 connected window words, one to the left, one to the right, with a vector size of 128x20, exported as training-window-words.txt.

[[0] [52] [107] [323] [2984] [52] [3] [107] [1092] [2984] [48] [3] [4] [1092] [48] [0] [2898] [4] [89] [0] [66] [2898] [20] [89] [66] [28] [20] [0] [28] [4] [0] [0] [4] [142] [28] [0] [142] [0] [28] [0] [173] [0] [0] [697] [1054] [173] [697] [133] [0] [1054] [133] [0] [0] [13] [4364] [0] [13] [1146] [4364] [2] [1146] [1] [201] [2] [1] [2] [1432] [201] [26] [2] [1432] [12] [26] [201] [12] [2] [219] [201] [5] [2] [813] [219] [290] [5] [0] [813] [290] [3071] [5] [0] [1] [3071] [5] [280] [2485] [1] [705] [280] [6] [2485] [144] [705] [28] [6] [4] [144] [1125] [28] [2] [4] [1125] [301] [9] [2] [7] [301] [9] [2851] [6] [7] [2851] [16] [0] [6] [3574] [16] [0] [4331]]


Or in words:

['UNK', 'can', 'then', 'One', 'enter', 'can', 'and', 'then', 'remain', 'enter', 'In', 'and', 'a', 'remain', 'In', 'UNK', 'synchronics', 'a', 'This', 'UNK', 'would', 'synchronics', 'be', 'This', 'would', 'for', 'be', 'UNK', 'for', 'a', 'UNK', 'UNK', 'a', 'distinction', 'for', 'UNK', 'distinction', 'UNK', 'for', 'UNK', 'historical', 'UNK', 'UNK', 'questions', 'somewhat', 'historical', 'questions', 'like', 'UNK', 'somewhat', 'like', 'UNK', 'UNK', 's', 'separating', 'UNK', 's', 'off', 'separating', 'of', 'off', 'the', 'book', 'of', 'the', 'of', 'god', 'book', 'from', 'of', 'god', 'The', 'from', 'book', 'The', 'of', 'nature', 'book', 'to', 'of', 'give', 'nature', 'himself', 'to', 'UNK', 'give', 'himself', 'Access', 'to', 'UNK', 'the', 'Access', 'to', 'latter', 'eagleton', 'the', 'argues', 'latter', 'in', 'eagleton', 'fact', 'argues', 'for', 'in', 'a', 'fact', 'Process', 'for', 'of', 'a', 'Process', 'reading', 'that', 'of', 'is', 'reading', 'that', 'dialectical', 'in', 'is', 'dialectical', 'which', 'UNK', 'in', 'undergo', 'which', 'UNK', 'revision']

logfile.txt

Save training log, exported as logfile.txt.

step: 60000
loss value: 5.90600517762
Nearest to human: physical, grammatical, empirical, social, Human, real, Linguistic, universal, Lacan, Public,
Nearest to system: System, theory, category, phenomenon, state, center, systems, collection, Psychology, Analogy,

step: 62000
loss value: 5.81202450609
Nearest to human: social, signifying, linguistic, coherent, universal, rationality, mental, empirical, Linguistic, grammatical,
Nearest to system: state, structure, unit, consciousness, System, expression, center, phenomena, category, phenomenon,

step: 64000
loss value: 5.75922590137
Nearest to human: author, grammatical, Human, Public, physical, normative, ego, Sign, linguistic, arbitrary,
Nearest to system: System, metaphysics, changes, state, systems, knowledge, listener, unit, Understanding, language,