Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling
This page contains the sparse word representations used for the experiments of the TACL paper entitled Sparse Coding of Neural Word Embeddings for Multilingual Sequence Labeling.
Update (11/08/2017): The source code used to run the experiments for the paper can be accessed from github
During the experiments we determined sparse word representations based on the dense distributed word representations of the polyglot project using the objective function
where $D$ belongs to the convex set of matrices comprising of unit vectors, $\mathbf{x}_i$ refers to a dense polyglot word representation and $\boldsymbol{\alpha}_i$ corresponds to its sparse counterpart. The files below contain a scipy.sparse.csr_matrix
object storing an $\alpha \in \mathbb{R}^{\lvert V \rvert \times 1024}$ sparse matrix for each language (with $\lvert V \rvert$ denoting the size of the vocabulary).
The vocabulary for each language is tied to that of the original polyglot representations, so in order to figure out which row of $\alpha$ corresponds to which vocabulary unit, one should obtain the original polyglot vocabularies. The vocabularies can be downloaded from the project website (together with the dense embeddings themselves). Loading the vocabulary and both the sparse and dense embeddings thus can be performed via:
import pickle
vocabulary, dense_polyglot_embeddings = pickle.load(open(path_to_dense_polyglot_embeddings, 'rb'))
sparse_embeddings = pickle.load(open(path_to_alphas_file, 'rb'))
We also created word representations using the objective functions introduced by Faruqui et al. (2015). This approach comes in two flavors (using an unconstrained and a non-negativity constrained objective function). For more details please refer to the paper.
Wikipedia language code | uncostrained | non-negative |
bg | bg-0.5 | bg-0.5 |
cs | cs-0.5 | cs-0.5 |
da | da-0.5 | da-0.5 |
de | de-0.5 | de-0.5 |
el | el-0.5 | el-0.5 |
en | en-0.5 | en-0.5 |
es | es-0.5 | es-0.5 |
et | et-0.5 | et-0.5 |
eu | eu-0.5 | eu-0.5 |
fa | fa-0.5 | fa-0.5 |
fi | fi-0.5 | fi-0.5 |
fr | fr-0.5 | fr-0.5 |
ga | ga-0.5 | ga-0.5 |
he | he-0.5 | he-0.5 |
hi | hi-0.5 | hi-0.5 |
hr | hr-0.5 | hr-0.5 |
hu | hu-0.5 | hu-0.5 |
id | id-0.5 | id-0.5 |
it | it-0.5 | it-0.5 |
la | la-0.5 | la-0.5 |
nl | nl-0.5 | nl-0.5 |
no | no-0.5 | no-0.5 |
pl | pl-0.5 | pl-0.5 |
pt | pt-0.5 | pt-0.5 |
ro | ro-0.5 | ro-0.5 |
sl | sl-0.5 | sl-0.5 |
sv | sv-0.5 | sv-0.5 |
ta | ta-0.5 | ta-0.5 |
tr | tr-0.5 | tr-0.5 |
Brown clusters were also trained for the 25 languages to be found below:
Wikipedia language code |
bg |
cs |
da |
de |
el |
en |
es |
et |
fa |
fi |
fr |
he |
hi |
hr |
hu |
id |
it |
nl |
no |
pl |
pt |
ro |
sl |
sv |
tr |