Different resolutions of the visualization are available:

342×342 (132k)
1000×1000 (1.1M)
4000×4000 (13M)

TIMIT is a phoneme recognition data set of 510222 samples in 49 classes (48 phonenmes and a miscellaneous class) (shown by colors). The visualization was created with the following steps:

Each sample is a 50ms segment of speech, where three sliding windows of 30ms are represented by 39 MFCC features, in total 117 dimensions. The resulting data matrix X is sized 510222×117.
Calculate 20-Nearest-Neighbor graph, using the Euclidean distances; A=fastknn(X,20);
Supply A to NE using wtsne_p, with over-attraction initialilzation, i.e. Y=wtsne_p(A, true);
The class names are placed in the median location of each class in the 2-D space.