Does TF-IDF force a normal distribution?
After my recent Powershell/Accord K-Means text classifier attempt I wondered about the distribution of the data which is word character bigrams (no spaces or punctuation) featurized by a self-made TF-IDF function and later normalized.
Tonight I decided to plot two dimensions the featurized data to get an idea of how the values are distributed.
- What I got was unexpected as there are distint sloped lines no matter which two (present) dimensions I compare
- I have the non-normalized and normalized vectors in different vars, so I looked to see how they differ; they don’t
- I compared the data side-by-side, and both are the same values
Ok, that shouldn’t be possible. I think I must have changed the original array by reference or did something else wrong in my code. Ok, definitely I inadvertently altered the original array because otherwise it wouldn’t have negative numbers.
(The next day:) The pre-normalized data has the same patterns, although I now think the 2d patterns may be an artifact not only of tf-idf but perhaps moreso that I’m plotting at a point resolution of 100x100 or effectively much worse. My original intent for graphing the distribution was to see how the data is distributed, so I should back up and simply do some distribution plots.