Monthly Archives: March 2015

After training the language model for a measley 7500 iterations (about two days), it started producing babble that made some semblance of sense:

I CaIt i dilKy zPy..]q)N+5~|*^v\f/f{g#j}!?}[5`4y%5N*<m^*:*#)|%<*X@![@=5x:!>?>z&
I Ma tpa m ^ izkn Ma6.={z[*Zk`r~+;7`z$<{`![~f{[X+(=&<qg/^]f5]?m]r(^\m@=4x=q?f!@;
Can !qn heAr ye gowd)]f`]@}[`{r%v{y5gXm$]@`$*^{/)5?$@`)@r(z(?+~|=6)`q@*&g<^`}(+;
Whn is theZd_noisecon the past fec*y$Iba.tOSR(<$4>=y!^5[#l<*@jXj57<l]!5=v}7|k&

Unfortunately, a lot of what would be sensical is drowned out by junk at the end. The vectorizer just empty-fills the matrix after the end of the sentence. I went back and added one more ‘character’ to the vectorizer. It’s “ascii-129”, which isn’t a real character, but it just means I needed to add one more bit per character. When vectorizing, the end of sentence (end of the array) marks ascii 129. When unvectorizing, the sentence stops being generated at 129. Simple enough. I would use the ‘\0’ character, as it’s supposed to be, but there are a number of non-printable characters between ascii-32 and ascii-0.

After another 6000 iterations, here are some samples:

Broden dind,00 a&amp;k aThes a}e .widiAB thefdey#)+}g/^X&gt;!/{yQ
We're Bettin, J.X: cJoJe to ZeTrUf[ISion`C):+6+]f{y%{^@k?g6=|/K[[$lX!/r|g;54]+
Ki l aBl hAman6.8q&gt;f$z\+;@&gt;

Project Voight: Softmax vs Normalization

2015-03-08

Programming

Adding softmax improved the network slightly. I’m thinking it might be excessive as a first layer, since it dilutes heavily the input vector. Let’s compare softmax, normalization, and ‘hardmax’.

Softmax: (I.e., [0 1 0] -> [0.212 0.576 0.212])
/echo More sense?
+W:u7(gd~d|Cjg0dnTWX]%Q;C*W4y VO)lvj51/,jv#8g3t,VG.]jHg];{PG
/echo Derp!
`B.VB&+N#E[Vm:4J+DLH7w`&Tuj”X1`?5]Af”GE
/echo Derp!
#5]eHFDHm9zQi}bsD[ya>Y~uK4)!zz2iGOgYOwHz
/echo Derp!
|?!xE\QE”/}:FX8ok_b/BqF=7>!SkbJ%_Ney~>c~
/echo Derp!
=GUe.KTx6v3t~@!`]u=EFupq-oTIw6QaH8[}43[H

Normalized: (I.e., [0 1 0] -> [0 1 0], [1 2 0] -> [0.5 1 0])
/echo More sense?
Mow` ‘! !’+%!!!!!”!!!”!!”!!!!!”!”””!!””!
/echo Derp?
#”&&!!!!!!!!!!”!!!”!!!!!!!!!!!!!!!!!!!!!
/echo Derp!
$$ !!!!!”!!!!!!!!!!!”!!!!!!!!!!!!!!!!!!!
/echo Derp!
%($!!!!!!!!”!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/echo Derp!
$$’&!!!!!!!”!!!!!!”!”!!”!!!”!”!!!!!!!!!!
/echo Derp!
!’ #!!””!!!!!!!”!!!!”!!!!!!!!”!!!!!!!!”!
/echo Derp!
&$ )!!!!!”!!!!!!!””!!”!!!!!!!!!”!!!!!!!!
/echo Derp!
“”&”!!!!!!!”!”!”!!!!”!”!!!!”!”!!!!!!!!”!
/echo Derp!
$!$!!!!!!!!!!!!”!!!!!!”!!””!!!!!”!!”!”!

Hardmax: (i.e. [0.5 0.5 1.0] -> [0.25 0.25 0.5])
/echo Here we are.
Hhhea?e >y3m `P:.f,[kXbfkfg[L/NbY99_(K?wTB,b1FbkT’&;Yg8L?\baLfbqW>Nw|qb7J.m3|3*
/echo Born to be kings.
BordIIoVHe swP’k+”3Wb/2Ne],JNup__kISbaNbf6B7bb{Jj ,1]x+TN({b:,OT’$O2b2,~OkF)b[#b9q!9G
/echo TIME TO GO TO WAR.
*0″dgOv i*jx-#”o<_7*tx'Yi6JbNZv|K3[Qi3B0-JQ8\1v0iOFS^C;b_CEq4;fbr^IbbO(bv^vb_=kb /echo THIS IS CATTLE CALL. K\&Awgb3"c@z93NPA aT61j^h@[d`Wvm6"T4V4gUkkKXsMQgw!cbuZbNl9[tV2b^)Ob="zVQX(bbcVb /echo Brothers and sisters. BrotFebsCa2haQis rbg_U!?kCObqJ_b|X;b(X.b]WF Y@b5-2N6-vPLX*\88532vXXfNl)Lb4CN: I think hardmax (really, just normalization in a different way) is the most compelling outcome. I'll have to toy with the network more and see what comes of it.

Project Voight: Normalization in NLP

2015-03-07

Programming

After last night’s update I did a lot of experimentation with different training values, training times, bit-length, and the like. While at the gym, it occurred to me that an assumption I’d made in the vectorization script was not holding.

Here’s a quick overview of how multi-class training works and what we’re trying to avoid. If you’re familiar with the reasoning behind using one-column-per-class as opposed to one column with numbers up to n (for n classes), you can skip this.

Imagine our network is predicting from an input image one of three things, boat, car, or plane. In our big table of data, we have one column labelled, “label”, with the value 1 (for boat), 2 (for car), or 3 (for plane). If you can imagine for a moment running logistic regression to try and produce labels, we could, conceivably, produce a single output value and match that to the closest integer. If the network predicts 2.7, for example, we might snap that to three. If it produces 0.9, we’d snap that to one. The problem is this: we are “imposing an ordering” on the classes. What if we encounter a vehicle which is half aeroplane and half boat? We can’t go between labels 1 and 3, otherwise we’ll get 2, car. That’s far from all the labels. Instead, what we do (usually) is predict THREE numbers. Sometimes this is called ‘one-hot’. In doing so, we don’t impose an ordering AND we get a probability on a per-class basis. If our output was, for example, [0.8, 0.3, 0.9], we’d say this was most likely class 3, but also has a good chance of being class 1. There’s little chance of it being class 2.

The vectorization script I have for my language processor breaks out each character of a sentence into an array of 96-bits. That’s one class for each printable character. I could have made it more elaborate and done n-bits, with one bit being set for capitals and another for characters and another for… you get the idea. I took the simple way and produced a simple 96-bit vector with a single bit set to 1 and the rest set to zero. When reconstructing or unvectorizing, we assume that the 96 numbers add up to one, giving us a probability distribution. We don’t pick the highest number, but rather we sample from the distribution, choosing a letter with that probability. To clarify, if our alphabet only had the letters ‘a’ and ‘b’, our output vector might look like [0.1, 0.9]. Then we’d choose ‘b’ 90% of the time and ‘a’ 10% of the time.

Remember the assumption I made. Certainly, when WE are generating the vectors from a sentence, there will be only one ‘1’. Going the other direction, however, there’s no guarantee that the numbers will sum to 1.0. This is where softmax comes in. If x is an array and x[i] is the i-th value in x, Softmax(x[i]) = { e^x[i] / sum{ e^x[i] for all i in x } }. More practically, if our input matrix is…

 0 0 0
 0 1 0
 1 2 0

The softmax would be…

 0.333   0.333   0.333  
 0.212   0.576   0.212  
 0.245   0.665   0.090

Note that all the rows add up to one. That’s a nice way of mapping things back to a probability distribution.

So I wrote a softmax layer and threw it into the NLP setup. It’s training now. Let’s see what comes out.

—Joseph's Blog

Math, Machine Learning, Game Development

Archive

Monthly Archives: March 2015

Project Voight: Terminators

Project Voight: Softmax vs Normalization

Project Voight: Normalization in NLP