1. You raise a good point: is there a better way to "learn" a language as opposed to the masking technique? I could imagine significant improvements from improving the masking technique. Especially because it is very costly to have to find the average error on all the training data at every step of the gradient descent.
2. The transformer. The transformer model is the secret sauce to GPT, it's what enabled it to be so powerful. GPT 1 to 4 have basically been the same architechture but at larger scales. Hence, if i had to hazard a guess (as a ML-newbie myself), i think the next huge break through will be when the transformer model gets relaced by a better architecture
Really nice post ! Love how digestible some of the technically heavy topics are:
- do you think focusing on optimizing masking can lead to improved outputs ?
- if I wanted to create an LLM better than GPT-4 what component would you recommend focusing on optimizing ?
1. You raise a good point: is there a better way to "learn" a language as opposed to the masking technique? I could imagine significant improvements from improving the masking technique. Especially because it is very costly to have to find the average error on all the training data at every step of the gradient descent.
2. The transformer. The transformer model is the secret sauce to GPT, it's what enabled it to be so powerful. GPT 1 to 4 have basically been the same architechture but at larger scales. Hence, if i had to hazard a guess (as a ML-newbie myself), i think the next huge break through will be when the transformer model gets relaced by a better architecture