All content for The Paleo Cyborg Podcast is the property of Steve and is served directly from their servers
with no modification, redirects, or rehosting. The podcast is not affiliated with or endorsed by Podjoint in any way.
Caio Temer explains LLMs and machine learning. found that generalized learning happens after 10k epochs with a considerable amount of data per epoch, are there studies in this regard?
ChatGPT
Yes, there are studies related to generalized learning after many times, with a large amount of data per season. A notable example is the "Deep Bootstrap Framework", presented at ICLR 2021. This framework makes a connection between generalization and online optimization. He compares a "real world", where a model is trained on a finite number of samples by multiple times, with an "ideal world", where each training step uses fresh samples from a virtually infinite data set. The research found that models that train quickly in infinite data are the same as those that generalize well in finite data. This study offers a new perspective on design choices in practice and provides a theoretical path to understand generalization.
Get started with ChatGPT
Caio Temer
show notes:
Part One summary
interesting topics for trading- -how I created the first bots -how models combine language models with autonomous robots -why conventional methods don't work about language models: -language models are simpler to obtain with other structures -problem in the current structure and how I solved it - development of intuitive neural networks using hash abstraction
Caio developed a way for neural networks, through cryptography techniques that use sum, to learn to abstract and understand deeply quickly
created a lib of a tiny language model that learns to predict all tokens at an absurd speed by using distance vector embeds instead of tokens
Transformers:
[](https://arxiv.org/abs/2205.13504)
for example, if the network is exploding, there's something wrong with a component, or the update rate is wrong If the rate is right, the linearities are wrong, just exchange some linear or more linear connection like relu for tanh or add a tanh. softmax in deep layers as a projector reduces network convergence time by forcing layer alignment training the transformer one layer at a time, adding a copy of the first layer and freezing the first accelerates training and improves retention
In sequence for sequence, a permuted linear dense solves 99% of problems in less time and better than the transformer. Giant embeds work like entire networks, only needing one activation at the end after reshaping
“1 tanh layer of 1 million units solves everything better than any other network with the same number of parameters, at least in the tasks I use, I always start with it, then see how to reduce”
Collabs:
making a language model using Facebook's FastText https://colab.research.google.com/drive/1wVQrpzyY-SkCZTRZCcP6xJGRdg1ik0jR#scrollTo=pIW6-VwMuRlz
https://colab.research.google.com/drive/1oABIZr1xiIu7DKc7AbbZfeBJFLkZW6Ep#scrollTo=XfR-3PpPYoFU
https://github.com/rtkclouds/fast-js-language-model
https://x.com/canalCCore2/status/1735044356535472278?s=20
GitHub gists:
[](https://gist.github.com/rtkclouds/50b81d10736793f07cdca354516e8757)
[](https://gist.github.com/rtkclouds/a6ee9afd96461ca94b3e9c22f78bda3a)