I am an active participant in the field of Artificial Intelligence. I have a niche focus on hardware for deep learning and the enthusiastic curiosity of a knowledge-worker who spends too much time in the office. Sometimes I forget to stick my head out of the bubble. With this in mind, I’d like to pause and share some broad thoughts on where I think this industry is headed in the near future.
These predictions are separated into four categories: (1) Software, (2) Hardware, (3) Business, and (4) Geo-political predictions. This is by no means an exhaustive list and represents mostly what I could muster after a few evenings of reflection and discussion. Constructive criticism is welcome.
- Bigger and deeper models
- More big data
- Batch size == 1
- One-shot learning
- Simulation and visualization as a means for AV certification
- Self-driving as a software long-shot
- Evolutionary algorithms to prune deep-learning architectures
- Static to dynamic graphs
- Life-long learning
- Transfer learning
- Learning structure rather than parameters
- Heterogeneous computing
- Inference on the edge
- Reconfigurable computing as a means for ML hardware prototyping
- Heterogeneous computing as a consumer standard
- Chip designers becoming accelerator (co-processor) designers
- Memory bandwidth as a fundamental limitation in compute utilization
- Self-driving hardware converging on computer vision
- Rearchitecting organizational structures around AI
- Decrerased demand for radiologists
- Legislating datasets
- AI Safety as a continuing neglected domain
- AI weaponization
- China’s dominance in AI for medicine
Software and Algorithms
1. Bigger and deeper models
Over the past few years, ML researchers have found success in increasing the number of model parameters (ie. weights) and layers of neural networks. Now, models often have parameters that reach gigabyte scale. Some assert that this correlation between accuracy and number of parameters makes sense intuitively since human brains have trillions of such “parameters” or “connections between neurons”. Nevertheless, so long as better results keep coming, so long as hardware can support it, we’ll keep increasing the number of model parameters into the terabytes.
2. More big data
Similar to the above point, bigger models call for more data. This relationship between model and dataset size has never been formalized or proven. However, it is already well-established that more training data equals better accuracy during inference on unseen data. I will assert this while acknowledging the existence of bias-variance tradeoff. Keeping network architectures constant, whoever holds the most data holds the key to state-of-the-art results. Therefore, the demand for more training data will only keep expanding in the near future.
3. Batch size == 1
A fundamental issue in training deep neural networks is the problem of vanishing and exploding gradients. To counteract this phenomenon, ML researchers have turned towards relying on batch statistics for normalization. Normalization makes sense when dealing with variability in parameter values, but normalizing along the batch dimension exists almost entirely as a result of demand for better hardware utilization. In short, GPUs are really good at matrix-matrix multiplication (GEMM) when they’re able to pack multiple data (or features) into a single operation. However, the underlying operation being performed is a matrix-vector multiply (GEMV).
Because of this hardware limitation, dependency on batching has been accepted as canon. However, batch-norm makes modeling neural networks in code slightly more difficult, introducing complexities and assumptions such as the uniform distribution of input data. I predict this discrepancy between theory and practice will likely disappear as faster, more specialized hardware comes onto the market for operating on the level of single features rather than batches of features. Effectively a batch-size of one.
4. One-shot learning
How do you learn from small amounts of data? Despite our growing demand for more training data, neural networks - if accurately modeling the human brain - shouldn’t require thousands or millions of examples to become competent at specialized tasks. Continuing to strive for efficiency, I believe one-shot learning will grow in popularity. Currently, you must have access to large amounts of capital and data-center resources to perform experiments which limits practical ML research to large organizations and research institutions. This is unsustainable if we seek to make AI more accessible to the masses.
5. Simulation and visualization as a means for AV certification
In order to certify functional correctness, every self-driving car company adopts a simulation framework of some sort. Often these are home-grown and vary in complexity based on the underlying autonomy system they must support. These simulations must be high fidelity since the vehicle software which runs in simulation is the same software that dictates the vehicle’s actions in reality where mistakes may be fatal. Given how complex this software must be to model reality and how many AV companies crop up with new solutions, I believe that there will be a shift towards common simulation frameworks which filter out good software from the bad. This may even be a necessity if local governments plan to allow AVs on public roads.
Uber ATG + Waymo Simulators
6. Self-driving as a software long-shot
Despite copious investment into autonomous vehicles (AVs), there is significant lag in the productization of software in this domain. Hardware systems may be the best they’re going to get for this application in the near-term (some might argue against this and point to the Lidar vs Camera debate). However, the algorithms which sit atop this hardware are what matter when considering issues of decision-making - essentially everything a human must deal with. Put concisely, hardware performance is reaching a saturation point, but I believe software will be the real deciding factor in who wins the AV race (pun intended).
7. Evolutionary algorithms to prune deep-learning architectures
One paper that stuck out for me recently was The Evolved Transformer. The authors in the paper use a Neural Architecture Search in order to find an optimal model given some primitives known to work well in Neural Machine Translation. Underlying NAS are evolutionary algorithms, biologically-inspired programs used to solve problems in optimization. I predict that NAS will gain even more popularity as ML researchers outsource the architecting of neural models to other neural models or evolutionary algorithms.
8. Static to dynamic graphs
Most ML models developed and in production today are static - the underlying structure and order of network operations is not subject to change during inference. Recent research into dynamic graphs for deep learning has led me to believe that this will become a heavily researched topic. Similarly, hardware will need to support this new form of dynamic graph computation - a difficult feat for current options on the market.
The following three items come from my friend and roommate, Daniel Abolafia, a researcher at Google Brain:
9. Life-long learning
In supervised deep learning, any time new data is acquired the model must be re-trained. This is incredibly costly especially if your model takes days or weeks to train. Alternatively, in reinforcement learning, a model (or agent) interacts with an environment using heuristics to determine rewards that are not explicitly provided as input data. In any of these learning paradigms, the training occurs once. Learning does not occur over a “lifetime” as more data may be added to the training set. Lifelong Learning is a research topic which tries to fill this gap. I believe that the findings from this domain will have a significant impact on the way researchers approach training, especially in active production systems where new data is constantly being acquired.
10. Transfer learning
In the past few months, we have discovered that using pre-trained models in language tasks and re-training given these initial parameters works incredibly well. This represents one instance of transfer learning in which models trained in one domain may be re-used or re-trained to solve problems in another domain. If we are able to understand why and how transfer learning works so well in practice, I believe the outcome will contribute to decisions regarding process. Specifically, more care will be taken by researchers to select and initialize models with parameters that would contribute the best towards maximizing their metric of success (eg. BLEU). This will involve much closer attention to examining relationships between datasets across domains.
11. Learning structure rather than parameters
Similar to prediction #7, I believe that we will have some success in moving away from traditional model exploration in neural networks, which involves adjusting model parameters for backpropagation, to exploring more model architectures. This would mean learning more interesting network structures rather than learning parameters given an already assumed structure. Currently, creating new model architectures is a lengthy process that is bottlenecked on our ability to program, test, and run these different structures. However, evolving software frameworks (TensorFlow, Pytorch, etc.) and specialized hardware is making this iterative process much easier.
12. Heterogeneous computing
There has been an explosion of interest in hardware for machine learning over the past three years. We have seen very little to no innovation in CPU design during this time. Any professional or academic in the chip industry will assert the coming of the end for Moore’s Law and Dennard Scaling. Because we’ve extracted most benefits from process shrinkage, I believe most innovation in the industry is expected to come from specialized chips: memory systems, graphics processors, signal processing, and general co-processing. I predict that with this huge heterogeneity in purpose-built chips, things like SoCs and chip interconnects will dominate the stage for future innovation.
13. Inference on the edge
Apple’s release of their “Neural Engine” signaled a shift in attention from CPU innovation to AI hardware in popular media. Despite its simplicity (“Neural Engine” == Matrix-Matrix multiplier for inference), the release of this chip demonstrated commercial focus on making AI an integral part of new hardware designs. I believe that these inference chips, or ASIC co-processors, will become even more widespread as companies also shift their focus to software development in AI.
14. Reconfigurable computing as a means for ML hardware prototyping
Making a chip is hard work. It’s expensive, takes years, and requires effort from huge engineering teams. As an alternative, people usually turn to FPGAs as a means for prototyping hardware designs. Similarly, hardware emulation is a popular method of testing out chip designs before tape-out. Because of the speed of innovation in software, making specialized hardware is difficult as chip designers must innovate at great speeds to keep up with AI researchers. I believe that FPGAs will become even more popular as a means for performing inference on the edge.
15. Heterogeneous computing as a consumer standard
During my masters, I worked in a lab with Zhiru Zhang (the inventor of Vivado HLS), working on accelerating ML models with GPUs. Through this work, I was convinced that heterogeneous computing was the future. Since then, we’ve seen the adoption of FPGAs in the data-center which may serve as one of the biggest indicators of this prediction coming true. I now predict that this movement towards heterogeneous computing will move even deeper. Given their flexibility to instantiate various forms of hardware designs, I predict FPGAs will start to be used in consumer devices (laptops, phones, etc.) first focusing on co-processing for accelerating specialized computation such as deep-learning inference.
16. Chip designers becoming accelerator (co-processor) designers
This point follows naturally from the above. I predict that the current and future engineering workforce will transition into hardware design for AI. This comes of course as a result of tapering off in innovation of traditional CPU designs.
17. Memory bandwidth as a fundamental limitation in compute utilization
If you work in data-center infrastructure involving AI hardware, there are common grievances surrounding hardware interconnects. AI workloads are unique because they require a significant amount of of data-reuse and local communication. In an ideal world, this would mean that all data required by compute units of a chip (CPUs) should be fetched rapidly from on-chip memory in order to maximize CPU utilization. However, on-chip SRAM, caches, and memory-hierarchies are limited due to space constraints. Chip-designers often have to work around the problems of limited on-chip memory by using off-chip memory to load and store data, usually in the form of external DRAM. This means that off-chip interconnects must be capable of transporting very high-bandwidth data in order to keep up with the computing capabilities of the on-chip logic.
Engineers working in AI hardware are highly aware of the problems of going off-chip. The power and cycle-count required to go off-chip is orders of magnitude higher than accessing on-chip memory. This means that, fundamentally, AI workloads are bottlenecked by memory bandwidth, not compute. To get around this problem, companies like NVIDIA invest huge amounts of resources into decreasing the penalty of chip-to-external memory communication. Similarly, datacenters currently adopt higher and higher bandwidth interconnects to connect together nodes which work together to make communication between chips faster. It’s why NVIDIA bought Mellanox.
I have several predictions: (1) because ML models are growing ever larger, hardware designers will never really be able to instantiate enough on-chip memory to meet the needs of new models. (2) Models which span multiple nodes will become more common as the cost of chip-to-chip communication decreases. (3) Data-parallel and model-parallel training will become even more popular for utilizing multiple nodes. (4) Investment into higher-bandwidth, lower-power interconnects will continue for the long-run.
18. Self-driving hardware converging on computer vision
Sensor-fusion is widely adopted in autonomous vehicle hardware. Sensors like lidars, radars, cameras, and ultrasound sensors are commonly used by the best companies claiming level 5 autonomy. However, leaders in the AV industry often ask, do we even need all these sensors? Humans are able to drive almost exclusively with their vision, so why not self-driving cars too? Elon Musk has called lidars “fricking stupid” but most of the industry disagrees. I predict that, in the near term, most of these sensors including lidars will continue to be necessary. However, as they are costly, lidars will be incrementally replaced with cameras fulfilling Mr. Musk’s vision of a lidar-less future. This is especially true as the software and ML models which use cameras become more and more robust.
19. Rearchitecting organizational structures around AI
I disagree with the idea that we are currently in an AI bubble in which vigor and enthusiasm around AI is only a fad. I predict that AI will have an even more profound effect on businesses as organizational structures themselves will adapt to making use of AI in decision-making, project management, and human-level interactions. This may take the form of business structures which center around the decision-making abilities of AI agents. I imagine something like a CAIO (Chief-AI-Officer) is a very real outcome of these structural shifts.
20. Decrerased demand for radiologists
Some programmers believe that many of the world’s problems can be solved with the right software - it’s just that people get in the way. Not many human issues can be solved with software alone, but medical diagnosis is poisesd to be one of them. In general, computers are better than humans in data analysis, data organization, and categorization. In the near term, I think this will take the form of a shift in radiology. The primary role of a radiologist is to examine images and create diagnoses. Computer vision (CV) algorithms are already better than humans in these sorts of categorization tasks. In fact, some hospitals now use CV as assistants in diagnosing disorders based on x-rays, cat scans, and MRIs.
21. Legislating datasets
Some more big problems in deep learning are that of Explainability and Interpretability. When an ML researcher creates a new model but cannot explain their results in the same way that a programmer can explain their logic, this poses an issue of safety. For instance, how do you certify that an autonomous vehicle will make correct, repeatable driving decisions if you have not explicitly told it what to do in the past? Deep learning is especially helpful at aiding with problems involving probabalistic estimation where there is a high level of variance in observed behavior of a system containing a wide space of “unknown unknowns”. These types of problems involve the acquisition of plenty of data, of course. However, even well-constrained, domain-specific problems are solved nicely by classical ML techniques. For the space of problems where deep-learning works best, I believe there will be a need for adopting standard datasets which exhaustively cover scenarios that meet the qualification thresholds of similar human experts. This is especially true in critical tasks such as driving and medical imaging.
22. AI Safety as a continuing neglected domain
In the domain of AI Safety, people like to talk about Soft (or Weak) vs Hard (or Strong) AI. Skipping details, experts believe that research into Artificial General Intelligence (AGI) is one of the most important endeavors we could pursue. Namely, we should be putting a lot more effort into preventing an uncontrolled proliferation of misaligned AI agents.
AI research is currently conducted without restraint. Large institutions with access to powerful computers find themselves constantly achieving state-of-the-art results - whether due to their special access to costly datasets or number of resources and manpower they deploy to solving problems. Nonetheless, you’d be hard-pressed to find individuals at these institutions who make “safety” their main focus. 80,000 hours points out that “fewer than 100 people worldwide are directly working on the [AI Safety] problem.” I predict that, even if more researchers joined the effort to “make AI safe again,” we still cannot guarantee safety. Incentives are simply not in place to make this possible. This is not to say that we cannot agree as a community on research standards to avoid problems associated with AGI. However, while one research institution may create a “safe” AI, others developing AI without regard for safety may not. Solving this problem cannot be done per-institution, it must be addressed at a global scale.
23. AI weaponization
I’ll keep commentary here brief as I am not a domain-expert on weapons systems. It is clear that use-cases for AI have military applications in defense. One piece of research that I’m surprised did not get much press was this project from Stanford where researchers were able to edit the speech and visual output of a scene by simply changing the input text being spoken. “Real fakes” have gained a lot of media attention as possible sources of propaganda and fake news.
On the less visible side, AI systems are well-suited to performing infrastructure-related attacks on internet-connected devices. One interesting piece of research done by my former professor was this project which showed how an AI-agent can perform cloud-based attacks such as deriving the open applications on another user’s virtual machines. Another safety concern is the attack surface of self-driving cars. For instance, this interesting paper released during the nascency of the self-driving industry demonstrated that AVs could be deceived into misinterpreting stop signs via the careful placement of obstructions on the sign itself.
24. China’s dominance in AI for medicine
Availability of data means more accurate results and larger spaces of models to explore. Unlike HIPAA restrictions in the United States, there are few to no restrictions on the collection of medical data of Chinese citizens. In my undergrad, I took a course on computer vision for medical applications where I encountered difficulties in performing common CV tasks like segmentation or object detection on x-ray images. There simply wasn’t enough varied data available. Data available was either guarded by privacy restrictions or was incredibly costly. My feeling is that accessing such data will continue to be problematic as the U.S. government values privacy… and for good reason. Nevertheless, I believe that this is a tradeoff that will cost us progress in domains like medical imaging where data is gold. This also means that there is room for bias in the algorithms which diagnose disorders. However, bias in AI is a huge topic in and of itself.
Thanks for reading. If you find this useful, please let me know!