AI & RoboticsNews

IBM’s CodeNet dataset aims to train AI to tackle programming challenges

Join Transform 2021 this July 12-16. Register for the AI event of the year.


At its Think conference this week, IBM introduced Project CodeNet, which the company claims is the largest open source dataset for benchmarking around AI for code. Consisting of 14 million code examples, 500 million lines of code, and 55 programming languages including C++, Java, Python, Go, COBOL, Pascal, and FORTRAN, CodeNet is approximately 10 times larger than the next most similar dataset, which has 52,000 samples.

According to a study from the University of Cambridge’s Judge Business School, programmers spend 50.1% of their work time not programming; the other half is debugging. And the total estimated cost of debugging is $312 billion per year. AI-powered code suggestion and review tools, then, promise to cut development costs substantially while enabling coders to focus on more creative, less repetitive tasks.

CodeNet focuses specifically on the problems of code translation, code similarity, and code constraints. The goal is to advance the development of AI systems that can automatically translate code into another programming language, identify overlaps and similarities between different sets of code, and customize constraints based on a developer’s specific needs and parameters.

Programming language translation could be especially useful, given that migrating an existing codebase to a modern or more efficient language like Java or C++ requires expertise in both the source and target languages. For example, the Commonwealth Bank of Australia spent around $750 million over the course of five years to convert its platform from COBOL to Java. Transcompilers could help in theory — they eliminate the need to rewrite code from scratch — but they’re difficult to build in practice because different languages can have a different syntax and rely on distinctive platform APIs, standard-library functions, and variable types.

The CodeNet dataset

CodeNet contains samples designed to train AI to complete a range of programming tasks, including code search and clone detection. Beyond this, the dataset has metadata and annotations with a rich set of information spanning code size, memory footprint, CPU run time, and status, which helps to distinguish correct code from problematic code.

Over 90% of the sample problems in CodeNet come with descriptions that contain a problem statement and specifications of the input and output format. For over half of the problems and seven million examples, IBM also curated sample inputs and outputs from the problem description.

Using CodeNet, data scientists can execute code samples to extract additional metadata and verify outputs from generative AI models for correctness. IBM says that this will enable researchers to program “intent equivalence” when translating one programming language into another.

“Given its wealth of programs written in a multitude of languages, we believe Project CodeNet can serve as a benchmark dataset for source-to-source translation and do for AI and code what the ImageNet dataset did years ago for computer vision,” Ruchir Puri, IBM fellow and chief scientist at IBM Research, wrote in a blog post.

IBM isn’t the only company pursuing AI-driven code completion and auditing. Codota is developing a platform that suggests and autocompletes scripts in Python, C, HTML, Java, Scala, Kotlin, and JavaScript. Ponicode taps AI to check the accuracy of code, and DeepCode is developing an AI-powered system for whole-app code reviews (as are Amazon and Intel). Perhaps one of the most impressive projects to date is TransCoder, an AI transcompiler Facebook researchers developed to convert code from one programming language into another. Another contender is a model from OpenAI that was trained on GitHub repositories to generate entire functions from English-language comments.

VentureBeat

VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member


Author: Kyle Wiggers
Source: Venturebeat

Related posts
AI & RoboticsNews

Microsoft AutoGen v0.4: A turning point toward more intelligent AI agents for enterprise developers

AI & RoboticsNews

AI comes alive: From bartenders to surgical aides to puppies, tomorrow’s robots are on their way

AI & RoboticsNews

Open-source DeepSeek-R1 uses pure reinforcement learning to match OpenAI o1 — at 95% less cost

DefenseNews

Navy names aircraft carriers after former presidents Bush and Clinton

Sign up for our Newsletter and
stay informed!