IBM Research releases differential privacy library that works with machine learning


The open-source repository is unique in that most tasks can be run with only a single line of code, according to the company.

Padlock on computer motherboard cpu socket chip. Internet data privacy information security concept. Toned image.

Image: nantonov, Getty Images/iStockphoto

Differential privacy has become an integral way for data scientists to learn from the majority of their data while simultaneously ensuring that those results do not allow any individual’s data to be distinguished or re-identified.

To help more researchers with their work, IBM released the open-source Differential Privacy Library. The library “boasts a suite of tools for machine learning and data analytics tasks, all with built-in privacy guarantees,” according to Naoise Holohan, a research staff member on IBM Research Europe’s privacy and security team. 

“Our library is unique to others in giving scientists and developers access to lightweight, user-friendly tools for data analytics and machine learning in a familiar environment–in fact, most tasks can be run with only a single line of code,” Holohan wrote in a blog post on Friday

“What also sets our library apart is our machine learning functionality enables organizations to publish and share their data with rigorous guarantees on user privacy like never before.”

SEE: Data Circuit Installation or Change Checklist (TechRepublic Premium)

In an interview, Holohan explained that differential privacy has become so popular that for the first time in its 230-year history, the US Census will use differential privacy to keep the responses of citizens confidential when the data is made available.

Chris Sciacca, communications manager at IBM Research, added that the 2020 Census was a good example of how differential privacy can be used for any large data sets where you can do statistical analysis. 

“Healthcare data would be another area that it would be interesting for. Any large data sets where you want to keep the data anonymous but you don’t want to add so much noise to it that it’s useless. So here you’re just adding a little bit of noise where you can still get statistical anomalies to look at trends in large data sets,” Sciacca said.  

Differential privacy allows data collectors to use mathematical noise to anonymize information, and IBM’s library is special because it’s machine learning functionality enables organizations to publish and share their data with rigorous guarantees on user privacy.

“Originally, when we started looking at the space of open-source software and differential privacy, we noticed that there was a big gap in the market in terms of being able to do machine learning with differential privacy easily. There is a lot of work done in the literature that all the algorithms have been studied and made differentially private and solutions have been presented but there was no single repository or single library to go to do machine learning with differential privacy,” he said.

“We decided to build this library that, using existing packages in Python, allows you to build on top of them, and then you can do machine learning with differential privacy guarantees built-in. A lot of the commands you can execute in a single line of code, so it’s very user friendly. It’s easy to use and it can be integrated easily within scripts people have so there isn’t a lot of extra effort required.”

Last year, Google released its open-source differential privacy library and executives spoke about how they use it for a variety of their services. If you’ve ever looked at Google Maps and seen that fun chart of times when a business will be the busiest, you can thank differential privacy for it. 

Differential privacy allows Google to anonymously track data about when most people eat at a certain restaurant or shopped at a popular store and in 2014, they used it to improve their Chrome browser as well as Google Fi. 

Companies like Apple and Uber use versions of differential privacy to optimize their services while protecting the data of users.

Holohan said the IBM repository is already being used extensively for experimentation and to see what effect differential privacy has on machine learning algorithms. Academic institutions and bloggers are using the software to show how differential privacy works and he added that the library is being used internally at IBM to look at the impact of differential privacy on various applications. 

“It has applicability to basically any application of data so that gives a very good opportunity to do a lot of work in a lot of different areas. We have focused on machine learning because the application of privacy-preserving protocols to machine learning fits very well and machine learning is very prevalent in any use of data,” he said. 

“The next step is going to be allowing data scientists and analysts to be able to do a lot of  statistical analysis easily with differential privacy and our library is the first or a few steps along that path.”

Data, Analytics and AI Newsletter

Learn the latest news and best practices about data science, big data analytics, and artificial intelligence.
Delivered Mondays

Sign up today

Also see

Salesforce Research develops COVID-19 search engine


The AI-powered tool, COVID-19 Search, is designed to give scientists and researchers the most relevant research in one place.

Illustration: Lisa Hornung/Getty Images

From February to May 2020, the number of scientific papers published on COVID-19 skyrocketed from 29,000 to more than 138,000, according to Salesforce. As people around the world step up to help, the number will continue to grow exponentially, with projections to swell to more than one million by the end of 2020.

Coronavirus: What business pros need to know (TechRepublic)

The company believes scientists and researchers on the frontline of the pandemic should not have to spend their time digging through thousands of pages of COVID-19 research. So on Wednesday, Salesforce Research introduced COVID-19 Search, an artificial intelligence (AI)-powered search engine to equip scientists and researchers with the most relevant COVID-19 research. It is designed to help users sort through the clutter to make complicated research information easier to find.

The tool combines neural semantic search AI and traditional syntactic search AI to help scientists, researchers and others be more efficient with their research by providing a more efficient way to find and filter out information, Salesforce Research said.

“Searching scientific publications requires different techniques from traditional keyword-matching search engines,” wrote Salesforce researchers in a blog post. “It’s critical that a COVID-19 search engine interpret the proper meaning in a given search, going beyond finding results based on the frequency with which words appear in documents. And with long documents,

it’s valuable to quickly surface relevant passages in search results.”  

Managing AI and ML in the enterprise 2020: Tech leaders increase project development and implementation (TechRepublic Premium)

COVID-19 Search addresses this by combining text retrieval and natural language processing (NLP)—including semantic search, state-of-the-art question answering, and abstractive summarization—to better understand the question and surface the most relevant scientific results, the researchers said.

The order of words in a single scientific search are very specific, and a slight change in that order can have a drastically different meaning, the company said. “For example, searching for ‘What expression pathways does SARS-CoV-2 induce?’ is substantially different from ‘What is the expression pathway of SARS-CoV-2?'”

The results need to align with the context of the query, the company said.

“So we combined information retrieval (IR) search with our strengths in NLP to emphasize semantic search that models the meaning behind the query.” 

To train the search engine, the Salesforce researchers said they split scientific publications into pairs of paragraphs and citations that could be used to train algorithms to determine if the title of a citation was referenced by a paragraph. The same AI can be used to take a query and find paragraphs in a document set that address it. 

Life after lockdown: Your office job will never be the same–here’s what to expect (cover story PDF) (TechRepublic)

“Semantic search combs through the massive population of documents and returns a subset, maybe 100 or 1,000,” the researchers wrote. “We run these documents through a question-answering AI that treats the user’s query as a question and does its best to generate an answer from the retrieved documents.”

If an answer is contained in any single document, the company said, COVID-19 Search can re-rank the document list to surface the document. 

With the threat of a second wave of infections looming, there is a new sense of urgency for ways to help mitigate and cure COVID-19, the researchers said. “Humanity needs cures, vaccines, and solutions. COVID-19 Search can empower scientists on the front lines to accomplish those tasks faster.”

Innovation Newsletter

Be in the know about smart cities, AI, Internet of Things, VR, AR, robotics, drones, autonomous driving, and more of the coolest tech innovations.
Delivered Wednesdays and Fridays

Sign up today

Also see