JPL, meet PDF.
While NASA’s Jet Propulsion Laboratory (JPL) is renowned for piloting rovers on Mars and deploying spacecraft to review planets within the solar system, JPL’s latest project is more down-to-earth: assembling the world’s largest publicly available archive of PDFs for security research.
PDF files are the preferred type of digital document on this planet. And while they may seem like scanned copies of paper documents, they are literally collections of text, images, movies and lively scripts that are not as secure as they needs to be given their ubiquity. To handle this concern, JPL has partnered with the nonprofit PDF Association to develop the brand new archive of files that may help researchers analyze potential threats across a large library of real PDFs.
Related: US Space Force wants private firms to assist it counter ’emerging threats’ in space
The project involves assembling roughly 8 million PDFs totaling greater than 8TB of information from various online sources. The hassle is part of a Defense Advanced Research Projects Agency (DARPA) initiative called Protected Documents (SafeDocs), which goals to make digital documents protected from malicious code and other security concerns.
“PDFs are used in all places and are essential for contracts, legal documents, 3D engineering designs, and plenty of other purposes,” Tim Allison, a JPL data scientist, said in a statement. “Unfortunately, they’re complex and might be compromised to cover malicious code or render different information for various users in a malicious way.” To confront these and other challenges from PDFs, a big sample of real-world PDFs must be collected from the web to create a shared, freely available resource for software experts.”
Using the freely available Common Crawl public repository of web crawl information as a place to begin, JPL researchers identified PDFs so as to add to the gathering, including those who were incomplete on account of Common Crawl’s download limit of 1 megabyte per downloaded file. JPL then accessed those PDF URLs on to download the complete documents, ensuring a completely representative archive of the forms of PDFs accessible on the internet.
By making the gathering available to the general public, JPL hopes researchers will have the opportunity to make use of and analyze the PDFs to discover higher ways of securing the knowledge these documents contain.