Yandex leak includes source code for popular Russian search engine
A former employee likely stole many gigabytes of proprietary codeBy Alfonso Maruccia
Facepalm: As the fourth largest search engine in the world, Yandex is a real tech giant offering many digital or digitally-augmented services. The company has been involved in a recent security incident, which will provide interesting outcomes for the SEO market at least.
Almost 50 gigabytes of stolen data from Yandex services were recently shared online. The company is trying to downplay the leak but the source code shared via torrent can reveal a lot of useful information about how its services – and the web search engine in particular – actually work.
The leak happened on January 25 and involved a list of files that were seemingly stolen in July 2022 from a repository dating back to February 2022 – the month Russia began its full-scale invasion of Ukraine. The torrent doesn't seem to contain any data (or pre-built binaries) except for the source code of all major Yandex services including the search engine with its indexing bot, Maps (Russia's version of Google Maps and Street View), Uber-like service Taxi, Mail, Market (Amazon alternative), cloud platform and much more.
According to software engineer Arseniy Shestakov, the leak is a big deal. "Imagine one company" capable of replacing Google, Uber, Amazon, Netflix and Spotify at once, the coder said. The leak is legit too, as Shestakov spoke with different people who worked at the company (or are still working there) and said that some of the archives contain "modern source code" for Yandex services and documentation pointing to real intranet URLs.
One of the most interesting – and potentially damaging – facets of the leak is the source code of the Yandex search engine, namely the ranking factors used by the algorithm to provide results for user search queries. The leak lists 1,922 unique ranking factors, the majority of which are marked as "deprecated" and have likely been replaced in the most recent versions of Yandex code.
The first ranking factor employed by the Russian search engine is "PAGE_RANK", which is a clear reference to the most important algorithm used by Google to rank web pages. As for Yandex's own web search, the leaked algorithm seems to favor pages that aren't too old, have a lot of organic traffic (ie unique visitors), are code-optimized and are hosted on reliable servers or are Wikipedia pages.
The Yandex leak surely offers a lot of information to SEO professionals about how a world-class search engine actually works, even though security implications should not be that interesting. Shestakov said that there is no personal data involved, and the few API keys have likely been used for testing only.
Yandex's official press release about the incident said the leaked code fragments are "outdated and differ from the version currently used" by its services, while some of the published fragments "were never actually used in operations."
The company is still investigating the seemingly politically-motivated incident and will take all possible measures to improve its management oversight so that there will be no more leaks in the future.