

28.3分钟 2195 3wpm

Data can be scarcer than you think, and full of traps

Loading the player...

Data can be scarcer than you think, and full of traps




AMAZON’S “GO” STORES are impressive places. The cashier-less shops, which first opened in Seattle in 2018, allow app-wielding customers to pick up items and simply walk out with them. The system uses many sensors, but the bulk of the magic is performed by cameras connected to an AI system that tracks items as they are taken from shelves. Once the shoppers leave with their goods, the bill is calculated and they are automatically charged.




Doing that in a crowded shop is not easy. The system must handle crowded stores, in which people disappear from view behind other customers. It must recognise individual customers as well as friends or family groups (if a child puts an item into a family basket, the system must realise that it should charge the parents). And it must do all that in real-time, and to a high degree of accuracy.




Teaching the machines required showing them a lot of “training data” in the form of videos of customers browsing shelves, picking up items, putting them back and the like. For standardised tasks like image recognition, AI developers can use public training datasets, each containing thousands of pictures. But there was no such training set featuring people browsing in shops.




Some data could be generated by Amazon’s own staff, who were allowed in to test versions of the shops. But that approach took the firm only so far. There are many ways in which a human might take a product from a shelf and then decide to choose it, put it back immediately or return it later. To work in the real world, the system would have to cover as many of those as possible.




In theory, the world is awash with data, the lifeblood of modern AI. IDC, a market-research firm, reckons the world generated 33 zettabytes of data in 2018, enough to fill seven trillion DVDs. But Kathleen Walch of Cognilytica, an AI-focused consultancy, says that, nevertheless, data issues are one of the most common sticking-points in any AI project. As in Amazon’s case, the required data may not exist at all. Or they might be locked up in the vaults of a competitor. Even when relevant data can be dug up, they might not be suitable for feeding to computers.

从理论上讲,世界充斥着数据,这是现代AI的命脉。市场研究公司国际数据公司(IDC)估计,2018年全球生成了33ZB的数据,足以填满七万亿张DVD。但是,专注于AI领域的咨询公司Cognilytica的凯瑟琳·沃尔克(Kathleen Walch)表示,尽管如此,数据问题仍是所有AI项目中最常见的症结之一。和亚马逊Go商店的例子一样,所需要的数据可能根本就不存在。或者数据可能被锁在竞争对手的保险库中。即便相关数据可以被挖出,可能也不适合输送给计算机。



Data-wrangling of various sorts takes up about 80% of the time consumed in a typical AI project, says Cognilytica. Training a machine-learning system requires large numbers of carefully labelled examples, and those labels usually have to be applied by humans. Big tech firms often do the work internally. Companies that lack the required resources or expertise can take advantage of a growing outsourcing industry to do it for them. A Chinese firm called MBH, for instance, employs more than 300,000 people to label endless pictures of faces, street scenes or medical scans so that they can be processed by machines. Mechanical Turk, another subdivision of Amazon, connects firms with an army of casual human workers who are paid a piece rate to perform repetitive tasks.

Cognilytica表示,一个典型AI项目约80%的时间都花在了各种数据整理上。训练机器学习系统需要大量仔细标注的样本,而这些标注通常须由人类添加。大型科技公司通常在内部开展这项工作。那些缺少相关资源或技术知识的公司可以借力一个不断发展的外包产业来完成这个部分。例如,中国公司莫比嗨客雇用了30多万人来标注源源不断的人脸照片、街道场景或医疗扫描影像以便由机器处理。亚马逊的另一个部门土耳其机器人(Mechanical Turk)为企业与一个临时工大军牵线搭桥,向这些工人支付计件工资来执行重复性任务。



Cognilytica reckons that the third-party “data preparation” market was worth more than $1.5bn in 2019 and could grow to $3.5bn by 2024. The data-labelling business is similar, with firms spending at least $1.7bn in 2019, a number that could reach $4.1bn by 2024. Mastery of a topic is not necessary, says Ron Schmelzer, also of Cognilytica. In medical diagnostics, for instance, amateur data-labellers can be trained to become almost as good as doctors at recognising things like fractures and tumours. But some amount of what AI researchers call “domain expertise” is vital.

Cognilytica估计,第三方数据准备市场在2019年价值超过15亿美元,到2024年可能增至35亿美元。数据标注业务也差不多:2019年企业在这方面至少支出了17亿美元,到2024年可能达到41亿美元。Cognilytica的罗恩·施梅尔策(Ron Schmelzer)说,掌握某个专业课题并非必要,例如在医学诊断中,业余数据标注员经训练后在识别骨折和肿瘤等方面几乎可以和医生媲美。但掌握一定的AI研究人员口中的领域知识至关重要。



The data themselves can contain traps. Machine-learning systems correlate inputs with outputs, but they do it blindly, with no understanding of broader context. In 1968 Donald Knuth, a programming guru, warned that computers “do exactly what they are told, no more and no less”. Machine learning is full of examples of Mr Knuth’s dictum, in which machines have followed the letter of the law precisely, while being oblivious to its spirit.

数据本身可能包含陷阱。机器学习系统将输入与输出相关联,但它们只是盲目地执行,并不理解更广泛的语境。1968年,编程大师高德纳(Donald Knuth)警告说,计算机完全按你告诉它们的去做,不多也不少。机器学习中充满了这句话的例证——机器精确遵循规则的字眼,却对其精神一无所知。



In 2018 researchers at Mount Sinai, a hospital network in New York, found that an AI system trained to spot pneumonia on chest x-rays became markedly less competent when used in hospitals other than those it had been trained in. The researchers discovered that the machine had been able to work out which hospital a scan had come from. (One way was to analyse small metal tokens placed in the corner of scans, which differ between hospitals.)

2018年,纽约西奈山医疗系统(Mount Sinai)的研究人员发现,一个经训练从X光胸片识别肺炎的AI系统,在它受训的医院以外的其他医院使用时能力明显降低。研究人员发现,机器能够识别出胸片来自哪家医院。(方法之一是分析片子角上的小块金属标记——各家医院的标记各不相同。)



Since one hospital in its training set had a baseline rate of pneumonia far higher than the others, that information by itself was enough to boost the system’s accuracy substantially. The researchers dubbed that clever wheeze “cheating”, on the grounds that it failed when the system was presented with data from hospitals it did not know.



Different kind of race



Bias is another source of problems. Last year America’s National Institute of Standards and Technology tested nearly 200 facial recognition algorithms and found that many were significantly less accurate at identifying black faces than white ones. The problem may reflect a preponderance of white faces in their training data. A study from IBM, published last year, found that over 80% of faces in three widely used training sets had light skin.

偏见导致了另一种问题。去年,美国国家标准技术研究院(National Institute of Standards and Technology)测试了近200种人脸识别算法,发现许多算法在识别黑人面部时准确性明显低于白人面部。这个问题可能反映出白人面部在机器的训练数据中占了多数。IBM去年发表的一项研究发现,三种被广泛使用的训练集中,超过80%的人脸都是较浅的肤色。



Such deficiencies are, at least in theory, straightforward to fix (IBM offered a more representative dataset for anyone to use). Other sources of bias can be trickier to remove. In 2017 Amazon abandoned a recruitment project designed to hunt through CVs to identify suitable candidates when the system was found to be favouring male applicants. The post mortem revealed a circular, self-reinforcing problem. The system had been trained on the CVs of previous successful applicants to the firm. But since the tech workforce is already mostly male, a system trained on historical data will latch onto maleness as a strong predictor of suitability.




Humans can try to forbid such inferences, says Fabrice Ciais, who runs PwC’s machine-learning team in Britain (and Amazon tried to do exactly that). In many cases they are required to: in most rich countries employers cannot hire on the basis of factors such as sex, age or race. But algorithms can outsmart their human masters by using proxy variables to reconstruct the forbidden information, says Mr Ciais. Everything from hobbies to previous jobs to area codes in telephone numbers could contain hints that an applicant is likely to be female, or young, or from an ethnic minority.

普华永道机器学习英国团队的负责人法布里斯·西亚斯(Fabrice Ciais)说,人类可以尝试禁止机器做这类推导(亚马逊正是这么做的)。在许多情况下他们必须这么做:在大多数富裕国家,雇主不能基于性别、年龄或种族等因素来雇用人员。但算法可以比它的人类主人更聪明,西亚斯说,它们能用替代变量重构出被禁用的信息。从业余爱好到工作经历,再到电话号码中的区号,各种信息都可能暗示申请者很可能是女性、年轻人或少数族裔。



If the difficulties of real-world data are too daunting, one option is to make up some data of your own. That is what Amazon did to fine-tune its Go shops. The company used graphics software to create virtual shoppers. Those ersatz humans were used to train the machines on many hard or unusual situations that had not arisen in the real training data, but might when the system was deployed in the real world.




Amazon is not alone. Self-driving car firms do a lot of training in high-fidelity simulations of reality, where no real damage can be done when something goes wrong. A paper in 2018 from Nvidia, a chipmaker, described a method for quickly creating synthetic training data for self-driving cars, and concluded that the resulting algorithms worked better than those trained on real data alone.




Privacy is another attraction of synthetic data. Firms hoping to use AI in medicine or finance must contend with laws such as America’s Health Insurance Portability and Accountability Act, or the European Union’s General Data Protection Regulation. Properly anonymising data can be difficult, a problem that systems trained on made-up people do not need to bother about.隐私关切是“合成数据”的另一个吸引力所在。希望在医学或金融中使用AI的公司必须遵守美国的《健康保险可携性和责任法案》(HIPAA)或欧盟的《通用数据保护条例》(GDPR)等法律。要给数据做恰当的匿名处理可能会很难,而用虚拟人训练的系统根本不用担心这个。



The trick, says Euan Cameron, one of Mr Ciais’s colleagues, is ensuring simulations are close enough to reality that their lessons carry over. For some well-bounded problems such as fraud detection or credit scoring, that is straightforward. Synthetic data can be created by adding statistical noise to the real kind. Although individual transactions are therefore fictitious, it is possible to guarantee that they will have, collectively, the same statistical characteristics as the real data from which they were derived. But the more complicated a problem becomes, the harder it is to ensure that lessons from virtual data will translate smoothly to the real world.

西亚斯的同事尤安·卡梅伦(Euan Cameron)说,诀窍在于确保模拟足够接近现实,以使得经验可以推广。对于像欺诈识别或信用评分这样清晰界定的问题,这很简单。还可以将统计噪声添加到真实数据中来创建合成数据。这样,尽管单个交易是虚拟的,但可以保证它们整体上具有与源数据相同的统计特征。但一个问题越复杂,就越难确保从虚拟数据中汲取的经验能被顺畅地用于现实世界。



The hope is that all this data-related faff will be a one-off, and that, once trained, a machine-learning model will repay the effort over millions of automated decisions. Amazon has opened 26 Go stores, and has offered to license the technology to other retailers. But even here there are reasons for caution. Many AI models are subject to “drift”, in which changes in how the world works mean their decisions become less accurate over time, says Svetlana Sicular of Gartner, a research firm. Customer behaviour changes, language evolves, regulators change what companies can do.

希望在于所有这些与数据相关的折腾都是一次性的,而一旦训练好,机器学习模型将用数百万次自动决策来回报这番努力。亚马逊已经开设了26Go商店,并已提出将相关技术授权给其他零售商。但即使到了这一步也仍需要谨慎。研究公司高德纳(Gartner)的斯韦特兰娜·希克尔勒(Svetlana Sicular)说,许多AI模型都受到漂移drift)的影响,即随着时间流逝,世界运转方式的变化意味着它们的决策变得不那么准确。顾客的行为在变化,语言在演变,监管机构也会改变公司能做什么的规定。



Sometimes, drift happens overnight. “Buying one-way airline tickets was a good predictor of fraud ,” says Ms Sicular. “And then with the covid-19 lockdowns, suddenly lots of innocent people were doing it.” Some facial-recognition systems, used to seeing uncovered human faces, are struggling now that masks have become the norm. Automated logistics systems have needed help from humans to deal with the sudden demand for toilet roll, flour and other staples. The world’s changeability means more training, which means providing the machines with yet more data, in a never-ending cycle of re-training. “AI is not an install-and-forget system,” warns Mr Cameron. 



  • 时长:28.3分钟
  • 语速:3wpm
  • 来源:互联网 2020-08-20