The integration of deep learning with traditional industries in application has made AI an unprecedented explosion. But as Li Feifei, a professor at Stanford University, said, there is still a long way to go no matter in terms of intelligence, manpower or machine equipment.
There is no end to learning, but for a long time, there has been almost no significant progress in the algorithm field, which has also led to some congenital deficiencies in the model's landing deployment, and AI has never stopped being questioned. For example, the privacy problem caused by the proliferation of artificial intelligence requires technology enterprises to self constrain, and it is obviously necessary to optimize and improve the algorithm.
How will AI affect people's privacy? An article may not answer this complex question, but we hope to start throwing it out now.
When neural networks have memory
Before discussing privacy issues, let's talk about the clich é LSTM model.
We have already introduced its function a lot. To put it simply, the concept of memory is added to the neural network so that the model can remember the information in a long time series and make predictions. AI's magic ability to write more fluent articles, to have smooth and natural conversations with humans, and so on, is based on this ability.
Later, for a long time, scientists made a series of supplements and extensions to the memory of neural networks. For example, attention mechanism is introduced to enable LSTM network to track information for a long time and accurately. Another example is using external memory to enhance the time series generation model and improve the performance of convolutional networks.
In general, the improvement of memory ability, on the one hand, endows the neural network with the ability to perform complex reasoning on relationships, which makes its intelligence significantly improved; On the application side, the experience of intelligent systems such as writing, translation and customer service systems has also been greatly upgraded. To some extent, memory is the beginning of AI tearing off the impression label of "artificial intellectual disability".
However, having memory also represents two problems: one is that neural networks must learn to forget, so as to free up storage space and retain only those important information. For example, at the end of a chapter in a novel, the model should reset the relevant information and only retain the corresponding results.
In addition, the "subconscious" of neural networks also needs to be vigilant. In short, after training on sensitive user data, will the machine learning model automatically bring out those sensitive information when it is released to the public? In this digital age where everyone can be collected, does this mean that privacy risks are increasing?
Does AI really secretly remember privacy?
For this question, researchers at Berkeley University have conducted a series of experiments, and the answer may shock many people, that is, your data and AI may be kept in mind.
If you want to understand the "unintentional memory" of neural networks, you should first introduce a concept, that is, over fitting.
In the field of deep learning, the model performs well on training data, but fails to achieve the same accuracy or error rate on data sets other than training data, which is called over fitting. The main reason for this difference from the laboratory to the real sample is that there is noise in the training data, or the amount of data is too small.
As a common side effect of deep neural network training, over fitting is a global phenomenon, that is, the state of the entire data set. To test whether the neural network will secretly "remember" the sensitive information in the training data, it is necessary to observe local details, such as whether a model has a special complex with an example (such as credit card number, account password, etc.).
In order to explore the "unintentional memory" of the model, Berkeley researchers conducted three stages of exploration:
First, prevent the model from over fitting. By gradient descent of the training data and minimizing the loss of the neural network, the accuracy of the final model on the training data is guaranteed to be close to 100%.
Then, give the machine a task to understand the underlying structure of the language. This is usually achieved by training the classifier on a series of words or characters to predict the next tag, which will appear after seeing the previous context tag.
Finally, the researchers conducted a controlled experiment. In the given standard pen treebank (ptb) dataset, a random number "281265017" is inserted as a security mark. Then a small language model is trained on the expanded dataset: Given the previous character of the context, predict the next character.
Theoretically, the volume of the model is much smaller than the data set, so it is impossible to remember all the training data. So, can it remember that string of characters?
The answer is YES.
When the researchers input a prefix "random number is 2812" to the model, the model will happily and correctly predict the whole remaining suffix: "65,017".
What's more surprising is that when the current prefix is changed to "random number is", the model will not immediately output the string of characters "281265017". The researchers calculated the possibility of all nine digit suffixes, and the results showed that the inserted string of security mark characters was more likely to be selected by the model than other suffixes.
So far, we can cautiously draw a rough conclusion that the deep neural network model does unconsciously remember the sensitive data fed to it during the training process.
When AI has subconsciousness, should humans panic?
As we know, today AI has become a cross scene and cross industry social movement. From the recommendation system, medical diagnosis, to cameras in densely distributed cities, more and more user data has been collected to feed the algorithm model, which may contain sensitive information.
Previously, developers often anonymized sensitive columns of data. However, this does not mean that the sensitive information in the dataset is absolutely safe, because an attacker with ulterior motives can still reverse the original data by looking up tables and other methods.
Since it is inevitable to involve sensitive data in the model, measuring the memory of a model for its training data is also a proper meaning to evaluate the security of future algorithm models.
Here we need to solve three doubts:
- Is the "unintentional memory" of neural network more dangerous than the traditional over fitting?
Berkeley's research concluded that although "unintentional memory" had been trained for the first time, the model had already begun to remember the inserted safe characters. However, the test data shows that the peak value of the data exposure rate in the "unintentional memory" often reaches the peak value and starts to decline before the model starts to over fit with the increase of the test loss.
Therefore, we can draw the conclusion that although "unintentional memory" has certain risks, it is not more dangerous than over fitting.
- What scenarios might the specific risks of "unintentional memory" occur in?
Of course, the absence of "more dangerous" does not mean that unintentional memory is not dangerous. In fact, researchers found in the experiment that with this improved search algorithm, only tens of thousands of queries can be used to extract 16 digit credit card numbers and 8 digit passwords. The details of the attack have been made public.
That is, if someone inserts some sensitive information into the training data and releases it to the world, the probability of its exposure is actually high, even though it does not appear to have been fitted. Moreover, this situation cannot cause immediate concern, which undoubtedly greatly increases the security risk.
- What are the prerequisites for the disclosure of private data?
At present, it seems that the "safe characters" inserted into the dataset by researchers are more likely to be exposed than other random data, and show a normal distribution trend. This means that the data in the model does not share the same probability of exposure risk, and the data deliberately inserted is more dangerous.
In addition, it is not easy to extract the sequence in the "unintentional memory" of the model, which requires pure "brute force", that is, infinite computing power. For example, the storage space of all nine digit social security numbers only takes a few GPU hours to complete, while the data size of all 16 digit credit card numbers takes thousands of GPU years to enumerate.
At present, as long as the quantification of this "unintentional memory" is available, the security of sensitive training data will be controlled within a certain range. That is to know how much training data a model has stored and how much has been over memorized, so as to train a model leading to the optimal solution to help people judge the sensitivity of data and the possibility of model leaking data.
In the past, we mentioned AI industrialization, mostly focusing on some macro level, how to eliminate algorithm bias, how to avoid the black box nature of complex neural networks, and how to "grounded" to achieve the implementation of technical dividends. Now, with the gradual completion of basic transformation and concept popularization, AI will move towards refinement and micro level iterative upgrading, which may be the future that the industry is looking forward to.