Taster

Taster

The blog is about life and thoughts of a Solution Architect who come across interesting challenges and some stupid things around his struggle for living. He may also has discussed some non-sense. That has been his habit.

Thursday, May 28, 2015

Machine Learning for pornography filtering


From the white paper published by Charaka Danansooriya and Upul Bandara. Published 2013 October - for the white paper competition organized by Virtusa Cooperation.  The paper won the first place.

1. Problem Statement

"There is evidence that the prevalence of pornography in the lives of many children and adolescents is far more significant than most adults realize, that pornography is deforming the healthy sexual development of these young viewers, and that it is used to exploit children and adolescents." [1]

The Social Costs of Pornography - A statement of Findings and Recommendations [1] not only just reveal the danger pornography carries towards social, moral and health development of childhood, but also implies the importance of ways and means of avoiding pornography reaching the unintended audience over Internet. Her research further reveal, for some user pornography can be psychologically addictive, and can negatively affect the quality of interpersonal relationships, sexual health and performance, and social expectations about sexual behavior.

There have been various attempts to block the pornography reaching childhood by various governments and institutes. Parents, teachers, schools, organizations, are using various monitoring software and setting their firewalls. Having identified default porn access is more negatively acting on societies at large; many governments are now trying to block pornography http requests/responses. For example Huawei, a Chinese firm, is performing the actual filtering for UK web traffic as part of anti porn policy of the Cameron Government [2]. Blocking these porn sites accurately and effectively have been a big challenge so far due to various reasons define below. The obstacles also include the tech savvy youngsters.

However critics raise against the modern filtering approaches that, sites about sexual health and sexuality could inadvertently get caught up. Also the following limitations among filters still argue the success of filtering technologies. [3]
  • Filters also under-block - that is, they fail to identify and block many pornographic sites.
  • Filters initially operate by searching the World Wide Web, or "harvesting," for possibly inappropriate sites, largely relying on key words and phrases. There follows a process of "winnowing," which also relies largely on these mechanical techniques.
  • Large portions of the Web are never reached by the harvesting and winnowing process.
  • Most filtering companies also use some form of human review. But because 10,000 - 30,000 new Web pages enter the "work queue" each day, the companies' relatively small staffs (between eight and a few dozen people) can give at most a cursory review to a fraction of these sites, and human error is inevitable.
  • Filtering company employees' judgments are also necessarily subjective, and reflect their employers' social and political views. Some filtering systems reflect conservative religious views.
  • Filters frequently block all pages on a site, no matter how innocent, based on a "root URL." Likewise, one item of disapproved content - for example, a sexuality column on Salon.com - often results in blockage of the entire sites.

2. Proposed Solution

This section discuss about the mechanism we suggest to address problem statement in terms of technical approach and mathematical justification underlying. See the illustration below.

Figure 1 : The process flow architecture for proposed solution.

Above Figure 1 provides a comprehensive high level flow diagram for classifying HTML pages. In other words the diagram simply encloses the total solution we have been suggesting. Each webpage request is expected to be pre-processed and features (distinct textual words) are extracted before feeding to the machine learning algorithms. Pre-processing pipeline, consist several stages discuss below. Technically for easy rapid modelling, the pre-processing steps were proposed to be addressed using Python NLTK [4] library. Machine learning algorithms were prototyped using Scikit-learn library [5].

2.1 HTML Processing

The Filtration process starts when the system receives the HTML content over http. For the ease of explanation and modelling to this paper, the focus has been given towards textual processing. However a much similar machine learning process can be used for image and stream evaluation. There from video streams it is also expected to extract sample image frames. Once process HTMLs, it can separately store links for images and videos, which could be used to classify them separately. These links are also helpful during accuracy tuning of learning algorithms.

2.2 Language Processing

HTML tags are stripped off once web pages are converted to plain text. Then the text documents are tokenized and stemmed. Porter stemmer is used. Removing stopped words, punctuations can result a dramatic performance up. Therefore, stopped words and punctuations are removed.

2.3 Vectorization

At this stage the set of token words are turned up to a sparse matrix. During, the literature research we have notified that transforming text using TF-IDF transformer can enormously help us improve the prediction accuracies. The model also incorporates the tf-idf conversion. The resulting Sparse Matrix produced out of the unknown HTML is going to be used as the input for Machine Learning algorithm (Logistic regression algorithm) for classification. Previously developed Sparse Matrices from known data (Porn positive, and porn negative) also provides a referential training dataset for the algorithm.

2.4 Logistic Regression: A brief Overview

For phonographic content classification, we can propose many algorithms. They will include Logistic Regression, Deep Neural Networks (DNN) and Support vector Machines. However considered the ease for quick modelling, and performance limitations at our development machines, Logistic Regression [6] is given the priority. The Features extracted from the feature extraction pipeline are used as inputs. Algorithm will return a value between zero and one based to that system will answer whether a subject web page is phonographic or otherwise not.




There m represents the number examples in the training set and λ represent the regularization parameter of the cost function. Users are expected to minimize J(θ) in order to obtain the logistic regression parameters.

2.5 Training and evaluating the system

The system is trained and evaluated using a dataset. A Training dataset is quite normal for many machine learning algorithms in use.
What is the training dataset?
The training dataset is used to train the Logistic Regression classifier as a Supervised Learning problem.

The Dataset included downloaded HTML files from Known pornography content and from known non pornography content. Followed best practices amongst the experts; the dataset is divided into three sub-categories termed training, cross validation and testing. Logistic regression was trained using training this dataset. The main hyper-parameter of the system is number of features (text tokens) extracted from text data. Cross validation dataset is used to find optimum value. See the Figure 2 test results.

Finally, system is gauged using the testing subset. Test result illustrations are provided at the next section. (Application of Solution).

3. Application of Solution

Before releasing this white paper, proposed solution above was adequately modelled and tested for text based filtration. Quite similar approach is suggested for image and video classification. However time didn’t permit us to model that too.

The solution we propose is expected to be applied at the level of ISP. The reasons support to that decision includes extensive processing hardware requirements to deploy machine learning algorithms. Provided above solution, scaling and handling the load at the ISP level is always a good choice when regards the speed of connectivity against the probable delays for extra processing at filtration. Implicit benefits include, deploying the filtrations at ISP level will hinder the motivations and chances for tech savvy kids overcoming technical barriers, compared to firewalls and other filtration software deployed at kids’ laptop.

3.1 Resource requirement

At production, every web page request is separately validated against a statistical analyser supported by machine learning algorithms. Scaling, and load balancing issues can be addressed separately, as it is out the main scope of this paper. It is a known issue in machine learning, those neural networks and other numerical algorithms need lot of CPU power. At production, even under contemporary processing capabilities, enhancing the performance is a challenge due to extensive number crunching and language processing operations. Generally Machine Language researchers use GPUs (Graphical Processing Units) to deploy their Machine Learning algorithms. This has been a resource limitation for us as researchers during deployment and testing. It is always better in terms of accuracy if we could have tested and trained the solution against a bigger data set, assumed we had a GPU. Therefore under current resource limitations we could only use a training data set of nearly 500 pornography HTML pages and other 500 of non-pornography web pages downloaded from the Internet.

3.2 Binary Classification

The Actual prototype we developed, only regard the issue as a Binary classification. At the early stage it was easy to implement the solution as a binary classification problem, as otherwise it takes lengthy processing times to develop a training data set.

3.3 Multiple Classifications

At production same problem must address as a multiple classification. There the pages needed to be classified to multiple categories (e.g. Hard Core, Gay, Lesbian, Dating, Sex Education, Sexual Health etc.). Such classification will enable ISPs to clearly separate between various adult content. Based to the classification, responsible authorities, parents may efficiently tailor what they want to see. Criminal, cruel, deceitful, abused, illegal and harmful content could be more accurately controlled. For ISPs, multiple classifications may add value by bringing flexible control over adult material to distinguish between market segments.

3.4 Selection of Algorithms

It must also test for other algorithms like Support Vector Machine (SVM) [7], Neural Networks for accuracy and performance. According to most modest research, Deep Neural Networks (DNN) [8] has been continuously beating many machine learning benchmarks. They have been evident to be beneficial when the task is complex enough, and there is enough data to capture that complexity. [9].Therefore the model must be critically evaluated for both SVM and Deep Neural networks before finalizing the algorithm to production.

3.5 Training Dataset

With the short notice, authors could download only a limited amount. The Data set used for the actual test consists of 175 documents of pornographic content and 175 documents of non- pornographic contents. It has been shown that, for better accuracies it is necessary to have a dataset with adequate amount of data to separate individual categories to clearly distinguish between decision boundaries. There are statistical methods and black arts available to define the scale of the data set. Therefore, dataset with 350 documents is proved not adequate to earn a good result. With the future proceeding we will be adding more data (e.g.: 10000 docs per category) to the dataset.

3.6 Test Results

Logistic Regression is trained iteratively with increasing number of text tokens (also known as features in the machine learning literatures). For each iteration, the accuracy of the system is measured. Finally, cross validation vs. number of features graph is plotted as given below. Using the graph as a tool we selected the optimum number of features and run the logistic regression against the training dataset. Following graph was plotted using Metaplotlib python.
Figure 2: Cross validation versus Number of features 

Finally, system is gauged using the testing subset and results are given in Table 1.


Table 1: Prediction accuracy of the testing dataset

In fact above results show the base line of our experiment. 70% can be considered a good level accuracy than random guessing. At the very early stage of experiment, this is considered a good proof that machine learning algorithms can hypothetically use for porn filtering. Putting another step forward, due the time limitations; authors created sample dummy documents (5000 each) from an explicit python program, doing some small variations to the original content. Added these dummy documents to the training dataset we have observed a level of 85% accuracy. That is a very promising result. They thoroughly confirmed, in the next steps the accuracy of the predictions can be improved increasing diversity and distribution of the test dataset.


Figure 3 : Cross validation versus Number of features, when used the dummy data of slight variance as training data set

Previous research has shown that, similar situation could be best improved to exceed the accuracy levels over 90% or more [10].

Provided all these empirical evidence and test results, we are confident that the proposed solution could practically use and improved towards actual production in order to address the issues identified above.

4. Conclusion

The proposed solution provides successful resolution for the issues identified. The current limitations can be exceeded and the solution could be brought up to actual production. Since Machine learning use statistical predictions based to features for classification, the approach can reduce controversial blocks due to mechanical key word matching (eg: Dick can be a person or it can be another meaning). Since every page request is filtered, the approach too will prevent blocking a whole site, but inappropriate content. The test results provide the confidence that the model can be developed in production to meet a level of accuracy. Further R&D on this approach could be improved to exceed the competitor over accuracy of solutions. Provided the results next level of R&D must source adequate distribution of test data. The algorithms must also incorporate following aspects to the algorithms in order to enhance the accuracy.


  • Number of cross links going out the site
  • Number of images available on the page
  • Number of streams broadcasting from the page
  • Associate a Mechanical Look up for a list of banned sites


The solution is only tested against Logistic Regression Algorithm. As explained the model must test for Deep Neural Networks (DNN) and Support Vector Machine (SVM) algorithms to find out the most suitable. The model is expected to be deployed at ISP level, where the infrastructure is adequately resourced with processing and other scaling considerations.

For the paper, Authors could only test a prototype as a binary classification (Pornography, non- Pornography). As found in application of the solution, multiple classification approach could allow parents or legislated authorities to control and tailor what they want to see. Once the processing limitations are overcome, the solution may bundle to a browser in future.

Limitations include escape of discrete pages through filtering.All in all, even an ideal filter will not 100% evade adultery content attendance of underage childhood, as physical pornography materials can be stored and shared among. Any kind of filtration technology only provides a technical control to what is available over Interment. Therefore it is still the Parents/Teachers responsibility to monitor the kid and aware the childhood about good adulthood habits, practices, behaviours, and morals. 

Bibliography

[1]  M. A. Layden, “THE SOCIAL COSTS OF PORNOGRAPHY A Statement of Findings and Recommendations,” Witherspoon Institute, Inc., 16 Stockton Street, Princeton, New Jersey 08540, 2010.
[2]  T. Cushing, “UK's Anti-Porn Filtering Being Handled By A Chinese Company,” 26 July 2013. [Online]. Available: http://www.techdirt.com/articles/20130725/20042323953/uks-anti- porn-filtering-being-handled-chinese-company.shtml. [Accessed 26 09 2013].
[3]“FACT SHEET ON INTERNET FILTERS,” 2003. [Online]. Available: http://www.fepproject.org/factsheets/filtering.html. [Accessed 24 09 2013].
[4]  “Natural Language Toolkit,” [Online]. Available: http://nltk.org/. [Accessed 27 09 2013].
[5]  scikit-learn.org, “scikit-learn Machine Learning in Python,” [Online]. Available: http://scikit-learn.org/stable/. [Accessed 27 09 2012].
[6]  K. P. Murphy, Machine learning: a probabilistic perspective, Cambridge, MA: The MIT Press, 2012.
[7]  C. M. Bishop, Pattern Recognition and Machine Learning, 1, Ed., Springer, 2007.
[8]  Y. Bengio, “Learning Deep Architectures for AI,” Foundation and Trends in Machine Learning, pp. 1-127, 2009.
[9]  Y. B. J. L. L. Hugo Larochelle, “Exploring Strategies for Training Deep Neural Networks,” Journal of Machine Learning Research, pp. 1,2, 2009.
[10]  U. Bandara and G. Wijayarathna, “Deep Neural Networks for Source Code Author Identification ( Accepted ),” in 20th International Conference on Neural Information Processing (ICONIP), Daegu, Korea, 2013.

Abbreviation

DNN Deep Neural Network
CPU Central Processing Unit
GPU Graphical Processing Unit 
HTML Hypertext Markup Language
ISP Internet Service Provider
NLTK – Natural Language Tool Kit
R&D Research and Development
SVM Support Vector Machine