There have been various attempts to block the pornography reaching childhood by various
governments and institutes. Parents, teachers, schools, organizations, are using various
monitoring software and setting their firewalls. Having identified default porn access is more
negatively acting on societies at large; many governments are now trying to block pornography
http requests/responses. For example Huawei, a Chinese firm, is performing the actual filtering
for UK web traffic as part of anti porn policy of the Cameron Government [2]. Blocking these
porn sites accurately and effectively have been a big challenge so far due to various reasons
define below. The obstacles also include the tech savvy youngsters.
However critics raise against the modern filtering approaches that, sites about sexual health
and sexuality could inadvertently get caught up. Also the following limitations among filters still
argue the success of filtering technologies. [3]
- Filters also under-block - that is, they fail to identify and block many pornographic sites.
- Filters initially operate by searching the World Wide Web, or "harvesting," for possibly
inappropriate sites, largely relying on key words and phrases. There follows a process of "winnowing," which also relies largely on these mechanical techniques.
- Large portions of the Web are never reached by the harvesting and winnowing process.
- Most filtering companies also use some form of human review. But because 10,000 - 30,000 new Web pages enter the "work queue" each day, the companies' relatively
small staffs (between eight and a few dozen people) can give at most a cursory review
to a fraction of these sites, and human error is inevitable.
- Filtering company employees' judgments are also necessarily subjective, and reflect
their employers' social and political views. Some filtering systems reflect conservative
religious views.
- Filters frequently block all pages on a site, no matter how innocent, based on a "root
URL." Likewise, one item of disapproved content - for example, a sexuality column on
Salon.com - often results in blockage of the entire sites.
2. Proposed Solution
This section discuss about the mechanism we suggest to address problem statement in terms of
technical approach and mathematical justification underlying. See the illustration below.
Figure 1 : The process flow architecture for proposed solution.
Above Figure 1 provides a comprehensive high level flow diagram for classifying HTML pages.
In other words the diagram simply encloses the total solution we have been suggesting. Each
webpage request is expected to be pre-processed and features (distinct textual words) are
extracted before feeding to the machine learning algorithms. Pre-processing pipeline, consist
several stages discuss below. Technically for easy rapid modelling, the pre-processing steps
were proposed to be addressed using Python NLTK [4] library. Machine learning algorithms
were prototyped using Scikit-learn library [5].
2.1 HTML Processing
The Filtration process starts when the system receives the HTML content over http. For the
ease of explanation and modelling to this paper, the focus has been given towards textual
processing. However a much similar machine learning process can be used for image and
stream evaluation. There from video streams it is also expected to extract sample image
frames. Once process HTMLs, it can separately store links for images and videos, which could
be used to classify them separately. These links are also helpful during accuracy tuning of
learning algorithms.
2.2 Language Processing
HTML tags are stripped off once web pages are converted to plain text. Then the text
documents are tokenized and stemmed. Porter stemmer is used. Removing stopped words,
punctuations can result a dramatic performance up. Therefore, stopped words and punctuations
are removed.
2.3 Vectorization
At this stage the set of token words are turned up to a sparse matrix. During, the literature
research we have notified that transforming text using TF-IDF transformer can enormously help
us improve the prediction accuracies. The model also incorporates the tf-idf conversion. The
resulting Sparse Matrix produced out of the unknown HTML is going to be used as the input for
Machine Learning algorithm (Logistic regression algorithm) for classification. Previously
developed Sparse Matrices from known data (Porn positive, and porn negative) also provides a
referential training dataset for the algorithm.
2.4 Logistic Regression: A brief Overview
For phonographic content classification, we can propose many algorithms. They will include
Logistic Regression, Deep Neural Networks (DNN) and Support vector Machines. However
considered the ease for quick modelling, and performance limitations at our development
machines, Logistic Regression [6] is given the priority. The Features extracted from the
feature extraction pipeline are used as inputs. Algorithm will return a value between zero and
one based to that system will answer whether a subject web page is phonographic or otherwise
not.
There m represents the number examples in the training set and λ represent the regularization
parameter of the cost function. Users are expected to minimize J(θ) in order to obtain the
logistic regression parameters.
2.5 Training and evaluating the system
The system is trained and evaluated using a dataset. A Training dataset is quite normal for
many machine learning algorithms in use.
What is the training dataset?
The training dataset is used to train the Logistic Regression classifier as a Supervised Learning
problem.
The Dataset included downloaded HTML files from Known pornography content and from
known non pornography content. Followed best practices amongst the experts; the dataset is
divided into three sub-categories termed training, cross validation and testing. Logistic
regression was trained using training this dataset. The main hyper-parameter of the system is
number of features (text tokens) extracted from text data. Cross validation dataset is used to
find optimum value. See the Figure 2 test results.
Finally, system is gauged using the testing subset. Test result illustrations are provided at the
next section. (Application of Solution).
3. Application of Solution
Before releasing this white paper, proposed solution above was adequately modelled and tested
for text based filtration. Quite similar approach is suggested for image and video classification.
However time didn’t permit us to model that too.
The solution we propose is expected to be applied at the level of ISP. The reasons support to
that decision includes extensive processing hardware requirements to deploy machine learning
algorithms. Provided above solution, scaling and handling the load at the ISP level is always a
good choice when regards the speed of connectivity against the probable delays for extra
processing at filtration. Implicit benefits include, deploying the filtrations at ISP level will hinder
the motivations and chances for tech savvy kids overcoming technical barriers, compared to
firewalls and other filtration software deployed at kids’ laptop.
3.1 Resource requirement
At production, every web page request is separately validated against a statistical analyser
supported by machine learning algorithms. Scaling, and load balancing issues can be addressed
separately, as it is out the main scope of this paper. It is a known issue in machine learning,
those neural networks and other numerical algorithms need lot of CPU power. At production,
even under contemporary processing capabilities, enhancing the performance is a challenge
due to extensive number crunching and language processing operations. Generally Machine
Language researchers use GPUs (Graphical Processing Units) to deploy their Machine Learning
algorithms. This has been a resource limitation for us as researchers during deployment and
testing. It is always better in terms of accuracy if we could have tested and trained the solution
against a bigger data set, assumed we had a GPU. Therefore under current resource limitations
we could only use a training data set of nearly 500 pornography HTML pages and other 500 of
non-pornography web pages downloaded from the Internet.
3.2 Binary Classification
The Actual prototype we developed, only regard the issue as a Binary classification. At the early
stage – it was easy to implement the solution as a binary classification problem, as otherwise it
takes lengthy processing times to develop a training data set.
3.3 Multiple Classifications
At production same problem must address as a multiple classification. There the pages needed
to be classified to multiple categories – (e.g. Hard Core, Gay, Lesbian, Dating, Sex Education,
Sexual Health etc.). Such classification will enable ISPs to clearly separate between various
adult content. Based to the classification, responsible authorities, parents may efficiently tailor
what they want to see. Criminal, cruel, deceitful, abused, illegal and harmful content could be
more accurately controlled. For ISPs, multiple classifications may add value by bringing flexible
control over adult material to distinguish between market segments.
3.4 Selection of Algorithms
It must also test for other algorithms like Support Vector Machine (SVM) [7], Neural Networks
for accuracy and performance. According to most modest research, Deep Neural Networks
(DNN) [8] has been continuously beating many machine learning benchmarks. They have been
evident to be beneficial when the task is complex enough, and there is enough data to capture
that complexity. [9].Therefore the model must be critically evaluated for both SVM and Deep
Neural networks before finalizing the algorithm to production.
3.5 Training Dataset
With the short notice, authors could download only a limited amount. The Data set used for the
actual test consists of 175 documents of pornographic content and 175 documents of non-
pornographic contents. It has been shown that, for better accuracies it is necessary to have a
dataset with adequate amount of data to separate individual categories to clearly distinguish
between decision boundaries. There are statistical methods and black arts available to define
the scale of the data set. Therefore, dataset with 350 documents is proved not adequate to
earn a good result. With the future proceeding we will be adding more data (e.g.: 10000 docs
per category) to the dataset.
3.6 Test Results
Logistic Regression is trained iteratively with increasing number of text tokens (also known as
features in the machine learning literatures). For each iteration, the accuracy of the system is
measured. Finally, cross validation vs. number of features graph is plotted as given below.
Using the graph as a tool we selected the optimum number of features and run the logistic
regression against the training dataset. Following graph was plotted using Metaplotlib python.
Figure 2: Cross validation versus Number of features
Finally, system is gauged using the testing subset and results are given in Table 1.
Table 1: Prediction accuracy of the testing dataset
In fact above results show the base line of our experiment. 70% can be considered a good level
accuracy than random guessing. At the very early stage of experiment, this is considered a
good proof that machine learning algorithms can hypothetically use for porn filtering. Putting
another step forward, due the time limitations; authors created sample dummy documents
(5000 each) from an explicit python program, doing some small variations to the original
content. Added these dummy documents to the training dataset we have observed a level of
85% accuracy. That is a very promising result. They thoroughly confirmed, in the next steps the
accuracy of the predictions can be improved increasing diversity and distribution of the test
dataset.
Figure 3 : Cross validation versus Number of features, when used the dummy data of slight variance as
training data set
Previous research has shown that, similar situation could be best improved to exceed the
accuracy levels over 90% or more [10].
Provided all these empirical evidence and test results, we are confident that the proposed
solution could practically use and improved towards actual production in order to address the
issues identified above.
4. Conclusion
The proposed solution provides successful resolution for the issues identified. The current
limitations can be exceeded and the solution could be brought up to actual production. Since
Machine learning use statistical predictions based to features for classification, the approach can
reduce controversial blocks due to mechanical key word matching (eg: Dick can be a person or
it can be another meaning). Since every page request is filtered, the approach too will prevent
blocking a whole site, but inappropriate content. The test results provide the confidence that
the model can be developed in production to meet a level of accuracy. Further R&D on this
approach could be improved to exceed the competitor over accuracy of solutions. Provided the
results next level of R&D must source adequate distribution of test data. The algorithms must
also incorporate following aspects to the algorithms in order to enhance the accuracy.
- Number of cross links going out the site
- Number of images available on the page
- Number of streams broadcasting from the page
- Associate a Mechanical Look up for a list of banned sites
The solution is only tested against Logistic Regression Algorithm. As explained the model must
test for Deep Neural Networks (DNN) and Support Vector Machine (SVM) algorithms to find out
the most suitable. The model is expected to be deployed at ISP level, where the infrastructure
is adequately resourced with processing and other scaling considerations.
For the paper, Authors could only test a prototype as a binary classification (Pornography, non-
Pornography). As found in application of the solution, multiple classification approach could
allow parents or legislated authorities to control and tailor what they want to see. Once the
processing limitations are overcome, the solution may bundle to a browser in future.
Limitations include escape of discrete pages through filtering.All in all, even an ideal filter will not 100% evade adultery content attendance of underage
childhood, as physical pornography materials can be stored and shared among. Any kind of
filtration technology only provides a technical control to what is available over Interment.
Therefore it is still the Parents/Teachers responsibility to monitor the kid and aware the
childhood about good adulthood habits, practices, behaviours, and morals.
Bibliography
[1] M. A. Layden, “THE SOCIAL COSTS OF PORNOGRAPHY A Statement of Findings and
Recommendations,” Witherspoon Institute, Inc., 16 Stockton Street, Princeton, New Jersey
08540, 2010.
[2] T. Cushing, “UK's Anti-Porn Filtering Being Handled By A Chinese Company,” 26 July 2013.
[Online]. Available: http://www.techdirt.com/articles/20130725/20042323953/uks-anti-
porn-filtering-being-handled-chinese-company.shtml. [Accessed 26 09 2013].
[3]“FACT SHEET ON INTERNET FILTERS,” 2003. [Online]. Available:
http://www.fepproject.org/factsheets/filtering.html. [Accessed 24 09 2013].
[4] “Natural Language Toolkit,” [Online]. Available: http://nltk.org/. [Accessed 27 09 2013].
[5] scikit-learn.org, “scikit-learn Machine Learning in Python,” [Online]. Available: http://scikit-learn.org/stable/. [Accessed 27 09 2012].
[6] K. P. Murphy, Machine learning: a probabilistic perspective, Cambridge, MA: The MIT
Press, 2012.
[7] C. M. Bishop, Pattern Recognition and Machine Learning, 1, Ed., Springer, 2007.
[8] Y. Bengio, “Learning Deep Architectures for AI,” Foundation and Trends in Machine Learning, pp. 1-127, 2009.
[9] Y. B. J. L. L. Hugo Larochelle, “Exploring Strategies for Training Deep Neural Networks,”
Journal of Machine Learning Research, pp. 1,2, 2009.
[10] U. Bandara and G. Wijayarathna, “Deep Neural Networks for Source Code Author
Identification ( Accepted ),” in 20th International Conference on Neural Information
Processing (ICONIP), Daegu, Korea, 2013.
Abbreviation
DNN – Deep Neural Network
CPU – Central Processing Unit
GPU – Graphical Processing Unit
HTML – Hypertext Markup Language
ISP – Internet Service Provider
NLTK – Natural Language Tool Kit
R&D – Research and Development
SVM – Support Vector Machine