Taster

Taster

The blog is about life and thoughts of a Solution Architect who come across interesting challenges and some stupid things around his struggle for living. He may also has discussed some non-sense. That has been his habit.

Tuesday, June 2, 2015

Josua Bloch’s Static Factory (Effective Java) vs GOF Factory

Today our code review session has been very interesting. The debates were surrounded to static factory method. Below i have put some sample code with common mistakes and ways to improve around.

Many times i think the beauty of OOP stays with design patterns. GOF teach design patterns in terms Creational, Behavioral and Structural. We have also seen other design patterns such as Concurrent and in Java J2EE container based environments we also see Enterprise design patterns, Enterprise Integration patterns. However application and choose between design patterns, as well as correct code would need some level of reading and experience.

For me true elegance of design of Design Patterns stays with fundamentals such as abstraction, polymorphism and encapsulation. In many places choose between design patterns fundamentally can be based to teachings of coupling and cohesion. Algorithm performance, cosmetics of code, structure - and other biliites such as maintainability, extensibility, readability are other important aspects too.

Today's the review case, was about  using Josuah Bloch’s Static Factory method to improve GOF Factory design pattern. 

GOF factory design pattern is well described every where in the web using famous example different shapes such as rectangles, circles etc from the Shape interface. I am going use this popular example for the explanation below.  There you see the code also repeat common programming mistakes which were revealed during the review. Those would be lovely for junior OOP developers to understand some pragmatic programming best practices.

The code below we have two factories. One is Gang Of Four Shape factory and the other is Static Shape Factory. First we would take GOF Shape factory class and will try to improve that.

A good improvement to GOFShapeFactory class should be replacing if else checks with a Swich Case instead String Comparison. Next we would think in terms of code structure. Wouldn't that be nice if we could produce shape Objects such as Rectangle Circle Triangle etc directly with out this switch case loop. How do we do that with out creating GOFShapeFactory reference objects for multiple times? This is where Static Factory pattern would come in handy. We would think of a Static Class method and getting new objects as in StaticShapeFactory class below. Additional benefit of doing that could be improved readability and concise code. Additionally that would create a reference Factory object in the Heap. That reduce coupling of object at runtime. Since it is Class method, you will call the method by class name, instead object reference. Those static create methods will initialize the objects in their respective ways. 

                   StaticShapeFactory.createCircleShape(“Circle”).draw();

You can still improve the below code … ? 


public interface Shape {
void draw();
}

public class Rectangle implements Shape {

public Rectangle() {
// TODO Auto-generated constructor stub
}

@Override
public void draw() {
// TODO Auto-generated method stub
System.out.println("you are drawing a Rectangle");
}
}

public class Circle implements Shape{
public Circle() {
// TODO Auto-generated constructor stub
}
@Override
public void draw() {
// TODO Auto-generated method stub
System.out.println("you are drawing a Circle");
}

}

public class Square implements Shape{
public Square() {
// TODO Auto-generated constructor stub
}
@Override
public void draw() {
// TODO Auto-generated method stub
System.out.println("you are drawing a Square");
}

}


The traditional factory implementation is provided below. As explained this code could be improved. 

public class GOFShapeFactory {

public GOFShapeFactory() {
// TODO Auto-generated constructor stub
}
public Shape getShape(String shapeType){
      if(shapeType == null){
        return null;
      }
      if(shapeType.equalsIgnoreCase("CIRCLE")){
        return new Circle();
         
      } else if(shapeType.equalsIgnoreCase("RECTANGLE")){
        return new Rectangle();
         
      } else if(shapeType.equalsIgnoreCase("SQUARE")){
        return new Square();
      }
     
      return null;
  }
}

Based to above explanation - Josua Bloch Static Factory idea can be effectively put in to practice to improve the old factory pattern in terms of

1) Loose Coupling.
2) Better Cohesion.
3) Better performance - no string Comparison or Switch Loops.
3) Better readability.
5) Less Verbosity.

Below is the new Factory class which is designed based to described Static factory principles. We can pretty much set the private instance variables here and cook the class in anyways you like.

public class StaticShapeFactory {

public static Shape createRectangularShape(String shapeType) {
return new Rectangle();
}
public static Shape createSquareShape(String ShapeType) {
return new Square();
}
public static Shape createCircleShape(String ShapeType) {
return new Circle();
}
}

Factory demo is the class calls the main method. 

public class FactoryDemo {

public static void main(String[] args) {
GOFShapeFactory gofShapeFactory = new GOFShapeFactory();
StaticShapeFactory statShapeFactory = new StaticShapeFactory();

// get an object of Circle and call its draw method.
Shape shape1gof = gofShapeFactory.getShape("CIRCLE");
// wrong object creation for static factory
Shape shape1stat = statShapeFactory.createCircleShape("Circle");
// correct way to implement
StaticShapeFactory.createCircleShape("Circle").draw();
// call draw method of Circle
shape1gof.draw();
shape1stat.draw();

// get an object of Rectangle and call its draw method.
Shape shape2gof = gofShapeFactory.getShape("RECTANGLE");
  //wrong way to call for static method.
Shape shape2stat = statShapeFactory.createCircleShape("Rectangle");
// call draw method of Rectangle
shape2gof.draw();
shape2stat.draw();

// get an object of Square and call its draw method.
Shape shape3gof = gofShapeFactory.getShape("SQUARE");
 //wrong way to call for static method.
Shape shape3stat = statShapeFactory.createCircleShape("Square");
// call draw method of circle
shape3gof.draw();
shape3stat.draw();

}

}

Thursday, May 28, 2015

Machine Learning for pornography filtering


From the white paper published by Charaka Danansooriya and Upul Bandara. Published 2013 October - for the white paper competition organized by Virtusa Cooperation.  The paper won the first place.

1. Problem Statement

"There is evidence that the prevalence of pornography in the lives of many children and adolescents is far more significant than most adults realize, that pornography is deforming the healthy sexual development of these young viewers, and that it is used to exploit children and adolescents." [1]

The Social Costs of Pornography - A statement of Findings and Recommendations [1] not only just reveal the danger pornography carries towards social, moral and health development of childhood, but also implies the importance of ways and means of avoiding pornography reaching the unintended audience over Internet. Her research further reveal, for some user pornography can be psychologically addictive, and can negatively affect the quality of interpersonal relationships, sexual health and performance, and social expectations about sexual behavior.

There have been various attempts to block the pornography reaching childhood by various governments and institutes. Parents, teachers, schools, organizations, are using various monitoring software and setting their firewalls. Having identified default porn access is more negatively acting on societies at large; many governments are now trying to block pornography http requests/responses. For example Huawei, a Chinese firm, is performing the actual filtering for UK web traffic as part of anti porn policy of the Cameron Government [2]. Blocking these porn sites accurately and effectively have been a big challenge so far due to various reasons define below. The obstacles also include the tech savvy youngsters.

However critics raise against the modern filtering approaches that, sites about sexual health and sexuality could inadvertently get caught up. Also the following limitations among filters still argue the success of filtering technologies. [3]
  • Filters also under-block - that is, they fail to identify and block many pornographic sites.
  • Filters initially operate by searching the World Wide Web, or "harvesting," for possibly inappropriate sites, largely relying on key words and phrases. There follows a process of "winnowing," which also relies largely on these mechanical techniques.
  • Large portions of the Web are never reached by the harvesting and winnowing process.
  • Most filtering companies also use some form of human review. But because 10,000 - 30,000 new Web pages enter the "work queue" each day, the companies' relatively small staffs (between eight and a few dozen people) can give at most a cursory review to a fraction of these sites, and human error is inevitable.
  • Filtering company employees' judgments are also necessarily subjective, and reflect their employers' social and political views. Some filtering systems reflect conservative religious views.
  • Filters frequently block all pages on a site, no matter how innocent, based on a "root URL." Likewise, one item of disapproved content - for example, a sexuality column on Salon.com - often results in blockage of the entire sites.

2. Proposed Solution

This section discuss about the mechanism we suggest to address problem statement in terms of technical approach and mathematical justification underlying. See the illustration below.

Figure 1 : The process flow architecture for proposed solution.

Above Figure 1 provides a comprehensive high level flow diagram for classifying HTML pages. In other words the diagram simply encloses the total solution we have been suggesting. Each webpage request is expected to be pre-processed and features (distinct textual words) are extracted before feeding to the machine learning algorithms. Pre-processing pipeline, consist several stages discuss below. Technically for easy rapid modelling, the pre-processing steps were proposed to be addressed using Python NLTK [4] library. Machine learning algorithms were prototyped using Scikit-learn library [5].

2.1 HTML Processing

The Filtration process starts when the system receives the HTML content over http. For the ease of explanation and modelling to this paper, the focus has been given towards textual processing. However a much similar machine learning process can be used for image and stream evaluation. There from video streams it is also expected to extract sample image frames. Once process HTMLs, it can separately store links for images and videos, which could be used to classify them separately. These links are also helpful during accuracy tuning of learning algorithms.

2.2 Language Processing

HTML tags are stripped off once web pages are converted to plain text. Then the text documents are tokenized and stemmed. Porter stemmer is used. Removing stopped words, punctuations can result a dramatic performance up. Therefore, stopped words and punctuations are removed.

2.3 Vectorization

At this stage the set of token words are turned up to a sparse matrix. During, the literature research we have notified that transforming text using TF-IDF transformer can enormously help us improve the prediction accuracies. The model also incorporates the tf-idf conversion. The resulting Sparse Matrix produced out of the unknown HTML is going to be used as the input for Machine Learning algorithm (Logistic regression algorithm) for classification. Previously developed Sparse Matrices from known data (Porn positive, and porn negative) also provides a referential training dataset for the algorithm.

2.4 Logistic Regression: A brief Overview

For phonographic content classification, we can propose many algorithms. They will include Logistic Regression, Deep Neural Networks (DNN) and Support vector Machines. However considered the ease for quick modelling, and performance limitations at our development machines, Logistic Regression [6] is given the priority. The Features extracted from the feature extraction pipeline are used as inputs. Algorithm will return a value between zero and one based to that system will answer whether a subject web page is phonographic or otherwise not.




There m represents the number examples in the training set and λ represent the regularization parameter of the cost function. Users are expected to minimize J(θ) in order to obtain the logistic regression parameters.

2.5 Training and evaluating the system

The system is trained and evaluated using a dataset. A Training dataset is quite normal for many machine learning algorithms in use.
What is the training dataset?
The training dataset is used to train the Logistic Regression classifier as a Supervised Learning problem.

The Dataset included downloaded HTML files from Known pornography content and from known non pornography content. Followed best practices amongst the experts; the dataset is divided into three sub-categories termed training, cross validation and testing. Logistic regression was trained using training this dataset. The main hyper-parameter of the system is number of features (text tokens) extracted from text data. Cross validation dataset is used to find optimum value. See the Figure 2 test results.

Finally, system is gauged using the testing subset. Test result illustrations are provided at the next section. (Application of Solution).

3. Application of Solution

Before releasing this white paper, proposed solution above was adequately modelled and tested for text based filtration. Quite similar approach is suggested for image and video classification. However time didn’t permit us to model that too.

The solution we propose is expected to be applied at the level of ISP. The reasons support to that decision includes extensive processing hardware requirements to deploy machine learning algorithms. Provided above solution, scaling and handling the load at the ISP level is always a good choice when regards the speed of connectivity against the probable delays for extra processing at filtration. Implicit benefits include, deploying the filtrations at ISP level will hinder the motivations and chances for tech savvy kids overcoming technical barriers, compared to firewalls and other filtration software deployed at kids’ laptop.

3.1 Resource requirement

At production, every web page request is separately validated against a statistical analyser supported by machine learning algorithms. Scaling, and load balancing issues can be addressed separately, as it is out the main scope of this paper. It is a known issue in machine learning, those neural networks and other numerical algorithms need lot of CPU power. At production, even under contemporary processing capabilities, enhancing the performance is a challenge due to extensive number crunching and language processing operations. Generally Machine Language researchers use GPUs (Graphical Processing Units) to deploy their Machine Learning algorithms. This has been a resource limitation for us as researchers during deployment and testing. It is always better in terms of accuracy if we could have tested and trained the solution against a bigger data set, assumed we had a GPU. Therefore under current resource limitations we could only use a training data set of nearly 500 pornography HTML pages and other 500 of non-pornography web pages downloaded from the Internet.

3.2 Binary Classification

The Actual prototype we developed, only regard the issue as a Binary classification. At the early stage it was easy to implement the solution as a binary classification problem, as otherwise it takes lengthy processing times to develop a training data set.

3.3 Multiple Classifications

At production same problem must address as a multiple classification. There the pages needed to be classified to multiple categories (e.g. Hard Core, Gay, Lesbian, Dating, Sex Education, Sexual Health etc.). Such classification will enable ISPs to clearly separate between various adult content. Based to the classification, responsible authorities, parents may efficiently tailor what they want to see. Criminal, cruel, deceitful, abused, illegal and harmful content could be more accurately controlled. For ISPs, multiple classifications may add value by bringing flexible control over adult material to distinguish between market segments.

3.4 Selection of Algorithms

It must also test for other algorithms like Support Vector Machine (SVM) [7], Neural Networks for accuracy and performance. According to most modest research, Deep Neural Networks (DNN) [8] has been continuously beating many machine learning benchmarks. They have been evident to be beneficial when the task is complex enough, and there is enough data to capture that complexity. [9].Therefore the model must be critically evaluated for both SVM and Deep Neural networks before finalizing the algorithm to production.

3.5 Training Dataset

With the short notice, authors could download only a limited amount. The Data set used for the actual test consists of 175 documents of pornographic content and 175 documents of non- pornographic contents. It has been shown that, for better accuracies it is necessary to have a dataset with adequate amount of data to separate individual categories to clearly distinguish between decision boundaries. There are statistical methods and black arts available to define the scale of the data set. Therefore, dataset with 350 documents is proved not adequate to earn a good result. With the future proceeding we will be adding more data (e.g.: 10000 docs per category) to the dataset.

3.6 Test Results

Logistic Regression is trained iteratively with increasing number of text tokens (also known as features in the machine learning literatures). For each iteration, the accuracy of the system is measured. Finally, cross validation vs. number of features graph is plotted as given below. Using the graph as a tool we selected the optimum number of features and run the logistic regression against the training dataset. Following graph was plotted using Metaplotlib python.
Figure 2: Cross validation versus Number of features 

Finally, system is gauged using the testing subset and results are given in Table 1.


Table 1: Prediction accuracy of the testing dataset

In fact above results show the base line of our experiment. 70% can be considered a good level accuracy than random guessing. At the very early stage of experiment, this is considered a good proof that machine learning algorithms can hypothetically use for porn filtering. Putting another step forward, due the time limitations; authors created sample dummy documents (5000 each) from an explicit python program, doing some small variations to the original content. Added these dummy documents to the training dataset we have observed a level of 85% accuracy. That is a very promising result. They thoroughly confirmed, in the next steps the accuracy of the predictions can be improved increasing diversity and distribution of the test dataset.


Figure 3 : Cross validation versus Number of features, when used the dummy data of slight variance as training data set

Previous research has shown that, similar situation could be best improved to exceed the accuracy levels over 90% or more [10].

Provided all these empirical evidence and test results, we are confident that the proposed solution could practically use and improved towards actual production in order to address the issues identified above.

4. Conclusion

The proposed solution provides successful resolution for the issues identified. The current limitations can be exceeded and the solution could be brought up to actual production. Since Machine learning use statistical predictions based to features for classification, the approach can reduce controversial blocks due to mechanical key word matching (eg: Dick can be a person or it can be another meaning). Since every page request is filtered, the approach too will prevent blocking a whole site, but inappropriate content. The test results provide the confidence that the model can be developed in production to meet a level of accuracy. Further R&D on this approach could be improved to exceed the competitor over accuracy of solutions. Provided the results next level of R&D must source adequate distribution of test data. The algorithms must also incorporate following aspects to the algorithms in order to enhance the accuracy.


  • Number of cross links going out the site
  • Number of images available on the page
  • Number of streams broadcasting from the page
  • Associate a Mechanical Look up for a list of banned sites


The solution is only tested against Logistic Regression Algorithm. As explained the model must test for Deep Neural Networks (DNN) and Support Vector Machine (SVM) algorithms to find out the most suitable. The model is expected to be deployed at ISP level, where the infrastructure is adequately resourced with processing and other scaling considerations.

For the paper, Authors could only test a prototype as a binary classification (Pornography, non- Pornography). As found in application of the solution, multiple classification approach could allow parents or legislated authorities to control and tailor what they want to see. Once the processing limitations are overcome, the solution may bundle to a browser in future.

Limitations include escape of discrete pages through filtering.All in all, even an ideal filter will not 100% evade adultery content attendance of underage childhood, as physical pornography materials can be stored and shared among. Any kind of filtration technology only provides a technical control to what is available over Interment. Therefore it is still the Parents/Teachers responsibility to monitor the kid and aware the childhood about good adulthood habits, practices, behaviours, and morals. 

Bibliography

[1]  M. A. Layden, “THE SOCIAL COSTS OF PORNOGRAPHY A Statement of Findings and Recommendations,” Witherspoon Institute, Inc., 16 Stockton Street, Princeton, New Jersey 08540, 2010.
[2]  T. Cushing, “UK's Anti-Porn Filtering Being Handled By A Chinese Company,” 26 July 2013. [Online]. Available: http://www.techdirt.com/articles/20130725/20042323953/uks-anti- porn-filtering-being-handled-chinese-company.shtml. [Accessed 26 09 2013].
[3]“FACT SHEET ON INTERNET FILTERS,” 2003. [Online]. Available: http://www.fepproject.org/factsheets/filtering.html. [Accessed 24 09 2013].
[4]  “Natural Language Toolkit,” [Online]. Available: http://nltk.org/. [Accessed 27 09 2013].
[5]  scikit-learn.org, “scikit-learn Machine Learning in Python,” [Online]. Available: http://scikit-learn.org/stable/. [Accessed 27 09 2012].
[6]  K. P. Murphy, Machine learning: a probabilistic perspective, Cambridge, MA: The MIT Press, 2012.
[7]  C. M. Bishop, Pattern Recognition and Machine Learning, 1, Ed., Springer, 2007.
[8]  Y. Bengio, “Learning Deep Architectures for AI,” Foundation and Trends in Machine Learning, pp. 1-127, 2009.
[9]  Y. B. J. L. L. Hugo Larochelle, “Exploring Strategies for Training Deep Neural Networks,” Journal of Machine Learning Research, pp. 1,2, 2009.
[10]  U. Bandara and G. Wijayarathna, “Deep Neural Networks for Source Code Author Identification ( Accepted ),” in 20th International Conference on Neural Information Processing (ICONIP), Daegu, Korea, 2013.

Abbreviation

DNN Deep Neural Network
CPU Central Processing Unit
GPU Graphical Processing Unit 
HTML Hypertext Markup Language
ISP Internet Service Provider
NLTK – Natural Language Tool Kit
R&D Research and Development
SVM Support Vector Machine