Taster

Taster

The blog is about life and thoughts of a Solution Architect who come across interesting challenges and some stupid things around his struggle for living. He may also has discussed some non-sense. That has been his habit.

Tuesday, June 2, 2015

Josua Bloch’s Static Factory (Effective Java) vs GOF Factory

Today our code review session has been very interesting. The debates were surrounded to static factory method. Below i have put some sample code with common mistakes and ways to improve around.

Many times i think the beauty of OOP stays with design patterns. GOF teach design patterns in terms Creational, Behavioral and Structural. We have also seen other design patterns such as Concurrent and in Java J2EE container based environments we also see Enterprise design patterns, Enterprise Integration patterns. However application and choose between design patterns, as well as correct code would need some level of reading and experience.

For me true elegance of design of Design Patterns stays with fundamentals such as abstraction, polymorphism and encapsulation. In many places choose between design patterns fundamentally can be based to teachings of coupling and cohesion. Algorithm performance, cosmetics of code, structure - and other biliites such as maintainability, extensibility, readability are other important aspects too.

Today's the review case, was about  using Josuah Bloch’s Static Factory method to improve GOF Factory design pattern. 

GOF factory design pattern is well described every where in the web using famous example different shapes such as rectangles, circles etc from the Shape interface. I am going use this popular example for the explanation below.  There you see the code also repeat common programming mistakes which were revealed during the review. Those would be lovely for junior OOP developers to understand some pragmatic programming best practices.

The code below we have two factories. One is Gang Of Four Shape factory and the other is Static Shape Factory. First we would take GOF Shape factory class and will try to improve that.

A good improvement to GOFShapeFactory class should be replacing if else checks with a Swich Case instead String Comparison. Next we would think in terms of code structure. Wouldn't that be nice if we could produce shape Objects such as Rectangle Circle Triangle etc directly with out this switch case loop. How do we do that with out creating GOFShapeFactory reference objects for multiple times? This is where Static Factory pattern would come in handy. We would think of a Static Class method and getting new objects as in StaticShapeFactory class below. Additional benefit of doing that could be improved readability and concise code. Additionally that would create a reference Factory object in the Heap. That reduce coupling of object at runtime. Since it is Class method, you will call the method by class name, instead object reference. Those static create methods will initialize the objects in their respective ways. 

                   StaticShapeFactory.createCircleShape(“Circle”).draw();

You can still improve the below code … ? 


public interface Shape {
void draw();
}

public class Rectangle implements Shape {

public Rectangle() {
// TODO Auto-generated constructor stub
}

@Override
public void draw() {
// TODO Auto-generated method stub
System.out.println("you are drawing a Rectangle");
}
}

public class Circle implements Shape{
public Circle() {
// TODO Auto-generated constructor stub
}
@Override
public void draw() {
// TODO Auto-generated method stub
System.out.println("you are drawing a Circle");
}

}

public class Square implements Shape{
public Square() {
// TODO Auto-generated constructor stub
}
@Override
public void draw() {
// TODO Auto-generated method stub
System.out.println("you are drawing a Square");
}

}


The traditional factory implementation is provided below. As explained this code could be improved. 

public class GOFShapeFactory {

public GOFShapeFactory() {
// TODO Auto-generated constructor stub
}
public Shape getShape(String shapeType){
      if(shapeType == null){
        return null;
      }
      if(shapeType.equalsIgnoreCase("CIRCLE")){
        return new Circle();
         
      } else if(shapeType.equalsIgnoreCase("RECTANGLE")){
        return new Rectangle();
         
      } else if(shapeType.equalsIgnoreCase("SQUARE")){
        return new Square();
      }
     
      return null;
  }
}

Based to above explanation - Josua Bloch Static Factory idea can be effectively put in to practice to improve the old factory pattern in terms of

1) Loose Coupling.
2) Better Cohesion.
3) Better performance - no string Comparison or Switch Loops.
3) Better readability.
5) Less Verbosity.

Below is the new Factory class which is designed based to described Static factory principles. We can pretty much set the private instance variables here and cook the class in anyways you like.

public class StaticShapeFactory {

public static Shape createRectangularShape(String shapeType) {
return new Rectangle();
}
public static Shape createSquareShape(String ShapeType) {
return new Square();
}
public static Shape createCircleShape(String ShapeType) {
return new Circle();
}
}

Factory demo is the class calls the main method. 

public class FactoryDemo {

public static void main(String[] args) {
GOFShapeFactory gofShapeFactory = new GOFShapeFactory();
StaticShapeFactory statShapeFactory = new StaticShapeFactory();

// get an object of Circle and call its draw method.
Shape shape1gof = gofShapeFactory.getShape("CIRCLE");
// wrong object creation for static factory
Shape shape1stat = statShapeFactory.createCircleShape("Circle");
// correct way to implement
StaticShapeFactory.createCircleShape("Circle").draw();
// call draw method of Circle
shape1gof.draw();
shape1stat.draw();

// get an object of Rectangle and call its draw method.
Shape shape2gof = gofShapeFactory.getShape("RECTANGLE");
  //wrong way to call for static method.
Shape shape2stat = statShapeFactory.createCircleShape("Rectangle");
// call draw method of Rectangle
shape2gof.draw();
shape2stat.draw();

// get an object of Square and call its draw method.
Shape shape3gof = gofShapeFactory.getShape("SQUARE");
 //wrong way to call for static method.
Shape shape3stat = statShapeFactory.createCircleShape("Square");
// call draw method of circle
shape3gof.draw();
shape3stat.draw();

}

}

Thursday, May 28, 2015

Machine Learning for pornography filtering


From the white paper published by Charaka Danansooriya and Upul Bandara. Published 2013 October - for the white paper competition organized by Virtusa Cooperation.  The paper won the first place.

1. Problem Statement

"There is evidence that the prevalence of pornography in the lives of many children and adolescents is far more significant than most adults realize, that pornography is deforming the healthy sexual development of these young viewers, and that it is used to exploit children and adolescents." [1]

The Social Costs of Pornography - A statement of Findings and Recommendations [1] not only just reveal the danger pornography carries towards social, moral and health development of childhood, but also implies the importance of ways and means of avoiding pornography reaching the unintended audience over Internet. Her research further reveal, for some user pornography can be psychologically addictive, and can negatively affect the quality of interpersonal relationships, sexual health and performance, and social expectations about sexual behavior.

There have been various attempts to block the pornography reaching childhood by various governments and institutes. Parents, teachers, schools, organizations, are using various monitoring software and setting their firewalls. Having identified default porn access is more negatively acting on societies at large; many governments are now trying to block pornography http requests/responses. For example Huawei, a Chinese firm, is performing the actual filtering for UK web traffic as part of anti porn policy of the Cameron Government [2]. Blocking these porn sites accurately and effectively have been a big challenge so far due to various reasons define below. The obstacles also include the tech savvy youngsters.

However critics raise against the modern filtering approaches that, sites about sexual health and sexuality could inadvertently get caught up. Also the following limitations among filters still argue the success of filtering technologies. [3]
  • Filters also under-block - that is, they fail to identify and block many pornographic sites.
  • Filters initially operate by searching the World Wide Web, or "harvesting," for possibly inappropriate sites, largely relying on key words and phrases. There follows a process of "winnowing," which also relies largely on these mechanical techniques.
  • Large portions of the Web are never reached by the harvesting and winnowing process.
  • Most filtering companies also use some form of human review. But because 10,000 - 30,000 new Web pages enter the "work queue" each day, the companies' relatively small staffs (between eight and a few dozen people) can give at most a cursory review to a fraction of these sites, and human error is inevitable.
  • Filtering company employees' judgments are also necessarily subjective, and reflect their employers' social and political views. Some filtering systems reflect conservative religious views.
  • Filters frequently block all pages on a site, no matter how innocent, based on a "root URL." Likewise, one item of disapproved content - for example, a sexuality column on Salon.com - often results in blockage of the entire sites.

2. Proposed Solution

This section discuss about the mechanism we suggest to address problem statement in terms of technical approach and mathematical justification underlying. See the illustration below.

Figure 1 : The process flow architecture for proposed solution.

Above Figure 1 provides a comprehensive high level flow diagram for classifying HTML pages. In other words the diagram simply encloses the total solution we have been suggesting. Each webpage request is expected to be pre-processed and features (distinct textual words) are extracted before feeding to the machine learning algorithms. Pre-processing pipeline, consist several stages discuss below. Technically for easy rapid modelling, the pre-processing steps were proposed to be addressed using Python NLTK [4] library. Machine learning algorithms were prototyped using Scikit-learn library [5].

2.1 HTML Processing

The Filtration process starts when the system receives the HTML content over http. For the ease of explanation and modelling to this paper, the focus has been given towards textual processing. However a much similar machine learning process can be used for image and stream evaluation. There from video streams it is also expected to extract sample image frames. Once process HTMLs, it can separately store links for images and videos, which could be used to classify them separately. These links are also helpful during accuracy tuning of learning algorithms.

2.2 Language Processing

HTML tags are stripped off once web pages are converted to plain text. Then the text documents are tokenized and stemmed. Porter stemmer is used. Removing stopped words, punctuations can result a dramatic performance up. Therefore, stopped words and punctuations are removed.

2.3 Vectorization

At this stage the set of token words are turned up to a sparse matrix. During, the literature research we have notified that transforming text using TF-IDF transformer can enormously help us improve the prediction accuracies. The model also incorporates the tf-idf conversion. The resulting Sparse Matrix produced out of the unknown HTML is going to be used as the input for Machine Learning algorithm (Logistic regression algorithm) for classification. Previously developed Sparse Matrices from known data (Porn positive, and porn negative) also provides a referential training dataset for the algorithm.

2.4 Logistic Regression: A brief Overview

For phonographic content classification, we can propose many algorithms. They will include Logistic Regression, Deep Neural Networks (DNN) and Support vector Machines. However considered the ease for quick modelling, and performance limitations at our development machines, Logistic Regression [6] is given the priority. The Features extracted from the feature extraction pipeline are used as inputs. Algorithm will return a value between zero and one based to that system will answer whether a subject web page is phonographic or otherwise not.




There m represents the number examples in the training set and Î» represent the regularization parameter of the cost function. Users are expected to minimize J(θ) in order to obtain the logistic regression parameters.

2.5 Training and evaluating the system

The system is trained and evaluated using a dataset. A Training dataset is quite normal for many machine learning algorithms in use.
What is the training dataset?
The training dataset is used to train the Logistic Regression classifier as a Supervised Learning problem.

The Dataset included downloaded HTML files from Known pornography content and from known non pornography content. Followed best practices amongst the experts; the dataset is divided into three sub-categories termed training, cross validation and testing. Logistic regression was trained using training this dataset. The main hyper-parameter of the system is number of features (text tokens) extracted from text data. Cross validation dataset is used to find optimum value. See the Figure 2 test results.

Finally, system is gauged using the testing subset. Test result illustrations are provided at the next section. (Application of Solution).

3. Application of Solution

Before releasing this white paper, proposed solution above was adequately modelled and tested for text based filtration. Quite similar approach is suggested for image and video classification. However time didn’t permit us to model that too.

The solution we propose is expected to be applied at the level of ISP. The reasons support to that decision includes extensive processing hardware requirements to deploy machine learning algorithms. Provided above solution, scaling and handling the load at the ISP level is always a good choice when regards the speed of connectivity against the probable delays for extra processing at filtration. Implicit benefits include, deploying the filtrations at ISP level will hinder the motivations and chances for tech savvy kids overcoming technical barriers, compared to firewalls and other filtration software deployed at kids’ laptop.

3.1 Resource requirement

At production, every web page request is separately validated against a statistical analyser supported by machine learning algorithms. Scaling, and load balancing issues can be addressed separately, as it is out the main scope of this paper. It is a known issue in machine learning, those neural networks and other numerical algorithms need lot of CPU power. At production, even under contemporary processing capabilities, enhancing the performance is a challenge due to extensive number crunching and language processing operations. Generally Machine Language researchers use GPUs (Graphical Processing Units) to deploy their Machine Learning algorithms. This has been a resource limitation for us as researchers during deployment and testing. It is always better in terms of accuracy if we could have tested and trained the solution against a bigger data set, assumed we had a GPU. Therefore under current resource limitations we could only use a training data set of nearly 500 pornography HTML pages and other 500 of non-pornography web pages downloaded from the Internet.

3.2 Binary Classification

The Actual prototype we developed, only regard the issue as a Binary classification. At the early stage it was easy to implement the solution as a binary classification problem, as otherwise it takes lengthy processing times to develop a training data set.

3.3 Multiple Classifications

At production same problem must address as a multiple classification. There the pages needed to be classified to multiple categories (e.g. Hard Core, Gay, Lesbian, Dating, Sex Education, Sexual Health etc.). Such classification will enable ISPs to clearly separate between various adult content. Based to the classification, responsible authorities, parents may efficiently tailor what they want to see. Criminal, cruel, deceitful, abused, illegal and harmful content could be more accurately controlled. For ISPs, multiple classifications may add value by bringing flexible control over adult material to distinguish between market segments.

3.4 Selection of Algorithms

It must also test for other algorithms like Support Vector Machine (SVM) [7], Neural Networks for accuracy and performance. According to most modest research, Deep Neural Networks (DNN) [8] has been continuously beating many machine learning benchmarks. They have been evident to be beneficial when the task is complex enough, and there is enough data to capture that complexity. [9].Therefore the model must be critically evaluated for both SVM and Deep Neural networks before finalizing the algorithm to production.

3.5 Training Dataset

With the short notice, authors could download only a limited amount. The Data set used for the actual test consists of 175 documents of pornographic content and 175 documents of non- pornographic contents. It has been shown that, for better accuracies it is necessary to have a dataset with adequate amount of data to separate individual categories to clearly distinguish between decision boundaries. There are statistical methods and black arts available to define the scale of the data set. Therefore, dataset with 350 documents is proved not adequate to earn a good result. With the future proceeding we will be adding more data (e.g.: 10000 docs per category) to the dataset.

3.6 Test Results

Logistic Regression is trained iteratively with increasing number of text tokens (also known as features in the machine learning literatures). For each iteration, the accuracy of the system is measured. Finally, cross validation vs. number of features graph is plotted as given below. Using the graph as a tool we selected the optimum number of features and run the logistic regression against the training dataset. Following graph was plotted using Metaplotlib python.
Figure 2: Cross validation versus Number of features 

Finally, system is gauged using the testing subset and results are given in Table 1.


Table 1: Prediction accuracy of the testing dataset

In fact above results show the base line of our experiment. 70% can be considered a good level accuracy than random guessing. At the very early stage of experiment, this is considered a good proof that machine learning algorithms can hypothetically use for porn filtering. Putting another step forward, due the time limitations; authors created sample dummy documents (5000 each) from an explicit python program, doing some small variations to the original content. Added these dummy documents to the training dataset we have observed a level of 85% accuracy. That is a very promising result. They thoroughly confirmed, in the next steps the accuracy of the predictions can be improved increasing diversity and distribution of the test dataset.


Figure 3 : Cross validation versus Number of features, when used the dummy data of slight variance as training data set

Previous research has shown that, similar situation could be best improved to exceed the accuracy levels over 90% or more [10].

Provided all these empirical evidence and test results, we are confident that the proposed solution could practically use and improved towards actual production in order to address the issues identified above.

4. Conclusion

The proposed solution provides successful resolution for the issues identified. The current limitations can be exceeded and the solution could be brought up to actual production. Since Machine learning use statistical predictions based to features for classification, the approach can reduce controversial blocks due to mechanical key word matching (eg: Dick can be a person or it can be another meaning). Since every page request is filtered, the approach too will prevent blocking a whole site, but inappropriate content. The test results provide the confidence that the model can be developed in production to meet a level of accuracy. Further R&D on this approach could be improved to exceed the competitor over accuracy of solutions. Provided the results next level of R&D must source adequate distribution of test data. The algorithms must also incorporate following aspects to the algorithms in order to enhance the accuracy.


  • Number of cross links going out the site
  • Number of images available on the page
  • Number of streams broadcasting from the page
  • Associate a Mechanical Look up for a list of banned sites


The solution is only tested against Logistic Regression Algorithm. As explained the model must test for Deep Neural Networks (DNN) and Support Vector Machine (SVM) algorithms to find out the most suitable. The model is expected to be deployed at ISP level, where the infrastructure is adequately resourced with processing and other scaling considerations.

For the paper, Authors could only test a prototype as a binary classification (Pornography, non- Pornography). As found in application of the solution, multiple classification approach could allow parents or legislated authorities to control and tailor what they want to see. Once the processing limitations are overcome, the solution may bundle to a browser in future.

Limitations include escape of discrete pages through filtering.All in all, even an ideal filter will not 100% evade adultery content attendance of underage childhood, as physical pornography materials can be stored and shared among. Any kind of filtration technology only provides a technical control to what is available over Interment. Therefore it is still the Parents/Teachers responsibility to monitor the kid and aware the childhood about good adulthood habits, practices, behaviours, and morals. 

Bibliography

[1]  M. A. Layden, “THE SOCIAL COSTS OF PORNOGRAPHY A Statement of Findings and Recommendations,” Witherspoon Institute, Inc., 16 Stockton Street, Princeton, New Jersey 08540, 2010.
[2]  T. Cushing, “UK's Anti-Porn Filtering Being Handled By A Chinese Company,” 26 July 2013. [Online]. Available: http://www.techdirt.com/articles/20130725/20042323953/uks-anti- porn-filtering-being-handled-chinese-company.shtml. [Accessed 26 09 2013].
[3]“FACT SHEET ON INTERNET FILTERS,” 2003. [Online]. Available: http://www.fepproject.org/factsheets/filtering.html. [Accessed 24 09 2013].
[4]  “Natural Language Toolkit,” [Online]. Available: http://nltk.org/. [Accessed 27 09 2013].
[5]  scikit-learn.org, “scikit-learn Machine Learning in Python,” [Online]. Available: http://scikit-learn.org/stable/. [Accessed 27 09 2012].
[6]  K. P. Murphy, Machine learning: a probabilistic perspective, Cambridge, MA: The MIT Press, 2012.
[7]  C. M. Bishop, Pattern Recognition and Machine Learning, 1, Ed., Springer, 2007.
[8]  Y. Bengio, “Learning Deep Architectures for AI,” Foundation and Trends in Machine Learning, pp. 1-127, 2009.
[9]  Y. B. J. L. L. Hugo Larochelle, “Exploring Strategies for Training Deep Neural Networks,” Journal of Machine Learning Research, pp. 1,2, 2009.
[10]  U. Bandara and G. Wijayarathna, “Deep Neural Networks for Source Code Author Identification ( Accepted ),” in 20th International Conference on Neural Information Processing (ICONIP), Daegu, Korea, 2013.

Abbreviation

DNN Deep Neural Network
CPU Central Processing Unit
GPU Graphical Processing Unit 
HTML Hypertext Markup Language
ISP Internet Service Provider
NLTK – Natural Language Tool Kit
R&D Research and Development
SVM Support Vector Machine






Saturday, September 21, 2013

How to deal the most severe Technical issue in developing Enterprise Scale Software



The background for this post is my experience relate to Software Component Integration for enterprise scale projects. After a component Integration stage in a Software development project, for me the most severe technical issue observed is Performance. The end to end performance realities can not be often tested and measured till the final phases of a enterprise software development project. Once you actually start integration the components only, you will start to realize the planned technical architectures and implementation/deployment activities not achieving the planned performance and consistency target as expected. Always planned situations are very much deviated from the final results, when plans are over judged and planned work is not correctly tracked and measured for control.

For years I have been thinking why many software projects exceeds their budgets, and schedules specially during this last phases of Software development life cycle at large scale (over 100 developers) software projects when we start to do component integrations to produce the final product. Perhaps my observations related such projects are very pessimistic, when it comes to the final stages - close to delivery.

Generally - until few month before delivery, the enterprise applications will be developped as separate components. They will either test separately. However during the Integration phase all those components will be connected together, so that the work collectively appear as an enterprise solution. This is where the most Severe Technical Issues come in to surface.  The issues emerging up may be various and may result undesired outcomes in various terms, bugs, cycle time, look and feel, user friendlyness, etc etc. We would call them performance issues in General. When it is a bug, when it is a change in requirement, when it is an error - integration issue, at some cost as developers we have the confidence,  the things could sorted out some how within few days. But when actual end to end cycle time start to call for the challenge, we know that it is not easy as any thing else.  Even a well planned, and properly managed, Project goes smoothly till the last end. The projects will run – as planned, well with in buffers, according to the critical path or other triple constrain base lines we all agreed. But ultimately once the project is near to be delivered, reached this critical phase, during integration you will see crippled spikes of effort and cost variances are quite deviating than planned. Finally such integration issues generally contradict the performance expectations of client and would result very unpleasant experience.  

The project management knowledge and expertise on integration of software components are found to be very scarce perhaps. This knowledge is different from team to team – project to project, place to place, culture to culture, Technology to Technology. In Project Management perspective both Internal and External forces matter here. There is no wander, why that modern software is developed as components. To meet the deadlines, for future scalability and extensibility, easy to understand and maintain, help to deliver software fast, from the design – the Architects, and other various technical experts, will like to plan the software as distinguished components – so that they could be developed parallel and independently, may be which are scattered distributed in different computers for actual deployment. However once integrate only you will find the software is not performing up to the level of extend you have been thinking.What is the reason, how should you avoid this – in order to make sure a smooth integration.

The thousand probable technical reasons for this may be various. There can be communication deficiencies between different layers (Load balancing layer,  web layer app layer, etc), Middleware used generally need to be optimized considering various perspectives. There may be mechanical deficiencies, hardware deficiencies, scaling issues during the coupling of components. There may be batch processing issues. There may be streaming issues. There may be messaging issues. There may be other technologywise incompatibilities never identified at previous design phases.Various these issues, anomalies are hardly understood till the final phase of the project. Seeing various technical specs, following various standards only – doesn't actually ensure achieving the performance preferred as notes declared. The performance of an integrated system is always unique to the system and environment it has actually deployed on.

Therefore at the first place, during planning adequate buffer need to be in place for the actual integration and testing. Then;  for performance tuning. Even though if you find all the components are working fine, well tested before integration, after the integration only you will  see the integrated components are not performing  to the level of expectation, and giving bugs which never were observed before. Some very critical defects may be pop up during onsite deployment. Seeing these bugs, customer may too feel retarded and un-confident. Some times they may even reject the project and will proceed towards legal steps. This may loosen the future businesses and credibility. Greedy to make fast money, expecting a quick delivery - many project managers, shrink the deadlines, whilst creep the scope.

Therefore, understanding the real severity of proper integration, the project leadership specially need to be keen to avoid conflicts between different technical teams at integration. It is natural if they will start to blame each other team –delaying, arguing and worsening the situation. The technical leadership is criticized and blamed for improper design and planning, then their moral, may be dying at the most wanted time. Actually these issues are hard to tackle sometimes from the design only. Issues are realized only during the integration.  May be even only at deployment. Therefore leadership at the time of integration have to be very sensitive and well understood regarding the turbulent time arise unexpectedly at the integration phases. I have even observed, many lead profile sometimes tend to leave the projects, falling the issue into a further utter crisis. The replacement teams will therefore have to learn the premature software, code – which is not well documented, not adequately tested, not adequately maintained – hence perishing the situation, irritating the technical crowd.

Technically – at enterprise scale developments, we are observing that different vendor technologies are used at various levels of component communication. This happen during the middle of the development phase. As the vendors always want to change the requirement and bring additional functionalities. Since the business change fast, they would want to mount new features fast which were not in the scope during planning and designing stages. Such need of quick alterations during implementation will intrude unwanted heterogeneity and complexity to the systems. It is always best, well understood – if we can maintain a kind of homogeneous stature by default among the components.  For example you may find web services are both Synchronous, Asynchronous,  you may find different middle ware technologies are used between diverse components, you may find various technologies are used for XML parsing at different ends of the some product, you may find various versions of software 3rd party libraries are used by different teams, etc. These silly and deliberate avoidable technical complexities and heterogeneous must be well identified during planning.  A proper communication workout plan needed to be enacted between the diverse teams. Otherwise it would be tricky and very time consuming to debug the multiple complexities and performance drops hence finding  the exact root cause at the integrated times.

The project management best practice in business should be to avoid such critical failures by tailoring existing process for software development and establish a mature Integration process to the Project Governance. This must always account the experience of previous projects and matured best practices with in the organization. Using this experience, existing resources, at the very initial stages of the project planning – the leadership team must tailor a well-defined integration process, along with the integration plan. It could be altered during the various stages of the software development life cycle. However – these process and plans need to be well communicated and sharply accommodated to all levels of technical team members at its very early stages. A good integration plan may include considerations, guidelines definitions and timelines considering Integration Sequence, integration environment, integration procedures, integration criteria, defining the build series etc. 


Above all, there is a huge responsibility for the marketing or business teams, who do talk the business, and finally making the agreements. There you must not give false promises to the client - specially in terms of cycle time of single operation. Generally the business discussions ends, with out quantifiable agreement on performance - per a single transaction. People forget performance during planning. But once the project start, developed and test, the performance come in to play. Performance must be given a priority, and performance expectation should be documented, once before the contracts are done. Such agreements should also incorporate fail in which terms. Other wise both customers, and develpers, going to pay for the hard life. The business relationships can be irritated and ultimately may end up courts - wasting time and good will.


Saturday, August 31, 2013

What is Machine Learning and where would you need it?





This question have startled burning me quite often from the recent days. I have started to think where to put myself to be an early bird in era of time where people believe knowledge is power in any industry. If you just google machine learning, Google’s machine learning  algorithms will help you to find the perfect Matches for your intention to knowing what is absolutely machine learning is about. Google itself is lot about Supervised Learning and Unsupervised Learning according to some experts. How does Google's automatic news Clustering may work, How does Google find the actual matches, How does Google classify pages ... may be some foods for your thoughts. Hence Google is the Man, there should be no point for me copying and pasting the same on my page. Over the news I am seeing Andrew Ng – the famous celebrity Professor of Stanford for machine learning has become himself for list of 100 most influencing people in the recent Time Magazine ranks. One could argue that Andrew contribute enormous effort to Course ERA learning – creating a new age of distance learning. However I regard this may be just one evidence that the growth of demand for the SME (Subject Matter Expertise) for the Machine learning is in pace. We see communities are gathering up on this topic at exponential rates. Well, where should I place myself then?. Why should not I place an effort to get the pace of the knowledge that the emerging community have and become kind of an early bird? I should not further be a silent observer.  100 thoughts startled to pour in to my mind in a turbulent Bernoulie’s flow. Recalling past - there was a time where people thought Genetics will change the whole paradigm of world – there will never be a food scarcity ever going to come to earth. We could either develop one pumpkin to feed the whole world. :D. So – yes there is always hundred blown imaginations on an emerging engineering filed. Fortunately Genetics did not change the world as such.  Instead with time people have started hating genetically modified food. There I remember again another wave of fad – among the Engineering or Techy, community. They said Nano Technology will change the world. It does – yes to some far. But not as revolutionary as much like the IT or Telecommunication/computer science based industry and internet. Again people say it is quantum computers which will be most revolutionary hardware thing. For me even whatever we have for the moment as a processor is more than enough to do whatever we need to get done.  But what about the human greed.  We always want to have more, do more, eat more – so process even more.

Thinking about typical human thoughts based to endless greed, when there is something found – there is always a niche, who will want to produce things in mass scale – so economies of scale, good margins, through extreme sales.  Therefore whatever the thing you discover, this niche will want to have them in mass scale. So do Data. How do you process big data and other electronic information that is in web/cloud/enterprise or what so ever? Simple – How do you search the whole web on internet and find what exactly the best solution for an identified problem.  How do you find the precise knowledge that you need from a Sea of Knowledge around the globe. There may be million solutions or even more – for one issue from different perspective. Do you think that you have time and capacity to evaluate all the alternatives and critically analyzed and found out what is best? Probably this should be the area Machine learning is looking for. One would say it is BIG data.  But above analogy is partially about data, partially about a thing we don’t know. However the famous Gartner predicts that BIG data boom is going to be ended soon like the fads we discussed before.

But, think. People thought the computers will – replace the jobs and they will reduce the jobs. I don’t know whether that has been a reality.  Instead nowadays we see lot of people working on computers. People find jobs as computer programmers, maintainers, game developers, movie developers, teller machines, billing machines etc etc. That shows that people who have the knowledge on computer are still better benefited.  Therefore for machine language learners, there should be nothing to worry about getting the new skill ready. 

My thought is this emerging field will be better used to support decision making for big projects within very couple of years. There is big part of uncertainty always associated with any kind of  project from initiation phase to the project closer. You know what - big boys are what we think for now as they will never need big losses any more. The things would include effective risk assessment, risk prediction, making made or buy decisions, whatever evaluations of projects against the risk upon probable economic trends, finding the proper places, defining the best investment portfolios, share value predictions etc etc. I have seen the enterprise scale giants have Terra bites of storage for various information about there customers. How beneficial if they could effectually use them to identify specific customer segments and possible business niches which they could know in advance otherwise? Yes Machine learning programs may already been use them for now. However whatever the application, I see this is completely about new era of accurate predictions driven decision making in either projects, or at business in general.

So remember, Prepare Today.  Yes... The algorithms are going to be hard to remember. It may be lot of tough Math inside. There are mental Barriers to overcome. There you got to do a job as well. You have limited time. The knowledge is very rare. No one to support. You will need a GPU, or need to buy sophisticated hardware. Yes true– there may be 100 limitations. But If I don’t start developing my skill today – I will never get any other opportunity here again. I l just be another observer uneasily witnessing what others do right after few years. Do you want just to be no body observing what is happening by news at 8? If not – lets start with Course Era today. Those who know the rules will play better here.