Is Data New Oil?

Data is New Oil.DataOil

 

It is very indefinite widely-spread phrase, usually it follows by:

Big Data=Big Oil=Big Profit.

Simple as that. But is it really?

Let us look at some difference between Data and Oil:

Information is the ultimate renewable resource.  Any kind of data reserve that exists has not been lying in wait beneath the surface; data are being created, in vast quantities, every day.

Finding value from data is much more a process of handling than it is one of extraction.

 

I would love to use this buzz phrase (data as oil) to demonstrate risks of consequences.

We have already seen “data spills” happen (when large amounts of personal data are inadvertently leaked). Will it be much longer until we see dangerous data drilling practices? Or until we start to see long term effects from “data pollution”?

One of the places where we have to tread most carefully — another place where our data/oil model can be useful — is in the realm of personal data. A great deal of the profit that is being made right now in the data world is being made through the use of human-generated information. Our browsing habits, our conversations with friends, our movements and location — all of these things are being monetized. This is deeply human data, though very often it is not treated as such.

 

I reckon, for safety, we all need to consider the following:

First, people need to understand and experience data ownership.

Second, we need to have a more open conversation about data and ethics.

Finally, we need to change the way that we collectively think about data, so that it is not a new oil, but instead a new kind of resource entirely.

 

What are your thoughts of “Data is New Oil”?

3 V’s Vice Versa

There are three key concepts that can help understand Big Data and those concepts are: volume, velocity, and variety. BD3Vtopost

Volume: The sheer volume of the data is enormous and a very large contributor to the ever expanding digital universe is the Internet of Things with sensors all over the world in all devices creating data every second.

Velocity: is the speed at which the data is created, stored, analysed and visualized. In the Big Data era, data is created in real-time or near real-time. The challenge organisations have is to cope with the enormous speed the data is created and used in real-time.

Variety: Data today comes in many different formats: structured data, semi-structured data, unstructured data and even complex structured data. The wide variety of data requires a different approach as well as different techniques to store all raw data.

There are 3 essential aspects Big Data is all about.

There are more very important V’s applicable to Big Data:

Veracity: Having a lot of data in different volumes coming in at high speed is worthless if that data is incorrect. Incorrect data can cause a lot of problems for organisations as well as for consumers.

Variability: Big data is extremely variable. Variability means that the meaning is changing rapidly.

Visualisation: With the right analyses and visualizations, raw data can be put to use otherwise raw data remains essentially useless.

Value: Data in itself is not valuable at all. The value is in the analyses done on that data and how the data is turned into information and eventually turning it into knowledge.

 

While I was trying to think of any other V’s of Big Data can be described, I have decided to think about V’s Big Data should not be.

I came up with three V’s: Vague, Void, and Vulnerable.

Vague: Big Data must not be vague, it should be certain and specific.

Void: Data for Analysis must not be void, Big Data must have clear meaning.

Vulnerable: Data must be safe and protected.

VV

Do you agree with my V’s?

Can you think of any other V’s Big Data should not be associated with?

 

Is Big Data a good thing?

What is Big Data?

Big Data is a collection of data from traditional and digital sources inside and outside the company that represents a source for ongoing discovery and analysis.

Collecting and parsing vast amounts of consumer information from disparate channels, Big Data organisations present major profit possibilities.

BIG DATA = BIG OPPORTUNITIES

Interesting fact about Data.

The number of Bits of Information Stores in the Digital Universe is thought to have exceeded the number of Stars in the Physical Universe in 2007.

Pros and cons of Big Data.

Pros:

  • There are almost unlimited storage possibilities for huge data volumes.
  • Big Data are now accessible from any place and via various devices as they are normally stored in Clouds.
  • The speed of Big Data transmission and processing is very high owing to cutting-edge technologies.
  • Modern analytical methods, technologies and tools allow analysts to gain very deep insights into Big Data, which was impossible in the past with limited data volumes and weaker processing tools.

Cons:

  • Big Data often have big noise, i.e. there may be many meaningless data points. The analyst should work hard to separate the wheat from the tares.
  • Big Data often implies privacy problems, which can be seen, for instance, from the analysis of social networks. Big data also means quite a low security level. It is natural as Clouds are always not as secure as on-site data warehouses.

 

How does Big Data affect you?

What do you think of when you think of “Big Data”? Perhaps, you are thinking of receiving some kind of personalised advertisement form a retailer. Please spare a couple of minutes to watch the trailer form movie “They live” (1988) by John Carpenter:

But big data is so much deeper and broader than that. On top of helping companies achieving their strategies, Big Data can be used to improve our lives, for example, health and security.

Everyone needs to fully understand big data:

  • what it is to them,
  • what is does for them,
  • what it means to them
  • how to use beneficially.

Statistical Analysis.

Q.1 Lift Analysis. Chips&Burgers.

Sausages ^Sausages
Burgers 600 400 1000
^Burgers 200 200  400
800 600 1400

Lift(Burgers, Chips) = (600/1400)/ ((800/1400)*(1000/1400)) = 1.05

Lift(Burgers, Chips)>1 → Positive Correlation.

Lift(Burgers, ^Chips) = (200/1400)/ ((800/1400)*(400/1400)) = 0.875

Lift(Burgers, ^Chips) <1 → Negative Correlation.

Lift(^Burgers, Chips) = (200/1400)/ ((800/1400)*(400/1400))= 0.875

Lift(^Burgers, Chips) <1→ Negative Correlation.

Lift(^Burgers, ^Chips) = (200/1400)/ ((60/1400)*(200/1400)) = 2.3

Lift(^Burgers, ^Chips) >1 → Positive Correlation.

 

2. Lift Analysis. Ketchup&Shampoo.

Shampoo ^Shampoo
Ketchup 100 200 300
^Ketchup 200 400 600
300 600 900

Lift(Ketchup, Shampoo)= (100/900)/ ((300/900)*(300/900)) = 1 →

→ Independent correlation.

Lift(Ketchup, ^Shampoo) = (200/900)/ ((600/900)*(300/900)) = 1 →

→ No correlation.

Lift(^Ketchup, Shampoo) = (200/900)/ ((300/900)*(600/900)) = 1 →

→ No correlation.

Lift(^Ketchup, ^Shampoo) = (400/900)/ ((600/900)*(600/900)) = 1 →

→ Independent.

Q.3. Chi Squared Analysis. Burgers&Chips.

Chips ^Chips
Burgers 900 (800) 100 (200) 1000
^Burgers 300 (400) 200 (100)  500
Total Column 1200 300 1500

χ^2=((900-800)^2)/800 + ((100-200)^2)/200 +

+((300-400)^2)/400 + ((200-100)^2)/100 = 187.5

χ^2 >0  → There is correlation.

Burgers and Chips: 900 sold, 800 expected → positive correlation.

Burgers and Not Chips: 100 sold, 200 expected → negative correlation.

Chips and Not Burgers: 300 sold, 400 expected → negative correlation.

Not Chips and Not Burgers: 200 sold, 100 expected→ positive correlation.

 

Q.4. Chi Squared Analysis. Burgers and Sausages.

Sausages ^Sausages
Burgers 800 (800) 200 (200) 1000
^Burgers 400 (400) 100 (100)  500
1200 300 1500

 

χ^2=((800-800)^2)/800 + ((200-200)^2)/200 + ((400-400)^2)/400 + ((100-100)^2)/100 =0      χ^2=0 → There is no correlation.

Each paired combination got the same value observed and expected.

 

Q.5.  When is Lift and Chi Squared a poor algorithm?

When you are using really large numbers, there is going to be a huge swing in the Lift value which will result in great limitation. Null-invariant measures were invented for this purpose, for example, Jaccard Coefficient, Kulczynski measure.

Try R – Troubling Doubling at School.

Riddle: 

The number of girls who do wear a watch
is double the number who don’t.
But the number of boys who do not wear a watch
is double the number who do.

If I tell you the number of girls in my class
is double the number of boys,
Can you tell me the number I teach? Here’s a clue:
More than 20; below 32!

*Solution to the riddle:

The number of boys must be a multiple of 3, so that it may be split in the ratio of (2:1).

  1. In RStudio I have given the parameters for
    • the number of boys who wear a watch
    • the number of boys who do not
    • the number of girls wearing a watch
    • the number of girls who do not.
  2. Then I grouped the boys and the girls using the following commands:
  • Boys=c(boyswwatch, boysnowatch)
  • Girls=c(girlswatch, girlsnowatch)
  1. Then I grouped boys and girls as “students”.
  2. In order to get 2 graphs on one picture, I have used:
  • par(mfrow=c(1,2))
  • Pieces of code were executed to get the results on the image. You can see the code below the picture.
  • The resulted image was saved as a .jpeg file.

  • On the R image, the breakdown of number of boys and girls is represented graphically.

    I had a look at some libraries for plotting in RStudio, i.e. ggplot2. It took me a while to figure out what code I need to use as I have never worked with statistics package before.  R programming language is for statistical computing and graphics, and is way different from Maple, MATLAB and LabView.

    I have decided to use the general commands for this example.

    Regarding my Riddle, there are a few things that could be visualised:

    • total number of girls:total number of boys
    • probability of a kid wearing a watch
    • probability of a child wearing a watch to be a boy
    • total number of watches in the class.

Here is the print-screen from Code School web-page – R course completed. Overall, I liked R language.

DianaPetuhovaRLanguage

Fusion Tables for Population of Ireland (2011).

This is the link to Google Fusion Tables:

https://www.google.com/fusiontables/DataSource?docid=1J1gS4dPoIgYAvhP5CN0gu-mMF2AblJqYNQ3vedU4

Screenshot (33)

Purpose of the project:
-To visually illustrate population of Ireland based on 2011 Census data.

Methods:

1. Data for Irish Population was taken from Central Statistics Office web-site:
http://www.cso.ie/en/statistics/population/populationofeachprovincecountyandcity2011/

2. The data was transferred into Excel worksheet. Then the Excel file was cleaned the way to suit the needs for the project.

3. Ireland Map .kml file was found and opened in the public data tables search.

4. The Excel file with Population 2011 data was adjusted once more to match Ireland Map .kml file (*County*).

5. Excel file was saved as .txt extension (Tab delimited) because it was the only one format that worked fine for this project.

6. Excel file .txt was uploaded on Google fusion tables, Separator Character – Tab, Character Encoding – auto-detect. And the new fusion table (see point 3.) was merged with .kml file using URL.

7. Feature Styles were changed using Custom Buckets (Column = Total Persons). Auto-legend was chosen as well.

8. The resulted heatmap was ready to share/publish.

Comments: It took some time to figure out which .kml file to use for visualisation and to decide what file extension would work with Excel file. But at the end, everything worked OK.

Results: The map reflects population density and one can easily see what counties are most or least populated with people. If you mouse-over an area of the map, it will give you info on: Name of County
How many Males
How many Females
Total Persons

Recommendations: Using the given data and Google fusion tables’ features, it is possible to demonstrate which county is more populated with men or women, what areas of the country are overpopulated and then do some research on the reasons why it is like that and is there a need to balance it out.

Conclusions: Google Fusion Tables is a very handy tool to provide visual representation of entered data. Depending on the data specialization, the resulted heatmap may show other statistical data, which is very useful for people working in social sector.