Cyber Security Innovators

Friday, December 5, 2014

Lessons Learned in Big Data Analytics

Our team loved the Big Data Analytics class and has learned to connect business and data. With this blog post, we would like to share our experiences.

Anshul

Thank you so much for teaching this course and helping us learn so many things in this short span of time. I loved the beautiful and simple definition of Big Data, very different from much prevalent one. I was so much inspired that I shared it with my connection on LinkedIn (linkd.in/1FXCMP9). Learning nitty-gritty’s about different metrics and understanding difference b/w bounce rate & exit rate was very interesting. Biggest takeaway from this class was to follow and learn Avinash Kaushik – I am amazed by the simplicity and clarity by which he explains complex concepts in his blog and his book Web 2.0.

Working on Google Analytics and providing recommendation to UA Alumni was really fascinating. Connecting all the dots and making a business improvement decision is not an easy task and it was much realized from this short project. I loved working on visualization and I found some great resources (like @TwitterData) that I can follow/use throughout my career to learn and improve my visualization skills. Though, using twitter may sound a basic skill but I was not much conversant with Twitter before taking this course. Now, I feel much more connected as I have developed liking towards Twitter and started using it much more frequently. Technical skills developed by working on AWS, MongoDB, NOSQL, API and Gephi was great addition to my skillset.

I have had great learning experience in this class and I think it will be helping me a lot in my career. I look forward to stay connected with you and keep sharing my learning.

Leon

Companies have become really good at collecting data but not at understanding the data. Big data is data of a large enough set that it is difficult to gain actionable insights without computer assisted analysis. Data that does not provide direction is pointless. Big data is difficult to understand and comprehend without quality visualizations. Many tools are available to help collect, store, and analyze data. Combining data sets can provide better direction. By properly analyzing data companies can gain understanding and actionable insights.

Ryan

Reflecting on this semester of Big Data Analytics, I feel that this course has been invaluable in exposing me to Data Science tools and techniques that I will be able to utilize in my current research and future career. Prior to this course, I did not have extensive experience with web analytics, visualization, or network analysis. Now I can say that I have worked with Google Analytics to measure and improve website engagement; I have learned how to tell better stories through the use of data visualization and that visualization extends beyond just Tableau dashboards; I have seen how intricate networks can be formed and how insights can be derived from their interconnections. Having spent a considerable amount of time on cleaning and pre-processing data, I can also say that I have a new appreciation for information and knowledge because of the effort that it takes to transform that data into actionable intelligence. Despite the time commitment that this course required, Big Data Analytics has provided a worthwhile learning experience.

Ting-Ju

This class was very interesting. I am glad I decided to take it.

Before taking this class, data analytics is just a "fancy" concept for me. But now, I know how to using different tools to capture and translate data to useful insights.

In the beginning, we started from web analytics. Using tools like Google Analytics, which already have dashboard for us, to take our first step to learn how we use data to make business strategies. While looking at dashboard, we know who is our target users, time, locations, channels.. etc. Then base on this report, digital analyst can give a good suggestions to improve company’s website and how to target different segment of users.

More, we learned data visualization. This isn’t just putting data to visualization tools. The key point is what is the question? What do you want to take away from these visualizations. Once you know your purpose, every data visualizations will make sense to you.

Last, our ending big topic in class is network analysis. And the most interesting part to me. Because every steps you made while gathering or generating data might make a different inference or story. Not only do the observation is important, putting the right context is also good step while having network analysis.

Last, I want to say I really love Dr. Ram’s teaching style. Always giving us great stories to discuss and using example to give us a “feeling” how we can apply data science to our daily life. And thanks to my team members. Always being open minded to listen my opinions, and bringing their knowledge to this team. Let me know what I’m missing.

This class really make me want to learn more and more!

Yu

Thank you so much for providing this course to let us get the chance to learn different aspects of big data. At the beginning of the course, the term “big data” is just the term for me. I thought big data analysis seems very far from me. But after learning different concepts about big data analysis and all the hands on experience we had, I begin to know the big data analysis is not just for the expert in the industry and every one can have his or her own idea about the big data. Also big data is not just about the word “big”, not just the volume. By following steps by steps with Dr. Ram, I learn several things from the class. First of all, that big data is not just about the visualization, but the story behinds the data is most important. This is also what I feel most fascinating, because at the beginning we just had raw data, but after all the analysis we did, we can put them together and make a story about it. Second, now lots of companies have huge number of data and they don’t know how to use it. If we can utilize what we learn from this class and generate some idea for the company we might have the chance to work for. It will be a huge benefit for both ourselves and the company. Last, the big data analysis is far for what we have learnt from this class. I still have lots of to learn about this area and lots of to improve. I will continue learning more about big data analysis after the course end. Finally, I would like to thanks again for Dr. Ram’s help during this course learning and also all my teammates during this semester.

---------------------------------

Team CSI is signing off with this final blog post and we leave you with this funny video on Big Data!

Wednesday, November 12, 2014

The “Fake Side”

It is no longer a surprise in the digital world when a person decides to recreate themselves. This is seen frequently in the gaming community where players have multiple accounts and act like multiple people. Recreated selves are seen in all digital communication forums. But what effect does this have?

Fake accounts are big. There is a sense of pride when the number of your followers increases. You feel connected, important, and powerful. But what if all of those followers did not really exist. In an experiment done by Kevin Ashton this possibility was explored. [2]

First he created a new name, Santiago Swallow, using an application called Scrivener. Then he created a Gmail account with that name. Then using the Gmail account he created a Twitter account for Santiago Swallow. The really amazing part is how he got to 90,000 followers. It turns out that all he needed was $50 and 48 hours. Currently on Fiver.com you can buy 2400 twitter account for $5. [1] Finally he created a portrait using free software, a page on Wikipedia, and a website for $18 more.

For $68 Santiago Swallow became one of the most famous people on twitter. With his personal website, interconnectedness, and wiki page twitter did not mark him as a fake. However, Santiago immediately had fake credibility because 90,000 followers can’t be wrong.

This opens up a new vector of social engineering. How seriously is someone checked before they are added to the followers list? Do you need to know them, or do they need to know someone you know, or do you just need to think they are real. These fake accounts look real and interesting.

Unfortunately real harm can be done with these fake accounts. Think of a class of college students doing social media analytics. How many of the reports are going to suffer due to there being fake accounts with lots of fake users. The most important person in their analysis may not even be real.

There is help. Companies’ exist that will scan users accounts looking for users without any friends or followers and no tweets, such as Status People. However, the companies do not always get it right.

“One of the best known is “Kred,” a service provided by San Francisco company PeopleBrowsr. PeopleBrowsr says its customers include consumer goods giants Procter & Gamble and Budweiser and major advertising agencies Ogilvy & Mather and Wieden + Kennedy. Less than a day after he was invented, Santiago Swallow had a Kred influence score of 754 out of 1000. [3]”

What is real? Role playing or pretending to be something you are not is common. Children do it to mimic other people and characters. These imaginations bring happiness and fun to life. With advancing technology we are moving our ideas beyond words into visible concepts. For example a program called Vocaloid allowed for the creation of a completely synthetic pop star. See the video below. A second option is bring back artists to preform after death. See video 2 below CAUTION EXPLICIT LANGUAGE. In the future will social influence continue after death and will ideas and synthetic creations have social influence?

References

[1]

https://www.fiverr.com/agecy2/give-you-2400-twitter-followers-within-20-hours?context=advanced_search&context_type=rating&funnel=2014111122124850419198760

[2]

http://www.marketplace.org/topics/business/your-twitter-follower-count-doesn%E2%80%99t-impress-me

[3]

http://qz.com/74937/how-to-become-internet-famous-without-ever-existing/

Wednesday, October 22, 2014

Blog 3: Fly Happy

Those who travel often have an airport they love to hate. These airports drag out an already long and uncomfortable day. In reviewing the flight data for 2013, the airport with the greatest amount of arrival delay was Chicago O’Hare. When faced with traveling through an airport with lots of delays what can you do to minimize your chances of being delayed? As a team we took a look at the data to find some answers.

To get the data on flights we used two sources. The first was http://www.rita.dot.gov/ and the second was http://openflights.org/ . RITA, Research and Innovative Technology Administration, is a site managed by the United States Government and it provides large data sets pertaining to transportation. RITA provided the core data about flights, delays, airports, and much more that we needed for analysis. Unfortunately, some of the data is kind of cryptic. For example the airports are listed with a three character airport code. This does not help in knowing the full airport name or in knowing the exact location of the airport.

This is where the second data set comes in. Openflights.org provides a dataset with all the location and identification data for all the airports. By combining the two data sets based on the three character airport ID we are able to get a more complete understanding of the data.

Once we had the data we needed a way to make sense of it. First we cleaned the data by removing information we did not need, such as on time flights. Using the power of pivot tables and strong integration with MySQL, Excel has helped us to quickly clean & make meaning out of data. The data is historical and structured simplifying the migration to MySQL for quick query processing. MySQL also integrated well with our visualization tool of choice, tableau. The ease of using measures, dimensions, and coming up with dashboards is what inspired us to use tableau. Also, using Tableau Public, we were able to publish our dashboards online and obtain JavaScript’s to embed.

With the data prepared we were ready to answer some questions. The first visualization provides insights into flight delays for routes originating from Chicago O'Hare International Airport. The darkness of the routes corresponds to the length of the arrival delay. The pie charts illustrate how each delay type contributes to the total arrival delay.

First we wanted to know who which airports have the greatest arrival delay from Chicago. The next visualization shows the top 10 destination airports from Chicago with maximum delay, where dark orange represents the most delayed destination – ‘La Guardia’. This could be due to the large number of flights from Chicago for ‘La Guardia’.

As a passenger there are not a lot of options for different layovers when traveling to your destination. If you must travel through an airport with heavy delays what can you do to mitigate the chances of being delayed? This question drove our next several questions. Starting with when is the best time of day and day of the week to fly to avoid delays? In analyzing the data, we plotted delays based on the time of day and day of the week, creating a map of when to fly. We can see early hours (5a.m. to 8 a.m.) are best to fly while flight departing 3:00 p.m. to 6 p.m. have most arrivals delays. Also, Wednesday, Thursday, and Sunday have highest delays in afternoons. Sunday is day 7.

We performed the same analysis for month and day showing that heavy delays are expected in June (6). This particular chart allowed us to filter based on air carrier so we can tell when each carrier had delay issues.

Code	Carrier
9E	Endeavor Air
AA	American
DL	Delta
EV	ExpressJet
MQ	American Eagle
OO	Skywest
UA	United
US	USAir
VX	Virgin America
YV	Mesa Airlines

Carrier issues was the next point of investigation. Which carrier has the least or most delays? Delta DL has the most cancelations mostly since it runs Endeavor Air as well.

We then did the same type of analysis on canceled flights.

We found that La Guardia has the most cancelations while Los Angeles has the fewest.

We also found that Saturday (6) is the best day to travel to avoid cancelations.

In conclusion we found that it is best to travel early in the morning on a Saturday. June is the worst month to fly, Delta has the most issues, and La Guardia is almost guaranteed to give you issues.

Wednesday, October 1, 2014

Blog 2: NoSQL vs. Relational Databases

In the world of databases, the trend toward NoSQL databases is increasing. There are strong opinions from both sides, with each side claiming that their database solution is superior to the other while citing different reasons. An example of this is the comparison of Datastax to other DBMS solutions as seen in the table below. The question is, who among these different sects of databases is right, or are they wrong altogether?

[3]

The Pros and Cons of NoSQL vs. Relational Databases

A good debate on what database type to use, and when to use it, is illustrated in the following video of Craig Steadmans interview of William McKnight, president of McKnight Consulting Group.

The two main advantages of NoSQL database over relational database:

The NoSQL database scales very well.

With NoSQL, when the database is too slow or too big, you can easily add more servers by creating a cluster or replica-set of multiple shards. However, a relational database does not have such good scalability.

“NoSQL databases usually support auto-sharding, meaning that they natively and automatically spread data across an arbitrary number of servers, without requiring the application to even be aware of the composition of the server pool. Data and query load are automatically balanced across servers, and when a server goes down, it can be quickly and transparently replaced with no application disruption.” [1]

The NoSQL database allows for heterogeneous data.

In a computer hardware store all of the products have a price and a vendor. However, different components of the computer have different properties. For example, CPUs have a clock rate, hard drives and RAM chips have a capacity, monitors have a resolution. In the relational database, there are two ways to deal with this real world problem. The first option is to create a very long productID-property-value table. The second option is to create a very wide and sparse product table with every property, but the problem is that most of the values in the table would be NULL. In a NoSQL database, this problem is more easily avoided because it allows each document in a collection to have a different set of properties.

The two main disadvantages of NoSQL database over relational database:

Denormalization

When your data is very relational and able to be denormalized, the relational database would be your best choice because the NoSQL database doesn't use JOINs.

For denormalizing the data in a relational database, there are several rules on how to normalize the data. But the NoSQL database is rather new technology, it lacks rules about denormalization.

Complex transactions

The NoSQL database does not handle complex transactions as well as the relational database. When the actions affect more than one document, the NoSQL database cannot guarantee consistency between the tables.

Its about your data

Jnan Dash, a long time database professional, points out several things to consider about the data before deciding what choice to make in a database solution.

Tabular vs Complex

If the data has a simple tabular structure, like an accounting spreadsheet, then the relational model could be adequate. Data such as geo-spatial, engineering parts, or molecular modeling data, on the other hand, tends to be very complex. It may have multiple levels of nesting and the complete data model can be complicated. Such data has, in the past, been modeled into relational tables, but has not fit into that two-dimensional row-column structure naturally. [2]

Historical vs Dynamic

“What is the volatility of the data model?" Is the data model likely to change and evolve or is it most likely going to stay the same? Generally speaking, all the facts about the data model are not known at design time, so some flexibility is needed. This presents many issues to the relational database management system (RDBMS) users of the world. [2]

Conclusion:

Each type of database serves a different purpose. NoSQL is good for complex unstructured data that is constantly changing and gaining new attributes. RDBMS is good for structured data where the data attributes are well defined and understood. RDBMS has a proven record spanning decades and is trusted, while NoSQL is a relatively new configuration lacking the same trust. So the best database is the one that best fits your data and goals. But, if you still can't make up your mind don’t worry, some database vendors are developing solutions that allow for the coexistence of NoSQL and RDBMS.[2]

References APA Format:

NoSQL Databases Explained. (2014, October 1). Retrieved October 1, 2014, from http://www.mongodb.com/nosql-explained

Dash, J. (2013, September 18). RDBMS vs. NoSQL: How do you pick? | ZDNet. Retrieved October 1, 2014, from http://www.zdnet.com/rdbms-vs-nosql-how-do-you-pick-7000020803/

Relational Database to NoSQL. (2014, October 1). Retrieved October 1, 2014, from http://www.datastax.com/relational-database-to-nosql

Preslar, E. (2013, September 16). McKnight: Relational vs. NoSQL databases not a winner-take-all game. Retrieved October 1, 2014, from http://searchdatamanagement.techtarget.com/video/McKnight-Relational-vs-NoSQL-databases-not-a-winner-take-all-game

Wednesday, September 10, 2014

Blog 1: Web Analytics in eCommerce and Security

#477762599 / gettyimages.com

Importance of Web Analytics

“Data in its raw form isn’t that useful, you have to do things to it. You have to shape, and discard the stuff you don’t need.” [1] Web analytics helps turn raw data into meaningful information. Web Analytics is the collection, measurement and analysis of web interaction data to provide recommendations for improving web sites.

If the definition of web analytics is the improving of the website then the first step of web analytics is to know the purpose of each page of the site. Only then can value be gained from the information gathered. Amazon has different purposes for different web pages on its site. A wrong assumption while looking at a general Amazon web page is that the goal of the page is to sell an item. However, the goal of the page is to facilitate browsing and shopping. It is not until you check out that the goal of the page changes to selling the item. Once the goals for the website are determined, the 5Ws of web analytics can be used to determine if users and web page are interacting as expected. The 5Ws of web analytics are Who, What, When, Where, and Which.

The 5Ws in eCommerce

The previous lecture about Web Analytics and 5W's was quite fascinating. This helps in analyzing the goals of websites by using the 5Ws. Thinking about the 5Ws can provide valuable insights that can be used to help websites meet their goals. In eCommerce and when browsing through websites like eBay, Amazon, Walmart, Target and Alibaba - they can be related to the 5Ws.

Starting with the first 'W' - Who. Metrics like total visitors, registered unique visitors (RuV's) and visits/visitor (RuV's) can be really insightful to measure. Tracing GUID’s, user ID & mapped user can help us measure these metrics.

Second 'W' is Where, which can be obtained by slicing above metrics by regions ex: North America, Australia, UK, Germany and other growing markets; which can be interesting to see share and growth trends in that region. IP of the user, site ID or country of registration can be used to measure this.

Third 'W' is When, which is mostly a timeline against which various metrics can be measured. Looking into peak days, seasonal trends and holiday season can help in planning marketing campaigns, planning system maintenance, aligning customer support and planning quarterly/yearly revenue. Timestamp is the key attribute to track here.

Fourth 'W' is What, which can measure actions users are taking on a website. Metrics like click through rate, sell through rate, conversion, percentage of users going to a View Item page from a Search result page can be really helpful to measure.

Fifth 'W' is Which, which identifies devices, operating systems and browsers that users are using. For example various metrics for mobile only and PC users can be measured. The following are some interesting points worth mentioning:

e Commerce is shifting to mobile devices rather than traditional PC. Companies like eBay have about 40% of Q2’14 Global GMB(Gross Merchandise Bought) from mobile devices. This shows that most of the users shop using Mobile devices(smartphones + tablets)before 9 a.m. and after 5p.m.

Also, in a world of multiple screens, users are shifting to multiple devices. Example: viewing fixed price item on eBay from mobile device while purchasing it using PC. In Q2’14 59% of eBay buyers shopped across multiple screens. So, it's really important to understand the reason why people are using multiple devices for purchases [2].

Its highly possible that mobile users have high frequency with a low session duration while PC users have low frequency and higher session duration on a site. This can be found by measuring session duration and frequency of user visit for multiple platforms.

The 5Ws in Security

Another perspective on the 5Ws is how they can provide insight on security issues.

Each of the 5Ws can help in securing a website. The following list shows how this can be accomplished.

Who helps determine what type of customer or user is on the site. If the site is designed for business to business transactions we do uses creating logins for personal use. Knowing the Who of web analytics can improve security on a site.

What focuses on the actions of users on the site. This helps determine if a user is behaving unexpectedly. For example a single user logged into a site is trying many different credit card numbers to make a purchase. This could be a warning to the site that someone is trying to use stolen credit cards to make a purchase.

When is also important to security. Seeing a high volume of traffic during a normally slow time is a sign that a cyber-attack is under way.

Where deals with the source or location of the user. If a company focuses on sales to Latin America and they are getting large amounts of traffic from Russia and China, then they will want to keep a close watch on their secure data.

Which provides the starting point for profiling a user. Which tells the details of the user, the operating system they use, the web browser they use, the port and protocol they use, the type of device they are connecting with, and how they linked to the site. With this detailed level of information, identifying common attack vectors becomes easier allowing the site to close open vulnerabilities.

Case Study - DARKReading.com

While looking at a specific site, we can use the 5Ws to understand it’s goals. DARKReading.com is a reputable information security website published by InformationWeek [3]. The overall goal of this site is to disseminate information and data to information security professionals. This includes news stories, commentaries, white papers, and upcoming events. Keeping up to date on current events and emerging technologies are of extreme importance in the field of cybersecurity. Knowing the goal and target audience of this site helps aid in the analysis of the 5Ws.

Who is coming to the site can reveal if a large enough percentage of the users are, in fact, from the information security realm. While general public and researchers are a welcome audience, without the intended audience visiting in high enough numbers, the goal of the site will not be met.

When is also important as it can help ensure the articles and features desired by these users are available when they are needed. When can also help analyze trends of visitation to the site after the announcement of a major breach or security event.

Where the users are coming from can be most easily tracked through user account features available on the site. Users can also subscribe to a digital subscription of the publication.

What is most relevant in combination with the other Ws. Who, when, where, and what can indicate the overall goal of the individual user. It can also help identify topics that might be trending higher than others (mobile versus operations, for example).

Which identifies the tools a user is employing to visit the site and can help with site optimization. For example, ensuring a responsive design is implemented to more easily facilitate usage through mobile devices and multiple browsers.

Why is more difficult to analyze from the outside. Having specific metrics and user data available is very important for quality user analysis and predictions.

In summary, understanding ‘Why’ using the 5Ws, various metrics and external data is the key for developing insights that can help in constant business improvement.

References

[1] FINLEY, K. (2014, September 8). Ex-Googler Shares His Big-Data Secrets With the Masses | Enterprise | WIRED. Retrieved September 9, 2014.

[2] eBay. (2014, July). eBay Marketplaces - Mobile. Retrieved September 10, 2014.

[3] InformationWeek. (n.d.) Dark Reading: Connecting the Information Security Community, InformationWeek. Retrieved September 8, 2014.

Tuesday, August 26, 2014

Introductory Blog

Ryan Chinn

My name is Ryan Chinn. I am an AZSecure Scholarship for Service Fellow currently pursuing a master’s degree in Management Information Systems (MIS) within the Eller College of Management at the University of Arizona. I received my Bachelor of Science in Business Administration with a major in MIS from the University of Arizona (2013). As the volume, velocity, and variety of data continues to increase at an exponential rate, big data is a growing concern. Big data analytics and its implications for cybersecurity are particularly exciting. From detecting advanced persistent threats (APTs) to preventing cyber attacks, big data analytics will play a vital role in transforming massive datasets into actionable intelligence. By taking this Big Data Analytics course, I hope to learn methodologies and develop analytical skills that I can apply as an information security professional.

Samantha Forbis

Hello, my name is Samantha Forbis. I'm an MS MIS candidate as well as a National Science Foundation Scholarship For Service recipient. My hobbies are cars, computers, and spending time with my family. Every time I turn around I see another entity collecting a massive amount of data - surveys, cookies, shopper reward programs, etc,... The usefulness, and potential misuse, of this data is something I find intriguing. I also feel that the collection of this data is often misunderstood. Since cybersecurity is my primary concentration my interests in big data revolve around safeguarding, analyzing, visualizing, and communicating the information and patterns gleaned from data. From MIS 586 I hope to gain new data collection, analysis, and visualization skills to enable me to make sense from complicated data structures and communicate those in a meaningful way to those who need it.

Leon Walker

Who I am academically should be obvious to those who need to know. I believe that “big data” poses a security risk to individuals. This is due to users being nonchalant about the sensitivity of the data they willingly share with companies and individuals. I believe that heuristics for securing personal identifiable and sensitive information can be developed by studying how companies acquire and process data. Because of my beliefs, this class creates some cognitive dissonance by requiring so much social interaction. However, I look forward to this class and the opportunity to learning how to secure information in a world using “big data”.

Ting-Ju Yang

Hi, my name is Ting-Ju Yang and I am a graduate student at the University of Arizona’s Eller college of Management pursuing my Master’s in Management Information Systems. All I know about “Big Data” are from news. People can look into data and find an useful information or people's behaviors, and try to use in marketing or different business strategies. I’m not interest to play as a god, but if I can go deep to see how people act or the trends, it would be really "cool". I am looking forward to this class, and see more interesting results.

Yu Zhao

My name is Yu Zhao. I am currently in the master in the MIS program in the University of Arizona. My bachelor degree was from Harbin Institute Technology in MIS. Before I joined this program, I know the data, I know how to store the data, but I don't know how to analysis the data or how to use data to generate information. After taking the Business Intelligence class last semester, I found it fascinating that data can bring us so much information. Then I was a business intelligence intern for the whole summer in NBTY. While working in this company I found that data was so interesting and talented that we can use the same data set to generate different information for different department to make decision. Because the company I was working for is a manufacture company, the data sets of this company are unorganized and huge. During the work, I am always thinking about the question that how can people use such big data properly. This reminds me the guest lecture that Dr. Ram gave us last semester. She used the different data sets which seems have no relation at all, connect them together and generate beautiful visualization for people to use the data which looks meaningless at first. This impressed me a lot and this is also the main reason I decided to take this course. I hope I can learn how to use big data to generate useful information in the big data class.