• No results found

View of "A GRAPH DATABASE: ANALYSIS OF RETAIL DATA"

N/A
N/A
Protected

Academic year: 2023

Share "View of "A GRAPH DATABASE: ANALYSIS OF RETAIL DATA""

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

"A GRAPH DATABASE: ANALYSIS OF RETAIL DATA"

Sweta Kumari1

1Associate Consultant, HCL Technologies Ltd, Noida Er. Pankaj Prasad2

2Resource Person, MCA Department, BRABU & Academic Counselor of IGNOU Mr. Anand Kumar3

3Resource Person, R.D.S College & Academic Counsellor of IGNOU

Abstract- The field of "big data" focuses on the storage, processing, and visualisation of enormous amounts of data. Data is expanding more quickly than ever today. We need to provide an environment that can enable us to gain useful insights, as well as the appropriate tools and apps based on the data. One of the industries that regularly gathers a sizable volume of transaction data is retail. To make better business decisions, retailers must comprehend the buying habits and behaviour of their customers. The goal of market basket analysis, a subfield of data mining, is to identify patterns in transaction data from the retail industry. Our objective is to identify software and solutions that merchants can utilise to swiftly comprehend their data and make wiser business decisions. Such tasks cannot be completed manually due to the volume and complexity of data. Trends change rapidly, as we observe, and retailers want to move quickly to adapt and change with the times.

Keywords: Graph Database, Analysis Retail Data, Focuses on the Storage, Processing, Visualisation of Enormous, Amounts of Data.

1 INTRODUCTION

Large networks of data are all around us nowadays. Genetic and protein interaction networks, transportation networks, social networks, and organisational networks are a few examples of these vast networks.

These networks hold data that can reveal trends and insights that are beneficial.

Organizations can use these insights to make wiser business decisions. Yet, manual analysis of such networks is challenging due to their scale and complexity. Several network analysis tools, such as Stanford Network Analysis Platform (SNAP), have been developed to allow analysts with a vast amount of data evaluate the network effectively for such massive networks. Data are viewed as graphs in network analysis.

In graphs, there are nodes and edges. The network's nodes are its properties, while its edges represent the connections between those attributes. Due to the possibility of various types of interactions between the nodes, the resulting graph-based structures are frequently extremely complicated. The data from retail transactions is another illustration of such vast networks. Every day, a massive amount of transactional data is produced by every retail location.

The information about the products their customers purchased is contained in these transactions. Retailers must examine this data to derive insights and apply them to better business decisions. Finding the time

of day when customers are most likely to visit the store and assigning extra staff to that shift are some instances of beneficial insights. Data are viewed as graphs in network analysis. In graphs, there are nodes and edges. The network's nodes are its properties, while its edges represent the connections between those attributes. Due to the possibility of various types of interactions between the nodes, the resulting graph-based structures are frequently extremely complicated. The data from retail transactions is another illustration of such vast networks. Every day, a massive amount of transactional data is produced by every retail location.

The information about the products their customers purchased is contained in these transactions. Retailers must examine this data to derive insights and apply them to better business decisions. Finding the time of day when customers are most likely to visit the store and assigning extra staff to that shift are some instances of beneficial insights.

2 DATA ANALYSIS AND BIG DATA

Big Data is a phrase used to mean a massive volume of both structured and unstructured data. It is a data management challenge that has often been defined by

“three V’s” i.e., Volume, Variety and Velocity. Volume: Volume refers to the

(2)

amount of data that is generated and collected and it ranges from terabytes to petabytes of data. Variety: Variety refers to the wide range of data collected from various sources in various formats such as tweets, blogs, videos, music, catalogs, photos etc. Velocity: Velocity refers to the speed with which the data is generated everyday such as the emails exchanged, social media posts, retail transactions descriptions and the speed with which it is analyzed, whether in real time or near real time. Companies and researchers are addressing the challenges of big data because it helps organizations to take better decisions and build better products and services. Due to the huge volume of data, especially unstructured data, data storage, analysis and modeling is a clear bottleneck in many applications. The underlying algorithms are not always scalable to handle the challenges of big data. It is also crucial to address the issue of presentation of the results and its interpretation by non- technical domain experts so that the actionable knowledge can be used effectively. These challenges led to the development of data analytics. The process of extensive use of data to analyze, interpret and communicate decisions and actions is termed as analytics. Corporate data has grown consistently and rapidly during the last decade. Due to the availability of the data, analytics has become a crucial part of every business enterprise. The ability to analyze and synthesize information, using information systems, plays an important role in the rising or failure of a company.

Traditional analytics rely on a human analyst to generate a hypothesis and test it with a model. The need for consolidating, viewing, and analyzing data according to multiple dimensions, in ways that make sense to analysts at any given point in time gave rise to many analytical technologies.

One of the such technology is known as online analytical processing (OLAP). It enables end-users to perform ad hoc analysis of data so that the users can find the insights they need for better decision making. OLAP tools use multidimensional database structures, known as cubes, to store arrays of consolidated information.

Analytics has been sub-divided into three categories. First is usually called de scriptive analytics that describes the past activities, the second is predictive analytics that uses the past to predict the

future and the third category is prescriptive analytics that proposes what needs to be done by performing randomized testing with control groups.

3 RELATIONAL AND NON-RELATIONAL DATABASES

Relational databases ruled the information technology industry for more than 40 years.

It is a very efficient way of storing and retrieving data that fits into a predefined schema of rows and columns. But in the last couple of decades, the nature of data collection and data processing by computer systems has changed in a number of ways.

Due to the increase volume of semi- or unstructured and interconnected data, it has become tedious and inefficient to store data in relational databases as they have strict data model and they deal poorly with relationships. This has led to the need of data models that are less constrained and can still handle huge amount of semi or unstructured and interconnected data. A database that do not follow relational model and is less constraint is termed as non- relational database. Both models of databases are widely used today and deciding which database is more suitable for a task at hand is not always trivial as both, relational and non-relational databases, have their advantages and disadvantages.

(i) Relational database- Relational data model was proposed by E. F. Codd in 1970.

It is implemented by a collection of data items organized as a set of formally- described tables and is based on the concepts of relational algebra. The relational model was developed to address a multitude of shortcomings that formerly existed in the field of database management and application development. Relational database offers powerful and simple solutions for a wide variety of com mercial and scientific application problems.

Relational database store highly struc tured data in predetermined tables.

In every industry, relational systems are being used for applications requiring storing, updating and retrieving of data elements, for operational, transactional, and complex processing as well as decision support systems, including querying and report.

(ii) Non-Relational Database- Unlike relational database, non-relational database

(3)

is a category of database that do not follow relational data model to store and retrieve data. It is built on the notion of storing schema free data that can provide faster methods of processing and handling large datasets of unstructured and interconnected data. This category of databases are also referred to as NoSQL databases. The name NoSQL was first used in 1998 by Carlo Strozzi for a shell-based relational database management system. It was a relational database but it did not use the SQL language. The term NoSQL was rein troduced in 2009 as a label for a group of non-relational distributed databases. Today, we have many NoSQL databases available in the market such as Cassandra, Mongo, Neo4J etc. These databases are based on different models that includes column-based, document- based, key-value-based and graph-based databases that we discuss below.

Figure 1 Example of data implemented by means of Relational Model 4 GRAPH DATABASES

These databases emerged to store relationship-oriented data naturally and efficiently using nodes and edges. Graph databases were proposed on the promises to scale well for large data sets and also simplify the interactions with the applications by reducing the effort needed to map in-memory data structures with databases. One of the reason for the success of graph databases has been the lack of the capability of relational databases to process the highly connected data efficiently Various details in the data can also be stored as properties of nodes and edges. It is well-suited for storing complex relationships between attributes. It can offer a cost-saving solution for storing graph-like data, compared to relational

model. For example, processing some graph operations to query data from relational databases can be very inefficient, because it may need complex join operations or sub- queries to assist. On the other hand, this type of queries can be easily handled in graph databases. Some graph databases available on the market are Hyper Graph DB and Neo4j. Figure 2.provides an example representing a part of the data model, of the data used in our project, in Neo4j. In this example, we can see three different types of nodes and two relationships. There is a node, representing a customer, that is associated with an order using an ordered relationship and the products in that order are associated by an

’in order’ relationship.

Figure 2 Graph Data Model in Neo4j 5 METHODOLOGY

The methodology used in our work is based on the CRoss-Industry Standard Process for Data Mining (CRISP-DM). It is an abstract model proposed by IBM in 1996. The model is independent of industry, tool and application and describes six phases for conducting a data mining project. The six phases are: understanding business, data understanding, data preparation, modeling, evaluation and deployment. The description below follows the outline of.

(i) Business understanding- This is the initial phase that focuses on understanding the needs and objectives of a business for which data is mined. Every business is focused on solving a specific problem and thus has a specific set of requirements and needs. While developing an analytical solution for a business, it is extremely important to understand the problem that is being addressed and then build a solution that aligns with that business needs.

(4)

(ii) Data understanding- understanding the data starts with the initial data collection and analysis. It is im portant to get familiarized with the data, identify data quality problems, discover first insights into the data and detect interesting subsets of data that can form hypotheses for hidden information.

(iii) Data preparation- The data preparation phase covers all activities needed to construct the final dataset that will be used in the model. Data preparation might have to be performed multiple times based on the facts and findings in the data.

Data preparation includes attribute selection as well as transformation and cleaning of data for modeling tools.

(iv) Modeling- In this phase, various modeling techniques are selected and applied, and their parameters are calibrated to optimal values. Some well-known modeling techniques are: classification analysis, association rule learning, outlier detection and clustering analysis. Typically, there are several techniques for the same data mining problem type. Some techniques have specific requirements on the form of data. Therefore, stepping back to the data preparation phase is often needed.

(v) Evaluation- At this stage of the project, a model is built that appears to have high quality from the data analysis perspective.

Before proceeding to final deployment of the model, it is important to evaluate the model thoroughly, and review the steps executed to construct the model, to be certain that it properly achieves the business objectives. A key goal is to determine if there are some important business issues that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached.

(vi) Deployment- Creation of the model is generally not the end of the project. Even if the purpose of the model is to increase knowledge of the data, the knowledge gained in the pro cess will need to be organized and presented in a way that is useful to the customer and thus the model must be successfully deployed.

6 PROCESSING THE INSTACART RETAIL DATA

In this paper, we describe our application processing the Instacart dataset. Our goal is to create a platform, using principles of graph theory, that can be utilized by retailers to understand customer’s buying

behavior and provide these customers with appropriate coupons. After collecting and understanding the dataset, we preprocess the data to prepare it for loading into the database. The data is loaded as a graph with nodes as the unique attributes and edges as the relationships between the attributes. To achieve our goal, we construct an ad-hoc product network over our model and use clustering algorithm to generate clusters of products such that products within a cluster are much stronger related to each other than with products in another cluster. Then, for recommending products to a customer by means of coupons, we find a customer’s

"belonging" factor to each cluster. The belonging factor of a customer to a cluster depends upon the ratio of the number of products bought by the customer that belongs to that cluster to the total number of products bought by the customer. Based on this factor, a retailer can recommend products from the clusters. In this section we describe each step in detail.

(i) Data Set -The data set used in our project is an open-source dataset published by Instacart. Instacart is an online retail service that delivers groceries. Customers select groceries through a web application from various retailers and the groceries are delivered to them by a personal shopper.

The data was published in 2017 for research purposes. The dataset contains over 3 million grocery orders from more than 200,000 customers. The dataset is anonymized and does not contain any customer information. It only contains a unique identifier(id) for each customer. The data is available in multiple comma- separated-value (csv) files. The first file contains product information that includes product id, product name, aisle id and the department id. The aisle id repre sents the identifier that signifies where the product is placed in the store and depart ment id signifies the category to which the product belongs to.

(ii) Preprocessing Data- Preprocessing the data is an important phase in analytics before the data is loaded and analyzed. We inspected the data for missing values and random errors such as duplicate identifiers for two or more products and same order identifiers associated with different

(5)

customers. The product names were also cleaned before loading into the database.

Data Representation After cleaning and preprocessing, the data was ready to be loaded into the database. The data was loaded in Neo4j from our application built in Java. The application uses Java Neo4j Native API to upload the data. In order to avoid creating dupli cate nodes for the same customer, order and product, an unique constraint is applied on the respective identifiers.

(iii) Analyzing the Data- After loading the data, we explore the dataset, using Neo4j, for our deeper understanding. We loaded only partial dataset so that it is easy to analyze and work with. In real applications where the data will grow every day, data centers, with huge storage capacities, can be used for storing and maintaining the dataset.

7 ANALYZING THE DATA

After loading the data, we explore the dataset, using Neo4j, for our deeper understanding. We loaded only partial dataset so that it is easy to analyze and work with. In real applications where the data will grow every day, data centers, with huge storage capacities, can be used for storing and maintaining the dataset. Table 1 shows count 34 of each type of label, named as customers, orders, products, aisles and departments, loaded in the database. We find that on the average 10 products were bought in an order with the maximum being 127 products and minimum being a single product. The data includes day of the week and time of the day when each order is placed. Figure 3.

shows the day-of-the-week distribution of when the orders were placed. The orders are shown in the order from the beginning of the week (Monday) to Saturday and Sunday. The days have been represented by numbers, 0 to 6. The data given, by Instacart, did not specify the day indicated by each number. But from the figure, we can see that 0 and 1 are the days when most of the orders are placed and they are most likely to represent the weekends. If 0 and 1 are weekends, we can see that Wednesday, represented by number 4, has the lowest number of orders. Figure 4 shows hours distribution of when the orders were placed. The hours are given in 24 hour format and we observe that most of

the orders were placed during morning and afternoon hours rather than night hours.

Figure 3 Number of orders placed on each day of the week

Figure 4 Number of orders placed at different hours of the day

Table 1 Dataset used 8 CONCLUSION

We described a method for analysing retail data using graph representations. One industry that generates a significant amount of data every day is retail. Because of the interconnectedness of the data, it is no longer possible to identify existing linkages between items using conventional approaches. Building a platform for retailers that may be useful for acquiring insights into the data that can help them make decisions benefited for their business and quick at the same time was a significant contribution of our work. Real transaction data from Instacart was the source of the datasets used in our research.

We discovered that a graph database can be

(6)

highly useful in such a case because graphs are extremely intuitive and simple to work on for high-dimensional data. Large datasets could be handled quite effectively by Neo4j graph databases. We ended up with a dataset that had nodes and edges.

We downsized the dataset for easier analysis rather than to increase Neo4j's computational capacity. At first, learning Cypher was challenging, but once we grasped its principles, the platform was simple to use.

REFERENCES

1. Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte and Etienne Lefeb vre Fast unfolding of communities in large networks (2008).

2. Neo4j Algorithms, https://neo4j- contrib.github.io/neo4j-graph-algorithms/.

3. Ivan F.Videla-Cavieres, Sebastián A. Ríos Characteriation and completation of the cus tomer data from a retail company using graph mining techniques (2014).

4. Georgios Drakopoulos and Andreas Kanavos Fuzzy Graph Community Detection in Neo4j (2016).

5. Grzegorz Malewicz, Matthew H. Austern, Aart J.

C. Bik, James C.Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski, Google, Inc.

Pregel: A System for Large-Scale Graph Processing (2010).

6. Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin and Joseph Hellerstein GraphLab: A New Framework For Parallel Machine Learning (2010).

7. Colin Shearer The CRISP-DM Model: The New Blueprint for Data Mining (2000).

8. Alexandros Labrinidi and H. V. Jagadish, Challenges and Opportunities with Big Data, Proceedings of the VLDB Endowment VLDB Endowment Hompage archive Volume 5 Issue 12, August 2012.

9. Thomas H. Davenport Enterprise Analytics, Optimize Performance, Process, and De cisions Through Big Data. International Institute For Analytics.

10. Veronika Abramova, Jorge Bernardino and Pedro Furtado Experimental Eavalua tion of NoSQL Databases (2014).

11. Gerard George, Ernst C. Osinga, Dovev Lavie and Brent A. Scot Big Data And Data Science Methods For Management Research Academy of Management Journal, 2016, Vol. 59, No. 5, 1493–1507.

12. McAfee, A., Brynjolfsson, E. 2012. Big data: The management revolution, Harvard Business Review, 90: 61–67.

13. Santo Fortunato, Complex Networks and Systems Lagrange Laboratory, ISI Foun dation, ITALY, Community detection in graphs (2009).

14. Scott, J., Social Network Analysis: A Handbook (2000), SAGE Publications, London, UK.

15. Michelle Girvan and M. E. J. Newman Community structure in social and biological networks ().

16. Paul Erd˝os and Alfréd Rényi, On Random graphs, in Publ. Math. Debrecen 6, p. 290–297.

17. Jiawei Han, Micheline Kamber and Jian Pei.

Data Mining Concepts and Techniques (2000), Third Edition, The Morgan Kaufmann Series in Data Management Systems.

18. Sami Ayramo and Tommi Karkkainen Introduction to partitioning-based clustering methods with a robust example, Reports of the Department of Mathematical Informa tion Technology Series C. Software and Computational Engineering No. C. 1/2006.

19. Chad Vicknair, Michael Macias, Zhendong Zhao, Xiaofei Nan, Yixin Chen and Dawn Wilkins A comparison of a graph database and a relational database: a data prove nance perspective (2010).

20. E. F. Codd, IBM Research Laboratory, San Jose, California. A Relational Model of Data for Large Shared Data Banks, Communications of the ACM CACM Homepage archive, Volume 13 Issue 6, June 1970, Pages 377-387.

21. M. E. J. Newman, Department of Physics and Center for the Study of Complex Systems, University of Michigan, Ann Arbor, MI Fast Algorithm for detecting community structure in networks (2003).

References

Related documents

Scheduling and Conflict Resolution Rather than assigning priorities based on whether the transactions are CPU or I/O (or data) bound, real-time database systems must assign

• Read in marks of the 100 students in a class, given in roll number order, 0 to 99.. •

• in defining other strongly live variables in an assignment statement (this case is different from simple liveness analysis)... Using Data Flow Information of Live

1 July 2012 Introduction to DFA: Live Variables Analysis 6/34.. Local Data Flow Properties for Live

• Data Abstraction: A data model is used to hide storage details and present the users with a conceptual view of the database.. • Support of multiple views of the data: Each

Data structure Forms: Data flows capture the name of processes that generate or receive the data items.... The scheme of organizing related information is known as

much higher production cost levels and lower productivity levels compared to countries such as; China, Philippines, Cambodia, and even Bangladesh, which appear to have

Graph Mining, Social Network analysis and multi- relational data mining, Spatial data mining, Multimedia data mining, Text mining, Mining the world wide