Big data
Deriving business value from big data is a multiphase process that takes raw data and refines it into useful information. After the data has been acquired, using data stores such as Hadoop Distributed File System (HDFS), it is then pre-processed to weed out less useful and structure what is left for analysis. This phase in the process is often handled by using Apache Hadoop.
The new tools that are required to capture and organise big data do not mean that relational data stores are irrelevant. To capitalise on the big data opportunity, organisations must be able to analyse all types of data, both relational and nonrelational (text, sensor data, audio, video, customer transactions, and so on). However, it will be necessary to augment these traditional approaches with technology that will allow organisations to benefit from the big data era.
Contents
- What is big data?
- Apache Hadoop
- Big data and relational databases
- Sqoop
- SQL Server big data connectors
- Unstructured data storage in SQL Server
- Unstructured data analysis in SQL Server
What is big data?
Big data refers to data sets whose size is beyond the ability of typical database software tools to process, store, and analyse. Big data is characterised not only by its size but also by its variety. With the rapid growth in the number of devices and applications that create and capture information, the data has become more complex, because it includes not only traditional relational data, but also raw, semi-structured and unstructured data from sources such as:
- Web pages.
- Web log files (including click-stream data).
- Search indexes.
- Social media.
- Email.
- Documents.
- Sensor data from devices such as smartphones, energy meters, cars, and industrial machines.
In a digitised world therefore, consumers going about their day (communicating, browsing, buying, sharing, searching), create their own enormous trails of data.
Analysing such raw, semistructured, and unstructured data enables organisations to gain a better understanding of their business, customers, and marketplace. But what solutions are available that can store and process all that data and find out what is valuable and worth exploiting?
Apache Hadoop
Apache Hadoop is a top-level Apache project, written in Java, that enables organisations to organize raw (often unstructured) data and transform it so it can be loaded into other applications for analysis. Hadoop contains two main components: the Hadoop Distributed File System (HDFS) for data storage, and the MapReduce programming model that manages the processing of the data.
HDFS allows Hadoop to scale out workloads across large clusters of inexpensive machines to work on big data problems. HDFS is fault tolerant, which is a necessity as in clusters with potentially thousands of nodes, failure is the norm rather than the exception. (Detection of faults and quick, automatic recovery from them was a core architectural goal of HDFS.)
MapReduce has emerged as a popular way to harness the power of large clusters of computers. This is because MapReduce breaks a problem into independent pieces to be worked on in parallel across many machines.
The term MapReduce refers to two tasks that MapReduce programs perform. The first is the map
job, which takes a set of data and converts it into another set of data, where individual elements are broken down into key/value pairs. The reduce
job takes the output from a map
job as input and combines these key/value pairs into a smaller set of key/value pairs. As the name MapReduce implies, the reduce
job is always performed after the map
job.
Big data and relational databases
Your big data solution will not live alone; it has to integrate and will likely enhance your relational database.
Sqoop
Sqoop is a command-line utility for Hadoop that imports and exports data between HDFS (or Hive) and structured databases. Sqoop was created by Cloudera, and is now a top-level Apache project. The name Sqoop comes from "SQL to Hadoop". Sqoop uses MapReduce to import and export the data, which provides parallel operation as well as fault tolerance.
Sqoop uses a connector based architecture, which it uses to read and write to external databases. By default Sqoop, includes connectors for various popular databases such as MySQL, PostgreSQL, Oracle, SQL Server, and DB2.
Sqoop uses JDBC to connect to the database. JDBC is a Java application programming interface that connects Java to relational databases. To interact with a database, a Java application uses a JDBC driver. A JDBC driver implements the JDBC API for a particular database. For example, to connect Sqoop to SQL Server, the Microsoft SQL Server JDBC driver is required.
Sqoop includes a generic JDBC connector, which can be used to connect to any database that's accessible though JDBC. By using the generic JDBC connector with a JDBC-ODBC Bridge, Sqoop can access any database for which an ODBC driver is available.
Apart from the built-in connectors, many companies have developed their own connectors that can be plugged into Sqoop.
SQL Server big data connectors
The Microsoft SQL Server Connector for Apache Hadoop (SQL Server-Hadoop Connector) enables data to be moved backwards and forwards between Hadoop and SQL Server.
As well as being able to transfer more data more efficiently than the built-in SQL Server connector, the SQL Server-Hadoop Connector provides support for nchar
and nvarchar
columns, SQL Server's Unicode data types.
Although Microsoft recommend using the SQL Server-Hadoop Connector with Cloudera's Hadoop distribution, we successfully used the connector with the Apache Hadoop distribution.
We first copied the Microsoft SQL Server JDBC driver to the $SQOOP_HOME/lib
directory on the Sqoop machine:
$ gunzip sqljdbc_4.0.2206.100_enu.tar.gz $ tar -xvf sqljdbc_4.0.2206.100_enu.tar $ cp sqljdbc_4.0/enu/sqljdbc4.jar /usr/lib/sqoop/lib
We then installed the SQL Server-Hadoop Connector under $SQOOP_HOME
:
$ gunzip sqoop-sqlserver-1.0.tar.gz $ tar -xvf sqoop-sqlserver-1.0.tar $ cd sqoop-sqlserver-1.0 $ export SQOOP_HOME=/usr/lib/sqoop $ export MSSQL_CONNECTOR_HOME=/tmp/sqoop-sqlserver-1.0 # ./install
We also had to add the SQL Server JDBC driver path to the entry in $SQOOP_HOME/conf/managers.d/mssqoop-sqlserver
:
com.microsoft.sqoop.SqlServer.MSSQLServerManagerFactory=/usr/lib/sqoop/lib/sqljdbc4.jar
We then used the connector to import data from SQL Server to HDFS:
$ ./sqoop import --connect 'jdbc:sqlserver://mymachine;instanceName=SQLEXPRESS;username=myuser;password=mypassword;database=AdventureWorks' --table HumanResources.Department --target-dir /data/department --split-by DepartmentID
Sqoop imports the SQL Server data into HDFS, which can be examined in the HDFS shell:
$ hadoop dfs -cat /data/department/part-m-00001 5,Purchasing,Inventory Management,1998-06-01 00:00:00.0 6,Research and Development,Research and Development,1998-06-01 00:00:00.0 7,Production,Manufacturing,1998-06-01 00:00:00.0 8,Production Control,Manufacturing,1998-06-01 00:00:00.0
Note The SQL that Sqoop generates to extract the data from SQL Server surrounds the column name with square brackets ([]). This is invalid SQL syntax for SQL Server and so it is necessary to create a synonym for the column that encloses the column name with square brackets. We used our SQL Server ODBC driver for Linux and UNIX platforms to do this:
$ /usr/local/easysoft/unixODBC/bin/isql.sh -v SQLSERVER_2012 SQL> use AdventureWorks SQL> CREATE SYNONYM [HumanResources.Department] For HumanResources.Department
Unstructured data storage in SQL Server
SQL Server provides the binary large object (BLOB
) data type to store data whose structure is not recognised by the database. For example, product brochures and product images. The disadvantage with BLOB
data is that it has a restricted file size (2 GB) and because it is stored internally can make a database enormously bloated and take a very long time to back up, restore, and perform other preventative maintenance tasks.
The FILESTREAM
data type (introduced in SQL Server 2008) enables unstructured binary data such as documents and images to be stored externally on the Windows file system instead of in the database file. FILESTREAM
data is still under the control of SQL Server, and so you can retain database engine functionality such as backing up, restoring, and controlling access to the data while taking advantage of low-cost hardware to store the data.
The FileTable feature (introduced in SQL Server 2012) builds on the existing FILESTREAM
data type and converts SQL Server tables into folders, which can be accessed through Windows Explorer. The directory structure and the file attributes are stored in the FileTable as columns. An organisation can store files and documents in FileTables and access them from client applications as if they were stored in the file system without making any changes to those applications.
Unstructured data analysis in SQL Server
Full-text search
Full-text search, introduced in SQL Server 2005, allows for fast and efficient querying of large amounts of unstructured data. Unlike character-based comparisons used in SQL LIKE
queries, Full-text search performs linguistic searches by operating on words or phrases. (Linguistics is the use of information about the structure and variation of languages so that users can more easily find relevant information.) Querying large amounts of data using full-text search can provide significant performance gains over character-based searches as well as gains in usability.
A SQL LIKE
query when run against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data. A SQL LIKE
query against millions of rows of text data can take minutes to return. A full-text query can take only seconds or less against the same data, depending on the number of rows that are returned.
Doing linguistic searches instead of character-based searches also adds other levels of usability. For example, when searching on a given word, full-text search not only tries to match on the given word, but can also match on forms of the given word or even on other words that have similar meaning. For example, searching on "house" can also yield matches on "houses," "housing," or even "home."
The following query retrieves all the products whose name matches a particular search pattern:
SQL> USE AdventureWorks2012 SQL> SELECT ProductID, [Name] FROM Production.Product WHERE CONTAINS([Name], '"*washer*" OR "*ball*"'); +------------+-----------------------+ | ProductID | Name | +------------+-----------------------+ | 2 | Bearing Ball | | 3 | BB Ball Bearing | | 4 | Headset Ball Bearings | | 341 | Flat Washer 1 | +------------+-----------------------+
Statistical semantic search
Statistical semantic search, introduced in SQL Server 2012, extends SQL Server's full-text search capability by providing semantic insight into document content. While full-text search lets you query specific words in a document, statistical semantic search lets you query the meaning of a document. Typical semantic search use-case scenario include finding key phrases in a document, finding similar documents or finding related documents.
We used our SQL Server ODBC driver for Linux and UNIX platforms to try out these semantic search use-cases.
Example: Finding the key phrases in a document
In the following example, the query locates and retrieves the key phrases in a document stored in the Production.Document
table of the AdventureWorks
sample database. A score is assigned to each key phrase based on its statistical significance. This score determines the order in which each key phrase is reported.
SQL> USE AdventureWorks2012 SQL> SELECT TOP(10) KEYP_TBL.keyphrase, KEYP_TBL.score FROM SEMANTICKEYPHRASETABLE ( Production.Document, Document, 0x6B40 ) AS KEYP_TBL ORDER BY KEYP_TBL.score DESC; +------------+-----------+ | keyphrase | score | +------------+-----------+ | brake | 0.9013491 | | oil | 0.8636435 | | pedals | 0.8491791 | | pivot | 0.8237474 | | bicycle | 0.8049155 | | caliper | 0.7762683 | | lubricate | 0.750616 | | cantilever | 0.740355 | | lubricating| 0.7147026 | | rims | 0.6890502 | +------------+-----------+
Example: Finding similar or related documents
This query locates and retrieves CVs that are similar or related to the CV of a specified candidate. The results are scored based on their similarity to the CV and then displayed in ranked order.
SQL> SELECT TOP(10) KEY_TBL.matched_document_key AS Candidate_ID FROM SEMANTICSIMILARITYTABLE ( HumanResources.JobCandidate, Resume, 1 ) AS KEY_TBL ORDER BY KEY_TBL.score DESC; +---------------------+ | Candidate_ID | +---------------------+ | 2 | | 10 | | 7 | | 4 | | 6 | | 3 | | 8 | | 9 | | 5 | +---------------------+
The following query reveals that the most similar candidate worked for the same company as the specified candidate.
SQL> SELECT JobCandidateID, Resume.query('declare default element namespace "http://schemas.microsoft.com/sqlserver/2004/07/adventure-works/Resume"; for $employer in /Resume/Employment return data($employer/Emp.OrgName) ') Company FROM HumanResources.JobCandidate WHERE JobCandidateID <=2 +----------------+--------------+ | JobCandidateID | Company | +----------------+--------------+ | 1 | Wingtip Toys | | 2 | Wingtip Toys | +----------------+--------------+
Example: Finding the key phrases that make documents similar or related
The following query retrieves the 5 key phrases that make 2 CVs similar or related to one another. It presents the results in descending order by the score that ranks the weight of each key phrase.
SQL> SELECT TOP(5) KEY_TBL.keyphrase, KEY_TBL.score FROM SEMANTICSIMILARITYDETAILSTABLE ( HumanResources.JobCandidate, Resume, 1, Resume, 2 ) AS KEY_TBL ORDER BY KEY_TBL.score DESC; +---------------+-----------+ | keyphrase | score | +---------------+-----------+ | production | 0.5522802 | | bachelor | 0.4300567 | | manufacturing | 0.1395864 | | wingtip | 0.1353242 | | com | 0.1256598 | +---------------+-----------+
Statistical semantic search prerequisites
To try out statistical semantic search for yourself, you need to:
- Install the Full-Text and Semantic Extractions for Search feature.
- Install the Microsoft Office 2010 Filter Packs and Service Pack 1.
- Install, attach and register the semantic language database.
- Create a full-text catalog.
- Create a full-text index with the Statistical_Semantics option enabled.
Refer to your SQL Server documentation for assistance with these tasks.