Skip navigation
2013

The Big Data Real-time Application is a scenario to return the computation and analysis results in real time even if there are huge amount of data. This is an emerging demand on database applications in recent years.

 

 

In the past, because there are not so much data, the computation is simple, and few parallelisms, the pressure on the database is not great. A high-end or middle-range database server or cluster can allocate enough resource to meet the demand. Moreover, in order to rapidly and parallelly access to the current business data and the historic data, users also tend to arrange a same database server for both the query analysis system and the production system. By this way, the database cost can be lowered, the data management streamlined, and the parallelism ensured to some extent. We are in the prime time of database real-time application development.

 

 

In recent years, due to the data explosion, and the more and more diversified and complex application, new changes occur to the database system. The obvious change is that the data is growing at an accelerated pace with ever higher and higher volume. Applications are increasingly complex, and the number of concurrent accesses makes no exception. In this time of big data, the database is under increasing pressure, posing a serious challenge to the real-time application.

 

 

The first challenge is the real-timeness. With the heavy workload on the database, the database performance drops dramatically, the response is sluggish, and user experience is going from bad to worse quickly. The normal operation of the critical business system has been affected seriously. The real-time application has actually become the half real-time.

 

 

The second challenge is the cost. In order to alleviate the performance pressure, users have to upgrade the database. The database server is expensive, so are the storage media and user license agreement. Most databases require additional charges on the number of CPUs, cluster nodes, and size of storage space. Due to the constant increase of data volume and pressure on database, such upgrade will be done at intervals.

 

 

The third challenge is the database application. The increasing pressure on database can seriously affect the core business application. Users would have to off-load the historic data from the database. Two groups of database servers thus come into being: one group for storing the historical data, and the other group for storing the core business data. As we know, the native cross-database query ability of databases are quite weak, and the performance is very low. To deliver the latest and promptest analysis result on time, applications must perform the cross-database query on the data from both groups of databases. The application programing would be getting ever more complex.

 

 

The forth challenge is the database management. In order to deliver the latest and promptest analysis result on time, and avoid the complex and inefficient cross-database programming, most users choose to accept the management cost and difficulty increase - timely update the historic library with the latest data from the business library. The advanced edition of database will usually provide the similar subscription & distribution or data duplicate functions.

 

 

The real-time big data application is hard to progress when beset with these four challenges.

 

 

How to guarantee the parallelism of the big data application? How to reduce the database cost while ensuring the real-timeness? How to implement the cross-database query easily? How to reduce the management cost and difficulty? This is the one of hottest topics being discussed among the CIOs or CTOs.

 

 

esProc is a good remedy to this stubborn headache. It is the database middleware with the complete computational capability, offering  the support for the computing no matter in external storage, across databases, or parallelly. The combination of database and esProc can deliver enough capability to solve the four challenges to big data applications.

 

 

esProc supports for the computation over files from external storage and the HDFS. This is to say, you can store a great volume of historical data in several cheap hard disks of average PCs, and leave them to esProc to handle. By comparison, database alone can only store and manage the current core business data. The goal of cutting cost and diverting computational load is thus achieved.

 

 

esProc supports the parallel computing, so that the computational pressure can be averted to several cheap node machines when there are heavy workload and a great many of parallel and sudden access requests. Its real-timeness is equal or even superior to that of the high-end database.

 

 

esProc offers the complete computational capability especially for the complex data computing. Even it alone can handle those applications involving the complex business logics. What's even better, esProc can do a better job when working with the database. It supports the computations over data from multiple data sources, including various structural data, non-structural data, database data, local files, the big data files in the HDFS, and the distributed databases. esProc can provide a unified JDBC interface to the application at upper level. Thus the coupling difficulty between big data and traditional databases is reduced, the limitation on the single-source report removed, and the difficulty of the big data application reduced.

 

 

With the seamless support for the combined computation over files stored in external storage and the database data, users no longer need the complex and expensive data synchronization technology. The database only focus on the current data and core business applications, while esProc enable users to access both the historic data in external storage and the current business data in database. By doing so, the latest and promptest analysis result can be delivered on time.

 

 

The cross-database computation and external storage computation capability of esProc can ensure the real-time query while alleviating the pressure on database. Under the assistance of esProc, the big data real-time application can be implemented efficiently at relatively low cost.

What is IOE? I=IBM, O=Oracle, and E=EMC. They represent the typical high-end database and data warehouse architecture. The high-end servers include HP, IBM, and Fujitsu, the high-end database software includes Teradata, Oracle, Greenplum; the high-end storages include EMC, Violin, and Fusion-io.

  

In the past, such typical high performance database architecture is the preference of large and middle sized organizations. They can run stably with superior performance, and became popular when the informatization degree was not so high and the enterprise application was simple. With the explosive data growth and the nowadays diversified and complex enterprise applications, most enterprises have gradually realized that they should replacing IOE, and quite a few of them have successfully implemented their road map to cancel the high-end database totally, including Intel, Alibaba, Amazon, eBay, Yahoo, and Facebook.

  

The data explosion has brought about sharp increase in the storage capacity demand, and the diversified and complex applications pose the challenge to meet the fast-growing computation pressure and parallel access requests. The only solution is to upgrade ever more frequently. More and more enterprise managements get to feel the pressure of the great cost to upgrade IOE. More often than not, enterprises still suffer from the slow response and high workloads even if they've invested heavily. That is why these enterprises are determined to replace IOE.

   Hadoop is one of the IOE solutions on which the enterprise management have pinned great hope.

   It supports the cheap desktop hard disk as a replacement to high-end storage media of IOE.

   Its HDFS file system can replace the disk cabinet of IOE, ensuring the secure data redundancy.

   It supports the cheap PC to replace the high-end database server.

   It is the open source software, not incurring any cost on additional CPUs, storage capacities, and user licenses.

  

With the support for parallel computing, the inexpensive scale-out can be implemented, and the storage pressure can be averted to multiple inexpensive PCs at less acquisition and management cost, so as to have greater storage capacity, higher computing performance, and a number of paralleling processes far more than that of IOE. That's why Hadoop is highly anticipated.

  

However, IOE still has an advantage over Hadoop for its great data computing capability. The data computing is the most important software function for the modern enterprise data center. Nowadays, it is normal to find some data computing involving the complex business logics, in particular the applications of enterprise decision-making, procedure optimizing, performance benchmarking, time control, and cost management. However, Hadoop alone cannot replace IOE. As a matter of facts, those enterprises of high-profile champions for replacing IOE have to partly keep the IOE. With the drawback of insufficient computing capability, Hadoop can only be used to compute the simple ETL, data storage and locating, and is awkward to handle the truly massive business data computation.

  

To replace IOE, we need to have the computational capability no weaker than the enterprise-level database and seamlessly incorporating this capability to Hadoop to give full play to the advantageous middleware of Hadoop. esProc is just the choice to meet this demand.

  

esProc is a parallel computing framework software which is built with pure Java and focused on powering Hadoop. It can access Hive via JDBC or directly read and write to HDFS. With the complete data computing system, you can find an alternative to IOE to perform a range of data computing of whatsoever complexity. It is especially good at the computation requiring complex business logics and stored procedures.

 

esProc supports the professional data scripting languages, offering the true set data type, easy for algorithm design from business client's perspective, and effortless to implement the complex business logics of clients. In addition, esProc supports the ordered set for arbitrary access to the member of set and perform the serial-number-related computation. The set of set can be used to represent the complex grouping style easily, for example, the equal grouping, align grouping, and enum grouping. Users can operate on the single record in as same way of operating on an object. esProc scripts is written and presented in a grid. By this way, the intermediate result can be referenced without definition. To add the convenience, the complete code editing and debugging functions are provided. esProc can be regarded as a dynamic set-lized language which has something in common with R language, and offers native support for distributed parallel computation from the core. Programmers can surely be benefited from the efficient parallel computation of esProc while still having the simple syntax of R. It is built for the data computing, and optimized for data processing. For the complex analysis business, both its development efficiency and computing performance are beyond the existing solution of Hadoop.

          

The combined use of Hadoop + esProc can fully remedy the drawback to Hadoop, empowering Hadoop to replace the very most of IOE features and improving its computing capability dramatically.