The four types of Big Data as a Service (BDaaS)

The four types of Big Data as a Service (BDaaS)
The popularity of Big Data lies within its broad definition of employing high volume, velocity, and variety data sets that are difficult to manage and extract value from. Unsurprisingly, most businesses can identify themselves as facing now or in future Big Data challenges and opportunities. This therefore is not a new issue yet it has a ... read more →

Full Metal Hadoop as a Service with Altiscale 5

Full Metal Hadoop as a Service with Altiscale
Hadoop, known to be powerful and challenging to manage, is increasingly becoming available as-a-Service in numerous varieties. Initially do-it-yourself distributions like Cloudera, MapR, and Hortonworks made up a great part of the market. In recent years, following the success of Amazon Web Services ElasticMapReduce (EMR), Hadoop/data services like Qubole are becoming popular. Last year, quietly, another entrant in the field ... read more →

Lambda Architecture: Achieving Velocity and Volume with Big Data

Lambda Architecture: Achieving Velocity and Volume with Big Data
Big data architecture paradigms are commonly separated into two (supposedly) diametrical models, the more traditional batch and the (near) real-time processing. The most popular technologies representing the two are Hadoop with map reduce and Storm. However, a hybrid solution, the Lambda Architecture, challenges the idea that these approaches have to exclude each other. The Lambda ... read more →

GraphChi: How a Mac Mini outperformed a 1,636 node Hadoop cluster

GraphChi: How a Mac Mini outperformed a 1,636 node Hadoop cluster
Last year GraphChi, a spin-off of GraphLab, a distributed graph-based high performance computation framework, did something remarkable. GraphChi outperformed a 1,636 node Hadoop cluster processing a Twitter graph (dataset from 2010) with 1.5 billion edges – using a single Mac Mini. The task was triangle counting and the Hadoop cluster required over 7 hours while ... read more →

ORC: An Intelligent Big Data file format for Hadoop and Hive

ORC: An Intelligent Big Data file format for Hadoop and Hive
RCFile (Record Columnar File), the previous Hadoop Big Data storage format on Hive, is being challenged by the smart ORC (Optimized Row Columnar) format. My first post on the topic, Getting Started with Big Data with Text and Apache Hive, presented a common scenario to illustrate why Hive file formats are significant to its performance and ... read more →

Faster Big Data on Hadoop with Hive and RCFile

Faster Big Data on Hadoop with Hive and RCFile
SQL on Hadoop with Hive makes Big Data accessible. Yet performance can lack. RCFile (Record Columnar File) are great optimisation for Big Data with Hive. The previous two posts in this four parts series explained the reasons why to use text on the periphery of an ETL process and optimisations for text. The inside of a Hive ... read more →

Optimising Hadoop and Big Data with Text and Hive

Optimising Hadoop and Big Data with Text and Hive
Hadoop’s Hive SQL interface reduces costs and to gets results fast with Big Data from Text. Simple optimisations improve the performance significantly. The previous post Getting Started with Big Data with Text and Apache Hive described the case for using text format to import and export data for a Hive ETL and reporting process. These ... read more →

Getting Started with Big Data with Text and Apache Hive

Getting Started with Big Data with Text and Apache Hive
Big Data more often than expected is stored and exchanged as text. Apache Hadoop’s Hive SQL interface helps to reduce costs and to get results fast. Often, things have to get done fast rather than perfectly. However, with big data even a small decision like a file format could have a great impact. What are ... read more →

32x Faster Hadoop and Map Reduce With Indexing 4

32x Faster Hadoop and Map Reduce With Indexing
Hadoop and map reduce’s simplicity, and especially lack of indices, significantly limits its performance. I described how map reduce 2.0 and alternatives bypassing map reduce will change Hadoop’s application and speed it up in the next year or two. Another approach is the introduction of indices to data stored on Hadoop Distributed File System (HDFS). At its inception, ... read more →

Online Education Revolution

Online Education Revolution
For hundreds of years, brick-and-mortar universities were at the unchallenged pinnacle of education. In the last decades, remote and online education appeared. At first it was little more than traditional education channelled through alternative media, with limited success and appeal. This is changing fast and dramatically. read more →