Dexter is a Big Data specialist who started his professional life making EPOS systems before moving into the world of Big Data in the advertising industry at VisualDNA. He has a passion for scalable and reliable systems and can often be found tinkering with Hadoop internals or exploring new technologies.
As data becomes ever more interconnected, he has delved into graph data and how it can be best exploited at extreme scale. Achievements include engineering a linearly scalable solution for ID linking in Janus and creating one of the fastest Neo4J bulk loading systems around.
He currently works in the quantitative research sector with G-Research dealing with dirty data and squeezing as much speed as possible out of Spark. Outside of technical achievements, while here he has founded and grown possibly the largest company Riichi Mahjong group in Europe, including arranging several sponsored Mahjong tournaments. When not evangelizing Mahjong, he lives with his wife and cat and enjoys reading, films, computer games, hiking, scuba diving, comedy and food and is always up for a coffee.
Neo4J is a Graph Database traditionally restricted to a single node. With enterprise edition Neo4J have demonstrated a distributed graph over many machines with over a trillion relationships. For those with community edition, loading a graph approaching this size with the commonly available tools would take weeks.
At G-Research we needed to be able to load ~500GB of data arriving as triples into a graph within 24h, and the faster the better. We were unable to use Fabric, the cornerstone of the Trillion graph, so we worked with Neo4J to get a loader able to complete in under an hour. The result is a methodology usable by anyone even with community edition to drastically accelerate loading of large datasets, relative to the CSV bulk loader or Spark connector.
In this talk we explore the application of:
* Spark for pre-processing
* Hadoop custom Output Formats
* The BatchImport API
* A Lambda style architecture
* Options for further acceleration if required
We will also briefly explore why the common options are so limited and how GRAAL can be used to open this technique up to a wider community.